Data Warehousing and Data Mining Essay Example | Topics and Well Written Essays

al Affiliation) Q1: Data warehouse is the repository of the historical and current data of an organization that is deemed important to management in decision making support. how the data is extracted from the organization’s operational systems and how the data is created as the snapshots for history. This helps in handling emergency queries and scheduled reports. Data warehouse also helps in performing complex analysis and queries. It supports data analysis and decision support by having data organized in a form ready to undergo analytical processing through activities like querying, data mining and reporting. It is thus characterized by subject, integration, time and volatility. Subject-oriented – this is the type of data warehousing in which the data is arranged by the subject details that are relevant to the decision support processes thus enable users determine the how and why of the organization’s performance. Integrated-this is the characteristic of data warehouse in which the data is placed from different sources are placed in a consistent format. This is only possible if the data warehouse is dealing with major data warehousing conflicts like naming discrepancies. Time-variant-this is the characteristic of data warehouse that maintains historical data. Time is a very important aspect of warehousing that defines the status of data in real systems. This character is responsible for detecting trends, long-term relationships and deviations that help in comparisons and forecasting which forms a very important part of decision making. Non-volatile- this is the characteristic of data warehouse that ensures that data entered into the data warehouse is not manipulated by users through updating or changing. It also ensures that changes in the data are recorded as new data and obsolete data are discarded. Q2: Natural Language Processing is the mode of representation that was used by early text mining applications to introduce various structures to the text-based documents so that they could cluster them to the natural groupings or classify them to some predetermined classes. In the earlier text mining like bag-of –words the natural language was considered a cluster of words. Texts like paragraphs, sentence or complete document were often represented as a collection of world. This process disregarded the grammar of the text or the order in which these words appeared. Though this model has become obsolete, some document classification tools still use it. This is still applicable in span filtering where the e-mail message is modeled as an unordered bag-of –words which are then compared to other predetermined bags-of-words. In the spam filtering, a single bag is filled with the words that were found in spam messages while the other bag is filled with the words that are found in the genuine or legitimate e-mails. There is likelihood that there are words that will be found in both bags to have been used in spam messages and in the legitimate e-mail text. It is therefore the match between the bag-of –words containing e-mail and those containing the word descriptors that determine the classification of e-mail as either legitimate or a spam. Challenges of Natural Language Processing- there are a number of discrepancies that comes with the use of natural language processing. These include; Word sense disambiguation- it is difficult to process different words with the same meaning. Part-of-speech tagging-it is difficult to categorize certain words in a speech as nouns, adjectives, verbs and adverbs. Systactic ambiguity- the natural language grammar is ambiguous. Speech acts- there are chances that the sentence may not contain enough information that can clearly define the action of the sentence therefore a sentence can be considered an action. Imperfect or irregular input- the use of natural language processing makes it difficult to process foreign accents and vocabularies especially when they have grammatical or typographical error. It is also difficult to analyze speeches without text boundaries. Q3: text mining- is the process of extracting useful information and knowledge from large amounts of data sources that are unstructured. This process is semi-automated. Data mining is the process of identifying novel, potentially useful, valid and understandable patterns of the data stored in the structured databases. Web mining is the process of discovering interesting and useful information from the web data which is expressed in the form of linkage, textual, and usage information. Text mining, data mining and web mining are similar in the sense that they use the same processes and have the same purpose. The commonalities between these mining processes come in the form of the content they handle. All these processes use similar data extraction process throughout the extraction action. They operate in the in the same array of data extraction that make it easy to process the language. In text mining the data handled are put into categories that help extraction process very easy. All these data handling processes use similar call frequencies that aggregate text data that are then clustered into the probabilistic data samples. They all have the ability to automatically extract the useful information. The processes all use the web tools to extract information. They navigate the modifications and business tools. The major difference of data mining, text mining and web mining is in their implementation. They are implemented distinct paradigms which use the same tool to extract the needed information. Q4: web 2.0 is the emerging technological trends in which the World Wide Web permits and enhances the interaction of users so as to be able to share information over online platform. It has some major components like blogs, wikis and the web services. These are the major components of web 2.0. This technology has always been considered a very important aspect of online business with its use major impacting on the business world in positive and negative ways. Wikis- they are component of web 2.0 that are used in aiding collaborative interaction between the company structures. In this respect, it encourages posting of hypotheses and help requests. It then initiates the process of suggestion taking and commentary. This helps find solution to different problems that users encounter in their daily business undertakings. Blogs- this is an interactive tool that is centered in certain kind of data and a key metric. They provide interfaces that can be used to post information. It can then take comments. This is important to the business intelligence as it allows for different information to be shared and accessed thus widen the scope of business knowledge. RSS feeds- these are dashboard embedded tools that help in inquiry process. It can show various aspects of inquiry that can be useful to the business by initiating collective analysis of information about transaction characteristics and behaviors related to selling that can produce high sales. Virtual world- the virtual world is the aspect of technology that gives the touch of telepresence and distance participation. This kind of artificial world is created by computers which allow users to create avatars that interact with other individuals generated by computers. The avatars can be hired and used as employees in business field. They can manage businesses and carry out functions as designated by the user this makes them very important aspect of business intelligence. Q5: How data can be divided between training and test sets. Data sets can be divided into training and tests data. There are several methods that can be used to divide into training sets and test sets. This is always done to achieve the data analysis accuracy. The methods that can be used to divide data include the following; cross validation and residual validation. These models of data evaluation have some distinctive setbacks. For example, in the residual evaluation model, the learner is not well explained to what they will do when they are asked or required to make new prediction about data they haven’t seen. This makes it a little inferior method to cross validation. However this problem can be avoided by using only a portion of the data sets while training someone. The best method is cross validation. This method involves the dividing of data sets into training sets and test sets. During learning, some data is removed at the training level after which the removed data is used for testing the performance of the learner on the new data. There are a number of cross validation kinds, they include; holdout method, k-fold and leave-one-out cross validation. The best and common amongst these methods is k-fold cross validation. K-fold cross-validation; this is the method where data is divided into k-subsets and the method is then repeated k-times. Every time a method is done, one k-subset is made a test set while the other subset which now becomes k-1 subset is used for training. Then their average error margin is computed. The outstanding advantage of this method is that the manner of data division does not matter. Q6: the ETL process refers to the process of data integration that involves data extraction from external sources, transformation of the extracted data into appropriate format and the loading of the transformed data into the warehouse repository. This process stimulates physical data movement from their sources to the target data. The extraction process is the first process in which the data is collected from the source. The second process is the data transformation where the data is formatted into the right form that is compatible with the database it’s targeting. The final process is the loading in which the data that has been transformed is imported into the data warehouse. Below are the processes; Extraction- this process of data integration is where the data source systems are connected into then the data is selected and collected for processing within the data mart or the data warehouse. The data is collected from various sources that can be stored in different forms. This process transforms the extracted data into the format that can be transformed in the next stage. The amount of data to be extracted determines the complexity of this process. Transformation- in this stage a series of functions are executed to the data extracted that help convert it into the right format. The records are either rejected or validated in this stage. The data amount determines the manipulations required. The process involved in this stage include, data filtering, data standardization, data sorting, data translation and data consistency check Loading- in this stage the data that has been extracted from the source and transformed are imported to the target data warehouse. The load process can also insert data into the record as new rows. However, this process does not allow for integrity check. REFERENCES Reeves, L. L. (2009). A managers guide to data warehousing. Indianapolis, IN: Wiley Pub.. Taniar, D. (2011). Integrations of data warehousing, data mining and database technologies innovative approaches. Hershey, PA: Information Science Reference. Turban, E. (2008). Business intelligence: a managerial approach. Upper Saddle River, N.J.: Pearson Prentice Hall. Read More

Data Warehousing and Data Mining - Essay Example

Extract of sample "Data Warehousing and Data Mining"

CHECK THESE SAMPLES OF Data Warehousing and Data Mining

Data Mining and Data Warehousing

Data Mining Technologies

Customer Service in the Travel and Tourism Industry

High Level ETL and Data Mining Requirements

Data Warehouse Business Technology

Structures of a Database vs Data Warehouse

How Should a Company Measure the Success of Its Business Intelligence

Multi Agent Driven Data Mining for Knowledge Discovery in Cloud Computing