Web Content Outlier Mining Through Using Web Datasets Research Paper

? Web Content Outlier Mining Using web datasets and finding outlier in them The development of information technology, global net-work and internet, and rapid growth of electronics engineering made the accessibility of knowledge easy and fast with the availability of huge volume of information at one’s fingertips. The amount of knowledge sought by an individual is always very specific. Search of specific knowledge from the huge databases and data warehouses has become an essential need. Knowledge seekers while surfing web content on internet, come across large amount of information which is irrelevant to the subject of search and it is generally referred as web content outlier. This research investigates different methods of extracting outliers from web contents. Using web contents as data sets, it is aimed to find an algorithm which extract and mine varying contents of web documents of same category. Structure of HTML is used in this paper with various available techniques to model for mining web content outliers. Web content outlier’s mining using web datasets and finding outlier in them. In this modern time, the information is overloaded with huge databases, data warehouses and websites. The growth of internet and uploading and storing of information in bulk on websites is exponential. Accessibility of information is also made very easy for common man through internet and web-browser technology. The structure of web is global, dynamic, and enormous which has made it necessary to have tools for automated tracking and efficient analyzing of web data. This necessity of automated tools has started the development of systems for mining web contents. Extracting data is also referred as knowledge discovery in datasets. The process of discovering patterns which are interesting and useful and the procedures for analyzing and establishing their relationships are described as data mining. Most of the algorithms used today in data mining technology find patterns that are frequent and eliminate those which are rare. These rare patterns are described as noise, nuisance or outliers. (Data mining, 2011) The process of mining data involves three key steps of computation. First step is the process of model-learning. Second step is the model evaluation and the third step is the use of the model. To clearly understand this division, it is necessary to classify data. (Data mining, 2011) The first step in data mining is the model learning. It is the process in which unique attributes are found about a group of data. The attributes classify the group and based on it an algorithm is built which defines the class of the group and establishes its relationship. Dataset with their attributes known are used to test this algorithm, generally called classifier. Results produced by the classifier assist in determining minimum requirements for accepting data of the known class. It gives the amount of accuracy of the model and if the accuracy is acceptable, the model is used to determine the similarity of each document or data in a dataset. (Data mining, 2011) The second step in data mining is the model evaluation. Techniques used for evaluating the model depend largely on the known attributes of data and knowledge types. The objectives of data users determine the tasks for data mining and types of analysis. These tasks include Exploratory Data Analysis (EDA), Descriptive Modeling, Predictive Modeling, Discovering Patterns and Rules, and Retrieval by Content. Outliers are generally found through anomaly detection, which is to find instances of data that are unusual and unfit to the established pattern. (Data mining, 2011) Exploratory Data Analysis (EDA) show small data sets interactively and visually in the form of a pie chart or coxcomb plot. Descriptive Modeling is the technique that shows overall data distribution such as density estimation, cluster analysis and segmentation, and dependency modeling. Predictive Modeling uses variables having known values to predict the value of a single unknown variable. Classification prediction is made categorically while in Regression prediction is made quantitatively. All the tasks related to exploratory data analysis, descriptive modeling and predictive modeling are related to building a model. (Hand D. & Mannila H. & Smyth P., 2001) The task of discovering patterns and rules include techniques for detecting similar behaviour of the data of a group such as trends or patterns based on association rules. The task of retrieval by content is based firstly on defining or describing a pattern or topic and then making it as a basis to search and retrieve similar contents from the given dataset. In case of outliers, the dissimilar contents are searched and retrieved. The percentage of similarity or dissimilarity is set by the user as a parameter which is then used to decide contents’ acceptance or rejection. (Hand D. & Mannila H. & Smyth P., 2001) Data sets used for content retrieval are mostly text and images. Number of keywords make a pattern to search or retrieve from a text data sets where as description, sketch or a sample of an image are used to search or retrieve similar or dissimilar images from a data set of images. (Hand D. & Mannila H. & Smyth P., 2001) The third and the last step in data mining is usage of models. Models are applied on datasets for data mining. The aim of this paper is to select a model and apply it on web content outlier mining. (Data mining, 2011) Methods Data which are observed as largely contrasting to the rest of the observed data and its contradiction is such that it creates doubt about its generation by the same mechanisms are referred as outliers. It is the data objects which are not consistent with all the remaining data objects. Web outliers are those which are found in web data and web outlier mining is the process in which outliers are extracted from web contents. (Data mining, 2011) Components of Data Mining Data mining is the result of integration of four components in an algorithm. These components are:- 1. Structure of pattern or model 2. Function of score 3. Method of search and optimization 4. Strategy of managing data Models or patterns referred in data mining are basically the relationships and summaries obtained by an exercise of data mining and represents tree structures, graphs, rules, clusters, linear equations and patterns of recurring nature in a given time. (Hand D. & Mannila H. & Smyth P., 2001) Results in the shape of models or patterns are produced by implementing and applying a well-defined procedure, usually termed as a data mining algorithm. Whenever data is processed through such an algorithm it produces either a model or a pattern which can then be used for further analysis and extraction of data. (Hand D. & Mannila H. & Smyth P., 2001) Score function determines the real benefits of a model or pattern. It quantifies the fitness of structure with a given data set. The precise utility of a model is very difficult to achieve and therefore score functions of generic nature such as least square or accuracy of classification are used to quantify the benefits. The comparison of suitability and efficiency of a model is made through the score functions. (Hand D. & Mannila H. & Smyth P., 2001) Optimization and search aims at finding the optimum value of the parameters in models. Optimum value is the most efficient and effective value derived after applying all the prescribed restrictions and rules and utilizing all the available resources. Search is the process of finding most related information in the shortest possible time with minimum usage of the resources using techniques of combinatorial and heuristic search. (Hand D. & Mannila H. & Smyth P., 2001) Strategy of managing data refers to the process of determing ways of storing, indexing and accessing data. Algorithms developed for analyzing data in the study of statistics and machine learning assumed that points of data were rapidly accessible in memory of random-access. (Hand D. & Mannila H. & Smyth P., 2001) Huge data sets are stored in secondary storage such as disks and tertiary storage such as tapes which cannot be fitted in RAM. Accessing data from secondary and tertiary storages reduces the speed as most of the data points are far from processor and additional time is required to load and save data in RAM before any processing. Results achieved through processing are required to be saved back on the external storages, before retrieving new data. (Hand D. & Mannila H. & Smyth P., 2001) Hence accessing of massive data sets have core issues of speed and time. Role of statistics in Data Mining The role of statistics in data mining is very important. Patterns and models based on the potentiality of the future data are used to generalize data which is massive and unknown. Massive databases are found in our daily life today. For example, Wal-Mart, one of the biggest retailer of USA, makes over 20 million transactions a day and in 1998, the size of database of transactions was 11 terabyte. Similarly, AT&T records 300 million long distance calls everyday from its 100 million customers. (Hand D. & Mannila H. & Smyth P., 2001) Massive data sets are mostly distributed and stored on several machines located at different locations. To get a well defined and rapidly obtainable random sample from such wide and splitted data points is one of the biggest challenges in statistics. This challenge becomes worse with the fact that data on distributed network are constantly evolving; changing with new information and technologies. This expands data size multi-directionally. Many data sets have properties which are unknown or unaccounted for in the applications of statistics. Huge data sets also have many other distortions such as data values that are missing, contaminated, and corrupted. (Hand D. & Mannila H. & Smyth P., 2001) The overlap between the statistical techniques and data mining is considerable yet there are many complicated problems that have come out of mammoth size and unaccounted properties of data sets. (Hand D. & Mannila H. & Smyth P., 2001) Measuring Data for mining A dataset is an integration of data items. Each data item has its own value. The difference between the two data items of a dataset is generally referred as their distance; how much one item is far from another. Measurement is the process of defining properties of each data item based on its values and attributes. These properties are commonly referred as WEIGHT, SEVERITY, PREFERENCE or any such name. The properties of data items are assigned a numeric value so that the difference or distance between any two items can be easily calculated and determined. (Hand D. & Mannila H. & Smyth P., 2001) There are two common ways of measuring the difference or distances between data items. One way is by description of the distinction which is achieved through understanding the working of a system. Another way is by prediction of the difference based on the potentiality of the future information. Both techniques of measurement overlap and yet they are completely different in determining the difference and distinction between the two data items. Measurements made by description use representational approach through mappings whereas measurements made by predictions use operational approach through numerical values. (Hand D. & Mannila H. & Smyth P., 2001) In data mining, measurements are made by measuring similarity or dissimilarity between any two data items. It is fundamental to define what similarity or dissimilarity is so that data items can be measured accordingly. Once any one of either similarity or dissimilarity is defined, the other term is found easily through transformation. (Hand D. & Mannila H. & Smyth P., 2001) If similarity between any two data items is represented as s(i,j) and dissimilarity as d(i,j), then the first way of the transformation is d(i,j)= 1- s(i,j) and the second way of transformation is (Hand D. & Mannila H. & Smyth P., 2001) The measure of similarity or dissimilarity is generally termed as proximity. In context of measuring similarities two other terms are also used commonly; distance and metric. A measure of dissimilarity obtained from the attributes of data items is informally termed as distance. A measure of dissimilarity when obtained by satisfying following three conditions is termed as a metric. The condition given in third line is termed as triangle inequality. In massive datasets, data do not correspond in amount, magnitude or degree. The common strategy used to overcome this problem of commensurability is the standardization of data through division of each variable by its sample standard deviation. All items of dataset may not be equally important for example some may be less important and some may be more important. To give the relative importance to each item weights are assigned after standardizing data to get Euclidian weighted measure of distance, mathematically represented as:- Quality of measurement is also very important. It must be accurate and true value. The procedures which are accurate have small variability but results they generate are near to the true value. The term used for accurate measurements are reliability. However processes used for the determining accuracy or precision are different from the processes used for determing reliability. Validity is the process of verifying the accuracy and reliability of measurements taken. (Hand D. & Mannila H. & Smyth P., 2001) Outliers in Datasets Any model or pattern can be obscured by the abnormal data items generally termed as outliers. In single variable, outliers can be detected by plotting the data, for example histogram. The points that lie out in tails far away can be observed as outliers. However, with multiple variables the problem gets complicated. Computers and sophisticated software are used when multiple variables are involved and datasets are massive and huge like web contents. Large data sets have many distortions such as incompleteness, error of measurement, and contamination. The accuracy, precision and reliability of any analysis from dataset are possible if and only if the outliers or distortions are removed from it. (Hand D. & Mannila H. & Smyth P., 2001) When the quality of data is poor, it pollutes clear thinking and correct decision making. Data that are biased or of poor quality and the analysis derived from them can lead to serious consequences for improvement in life quality and evolving scientific theories. (Hand D. & Mannila H. & Smyth P., 2001) Analysis of numeric data plays very important role even if the data is non-numeric. Mostly items of non-numeric data or their relationships are transformed to numeric expressions which are then analyzed with standard methods. For example while processing text data, frequency of a word calculated or sometimes probabilities of words in documents are calculated for doing analysis. (Hand D. & Mannila H. & Smyth P., 2001) Systematic Data Mining Algorithms Mining of data requires systematic algorithms. An algorithm for data mining is a set of procedures defined well to take input of data and bring out models or patterns. There are many algorithms for mining of data using different techniques of mathematics. However, they all have common components that are used to execute the algorithm for mining of data. These components are task, structure, score function, search or optimization method, and the data management technique. There are three renowned algorithms for mining data and Figure 1 shows what methods they use for executing each component of the data mining algorithm. (Hand D. & Mannila H. & Smyth P., 2001) Figure 1: Component wise methods used by three renowned algorithms Web content outlier A website has many pages each one having contents related to its title. There are several pages in each website which are not related to the main topic. An airline website has pages for reservations and bookings of flights. It also has pages for careers. The career pages are related to the job opportunities available in the airline. For a person seeking to buy an air travel ticket and surfing for entering a booking, the pages on careers are irrelevant and can be called outliers for him. However, a person looking for jobs in an airline company, the pages on careers are very important and pages on flight information, reservation and booking are all irrelevant, hence they are outliers for the job seeker. (Agyemang, Barker, & Alhajj, 2005) Algorithm for Web Content Outlier Mining The technique of n-gram is used in this algorithm for converting non-numeric data or text data into a frequency profile. A frequency distribution is generated for each document separately based on n-gram technique. N-gram is the technique by which a set of all continuous letters of length k are picked from of a word. For example, from the word “stadium” letters “stad”, “tadi”, “adiu”, and “dium” are all possible 4-grams excluding trailing and preceding blanks. Possible number of n-grams that can be generated from any string or word of length k is (k-n+1) where n represents n-gram size. (Agyemang, Barker, & Alhajj, 2005) The study of n-grams technique reveal that size of 4 and 5 letters for n-grams are efficient way of finding the similarity of words. Neither too little size of n-grams such as 1 or 2 letters or nor too big size of n-grams such as 8 or 9 letters give accurate results for finding similarity. (Agyemang, Barker, & Alhajj, 2005) Second information that this algorithm uses is the HTML tags Read More

Web Content Outlier Mining Through Using Web Datasets - Research Paper Example

Extract of sample "Web Content Outlier Mining Through Using Web Datasets"

CHECK THESE SAMPLES OF Web Content Outlier Mining Through Using Web Datasets

Data Mining: Concepts and Techniques

Data Mining for E-Commerce

Business Analysis Tools

Data Warehouse and Formal OLAP Tool

BUSINESS ANALYTICS METHOD AND SOFTWARE

Risk and Threats of Cloud Computing Services

Success in Social Commerce

Data Mining Techniques by Using Boston Data