Efficient Data Mining Classification Technique Essay

Efficient Data Mining Classification Technique With many real-world applications, such as, spam filtering, climate monitoring, or fraud detection, the importance of developing a more general classification framework can hardly be overstated. The general context is that of labeling multiple (a potentially large volume of) data streams, which may contain evolving data and irrelevant pieces of information. Many recent efforts have separately focused on concept concept-evolution, class imbalance, and data reduction. A more general learning framework for considering the problem, that is, learning from evolving multiple big data streams that also provide unbalanced classes, is ignored. Therefore, the concept-evolution problem would be effectively addressed in this thesis along with data reduction and class balancing issues. This research project aims to study the shortcomings of existing novel class detection, data reduction, and class balancing data mining techniques in terms of their accuracy, efficiency, and applicability to real life applications of multi streaming data. The aim of the research is also to provide alternate solutions to overcome those drawbacks. My thesis aims to propose a general model and algorithm that will be tested on synthetic data and well known real data sets e.g. KDD Cup 99 network intrusion detection (KDD), Auslan [Kad02], and EMG [Kol05]. RESEARCH MOTIVATION Classification, clustering, and aggregation are some of the data mining hot topics that are of extreme value in all engineering and scientific areas, such as, biological, physical and bio-medical sciences. Diversified nature of escalated data along with its composite aspects and multiple autonomous sources is a major issue in data mining that leads to the need for the development of real life applications. The motivation behind this study is offered in the following paragraphs: Novel Class Detection The first issue the thesis is going to address is that of evolving data, which represents a challenge for classification. The effective and efficient methods are needed by the growing and dynamic data streams, which are considerably different from the static data mining methods. The concept drift and infinite length are considered to be the well-studied features of data streams. Across data stream mining, to address the infinite length [Fan04] and concept-drift[Cha07][Kol05] [Wan03], diverse methods have been suggested in the literature. Yet, the data streams have two another challenging characteristics, known as, feature-evolution and concept-evolution, which are ignored by the present methods. The appearance of new classes in the data is likely to produce the concept-evolution. For instance, in intrusion detection in a network traffic stream, if each attack is considered as a class label, then we observe the concept-evolution in presence of an entirely new attack across the traffic. The novel class detection problem has been addressed in [Mas09] [Mas10]. However, all of them have many drawbacks. First, the false alarm rate (i.e., detection of existing classes as novel) is high. Second, if there is more than one novel class, they are unable to distinguish among them. Third, their works deal with only numerical attributes and not with categorical attributes. Final and the most important point is that they unable to detect a novel class across multiple data streams. All exiting works have ignored this dimension. If many instances from different streams in a data set form a new class, then that class will be misclassified by all proposed classifiers because each one of the classifiers focuses only on classes with which it has been associated. The research outcome will be very useful to many applications in various domains, such as, climate change, spa, intrusion detection, and fault detection to name a few. Class Balancing The unfair classification model created on training data that gives priority to majority group is one of the serious problems of the data stream mining[Jap02][Tan09]. It happens where extremely imbalanced dataset for classification techniques is applied, whereas it skips the minority class sampling [Cha04][Wan12]. Either by reducing the sample size of majority class or increasing the sample size of minority class, the skewed class distribution is likely to be cured by the data sampling technique. However, an unwanted biasness can emerge through algorithms that can modify the sample distribution using greedy approaches. Domains in which class imbalance is common include medical diagnosis where prediction of rare but important disease is very important than regular treatment, intrusion detection, risk management, text classification, and information retrieval [Cha04][Wil09]. Data Reduction For the real classification phase, the Data Reduction technique is a significant part in the preparation of accessible data. Generally, various gains in the economy of the data mining procedure can occur because of that operation. Some of those benefits include improved model interpretability by decreasing difficulty, speeding up by dropping the computational effort, and performance improvement by decreasing the data set. Moreover, integrity of the original data is efficiently maintained and cost can also be curtailed by avoiding expensive qualities [Das97]. By eliminating redundant or inappropriate features, the size of the data set is reduced using the attribute subset selection technique. To explore the minimum set of attributes is the major aim of its subset selection in a way that the subsequent probability distribution of the data classes across all the attributes should be frequently comparable to the original distribution. In every aspect, the literature has analyzed the feature (attribute) selection problem with current solutions, which include wrapper, filter, and embedded techniques. Wrappers use a search algorithm to search through the space of possible features and evaluate each subset by running a model on the subset. Wrappers can be computationally expensive and have a risk of over fitting to the model. Filters are similar to Wrappers in the search approach, but instead of evaluating against a model, a simpler filter is evaluated. Generally, wrappers have offered the most suitable strategy for the desirable output. However, the problem traits and the particular learner for classification basically derive the optimum wrapper. For a particular algorithm, the substantial improvements can be produced by one specific wrapper in one given problem, while execution of the same algorithm is not successful on a different problem (or on a different optimization goal) or a different learning procedure and the same problem. That concern needs to be addressed considerably. LITERATURE REVIEW Data stream classification and novelty detection recently received a significant amount of attention in many practical real-world applications, such as, intrusion detection and spam or climate change where data distributions inherently change over time [Far12]. In data stream classification assuming that total number of classes is fixed is invalid in a real streaming environment when new concepts may evolve at any time. Arrival of a novel class has been ignored by most of the existing data stream classification techniques [Mas10]. Novel class detection in stream data classification falls into two categories: Incremental approach also called single model and Ensemble Model. Ensemble technique uses a combination of classifiers with the aim of improved composite model and use the voting approach to label unlabeled instances. In [Mas09], [11s] author gives the existing class and novel class definitions. Recent work in data stream mining domain describes a clustering approach that can detect both concept-drift and novel class and assumes that there is only one “normal” class and all other classes are novel. Thus, it may not work well if more than one class is to be considered as “normal” or non-novel. Masud et al proposed MineClass[Mas09] [Mas10], which means Mining novel Classes in data stream, using two base learners, K- nearest neighbor and decision tree. MineClass can detect novel classes in the presence of concept-drift and proposed model is capable of detecting novel classes even when the model consists of multiple “existing” classes. ActMiner [Mas10] addressed the limited labeled data problem by applying an ensemble classification technique. It extends MineClass and addresses the Limited labeled data problem in addition to addressing the other three Problems thereby reducing the labeling cost. It also applies active learning but its data selection process is different from others. An unsupervised novel concept detection technique for data streams is proposed but it is not applicable to multi-class classification. As per previously mentioned work, MineClass addresses the concept evolution problem on a multi-class classification framework. MineClass does not address the limited labeled data problem and requires that all instances in the stream should be labeled and available for training. ECSMiner [Mas11] [Pra12] [Sha12] means enhanced classifier for data streams with novel class miner. Novel class detection using ECSMiner is different from the traditional class detection technique. This approach offers a “multiclass” framework for the novelty detection problem that can distinguish between different classes of data and discover the emergence of a novel class. This technique is a nonparametric approach, and therefore, it is not restricted to any specific data distribution. As MineClass, this technique is applied on two different classifiers as a base learner: decision tree and k-nearest neighbor. When decision tree is used as a classifier, each training data chunk is used to build a decision tree. K-NN strategy would lead to an inefficient classification model both in terms of memory and running time. SCANR [Mas11] (Stream Classifier and Novel and Recurring class detector) is a technique for Classification with novel and recurring class. It addresses a special case of concept-evolution, which is recurring class. A recurring class is a special and more common case of concept-evolution in data streams. It occurs when a class reappears after a long disappearance from the stream. ECSMiner identifies recurring classes as novel class. In [Bis12] [Far12], the authors have proposed new decision tree learning approach for the detection of Novel class. In this approach, we calculate the threshold value based on the ratio of percentage of data points between each leaf node in the tree and the training dataset and also cluster the data points of training dataset based on the similarity of attribute values. IN [Fan04], paper describes the ID3 (Iterative Dichotomiser) technique, which is a decision tree learning algorithm that builds the decision tree using information theory. The C5.0 algorithm improves the performance of building trees using boosting, which is an approach to combining different classifiers. CART (classification and regression trees) is a process of generating a binary tree for decision making. CART handles missing data and contains a pruning strategy. The SPRINT (Scalable Parallelizable Induction of Decision Trees) algorithm uses an impurity function called gini index to find the best split. In this algorithm, they introduce decision tree classifier based novel class detection in concept drifting data stream classification, which builds a decision tree from data stream. The decision tree continuously updates with new data points so that the most recent tree represents the most recent concept in data stream. MCM [Mas13] stands for multiclass miner in data streams. It improved the previous novel class detection techniques by proposing an adaptive threshold for outlier detection, which is a vital part of novel class detection. In addition, a probabilistic approach was used for novel class detection using discrete Gini Coefficient. Moreover, it applies a graph-based approach to detect the appearance of more than one novel class simultaneously and separates the instances of one novel class from the others. DTNC [Mia13] stands for Detecting Novel Classes in Data Streams, the based learner VFDTc is used to speed up the process and reduce the model size that has two major extensions to VFDT: a condition of the form attri ≤ cut_point in split-test for numeric attribute and naive Bayes classifier applied in tree leaves to deal with concept drift issue. C&CD [Div14] means Classification and novel Class Detection. It uses the outlier detection method to remove the unwanted data present on the data streams. Also, this approach uses the Nearest Neighbor algorithm and the Naive Bayes classifier concepts for novel class detection. However, none of all the above-mentioned works deal with detecting new class across multiple streams. All exiting works have ignored this dimension. If many instances from different streams in a data set form a new class, that class will be misclassified by all above classifiers because each one of the proposed classifiers focus only on classes with which it has been trained with from one data stream. Read More

Efficient Data Mining Classification Technique - Essay Example

Extract of sample "Efficient Data Mining Classification Technique"

CHECK THESE SAMPLES OF Efficient Data Mining Classification Technique

A Systematic Approach to Cost-Based Optimization in Data Mining Environment

Efficiency of Data Mining Algorithms in Identifying Outliers-Noise in a Large Biological Data Base

Data Mining, Its Purpose and Its Working Methodology

Social Networks User Types, Models

The Best Accuracy and Speed of the WEKA Software

Decision-Making Model in E-Learning

Review of the Popular Association-Mining Algorithms - Apriori

Data Mining Techniques by Using Boston Data