StudentShare
Contact Us
Sign In / Sign Up for FREE
Search
Go to advanced search...
Free

Efficient Data Mining Classification Technique - Essay Example

Cite this document
Summary
This research project “Efficient Data Mining Classification Technique” aims to study the shortcomings of existing novel class detection, data reduction, and class balancing data mining techniques in terms of their accuracy and efficiency…
Download full paper File format: .doc, available for editing
GRAB THE BEST PAPER92% of users find it useful
Efficient Data Mining Classification Technique
Read Text Preview

Extract of sample "Efficient Data Mining Classification Technique"

Efficient Data Mining Classification Technique With many real-world applications, such as, spam filtering, climate monitoring, or fraud detection, the importance of developing a more general classification framework can hardly be overstated. The general context is that of labeling multiple (a potentially large volume of) data streams, which may contain evolving data and irrelevant pieces of information. Many recent efforts have separately focused on concept concept-evolution, class imbalance, and data reduction. A more general learning framework for considering the problem, that is, learning from evolving multiple big data streams that also provide unbalanced classes, is ignored. Therefore, the concept-evolution problem would be effectively addressed in this thesis along with data reduction and class balancing issues. This research project aims to study the shortcomings of existing novel class detection, data reduction, and class balancing data mining techniques in terms of their accuracy, efficiency, and applicability to real life applications of multi streaming data. The aim of the research is also to provide alternate solutions to overcome those drawbacks. My thesis aims to propose a general model and algorithm that will be tested on synthetic data and well known real data sets e.g. KDD Cup 99 network intrusion detection (KDD), Auslan [Kad02], and EMG [Kol05]. RESEARCH MOTIVATION Classification, clustering, and aggregation are some of the data mining hot topics that are of extreme value in all engineering and scientific areas, such as, biological, physical and bio-medical sciences. Diversified nature of escalated data along with its composite aspects and multiple autonomous sources is a major issue in data mining that leads to the need for the development of real life applications. The motivation behind this study is offered in the following paragraphs: Novel Class Detection The first issue the thesis is going to address is that of evolving data, which represents a challenge for classification. The effective and efficient methods are needed by the growing and dynamic data streams, which are considerably different from the static data mining methods. The concept drift and infinite length are considered to be the well-studied features of data streams. Across data stream mining, to address the infinite length [Fan04] and concept-drift[Cha07][Kol05] [Wan03], diverse methods have been suggested in the literature. Yet, the data streams have two another challenging characteristics, known as, feature-evolution and concept-evolution, which are ignored by the present methods. The appearance of new classes in the data is likely to produce the concept-evolution. For instance, in intrusion detection in a network traffic stream, if each attack is considered as a class label, then we observe the concept-evolution in presence of an entirely new attack across the traffic. The novel class detection problem has been addressed in [Mas09] [Mas10]. However, all of them have many drawbacks. First, the false alarm rate (i.e., detection of existing classes as novel) is high. Second, if there is more than one novel class, they are unable to distinguish among them. Third, their works deal with only numerical attributes and not with categorical attributes. Final and the most important point is that they unable to detect a novel class across multiple data streams. All exiting works have ignored this dimension. If many instances from different streams in a data set form a new class, then that class will be misclassified by all proposed classifiers because each one of the classifiers focuses only on classes with which it has been associated. The research outcome will be very useful to many applications in various domains, such as, climate change, spa, intrusion detection, and fault detection to name a few. Class Balancing The unfair classification model created on training data that gives priority to majority group is one of the serious problems of the data stream mining[Jap02][Tan09]. It happens where extremely imbalanced dataset for classification techniques is applied, whereas it skips the minority class sampling [Cha04][Wan12]. Either by reducing the sample size of majority class or increasing the sample size of minority class, the skewed class distribution is likely to be cured by the data sampling technique. However, an unwanted biasness can emerge through algorithms that can modify the sample distribution using greedy approaches. Domains in which class imbalance is common include medical diagnosis where prediction of rare but important disease is very important than regular treatment, intrusion detection, risk management, text classification, and information retrieval [Cha04][Wil09]. Data Reduction For the real classification phase, the Data Reduction technique is a significant part in the preparation of accessible data. Generally, various gains in the economy of the data mining procedure can occur because of that operation. Some of those benefits include improved model interpretability by decreasing difficulty, speeding up by dropping the computational effort, and performance improvement by decreasing the data set. Moreover, integrity of the original data is efficiently maintained and cost can also be curtailed by avoiding expensive qualities [Das97]. By eliminating redundant or inappropriate features, the size of the data set is reduced using the attribute subset selection technique. To explore the minimum set of attributes is the major aim of its subset selection in a way that the subsequent probability distribution of the data classes across all the attributes should be frequently comparable to the original distribution. In every aspect, the literature has analyzed the feature (attribute) selection problem with current solutions, which include wrapper, filter, and embedded techniques. Wrappers use a search algorithm to search through the space of possible features and evaluate each subset by running a model on the subset. Wrappers can be computationally expensive and have a risk of over fitting to the model. Filters are similar to Wrappers in the search approach, but instead of evaluating against a model, a simpler filter is evaluated. Generally, wrappers have offered the most suitable strategy for the desirable output. However, the problem traits and the particular learner for classification basically derive the optimum wrapper. For a particular algorithm, the substantial improvements can be produced by one specific wrapper in one given problem, while execution of the same algorithm is not successful on a different problem (or on a different optimization goal) or a different learning procedure and the same problem. That concern needs to be addressed considerably. LITERATURE REVIEW Data stream classification and novelty detection recently received a significant amount of attention in many practical real-world applications, such as, intrusion detection and spam or climate change where data distributions inherently change over time [Far12]. In data stream classification assuming that total number of classes is fixed is invalid in a real streaming environment when new concepts may evolve at any time. Arrival of a novel class has been ignored by most of the existing data stream classification techniques [Mas10]. Novel class detection in stream data classification falls into two categories: Incremental approach also called single model and Ensemble Model. Ensemble technique uses a combination of classifiers with the aim of improved composite model and use the voting approach to label unlabeled instances. In [Mas09], [11s] author gives the existing class and novel class definitions. Recent work in data stream mining domain describes a clustering approach that can detect both concept-drift and novel class and assumes that there is only one “normal” class and all other classes are novel. Thus, it may not work well if more than one class is to be considered as “normal” or non-novel. Masud et al proposed MineClass[Mas09] [Mas10], which means Mining novel Classes in data stream, using two base learners, K- nearest neighbor and decision tree. MineClass can detect novel classes in the presence of concept-drift and proposed model is capable of detecting novel classes even when the model consists of multiple “existing” classes. ActMiner [Mas10] addressed the limited labeled data problem by applying an ensemble classification technique. It extends MineClass and addresses the Limited labeled data problem in addition to addressing the other three Problems thereby reducing the labeling cost. It also applies active learning but its data selection process is different from others. An unsupervised novel concept detection technique for data streams is proposed but it is not applicable to multi-class classification. As per previously mentioned work, MineClass addresses the concept evolution problem on a multi-class classification framework. MineClass does not address the limited labeled data problem and requires that all instances in the stream should be labeled and available for training. ECSMiner [Mas11] [Pra12] [Sha12] means enhanced classifier for data streams with novel class miner. Novel class detection using ECSMiner is different from the traditional class detection technique. This approach offers a “multiclass” framework for the novelty detection problem that can distinguish between different classes of data and discover the emergence of a novel class. This technique is a nonparametric approach, and therefore, it is not restricted to any specific data distribution. As MineClass, this technique is applied on two different classifiers as a base learner: decision tree and k-nearest neighbor. When decision tree is used as a classifier, each training data chunk is used to build a decision tree. K-NN strategy would lead to an inefficient classification model both in terms of memory and running time. SCANR [Mas11] (Stream Classifier and Novel and Recurring class detector) is a technique for Classification with novel and recurring class. It addresses a special case of concept-evolution, which is recurring class. A recurring class is a special and more common case of concept-evolution in data streams. It occurs when a class reappears after a long disappearance from the stream. ECSMiner identifies recurring classes as novel class. In [Bis12] [Far12], the authors have proposed new decision tree learning approach for the detection of Novel class. In this approach, we calculate the threshold value based on the ratio of percentage of data points between each leaf node in the tree and the training dataset and also cluster the data points of training dataset based on the similarity of attribute values. IN [Fan04], paper describes the ID3 (Iterative Dichotomiser) technique, which is a decision tree learning algorithm that builds the decision tree using information theory. The C5.0 algorithm improves the performance of building trees using boosting, which is an approach to combining different classifiers. CART (classification and regression trees) is a process of generating a binary tree for decision making. CART handles missing data and contains a pruning strategy. The SPRINT (Scalable Parallelizable Induction of Decision Trees) algorithm uses an impurity function called gini index to find the best split. In this algorithm, they introduce decision tree classifier based novel class detection in concept drifting data stream classification, which builds a decision tree from data stream. The decision tree continuously updates with new data points so that the most recent tree represents the most recent concept in data stream. MCM [Mas13] stands for multiclass miner in data streams. It improved the previous novel class detection techniques by proposing an adaptive threshold for outlier detection, which is a vital part of novel class detection. In addition, a probabilistic approach was used for novel class detection using discrete Gini Coefficient. Moreover, it applies a graph-based approach to detect the appearance of more than one novel class simultaneously and separates the instances of one novel class from the others. DTNC [Mia13] stands for Detecting Novel Classes in Data Streams, the based learner VFDTc is used to speed up the process and reduce the model size that has two major extensions to VFDT: a condition of the form attri ≤ cut_point in split-test for numeric attribute and naive Bayes classifier applied in tree leaves to deal with concept drift issue. C&CD [Div14] means Classification and novel Class Detection. It uses the outlier detection method to remove the unwanted data present on the data streams. Also, this approach uses the Nearest Neighbor algorithm and the Naive Bayes classifier concepts for novel class detection. However, none of all the above-mentioned works deal with detecting new class across multiple streams. All exiting works have ignored this dimension. If many instances from different streams in a data set form a new class, that class will be misclassified by all above classifiers because each one of the proposed classifiers focus only on classes with which it has been trained with from one data stream. Read More
Cite this document
  • APA
  • MLA
  • CHICAGO
(“Efficient Data Mining Classification Technique Essay”, n.d.)
Efficient Data Mining Classification Technique Essay. Retrieved from https://studentshare.org/information-technology/1656517-towards-accurate-and-efficient-data-mining-classification-technique-for-multi-streaming-data
(Efficient Data Mining Classification Technique Essay)
Efficient Data Mining Classification Technique Essay. https://studentshare.org/information-technology/1656517-towards-accurate-and-efficient-data-mining-classification-technique-for-multi-streaming-data.
“Efficient Data Mining Classification Technique Essay”, n.d. https://studentshare.org/information-technology/1656517-towards-accurate-and-efficient-data-mining-classification-technique-for-multi-streaming-data.
  • Cited: 0 times

CHECK THESE SAMPLES OF Efficient Data Mining Classification Technique

A Systematic Approach to Cost-Based Optimization in Data Mining Environment

data mining is commonly recognized as an interactive and iterative process.... he development of the Knowledge Discovery and data mining System (KDDMS) has been one of the long term aims of data mining so as to support the process of data mining… Table of Contents 1 References 12 1.... data mining is commonly recognized as an interactive and iterative process.... The development of the Knowledge Discovery and data mining System (KDDMS) has been one of the long term aims of data mining so as to support the process of data mining....
13 Pages (3250 words) Dissertation

Efficiency of Data Mining Algorithms in Identifying Outliers-Noise in a Large Biological Data Base

The paper "Efficiency of data mining Algorithms in Identifying Outliers-Noise in a Large Biological Data Base" summarizes that the classification of large protein sets of data sequences by clustering techniques in the place of alignment methods extremely cuts down on the execution time.... he above situation called for the need of a technique that would naturally group or come up with a meaningful partition that makes use of a distance function.... Most of these methods end up forcing users to only consult the classification stored results in database when using large data sets that are known....
7 Pages (1750 words) Essay

Data Mining, Its Purpose and Its Working Methodology

From the paper "data mining, Its Purpose and Its Working Methodology" it is clear that data mining is a knowledge discovery process that is also known as Knowledge Discovery in Databases.... The primary function of data mining or KKD is to analyze and search a large number of data patterns in a database.... hellip; The first phase of the brief starts with the comprehensive introduction of data mining, its purpose and its working methodology....
12 Pages (3000 words) Coursework

Social Networks User Types, Models

hellip; It also specifies the data classification and data mining techniques which provide the intended data security and privacy.... DATA classification TECHNIQUES Apart from these common features several other features are introduced using the data classification techniques.... The data classification techniques classify the data by dividing them into quantiles, equal intervals and natural breaks.... The social networking sites implement these data classification techniques to classify the information regarding the users [3]....
4 Pages (1000 words) Article

The Best Accuracy and Speed of the WEKA Software

hellip; The WEKA software is an excellent data mining tool that is incorporated several pre-processing and data mining techniques.... Classification is the major technique among the techniques.... WECA is the preferred software tool for the classification.... This is because the software can mine useful information from the data by providing an algorithm for coming up with accurate prediction models.... he objective of the business intelligence analysis is to equip learners with business skills that will help them in relating services, applications, and technologies so that they can manage and analyze data so that the data is transformed into useful information that is critical to sound decision-making....
9 Pages (2250 words) Assignment

Decision-Making Model in E-Learning

Further, the project proposes a new ontology domain alignment technique that uses contextual data of the knowledge sources for decision making from the e-learning domain.... The paper "Decision-Making Model in E-Learning" states that if an e-learning platform is deemed deficient, ineffective or inappropriate, the model enables the determination, based on AHP weights and on impact-digraph-map, of the problem and deficiencies of the e-learning platform....
12 Pages (3000 words) Coursework

Review of the Popular Association-Mining Algorithms - Apriori

ssociation rule mining is a highly important technique used in data mining.... It is a well-researched technique that aids in the discovery and derivation of interesting relationships between variables present in databases (Benoit 2002).... The paper "Review of the Popular Association-Mining Algorithms - Apriori" highlights that algorithm that can mine classification over large databases with great efficiency and scalability is naive Bayes.... The four algorithm techniques which will be under analysis include AprioriTid, Apriori hybrid, Apriori, count distribution as well as Fast Distributed mining algorithm....
10 Pages (2500 words) Case Study

Data Mining Techniques by Using Boston Data

… The paper "data mining Techniques by Using Boston Data" is a worthy example of a research proposal on science.... (2012) categorizes human problems as either economic, intellectual, or business interest in terms of six data mining activities.... The paper "data mining Techniques by Using Boston Data" is a worthy example of a research proposal on science.... (2012) categorizes human problems as either economic, intellectual, or business interest in terms of six data mining activities....
10 Pages (2500 words) Research Proposal
sponsored ads
We use cookies to create the best experience for you. Keep on browsing if you are OK with that, or find out how to manage cookies.
Contact Us