StudentShare
Contact Us
Sign In / Sign Up for FREE
Search
Go to advanced search...
Free

Efficiency of Clustering Algorithms in Mining Biological Databases - Research Paper Example

Cite this document
Summary
"Efficiency of Clustering Algorithms in Mining Biological Databases" paper revealed that hierarchical clustering algorithms are the most efficient for the analysis of biological databases such as those involving the analysis of gene sequences because it provides an analysis of the sequences…
Download full paper File format: .doc, available for editing
GRAB THE BEST PAPER96% of users find it useful
Efficiency of Clustering Algorithms in Mining Biological Databases
Read Text Preview

Extract of sample "Efficiency of Clustering Algorithms in Mining Biological Databases"

? EFFICIENCY OF CLUSTERING ALGORITHMS IN MINING BIOLOGICAL DATABASES (s) Efficiency of clustering algorithms in mining biological databases Introduction Clustering analysis is increasingly being used in the mining of databases such as gene and protein sequences. Clustering algorithms is generally a common technique of data mining where by the data sets being examined are assigned into clusters on the basis of their similarities. In most cases, clustering algorithms are categorized into various groups depending on how they form their clusters1. For example Hierarchical algorithms often work by either splitting or merging the groups being analyzed in order to develop a hierarchy of clusters that is based on the similarity of the sequences. On the other hand, partitioning algorithms work by partitioning the data sets being analyzed based on distance between them (Fayyad , 2003, 346). The selection of any type of clustering algorithm should however be primarily based on the nature of the sequences or clusters to be analyzed, the acceptable error expected as the availability of computational resources. This is particularly with regard to the fact that each of the categories of clustering algorithms has its own strengths and limitations and therefore suited for different tasks. Biological databases such as those involving the mining of protein or gene sequences are best analyzed using clustering algorithms because it provides detailed exploratory analysis of the sequences. This paper critically analyzes the efficiency of clustering algorithms in the mining of biological databases such as gene sequences. Applications of clustering algorithms in analyzing gene sequences During the statistical analysis of biological databases, the choice of clustering, the choice of clustering algorithm often depends on the nature of data sets as well as the intended application of the results. In biological data mining most of the sequences that are increasingly being analyzed using clustering algorithms include genomic as well as protein sequences. According to Werner (2008, 52), recent advances in bioinformatics have resulted in the increasing use of clustering algorithms in the analysis of both protein and gene sequences In the study of gene expressions, clustering is one of the major exploratory techniques used in the analysis of microarray slides containing hundreds of thousands of genes2. In such cases, clustering is employed to help group together similar genes and consequently enable the biologists to identify the relationship between the particular genes as well as reduce the amount of information that is needed to be analyzed. Genes clustered together are usually co-regulated or sharing similar functions. Additionally when time series clustering methods are used, genes which exhibit similar characteristics at given times may be grouped together to indicate a possibility of co-regulation. Clustering algorithms can also be efficiently used analyze gene samples on the basis of similar expression patterns. Although expression patterns usually involve complex phenotypes, clustering analysis is one of the most effective techniques that can be used to identify arrays with similar or different phenotype characteristics. This application of clustering algorithms is particularly important in medical researches where the approach allows medical scientists to identify different pathologies on the basis of the gene expression patterns as opposed to the common histological methods. In unsupervised cluster analysis of gene expression arrays, the major assumption is usually that genes of the same biological process should be clustered together whether in condition dependent or in time series. Another important area in the mining of genetic databases in which clustering algorithms is significantly used is the analysis of gene profiles. In this regard clustering algorithm is potentially important in the analysis of sub classes of diseases as well as in the detection of genes that are co-expressed3. During the analysis of sub classes of disease data sets are usually grouped together on the basis of similar gene expressions whereby the sub partitions may be a likely indication of the unknown subclasses of the disease being investigated. Case study demonstrating the efficiency of clustering algorithms As demonstrated by the case study below, the efficiency of using clustering algorithm in the mining of biological databases such as gene sequence analysis largely depends on the nature of variables used, the type of clustering algorithm applied and the distance measure. To determine the efficiency of clustering algorithms, researchers conducted an experiment was conducted using three basic types of clustering algorithms which included K-means, fuzzy c-means and hierarchical clustering algorithms (Datta, 2003, 461). The random process model used in the experiment employed simple Gaussian distributions for each of the cluster being examined. A fixed mean vector template was applied to all the clusters and this enabled the comparison of the results of simulations. Results The results of the experiment were categorized into the number of dimensions (D), Number of clusters (K), Total number of points (n) and finally the variance within each of the clusters. The parameters of the cluster model used in the experiment are shown below: Number of dimensions (D): 2, 4, 8, 16, 32, 64, 128 Number of points (n): 50, 100, 200, 500, 1000, 2000, 5000 Number of clusters (K): 2, 4, 8, 16, 32 Variance within each cluster (compactness): ?2 = 0.1, 0.25, 0.5, 1, 2.5, 5 To determine the clustering efficiency using the generated results, a table was created by drawing the efficiency rate against the each cluster variance using all the set of values4. The table revealed a number of misclassification errors which were then used to evaluate the efficiency of the clustering algorithms used in the experiment. Variance KM FCM SOM HCEc HCEs HCEa HCC EM Experiment 1: K = 2, D = 2, n = 2000 0.25 0 0 20 4220 4280 4280 4360 0 1.0 0 20 0 4260 4180 4160 4260 140 5.0 0 80 0 4140 4140 4160 540 Experiment 2: K = 16, D = 2, n = 2000 0.25 16 184 8 5072 5128 5496 5920 592 1.0 16 840 0 5816 5872 6216 6440 936 5.0 24 944 0 6512 7576 6936 6904 1216 Experiment 3: K = 2, D = 128, n = 2000 0.25 142 114 114 5828 5800 5828 8057 742 1.0 142 171 114 5828 5828 5828 8057 5457 5.0 142 171 142 5828 5828 5828 8085 4828 Discussion The results of the experiment demonstrate the efficiency of each of the clustering algorithms which were used. Generally most of the clustering algorithms showed increasing misrepresentation errors with large numbers of clusters. On the other hand, the results suggested that the other algorithms such as K-means, fuzzy c-means do not effectively discriminate clusters. Hierarchical clustering algorithms are however indicated as performing best which low variances of usually ?2 = 0.1 or less5. This is most likely because in the low variances, clusters are often not overlapping. It is worth noting that the other clustering algorithms like the K-means, fuzzy c-means are contrastingly not efficient in low variances particularly in larger dimensions as compared to the hierarchical algorithms. With regard to the number of samples, most of the clustering algorithms tend to perform poorly with the increase in the number of samples. KM algorithms are however known to be efficient even in large samples (Hartigan, 2000, 107). As earlier been noted, dimension is another important factor that significantly affects the efficiency of various types of clustering algorithms. For example in low dimensions, an increase in the number of samples being analyzed does not usually improve the efficiency of the clustering algorithms. On the other hand, in higher dimensions, some of the algorithms can effectively cluster samples with many features. Conclusion In conclusion, each of the categories of clustering algorithms has their own strengths and limitations and therefore suited for different tasks. The case study revealed that hierarchical clustering algorithms are the most efficient for the analysis of biological databases such as those involving the analysis of gene sequences. This is because it provides detailed exploratory analysis of the sequences and has minimal error even when the number of samples being analyzed is increased. Bibliography Achuthsankar, S. Nair. Computational Biology & Bioinformatics: A gentle Overview, Communications of Computer Society of India,2 (2007): 234-248. Agarwal, M. Alam.“Issues, Challenges and Tools of Clustering Algorithms”, International Journal of Computer Science Issues, 8.2(2011):523-528. Brenner, S. Chothia. “Assessing sequence comparison methods with reliable structurally identi?ed distant evolutionary relationships”. Proc Natl Acad Sci, 95(1998) :6073–6078. Claverie, J. Mark. “Information enhancement methods for large scale sequence analysis. Comput. Chem.”, 17(1993) :191-201. Cheeseman, Peter and John Stutz. Bayesian Classi?cation (AutoClass): Theory and Results. New York: Macmillan publishers, 1996. Datta S. Daniel “Comparisons and validation of statistical clustering techniques for microarray gene expression data”, Bioinformatics, 19.4(2003): 459-466. Fayyad, G. Smyth. Advances in Knowledge Discovery and Data Mining. New York: AAAI Press, 2003. Fraley, C. Raftery. “How many clusters? Which clustering method? Answers via model-based Cluster Analysis”. Computer Journal,41(1998): 578-588, 1998. Hartigan M. Wong. “A k-means clustering algorithm. Journal of the Royal Statistical Society. 28.1(2000):100-108. Hirschberg, Donald, S. “Algorithms for the longest common subsequence problem”. Journal of the ACM, 24(1997):664–675. Hinneburg A., Keim. (1998) An Efficient Approach to Clustering in Large Multimedia. Databases with Noise, 2(1998) 67-88. Huang, Z.. "Extensions to the k-means algorithm for clustering large data sets with categorical values". Data Mining and Knowledge Discovery, 2:283–304, 1998. Jain , A., Dubes, B. Algorithms for Clustering Data, Prentice-Hall, New Jersey, 1988. Karypis, G., Hanand and Kumar, Vion. Chameleon: “A hierarchical clustering algorithm using dynamic modelling”. IEEE Computer, 32.8(1999): 68-75. Kerr, A. Churchill, “Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments”, Proc. Natl. Acad. Sci. USA, 98. 16(2001): 8961-8965. Kriegel H., Sander. “A Density-Based Algorithm for Discovering Clusters in Large Spatial.” Databases with Noise, 3(1996): 456-510. Michael S. Waterman, Introduction to Computational Biology: Sequences, Maps and Genomes. New York: CRC Press, 1995.  Reinherd, Kneser and Hermann, L. Ney. Improved Clustering Techniques for Class-Based Statistical Language Modelling. European Conference on Speech Communication and Technology, 3 (1993) 973-976. Werner, Thomas. “Clustering algorithms and Bioinformatics applications for pathway analysis of microarray data.” Current opinion in biotechnology, 19.1(2008):50-54. Yeung, D. Haynor, and Ruzzo, L. William “Validating clustering for gene expression data”, Bioinformatics, 17.4 (2001):309-318. Read More
Cite this document
  • APA
  • MLA
  • CHICAGO
(“Efficiency of Clustering Algorithms in Mining Biological databases Research Paper”, n.d.)
Efficiency of Clustering Algorithms in Mining Biological databases Research Paper. Retrieved from https://studentshare.org/information-technology/1453276-efficiency-of-clustering-algorithms-for-mining
(Efficiency of Clustering Algorithms in Mining Biological Databases Research Paper)
Efficiency of Clustering Algorithms in Mining Biological Databases Research Paper. https://studentshare.org/information-technology/1453276-efficiency-of-clustering-algorithms-for-mining.
“Efficiency of Clustering Algorithms in Mining Biological Databases Research Paper”, n.d. https://studentshare.org/information-technology/1453276-efficiency-of-clustering-algorithms-for-mining.
  • Cited: 0 times

CHECK THESE SAMPLES OF Efficiency of Clustering Algorithms in Mining Biological Databases

Efficiency of Clustering Algorithms for Mining Large Biological Data Bases

This paper ''efficiency of clustering algorithms for Mining Large Biological Data Bases'' discusses that using Pro-PAM algorithm based on partitioning clustering techniques in the place of alignment methods in large data sets, increases efficiency and reduces execution time significantly.... Various clustering algorithms-methods-have addressed the gene sequence clustering.... This brings into question the efficiency of the pair-wise alignment methods in comparing and clustering of large protein data sets....
10 Pages (2500 words) Research Paper

Data Mining as an Emerging Technology

One of the notable differences that is there between data mining and databases is the fact that in databases, stored facts are retrieved from the data store while in data mining; patterns are analyzed so that new patterns that have been previously unknown can be got.... Data mining application have also been seen in use in biological use as seen in DNA testing and gene analysis in living organisms.... This paper tells that Abstract Data mining is an emerging technology that has been associated with database technology....
22 Pages (5500 words) Research Paper

Efficiency of Data Mining Algorithms in Identifying Outliers-Noise in a Large Biological Data Base

The paper "Efficiency of Data Mining algorithms in Identifying Outliers-Noise in a Large Biological Data Base" summarizes that the classification of large protein sets of data sequences by clustering techniques in the place of alignment methods extremely cuts down on the execution time.... The main aim of these identified clustering algorithms is to come up with meaningful partitions, to better the quality of classification and to reduce the time used for computation....
7 Pages (1750 words) Essay

A Comparison of Some Methods of Cluster Analysis with SPSS

Introduction to Classification and clustering Statistical analysis is the process by which those conducting research and analysing data, can determine who or what within a dataset, fit certain patterns and trends.... ... ... ...
70 Pages (17500 words) Dissertation

Time Series Data Mining and Forecasting Using SQL Server 2008

The storage or management technology associated with the relational databases is sufficient for a number of data mining applications which are below 50 GB.... This thesis "Time Series Data mining and Forecasting Using SQL Server 2008" carries out data mining using the records on the production of major crops in Ghana for the past forty years as the data source.... It overviews time data mining, trends in data mining, review literature, etc....
64 Pages (16000 words) Thesis

Identifying Outliers in a Large Biological Data Base

The main aim of these identified clustering algorithms is to come up with meaningful partitions, to better the quality of classification and to reduce the time used for computation.... This coursework "Identifying Outliers in a Large biological Data Base" identifies approaches that are efficient in clustering and are based on algorithms.... The identified algorithms include; Pro-LEADER, Pro-Kmeans, Pro-CLARINS, and Pro-CLARA.... The above methods are used in the partitioning of protein sequence data sets in cluster algorithms....
7 Pages (1750 words) Coursework

The Efficiency of Clustering Algorithms for Mining Large Data Bases

The paper "The efficiency of clustering algorithms for Mining Large Data Bases" highlights that using Pro-PAM algorithm based on partitioning clustering techniques in the place of alignment methods in large data sets, increases efficiency and reduces execution time significantly.... his study focuses on evaluating the efficiency of various types of sequencing data mining algorithms with respect to protein sequence data sets, and on the basis of their shortcomings, design and develops an efficient clustering algorithm on the basis of the partitioning method....
7 Pages (1750 words) Coursework

A Framework for Customer Relationship Management and Data Mining

As the above reveals, the effectiveness and efficiency of data mining have to do with the thought, planning, and extent of the database it has access to (Rygielski et al, 2002).... As the paper "A Framework for Customer Relationship Management and Data mining" outlines, in looking into the contribution of data mining to customer relationship management (CRM), the starting point of the evaluation centered upon understanding the two terms as a basis for making an assessment....
14 Pages (3500 words) Literature review
sponsored ads
We use cookies to create the best experience for you. Keep on browsing if you are OK with that, or find out how to manage cookies.
Contact Us