Efficiency of Clustering Algorithms in Mining Biological Databases Research Paper Example | Topics and Well Written Essays

? EFFICIENCY OF CLUSTERING ALGORITHMS IN MINING BIOLOGICAL DATABASES (s) Efficiency of clustering algorithms in mining biological databases Introduction Clustering analysis is increasingly being used in the mining of databases such as gene and protein sequences. Clustering algorithms is generally a common technique of data mining where by the data sets being examined are assigned into clusters on the basis of their similarities. In most cases, clustering algorithms are categorized into various groups depending on how they form their clusters1. For example Hierarchical algorithms often work by either splitting or merging the groups being analyzed in order to develop a hierarchy of clusters that is based on the similarity of the sequences. On the other hand, partitioning algorithms work by partitioning the data sets being analyzed based on distance between them (Fayyad , 2003, 346). The selection of any type of clustering algorithm should however be primarily based on the nature of the sequences or clusters to be analyzed, the acceptable error expected as the availability of computational resources. This is particularly with regard to the fact that each of the categories of clustering algorithms has its own strengths and limitations and therefore suited for different tasks. Biological databases such as those involving the mining of protein or gene sequences are best analyzed using clustering algorithms because it provides detailed exploratory analysis of the sequences. This paper critically analyzes the efficiency of clustering algorithms in the mining of biological databases such as gene sequences. Applications of clustering algorithms in analyzing gene sequences During the statistical analysis of biological databases, the choice of clustering, the choice of clustering algorithm often depends on the nature of data sets as well as the intended application of the results. In biological data mining most of the sequences that are increasingly being analyzed using clustering algorithms include genomic as well as protein sequences. According to Werner (2008, 52), recent advances in bioinformatics have resulted in the increasing use of clustering algorithms in the analysis of both protein and gene sequences In the study of gene expressions, clustering is one of the major exploratory techniques used in the analysis of microarray slides containing hundreds of thousands of genes2. In such cases, clustering is employed to help group together similar genes and consequently enable the biologists to identify the relationship between the particular genes as well as reduce the amount of information that is needed to be analyzed. Genes clustered together are usually co-regulated or sharing similar functions. Additionally when time series clustering methods are used, genes which exhibit similar characteristics at given times may be grouped together to indicate a possibility of co-regulation. Clustering algorithms can also be efficiently used analyze gene samples on the basis of similar expression patterns. Although expression patterns usually involve complex phenotypes, clustering analysis is one of the most effective techniques that can be used to identify arrays with similar or different phenotype characteristics. This application of clustering algorithms is particularly important in medical researches where the approach allows medical scientists to identify different pathologies on the basis of the gene expression patterns as opposed to the common histological methods. In unsupervised cluster analysis of gene expression arrays, the major assumption is usually that genes of the same biological process should be clustered together whether in condition dependent or in time series. Another important area in the mining of genetic databases in which clustering algorithms is significantly used is the analysis of gene profiles. In this regard clustering algorithm is potentially important in the analysis of sub classes of diseases as well as in the detection of genes that are co-expressed3. During the analysis of sub classes of disease data sets are usually grouped together on the basis of similar gene expressions whereby the sub partitions may be a likely indication of the unknown subclasses of the disease being investigated. Case study demonstrating the efficiency of clustering algorithms As demonstrated by the case study below, the efficiency of using clustering algorithm in the mining of biological databases such as gene sequence analysis largely depends on the nature of variables used, the type of clustering algorithm applied and the distance measure. To determine the efficiency of clustering algorithms, researchers conducted an experiment was conducted using three basic types of clustering algorithms which included K-means, fuzzy c-means and hierarchical clustering algorithms (Datta, 2003, 461). The random process model used in the experiment employed simple Gaussian distributions for each of the cluster being examined. A fixed mean vector template was applied to all the clusters and this enabled the comparison of the results of simulations. Results The results of the experiment were categorized into the number of dimensions (D), Number of clusters (K), Total number of points (n) and finally the variance within each of the clusters. The parameters of the cluster model used in the experiment are shown below: Number of dimensions (D): 2, 4, 8, 16, 32, 64, 128 Number of points (n): 50, 100, 200, 500, 1000, 2000, 5000 Number of clusters (K): 2, 4, 8, 16, 32 Variance within each cluster (compactness): ?2 = 0.1, 0.25, 0.5, 1, 2.5, 5 To determine the clustering efficiency using the generated results, a table was created by drawing the efficiency rate against the each cluster variance using all the set of values4. The table revealed a number of misclassification errors which were then used to evaluate the efficiency of the clustering algorithms used in the experiment. Variance KM FCM SOM HCEc HCEs HCEa HCC EM Experiment 1: K = 2, D = 2, n = 2000 0.25 0 0 20 4220 4280 4280 4360 0 1.0 0 20 0 4260 4180 4160 4260 140 5.0 0 80 0 4140 4140 4160 540 Experiment 2: K = 16, D = 2, n = 2000 0.25 16 184 8 5072 5128 5496 5920 592 1.0 16 840 0 5816 5872 6216 6440 936 5.0 24 944 0 6512 7576 6936 6904 1216 Experiment 3: K = 2, D = 128, n = 2000 0.25 142 114 114 5828 5800 5828 8057 742 1.0 142 171 114 5828 5828 5828 8057 5457 5.0 142 171 142 5828 5828 5828 8085 4828 Discussion The results of the experiment demonstrate the efficiency of each of the clustering algorithms which were used. Generally most of the clustering algorithms showed increasing misrepresentation errors with large numbers of clusters. On the other hand, the results suggested that the other algorithms such as K-means, fuzzy c-means do not effectively discriminate clusters. Hierarchical clustering algorithms are however indicated as performing best which low variances of usually ?2 = 0.1 or less5. This is most likely because in the low variances, clusters are often not overlapping. It is worth noting that the other clustering algorithms like the K-means, fuzzy c-means are contrastingly not efficient in low variances particularly in larger dimensions as compared to the hierarchical algorithms. With regard to the number of samples, most of the clustering algorithms tend to perform poorly with the increase in the number of samples. KM algorithms are however known to be efficient even in large samples (Hartigan, 2000, 107). As earlier been noted, dimension is another important factor that significantly affects the efficiency of various types of clustering algorithms. For example in low dimensions, an increase in the number of samples being analyzed does not usually improve the efficiency of the clustering algorithms. On the other hand, in higher dimensions, some of the algorithms can effectively cluster samples with many features. Conclusion In conclusion, each of the categories of clustering algorithms has their own strengths and limitations and therefore suited for different tasks. The case study revealed that hierarchical clustering algorithms are the most efficient for the analysis of biological databases such as those involving the analysis of gene sequences. This is because it provides detailed exploratory analysis of the sequences and has minimal error even when the number of samples being analyzed is increased. Bibliography Achuthsankar, S. Nair. Computational Biology & Bioinformatics: A gentle Overview, Communications of Computer Society of India,2 (2007): 234-248. Agarwal, M. Alam.“Issues, Challenges and Tools of Clustering Algorithms”, International Journal of Computer Science Issues, 8.2(2011):523-528. Brenner, S. Chothia. “Assessing sequence comparison methods with reliable structurally identi?ed distant evolutionary relationships”. Proc Natl Acad Sci, 95(1998) :6073–6078. Claverie, J. Mark. “Information enhancement methods for large scale sequence analysis. Comput. Chem.”, 17(1993) :191-201. Cheeseman, Peter and John Stutz. Bayesian Classi?cation (AutoClass): Theory and Results. New York: Macmillan publishers, 1996. Datta S. Daniel “Comparisons and validation of statistical clustering techniques for microarray gene expression data”, Bioinformatics, 19.4(2003): 459-466. Fayyad, G. Smyth. Advances in Knowledge Discovery and Data Mining. New York: AAAI Press, 2003. Fraley, C. Raftery. “How many clusters? Which clustering method? Answers via model-based Cluster Analysis”. Computer Journal,41(1998): 578-588, 1998. Hartigan M. Wong. “A k-means clustering algorithm. Journal of the Royal Statistical Society. 28.1(2000):100-108. Hirschberg, Donald, S. “Algorithms for the longest common subsequence problem”. Journal of the ACM, 24(1997):664–675. Hinneburg A., Keim. (1998) An Efficient Approach to Clustering in Large Multimedia. Databases with Noise, 2(1998) 67-88. Huang, Z.. "Extensions to the k-means algorithm for clustering large data sets with categorical values". Data Mining and Knowledge Discovery, 2:283–304, 1998. Jain , A., Dubes, B. Algorithms for Clustering Data, Prentice-Hall, New Jersey, 1988. Karypis, G., Hanand and Kumar, Vion. Chameleon: “A hierarchical clustering algorithm using dynamic modelling”. IEEE Computer, 32.8(1999): 68-75. Kerr, A. Churchill, “Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments”, Proc. Natl. Acad. Sci. USA, 98. 16(2001): 8961-8965. Kriegel H., Sander. “A Density-Based Algorithm for Discovering Clusters in Large Spatial.” Databases with Noise, 3(1996): 456-510. Michael S. Waterman, Introduction to Computational Biology: Sequences, Maps and Genomes. New York: CRC Press, 1995. Reinherd, Kneser and Hermann, L. Ney. Improved Clustering Techniques for Class-Based Statistical Language Modelling. European Conference on Speech Communication and Technology, 3 (1993) 973-976. Werner, Thomas. “Clustering algorithms and Bioinformatics applications for pathway analysis of microarray data.” Current opinion in biotechnology, 19.1(2008):50-54. Yeung, D. Haynor, and Ruzzo, L. William “Validating clustering for gene expression data”, Bioinformatics, 17.4 (2001):309-318. Read More

Efficiency of Clustering Algorithms in Mining Biological Databases - Research Paper Example

Extract of sample "Efficiency of Clustering Algorithms in Mining Biological Databases"

CHECK THESE SAMPLES OF Efficiency of Clustering Algorithms in Mining Biological Databases

Efficiency of Clustering Algorithms for Mining Large Biological Data Bases

Data Mining as an Emerging Technology

Efficiency of Data Mining Algorithms in Identifying Outliers-Noise in a Large Biological Data Base

A Comparison of Some Methods of Cluster Analysis with SPSS

Time Series Data Mining and Forecasting Using SQL Server 2008

Identifying Outliers in a Large Biological Data Base

The Efficiency of Clustering Algorithms for Mining Large Data Bases

A Framework for Customer Relationship Management and Data Mining