Cluster Analysis Technique and Profiles of the Clusters Essay Example | Topics and Well Written Essays

Cluster analysis is a technique which can be used to ify cases into groups or clusters which are homogeneous. The cases in each cluster are similar to each other and are different from cases in other clusters. In this case it is not necessary to have prior information about the group. This technique is basically used for segmentation. The respondents can be classified the various clusters on the basis of the importance they give to various facilities provided. The clustering methods are simple procedure and are based on algorithms. It is not based on statistical reasoning. There are various statistics associated with cluster analysis which are used for analyzing the data. Clustering can be hierarchical or non hierarchal and these are further classified into various methods. Hierarchal clustering is developed as a tree like structure. This method can be either agglomerative or divisive. In agglomerative clustering each object is formed as a separate cluster which is formed by grouping into bigger clusters and the process is continued till all the cases form as members of a single cluster. In agglomerative method, the various methods such as linkage methods, error sum of squares or variance and central methods are used. Linkage method includes single linkage, complete linkage and average linkage. The single linkage method is based on the minimum distance. The complete linkage is based on the maximum distance. And the average linkage is based on the average distance between all pairs of objects, so that one member of the pair is from each of the clusters. Variance method is used to minimize the within -cluster variance. Ward's procedure is a variance method where the squared euclidean distance to the cluster means is minimized. In the centroid method the distance between the two clusters is computed as the distance between their centroids. Generally the average linkage and Ward's method are supposed to perform better than other procedures. Now we shall discuss the various statistics associated with cluster analysis. Agglomerative schedule gives information on the cases being combined at each stage of a hierarchical clustering. The mean value of the variable associated with all cases in a cluster is known as cluster centroid. Dendogram is a tree like graph which displays the result of cluster analysis. The clusters which are joined together are represented by vertical lines. The position of line indicates the distance where the clusters are joined. This graph is a generally read from left to right. The distance between cluster centers indicates how the pairs of clusters are separated. If the clusters are widely separated and distinct then they are desirable. Icicle diagram is a graph, which displays the clustering results. It is called as icicles which hang from the eaves of a house. The columns represent the cases being clustered and the rows correspond to the number of clusters. This diagram is read from bottom to top. In this case chestnut ridge club clustering is considered on the attitude of the respondents in terms of joining a club. And the respondents expressed on a scale of 1-5, the objective here is group similar cases and to measures how similar or different the case are. The approach is to measure similarity in terms of distance between pairs of objects. There are different methods to measure the distance. These methods can be used to measure and the results can be compared. In hierarchical clustering agglomerative clustering is selected and Wards procedure is used to measure the distance. Generally the choice of clustering method and choice of a distance measure are related. Here the variables are measured on a five-point scale. The Wards linkage method is used to find the average distance between all pairs of objects. In this variance method the squared Euclidean distance to the cluster means is minimized. The important outputs obtained here are agglomeration schedule which shows the number of clusters combined at each stage. The first column represents stage with 251 clusters and the respondents. The next column shows the respondents cluster combined. The column labeled coefficient gives the squared Euclidean distance between the two respondents. The column stage cluster first appears indicates the stage at which a first cluster is formed. In last column "next stage" indicates the stage at which another cluster is combined with this. In this case the first line of the last column is 34, and when we look at stage 34, respondents 53 and 170 are combined to form a single cluster. Agglomeration Schedule Stage Cluster Combined Coefficients Stage Cluster First Appears Next Stage Cluster 1 Cluster 2 Cluster 1 Cluster 2 1 170 249 .000 0 0 34 2 233 248 .000 0 0 70 3 92 245 .000 0 0 156 4 195 244 .000 0 0 145 5 217 240 .000 0 0 13 6 124 237 .000 0 0 178 7 223 232 .000 0 0 12 8 59 230 .000 0 0 66 9 226 227 .000 0 0 10 10 90 226 .000 0 9 31 11 121 225 .000 0 0 51 12 52 223 .000 0 7 193 13 193 217 .000 0 5 122 14 211 216 .000 0 0 17 15 156 213 .000 0 0 42 16 202 212 .000 0 0 22 17 70 211 .000 0 14 20 18 204 210 .000 0 0 20 19 158 208 .000 0 0 40 20 70 204 .000 17 18 32 21 155 203 .000 0 0 43 22 49 202 .000 0 16 64 23 154 198 .000 0 0 195 24 113 197 .000 0 0 151 25 178 196 .000 0 0 31 26 177 191 .000 0 0 32 27 187 189 .000 0 0 72 28 176 188 .000 0 0 111 29 75 184 .000 0 0 112 30 22 179 .000 0 0 119 31 90 178 .000 10 25 48 32 70 177 .000 20 26 200 33 67 173 .000 0 0 64 34 53 170 .000 0 1 125 35 86 167 .000 0 0 60 36 140 165 .000 0 0 46 37 66 163 .000 0 0 113 38 129 161 .000 0 0 48 39 27 160 .000 0 0 67 40 13 158 .000 0 19 213 41 143 157 .000 0 0 114 42 104 156 .000 0 15 160 43 3 155 .000 0 21 147 44 137 153 .000 0 0 142 45 80 144 .000 0 0 168 46 60 140 .000 0 36 53 47 126 131 .000 0 0 49 48 90 129 .000 31 38 184 49 11 126 .000 0 47 160 50 35 125 .000 0 0 142 51 25 121 .000 0 11 148 52 16 119 .000 0 0 157 53 60 112 .000 46 0 183 54 72 111 .000 0 0 109 55 105 107 .000 0 0 56 56 102 105 .000 0 55 185 57 77 100 .000 0 0 116 58 38 99 .000 0 0 148 59 56 97 .000 0 0 115 60 37 86 .000 0 35 124 61 33 81 .000 0 0 117 62 46 78 .000 0 0 118 63 43 71 .000 0 0 147 64 49 67 .000 22 33 203 65 45 63 .000 0 0 110 66 31 59 .000 0 8 218 67 21 27 .000 0 39 123 68 48 252 .000 0 0 145 69 47 138 .010 0 0 146 70 83 233 .025 0 2 167 71 42 139 .118 0 0 162 72 159 187 .271 0 27 74 73 68 201 .455 0 0 108 74 62 159 .851 0 72 156 75 247 250 1.351 0 0 127 76 222 243 1.851 0 0 129 77 234 241 2.351 0 0 128 78 12 239 2.851 0 0 158 79 4 236 3.351 0 0 135 80 26 235 3.851 0 0 155 81 228 229 4.351 0 0 186 82 39 224 4.851 0 0 151 83 185 220 5.351 0 0 150 84 30 215 5.851 0 0 133 85 136 207 6.351 0 0 131 86 93 194 6.851 0 0 161 87 7 192 7.351 0 0 187 88 95 190 7.851 0 0 130 89 51 180 8.351 0 0 180 90 172 174 8.851 0 0 169 91 15 150 9.351 0 0 163 92 116 149 9.851 0 0 159 93 76 142 10.351 0 0 132 94 101 141 10.851 0 0 182 95 55 135 11.351 0 0 173 96 88 132 11.851 0 0 146 97 57 130 12.351 0 0 134 98 122 123 12.851 0 0 152 99 108 117 13.351 0 0 178 100 23 110 13.851 0 0 185 101 14 69 14.351 0 0 153 102 24 61 14.851 0 0 162 103 40 50 15.351 0 0 136 104 36 41 15.851 0 0 182 105 17 28 16.351 0 0 164 106 115 221 16.861 0 0 161 107 9 181 17.393 0 0 137 108 68 209 17.982 73 0 201 109 72 186 18.648 54 0 126 110 45 183 19.315 65 0 172 111 171 176 19.982 0 28 163 112 75 169 20.648 29 0 194 113 66 166 21.315 37 0 164 114 84 143 21.982 0 41 189 115 56 109 22.648 59 0 198 116 77 98 23.315 57 0 168 117 33 64 23.982 61 0 150 118 34 46 24.648 0 62 179 119 20 22 25.315 0 30 167 120 199 214 25.999 0 0 212 121 65 182 26.684 0 0 176 122 29 193 27.434 0 13 175 123 21 162 28.184 67 0 192 124 37 58 28.934 60 0 172 125 44 53 29.684 0 34 155 126 72 79 30.517 109 0 189 127 94 247 31.351 0 75 188 128 175 234 32.184 0 77 181 129 152 222 33.017 0 76 157 130 95 148 33.851 88 0 170 131 73 136 34.684 0 85 154 132 5 76 35.517 0 93 220 133 30 74 36.351 84 0 184 134 32 57 37.184 0 97 199 135 4 54 38.017 79 0 205 136 19 40 38.851 0 103 153 137 9 85 39.695 107 0 214 138 238 242 40.695 0 0 196 139 151 205 41.695 0 0 223 140 10 168 42.695 0 0 217 141 6 164 43.695 0 0 166 142 35 137 44.695 50 44 183 143 114 128 45.695 0 0 219 144 18 91 46.695 0 0 165 145 48 195 47.695 68 4 170 146 47 88 48.810 69 96 210 147 3 43 50.010 43 63 173 148 25 38 51.210 51 58 188 149 8 120 52.442 0 0 177 150 33 185 53.675 117 83 202 151 39 113 54.925 82 24 191 152 122 206 56.262 98 0 169 153 14 19 57.628 101 136 241 154 73 246 59.045 131 0 224 155 26 44 60.462 80 125 207 156 62 92 61.911 74 3 237 157 16 152 63.378 52 129 193 158 12 218 64.878 78 0 202 159 116 200 66.378 92 0 179 160 11 104 67.878 49 42 216 161 93 115 69.383 86 106 210 162 24 42 70.894 102 71 174 163 15 171 72.527 91 111 180 164 17 66 74.161 105 113 226 165 18 231 75.827 144 0 204 166 6 146 77.494 141 0 214 167 20 83 79.211 119 70 192 168 77 80 80.945 116 45 201 169 122 172 82.812 152 90 208 170 48 95 84.770 145 130 222 171 103 147 86.770 0 0 197 172 37 45 88.782 124 110 215 173 3 55 90.796 147 95 207 174 24 82 92.841 162 0 195 175 29 145 94.891 122 0 191 176 65 118 96.972 121 0 212 177 8 251 99.103 149 0 221 178 108 124 101.353 99 6 211 179 34 116 103.686 118 159 211 180 15 51 106.100 163 89 209 181 134 175 108.551 0 128 217 182 36 101 111.051 104 94 205 183 35 60 113.551 142 53 194 184 30 90 116.117 133 48 200 185 23 102 118.817 100 56 196 186 133 228 121.651 0 81 218 187 7 89 124.484 87 0 219 188 25 94 127.451 148 127 229 189 72 84 130.427 126 114 221 190 106 219 133.427 0 0 206 191 29 39 136.432 175 151 216 192 20 21 139.460 167 123 203 193 16 52 142.535 157 12 215 194 35 75 145.641 183 112 230 195 24 154 148.755 174 23 232 196 23 238 151.983 185 138 199 197 1 103 155.316 0 171 225 198 56 87 158.650 115 0 230 199 23 32 162.088 196 134 228 200 30 70 165.835 184 32 222 201 68 77 169.848 108 168 208 202 12 33 173.948 158 150 227 203 20 49 178.190 192 64 227 204 18 96 182.484 165 0 220 205 4 36 186.793 135 182 236 206 106 127 191.127 190 0 223 207 3 26 195.515 173 155 229 208 68 122 200.042 201 169 231 209 2 15 204.703 0 180 213 210 47 93 209.578 146 161 224 211 34 108 214.528 179 178 239 212 65 199 219.744 176 120 234 213 2 13 225.369 209 40 242 214 6 9 231.376 166 137 238 215 16 37 237.539 193 172 237 216 11 29 243.817 160 191 233 217 10 134 250.237 140 181 232 218 31 133 257.071 66 186 228 219 7 114 263.937 187 143 225 220 5 18 270.941 132 204 234 221 8 72 278.169 177 189 226 222 30 48 285.789 200 170 233 223 106 151 293.856 206 139 235 224 47 73 302.624 210 154 231 225 1 7 312.216 197 219 241 226 8 17 322.059 221 164 238 227 12 20 332.477 202 203 240 228 23 31 344.173 199 218 244 229 3 25 355.999 207 188 240 230 35 56 368.326 194 198 248 231 47 68 381.322 224 208 236 232 10 24 395.019 217 195 235 233 11 30 410.662 216 222 243 234 5 65 427.813 220 212 239 235 10 106 445.789 232 223 247 236 4 47 464.709 205 231 244 237 16 62 485.968 215 156 243 238 6 8 507.566 214 226 246 239 5 34 536.341 234 211 245 240 3 12 565.951 229 227 242 241 1 14 599.550 225 153 245 242 2 3 637.879 213 240 246 243 11 16 679.049 233 237 248 244 4 23 727.121 236 228 249 245 1 5 789.104 241 239 249 246 2 6 875.159 242 238 247 247 2 10 1012.059 246 235 250 248 11 35 1200.521 243 230 250 249 1 4 1390.909 245 244 251 250 2 11 1637.982 247 248 251 251 1 2 2226.893 249 250 0 The next important output is the icicle plot. In this the columns correspond to the cases being clustered. This figure is read from bottom to top. Another output is the Dendogram and it is useful in deciding on the numbers of clusters. In hierarchical clustering the distances the criteria used is the distance at which clusters are combined. Dendrogram using Ward Method Rescaled Distance Cluster Combine C A S E 0 5 10 15 20 25 Label Num +---------+---------+---------+---------+---------+ 170 249 53 44 26 235 55 135 155 203 3 43 71 247 250 94 121 225 25 38 99 185 220 33 81 64 12 239 218 202 212 49 67 173 27 160 21 162 233 248 83 22 179 20 158 208 13 51 180 15 150 176 188 171 2 9 181 85 6 164 146 17 28 66 163 166 8 120 251 143 157 84 72 111 186 79 151 205 106 219 127 154 198 42 139 24 61 82 10 168 234 241 175 134 75 184 169 140 165 60 112 137 153 35 125 56 97 109 87 156 213 104 126 131 11 113 197 39 224 217 240 193 29 145 95 190 148 195 244 48 252 211 216 70 204 210 177 191 226 227 90 178 196 129 161 30 215 74 92 245 187 189 159 62 45 63 183 86 167 37 58 223 232 52 16 119 222 243 152 57 130 32 238 242 105 107 102 110 59 230 31 228 229 133 4 236 54 101 141 36 41 172 174 122 123 206 68 201 209 80 144 77 100 98 136 207 73 246 47 138 88 132 93 194 115 221 124 237 108 117 46 78 34 116 149 200 199 214 65 182 118 76 142 5 18 91 231 96 14 69 40 50 19 103 147 1 114 128 7 192 89 Hence, when hierarchical clustering is used on this data based on agglomeration schedule, icicle plot, Dendogram 8 clusters are formed and the profiles of the cluster formed is as follows. These respondents can be clustered into 8 Clusters based on Agglomeration Schedule, Icicle Plot, Dendrogram and Pseudo t-square statistics. Profiles of the Clusters Cluster 1- People belonging to this Cluster place a high importance on Tennis and Pool facilities. They do not give importance to golf and dining facilities. They do not care about the social events being organized. Cluster 2- People belonging to this Cluster place a high importance on Golf and Dining facilities. They do not give importance to Tennis and Pool facilities. They do not care about the social events being organized. Cluster 3- People belonging to this Cluster place a very high importance on Dining facilities. They do not give any importance to golf, tennis and pool facilities at all. They do not care about the social events being organized. Cluster 4- People belonging to this Cluster place a high importance on Tennis and Golf facilities. They do not give importance to Pool, Social events and dining facilities. Cluster 5- People belonging to this Cluster place a very high importance on Dining and Pool facilities. They place a high importance on Tennis. They do not give importance to golf facilities. They do not care about the social events being organized. Cluster 6- People belonging to this Cluster place a high importance on dining and golf facilities. They do not give importance to tennis facilities. They do not care about the social events being organized and the pool facilities. Cluster 7- People belonging to this Cluster place a very high importance on Golf facilities. They do not give importance to Tennis, Pool, social events being organized and dining facilities. Cluster 8- People belonging to this Cluster place a high importance on Golf, Tennis, Dining and Pool facilities. They do not care about the social events being organized. All these five variables significantly differentiate between the Clusters. The importance of these variables (so to say these facilities) can not be over emphasized. There is an even distribution of cases across the eight clusters. Cluster 1 and Cluster 2 are the most clearly differentiated Clusters. There are certain procedures for assessment and validity of cluster analysis: 1) Cluster analysis can be performed on the data using different distance measures and the results can be compared to determine the stability of the solutions 2) The data can be divided into two parts and cluster analysis is performed on each half and cluster centroids for the two sub samples are compared. 3) The variables can be deleted randomly and cluster analysis is performed on the reduced set of variables and compare the results with those which are obtained on the entire set. Additional Research Once these clusters are formed, one can run k-means clustering to find out the importance given to each of the factors, as given in the table Final Cluster centers, Final Cluster Centers Cluster 1 2 3 4 5 6 7 8 Importance of challenge of golf 1 5 1 5 1 5 5 5 Importance of tennis facilities 4 1 1 4 4 2 1 4 Importance of pool facilities 4 1 1 2 5 3 2 4 Importance of dining facilities 2 4 5 2 5 4 3 5 Importance of social events 3 3 3 2 3 3 2 3 The distance between the final cluster centers indicate that the pairs are well separated. Distances between Final Cluster Centers Cluster 1 2 3 4 5 6 7 8 1 5.936 4.920 4.336 2.756 4.671 5.201 3.932 2 5.936 3.722 3.502 5.733 2.125 2.562 4.257 3 4.920 3.722 5.047 4.427 4.045 4.329 5.426 4 4.336 3.502 5.047 5.495 3.497 2.723 3.646 5 2.756 5.733 4.427 5.495 4.200 5.879 3.396 6 4.671 2.125 4.045 3.497 4.200 2.971 2.551 7 5.201 2.562 4.329 2.723 5.879 2.971 4.512 8 3.932 4.257 5.426 3.646 3.396 2.551 4.512 From this table it is clear that the cluster 1 and 2 are well separated. The F test for each variable is presented. These F tests should be used only for descriptive purposes because the clusters have been chosen to maximize the differences among cases in different clusters ANOVA Cluster Error F Sig. Mean Square df Mean Square df Importance of challenge of golf 64.866 7 .368 244 176.265 .000 Importance of tennis facilities 53.481 7 .475 244 112.480 .000 Importance of pool facilities 57.653 7 .552 244 104.381 .000 Importance of dining facilities 29.567 7 .516 244 57.252 .000 Importance of social events 15.233 7 .880 244 17.311 .000 Since the F values obtained here are highly significant, All these five variables significantly differentiate between the Clusters. The importance of these variables can not be over emphasized. There is an even distribution of cases across the eight clusters. Further, researcher can also find the average age, income, marital status using cluster membership. This can be done by grouping the respondents under each cluster and find the average age, income. The respondents in each cluster can be classified according to the marital status, gender. Hence these other variables can be analyzed using the chosen five variables and the eight clusters formed. Further this analysis can be carried out using SAS enterprise guide. The results obtained above can be strengthened using tsuedo t statistic and tsuedo F statistic, where the number of clusters formed can be determined and when this analysis was done it was observed that same number of clusters are obtained. Read More

Cluster Analysis Technique and Profiles of the Clusters - Essay Example

Extract of sample "Cluster Analysis Technique and Profiles of the Clusters"

CHECK THESE SAMPLES OF Cluster Analysis Technique and Profiles of the Clusters

Data Mining: Concepts and Techniques

Efficiency of Clustering Algorithms in Mining Biological Databases

Efficiency of Data Mining Algorithms in Identifying Outliers-Noise in a Large Biological Data Base

A Comparison of Some Methods of Cluster Analysis with SPSS

Nike Research Plan

Identifying Outliers in a Large Biological Data Base

Data Warehousing and Analytics

Data Mining Demographic Information and Transaction Data of a Large Retail Company