6. Difference between K Means and Hierarchical clustering- Hierarchical clustering can’t handle big data well but K Means clustering can. This is because the time complexity of K Means is linear i.e. O(n) while that of hierarchical clustering is quadratic i.e. O(n2).
- In K Means clustering, since we start with random choice of clusters, the results produced by running the algorithm multiple times might differ. While results are reproducible in Hierarchical clustering.
- K Means is found to work well when the shape of the clusters is hyper spherical (like circle in 2D, sphere in 3D).
- K Means clustering requires prior knowledge of K i.e. no. of clusters you want to divide your data into. But, you can stop at whatever number of clusters you find appropriate in hierarchical clustering by interpreting the dendrogram
7. Applications of ClusteringClustering has a large no. of applications spread across various domains. Some of the most popular applications of clustering are:
- Recommendation engines
- Market segmentation
- Social network analysis
- Search result grouping
- Medical imaging
- Image segmentation
- Anomaly detection
8. Improving Supervised Learning Algorithms with ClusteringClustering is an unsupervised machine learning approach, but can it be used to improve the accuracy of supervised machine learning algorithms as well by clustering the data points into similar groups and using these cluster labels as independent variables in the supervised machine learning algorithm? Let’s find out.
Let’s check out the impact of clustering on the accuracy of our model for the classification problem using 3000 observations with 100 predictors of stock data to predicting whether the stock will go up or down using R. This dataset contains 100 independent variables from X1 to X100 representing profile of a stock and one outcome variable Y with two levels : 1 for rise in stock price and -1 for drop in stock price.
The dataset is available here : Download
Let’s first try applying randomforest without clustering.
#loading required librarieslibrary('randomForest')library('Metrics')#set random seedset.seed(101)#loading datasetdata<-read.csv("train.csv",stringsAsFactors= T)#checking dimensions of datadim(data)## [1] 3000 101#specifying outcome variable as factor data$Y<-as.factor(data$Y)#dividing the dataset into train and testtrain<-data[1:2000,]test<-data[2001:3000,]#applying randomForest model_rf<-randomForest(Y~.,data=train)preds<-predict(object=model_rf,test[,-101])table(preds)## preds## -1 1## 453 547#checking accuracyauc(preds,test$Y)## [1] 0.4522703
So, the accuracy we get is 0.45. Now let’s create five clusters based on values of independent variables using k-means clustering and reapply randomforest.
#combing test and trainall<-rbind(train,test)#creating 5 clusters using K- means clusteringCluster <- kmeans(all[,-101], 5)#adding clusters as independent variable to the dataset.all$cluster<-as.factor(Cluster$cluster)#dividing the dataset into train and testtrain<-all[1:2000,]test<-all[2001:3000,]#applying randomforestmodel_rf<-randomForest(Y~.,data=train)preds2<-predict(object=model_rf,test[,-101])table(preds2)## preds2## -1 1 ##548 452 auc(preds2,test$Y)## [1] 0.5345908
Whoo! In the above example, even though the final accuracy is poor but clustering has given our model a significant boost from accuracy of 0.45 to slightly above 0.53.
This shows that clustering can indeed be helpful for supervised machine learning tasks.
End NotesIn this article, we have discussed what are the various ways of performing clustering. It find applications for unsupervised learning in a large no. of domains. You also saw how you can improve the accuracy of your supervised machine learning algorithm using clustering.
Although clustering is easy to implement, you need to take care of some important aspects like treating outliers in your data and making sure each cluster has sufficient population. These aspects of clustering are dealt in great detail in this article.
Did you enjoyed reading this article? Do share your views in the comment section below.