What Is Agglomerative Clustering Algorithm
Agglomerative clustering algorithm is one of unsupervised machine learning algorithm. That used for classifying data which is not labelled. In Agglomerative clustering first each data point is consider as one cluster and at every iteration closet cluster merge together and form a new cluster.This process is repeat until all data points are not placed in appropriate cluster.
Agglomerative clustering follow bottom, top approach to merge clusters. This algorithm generates a hierarchy of data points according to nearest data point that represent as .
Agglomerative clustering work on distance of the data point .To calculate distance between two point use the Euclidean distance formula
Method Of Calculating Distance Between Two Cluster
Basically, many methods for calculating the distance between cluster are available. Some methods explain below.
- Single Nearest Distance: In single nearest distance method calculates distance between two clusters based on the nearest point of two clusters
.This also called as Simple Linkage or Nearest Point Algorithm.
- Complete Farthest Distance: In complete farthest distance method calculates distance between two clusters based on the farthest point of two clusters
.This also called as Complete Linkage or Farthest Point Algorithm or Voor Hees Algorithm.
- Average Distance: In average distance method calculates distance based on the average of all the distances of all pairs of point in clusters. This is also called as average linkage or UPGMA (Unweighted Pair Group Mean Averaging) algorithm.
- Centroid Distance: In centroid distance method calculated distance between clusters centroid point
.This is also called as centroid linkage or UPGMC (Unweighted Pair-Group Method uses Centroids)
To Implement Agglomerative Clustering Algorithm Need To Install
pip install numpy pip install padas pip install matplotlib pip install sklearn pip install scipy
Here we implement an example of car classification using agglomerative clustering using slang and scipy package
Data Set On Which Clustering Algorithm Apply
Import Require Packages
import numpy as np #this package is used to read data from csv file import pandas as pd #this used to plot graph import matplotlib.pyplot as plt #this used for color making import matplotlib.cm as cm #this module used to generate distance matrix from scipy.spatial import distance_matrix #this module used to pefrom agglomerative clustering from scipy.cluster.hierarchy import linkage #this module used to generate dendograms from scipy.cluster.hierarchy import dendrogram #thtis is used to generate N clusters from dendgogram from scipy.cluster.hierarchy import fcluster #this used in preprocessing from sklearn.preprocessing import MinMaxScaler #this is useed to perform clustering from sklearn.cluster import hierarchical from sklearn.cluster import AgglomerativeClustering
Read Data From CSV File And Data Pre-processing
To read data from csv file, use the pandas package that has a read csv method that help to read data from csv file and return data frame object. In data pre-processing step removes null value from the data set and scale all value between 0 to 1.Also generate a distance matrix for the dataset.
data=pd.read_csv('cars_clus.csv') #convert into numeric value data=data.apply(pd.to_numeric,errors='coerce') #drop rows that contain null value data=data.dropna() #reset the index after drop null rows data=data.reset_index(drop=True) #copy dataset that used to apply algorithm using diifrent method dataf=data data.head() #scale all the values between (0,1) range data=MinMaxScaler().fit_transform(data); #generate distance matrix dm=distance_matrix(data,data)
Perform Agglomerative Clustering Using Scipy
#this help to perform agglomerative clustering model=linkage(dm,'complete') #this help make N cluster from linkage clusters=fcluster(model,5,criterion='maxclust') clusters
array([1, 3, 3, 3, 1, 3, 2, 3, 3, 3, 3, 3, 3, 3, 2, 2, 3, 4, 1, 3, 3, 3, 3, 2, 1, 5, 3, 3, 3, 3, 3, 3, 3, 1, 3], dtype=int32)
fig=pylab.figure(figsize=(20,30)) #this function help to label data point in dendrogram def leaf_label(i): return '[%s|%s|%s]' % (dataf['engine_s'][i],dataf['horsepow'][i],dataf['wheelbas'][i]); #this is plot dendogram dendog=dendrogram(model,leaf_label_func=leaf_label,leaf_rotation=90,leaf_font_size=18,orientation='top')
Perform Agglomerative Clustering Using Sklearn
#create object of Agglomerative Clustering Algorithm modelc=AgglomerativeClustering(n_clusters=5,linkage='complete') #fit data in model modelc.fit(data); #print labels for data point modelc.labels_ #append labels in data set dataf['cluster']=modelc.labels_; dataf.head()
array([1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 1, 1, 0, 0, 0, 4, 1, 2, 0, 1, 1, 0, 1, 0, 0, 1, 1], dtype=int64)
Plot Graph Of Cluster
Use scatter plot for plot cluster in graph and use
#identify number if clusters no_cluters=max(modelc.labels_)+1 #this is help to choose color from colormap of matplotlib colors=cm.rainbow(np.linspace(0,1,no_cluters)) #create cluster labellist c_label=list(range(0,no_cluters)) for color , label in zip(colors,c_label): subset=dataf[dataf.cluster == label] plt.scatter(subset.engine_s,subset.horsepow,c=color,label="cluster"+str(label)) plt.legend(); plt.xlabel("Engine size"); plt.ylabel("Horse power");