Hierarchical Agglomerative Clustering Algorithm Solution In Python

Hierarchical Agglomerative Clustering Algorithm Solution In Python

What Is Agglomerative Clustering Algorithm

Agglomerative clustering algorithm is one of unsupervised machine learning algorithm. That used for classifying data which is not labelled. In Agglomerative clustering first each data point is consider as one cluster and at every iteration closet cluster merge together and form a new cluster.This process is repeat until all data points are not placed in appropriate cluster.

Agglomerative clustering follow bottom, top approach to merge clusters. This algorithm generates a hierarchy of data points according to nearest data point that represent as .

Agglomerative clustering work on distance of the data point .To calculate distance between two point use the Euclidean distance formula

Method Of Calculating Distance Between Two Cluster

Basically, many methods for calculating the distance between cluster are available. Some methods explain below.

  • Single Nearest Distance: In single nearest distance method calculates distance between two clusters based on the nearest point of two clusters.This also called as Simple Linkage or Nearest Point Algorithm.
  • Complete Farthest Distance: In complete farthest distance method calculates distance between two clusters based on the farthest point of two clusters.This also called as Complete Linkage or Farthest Point Algorithm or Voor Hees Algorithm.
  • Average Distance: In average distance method calculates distance based on the average of all the distances of all pairs of point in clusters. This is also called as average linkage or UPGMA (Unweighted Pair Group Mean Averaging) algorithm.
  • Centroid Distance: In centroid distance method calculated distance between clusters centroid point.This is also called as centroid linkage or UPGMC(Unweighted Pair-Group Method uses Centroids)

To Implement Agglomerative Clustering Algorithm Need To Install Required Packages

pip install numpy
pip install padas
pip install matplotlib
pip install sklearn
pip install scipy

Here we implement an example of car classification using agglomerative clustering using slang and scipy package

Data Set On Which Clustering Algorithm Apply

Import Require Packages


import numpy as np
#this package is used to read data from csv file
import pandas as pd
#this used to plot graph
import matplotlib.pyplot as plt
#this used for color making
import matplotlib.cm as cm
#this module used to generate distance matrix
from scipy.spatial import distance_matrix
#this module used to pefrom agglomerative clustering
from scipy.cluster.hierarchy import linkage
#this module used to generate dendograms
from scipy.cluster.hierarchy import dendrogram
#thtis is used to generate N clusters from dendgogram
from scipy.cluster.hierarchy import fcluster
#this used in preprocessing
from sklearn.preprocessing import MinMaxScaler
#this is useed to perform clustering
from sklearn.cluster import hierarchical
from sklearn.cluster import AgglomerativeClustering

Read Data From CSV File And Data Pre-processing

To read data from csv file, use the pandas package that has a read csv method that help to read data from csv file and return data frame object. In data pre-processing step removes null value from the data set and scale all value between 0 to 1.Also generate a distance matrix for the dataset.


data=pd.read_csv('cars_clus.csv')
#convert into numeric value
data=data.apply(pd.to_numeric,errors='coerce')
#drop rows that contain null value
data=data.dropna()
#reset the index after drop null rows
data=data.reset_index(drop=True)
#copy dataset that used to apply algorithm using diifrent method
dataf=data
data.head()
#scale all the values between (0,1) range
data=MinMaxScaler().fit_transform(data);
#generate distance matrix
dm=distance_matrix(data,data)

Perform Agglomerative Clustering Using Scipy

A scipy have a linking class that has an implementation of the agglomerative clustering algorithm. That takes distance matrix and method as parameter and perform clustering. The method can be single, complete, average, centroid.Fcluster method helps to make N cluster from linkage and assign label and return array that represent labels for the data points.


#this help to perform agglomerative clustering
model=linkage(dm,'complete')
#this help make N cluster from linkage
clusters=fcluster(model,5,criterion='maxclust')
clusters
array([1, 3, 3, 3, 1, 3, 2, 3, 3, 3, 3, 3, 3, 3, 2, 2, 3, 4, 1, 3, 3, 3,
       3, 2, 1, 5, 3, 3, 3, 3, 3, 3, 3, 1, 3], dtype=int32)

Plot Dendrogram


fig=pylab.figure(figsize=(20,30)) 
#this function help to label data point in dendrogram
def leaf_label(i):
    return '[%s|%s|%s]' % (dataf['engine_s'][i],dataf['horsepow'][i],dataf['wheelbas'][i]);
#this is plot dendogram
dendog=dendrogram(model,leaf_label_func=leaf_label,leaf_rotation=90,leaf_font_size=18,orientation='top')

Perform Agglomerative Clustering Using Sklearn

A sklearn package has an Agglomerative Clustering class that has an implementation of an Agglomerative clustering algorithm. That takes a number of clusters and which linkage method is used as parameter. If you don’t specify these parameters, then the number of clusters by default = 2 and linkage is by default = ward.


#create object of Agglomerative Clustering Algorithm
modelc=AgglomerativeClustering(n_clusters=5,linkage='complete')
#fit data in model
modelc.fit(data);
#print labels for data point
modelc.labels_
#append labels in data set
dataf['cluster']=modelc.labels_;
dataf.head()
array([1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 1, 1, 0, 0,
       0, 4, 1, 2, 0, 1, 1, 0, 1, 0, 0, 1, 1], dtype=int64)
Hierarchical Agglomerative Clustering Algorithm Solution In Python
Agglomerative Clustering Algorithm

Plot Graph Of Cluster

Use scatter plot for plot cluster in graph and use a color map to generate different color for each cluster.


#identify number if clusters
no_cluters=max(modelc.labels_)+1

#this is help to choose color from colormap of matplotlib
colors=cm.rainbow(np.linspace(0,1,no_cluters))

#create cluster labellist
c_label=list(range(0,no_cluters))


for color , label in zip(colors,c_label):
    subset=dataf[dataf.cluster == label]
    plt.scatter(subset.engine_s,subset.horsepow,c=color,label="cluster"+str(label))

plt.legend();
plt.xlabel("Engine size");
plt.ylabel("Horse power");
Agglomerative Clustering Algorithm
Agglomerative Clustering Algorithm
Hierarchical Agglomerative Clustering Algorithm Solution In Python

Zala Digvijaysinh

MCA student at Dharamsinh Desai University


1 thought on “Hierarchical Agglomerative Clustering Algorithm Solution In Python”

Leave a Comment

Your email address will not be published. Required fields are marked *

Close Bitnami banner
Bitnami