Overview:
-
The linkage() function of the cluster.hierarchy module performs agglomerative clustering and creates hierarchical clusters.
-
Agglomerative clustering starts with individual nodes of an n-dimensional data. Each individual node of the data is considered a cluster. The algorithm works on a bottom-up approach and creates a hierarchy.
-
Based on a criterion or an objective function two nodes are merged together to form a cluster. A newly formed cluster is further merged into an individual node or a cluster.
-
The clustering process is continued by merging smaller clusters until the whole data is merged as a single cluster.
The linkage() function of SciPy:
-
The linkage() function accepts data points or distances between them (a.k.a a distance matrix) (scipy.spatial.distancematrix()) and returns a linkage matrix. The linkage matrix represents how the data is clustered and how they are connected as a hierarchy.
-
The hierarchy of clusters can be visualized by feeding the linkage matrix to graph routines like the dendrogram() function of hierarchy module to get a dendrogram.
-
Many methods are supported by the linkage() function in determining which two clusters are to be merged which include single, complete, average, weighted, centroid, median and Ward”s method.
The linkage matrix:
-
The linkage matrix describes how the hierarchical cluster is formed.
-
It is a two dimensional array containing four columns. For each linkage between the nodes or clusters, there exists a row in the linkage matrix.
-
Each row in the matrix describes a linkage between any of the following
-
a node and another node
-
a cluster and node
-
a cluster and another cluster.
-
-
The first entry of the row is the index of the data-point which is connected to another data-point pointed to by the second entry.
-
The third entry of the row is the distance between them.
-
The fourth entry is the number of elements present in the resultant cluster. The number of elements in the cluster will be two when just two individual data-points are merged.
-
The number of elements in the cluster will be greater than two, when the merging involves linking with another cluster.
-
When the index values(i.e., the first and the second entry of a row) in the linkage matrix exceeds the maximum index of the data-points, it means the index is referring to a cluster formed earlier during the agglomerative clustering, rather than an individual node. The index of such clusters being merged is given by actualIndex = index – len(data_points), which can be traced through the earlier rows of the linkage matrix.
Example:
-
The Python example takes data-points corresponding to six birds consisting of their body-weight and brain-weight. In real-world applications hierarchical clustering is used when the number of data points is large. The example uses a small dataset so that the clustering algorithm can be understood as two nodes are combined into a cluster which is further combined into a node or cluster to form another cluster.
-
The example Python code performs agglomerative hierarchical clustering using the scipy linkage() function from the scipy.clusters.hierarchy module.
-
The hierarchical cluster is visualized using the scipy function dendrogram() from the scipy.clusters.hierarchy module.
# Example Python program that performs # The names of the birds # The data points # Perform agglomerative clustering using the Ward"s method # Draw a denrdogram of the hierarchical cluster |
Output:
The linkage matrix: |