Free cookie consent management tool by TermsFeed SciPy hierarchical clustering using UPGMA method | Pythontic.com

SciPy hierarchical clustering using UPGMA method

Overview:

  • Agglomerative hierarchical clustering using minimum distance between the clusters has several variants including single linkage clustering, complete linkage clustering and the UPGMA – the Unweighted Pair Group method with Arithmetic Mean method.

  • The UPGMA method calculates the inter-cluster distance between a non-singleton cluster and another cluster by taking average distance between the cluster members. The other cluster can be a singleton cluster or non-singleton cluster.

Example:

  • The Python example code below performs hierarchical clustering using the UPGMA through the scipy linkage() function.

  • The code and the data-points are same as given in the example for the complete-linkage clustering. The invocation of the SciPy linkage() function has to be replaced as given below to perform hierarchical clustering using UPGMA method.

cluster_linkage = hc.linkage(brain_weights_birds, method="average")

  • The linkage() function accepts the actual data-points or the condensed pairwise distances between the data points as a parameter. It also accepts the method of hierarchical clustering through the “method” parameter.

  • The UPGMA method is selected by passing the string value “average” as the argument to the parameter “method” of the scipy linkage() function.

  • The linkage() function returns the cluster linkages as the result - which is called the linkage matrix. It is of the form that describes which two indexes of the data-points are connected to form a cluster followed by the inter-cluster distance and the number of elements in a cluster. This information is presented as a row containing four columns. The first two columns are the indexes that are linked to form a cluster. The third column is the inter-cluster distance. The fourth column is the number of elements in a cluster. There will be as many rows as the number of linkages required to link the whole set of data-points into one single cluster.

  • When an index value from any of the first two columns of a linkage matrix is greater than the actual index the value refers to an existing cluster formed earlier and is represented by one of the preceding rows of the linkage matrix. The index of the row is given by Linkage Index = Number of data points – index value.

  • How the algorithm works schematically is explained using distance matrices from iteration 1 to iteration 6.

Output:

[[0.         1.         0.05040794 2.        ]
 [2.         6.         0.1752649  3.        ]
 [3.         4.         0.20024984 2.        ]
 [5.         7.         0.58619103 4.        ]
 [8.         9.         1.33786883 6.        ]]

Iteration 1:

 

Parrot

Crow

Sparrow

Seagull

Snowy Owl

Duck

Parrot

0

0.05040794

0.20039966

1.30000105

1.50002324

0.50262991

Crow

0.05040794

0

0.15013015

1.35000836

1.55007018

0.55302356

Sparrow

0.20039966

0.15013015

0

1.50004033

1.7001297

0.70291963

Seagull

1.30000105

1.35000836

1.50004033

0

0.20024984

0.8017537

Snowy Owl

1.50002324

1.55007018

1.7001297

0.20024984

0

1.00092407

Duck

0.50262991

0.55302356

0.70291963

0.8017537

1.00092407

0

Inter-cluster distance = min(0.05040794, 0.20039966, 1.30000105, 1.50002324, 0.50262991, 0.15013015, 1.35000836, 1.55007018, 0.55302356, 1.50004033, 1.7001297, 0.70291963, 0.20024984, 0.8017537, 1.00092407) = 0.05040794


Indexes zero and one are linked creating the cluster containing items parrot and crow.

Number of items in the resultant cluster is two. The inter-cluster distance is 0.05040794.

Iteration 2:

The cluster containing the birds parrot and crow is created during the first iteration.

To calculate the distance between this cluster of (Parrot, Crow) and other singleton clusters pairwise distances are added together and the mean value is taken. The distances (Parrot, Sparrow) and (Crow, Sparrow) are added together and divided by two to get the mean distance between the cluster(Parrot, Crow) and Sparrow. In the same way distances between the cluster (Parrot, Sparrow) and Seagull, cluster (Parrot, Sparrow) and Snowy Owl and cluster (Parrot, Sparrow) and Duck are calculated using the UPGMA method.

After finding all the pairwise distances in the distance matrix the minimum value is found. The pair that corresponds to the minimum distance is linked together as a cluster.

The iteration results in the linkage of indexes corresponding to Parrot, Crow and Sparrow. The value of the first column in the second row of the linkage matrix returned by the linkage() function is 2. The value of the second column is 6. To get the actual linkage index, it is subtracted from the total number of data points which gives the value of 0. This value 0 refers to the first row of the linkage matrix.

Thus at the end of the second iteration sparrow is linked with another cluster containing parrot and crow.

 

(Parrot, Crow)

Sparrow

Seagull

Snowy Owl

Duck

(Parrot, Crow)

0

(0.20039966 + 0.15013015)/2=

0.175264905

(1.30000105+1.35000836)/2=

1.325004705

(1.50002324+

1.55007018)/2=1.52504671

(0.50262991+

0.55302356)/2

=0.527826735

Sparrow

(0.20039966 + 0.15013015)/2

=0.175264905

0

1.35000836

1.55007018

0.55302356

Seagull

(1.30000105+1.35000836)/2=

1.325004705

1.35000836

0

0.20024984

0.8017537

Snowy Owl

(1.50002324+

1.55007018)/2=

1.52504671

1.55007018

0.20024984

0

1.00092407

Duck

(0.50262991+

0.55302356)/2=

0.527826735

0.55302356

0.8017537

1.00092407

0

Inter-cluster distance = Min(0.175264905, 1.325004705, 1.52504671, 0.527826735,1.35000836, 1.55007018, 0.55302356, 0.20024984, 0.8017537, 1.00092407) = 0.175264905. The number of elements in the cluster is three.

Iteration 3:

During the third iteration Seagull and Snowy Owl are linked together to form another cluster. The number of items in the cluster is two. The inter-cluster distance is 0.20024984.

 

(Parrot, Crow, Sparrow)

Seagull

Snowy Owl

Duck

(Parrot, Crow, Sparrow)

0

=(1.30000105+ 1.35000836,

1.50004033)/3

=1.383349913

=(1.50002324+

1.55007018+

1.7001297)/3 = 1.583407706666667

=(0.50262991+

0.55302356+

0.70291963)/3=0.586191033333333

Seagull

1.383349913

0

0.20024984

0.8017537

Snowy Owl

1.583407706666667

0.20024984

0

1.00092407

Duck

0.586191033333333

0.55302356+

0.70291963)/3=0.586191033333333

0.8017537

1.00092407

0

Inter-cluster distance = Min(1.383349913, 1.583407706666667, 0.586191033333333, 0.20024984, 0.8017537, 1.00092407)=0.20024984. The number of elements in the cluster is two.

Iteration 4:

During the iteration four the index five is linked with the existing linkage index seven. Actual linkage index = linkage matrix index – count. which gives the actual linkage index = 7 - 6 = 1, that points to the cluster formed at the iteration two. Thus the clustering process links the birds Parrot, Crow and Sparrow with the Duck.

 

(Parrot, Crow, Sparrow)

(Seagull, Snowy Owl)

Duck

(Parrot, Crow, Sparrow)

0

=(1.30000105+1.35000836+1.50004033+1.50002324+1.55007018+

1.7001297)/6 = 1.48337881

=(0.50262991+

0.55302356+

0.70291963)/3 = 0.586191033333333

(Seagull, Snowy Owl)

1.48337881

0

=(0.8017537+

1.00092407)/2=0.901338885

Duck

0.586191033333333

0.901338885

0

Inter-cluster distance = Min(1.48337881, 0.586191033333333, 0.901338885) = 0.586191033333333

Iteration 6:

The linkage matrix values of the first two columns are five and seven. The cluster with birds parrot, crow, sparrow, duck are linked with the cluster that consists of seagull and snowy owl. The inter-cluster distance between them is 1.33786882875. The number of items in the cluster is six.

 

(Parrot, Crow, Sparrow, Duck)

(Seagull, Snowy Owl)

(Parrot, Crow, Sparrow, Duck)

0

=(1.30000105+

1.35000836+

1.50004033+

0.8017537+1.50002324+

1.55007018+

1.7001297+1.00092407)/8=1.33786882875

(Seagull, Snowy Owl)

=(1.30000105+

1.35000836+

1.50004033+

0.8017537+1.50002324+

1.55007018+

1.7001297+1.00092407)/8=1.33786882875

0

 


Copyright 2025 © pythontic.com