Finding mahalanobis distance using Python

Overview:

Mahalanobis distance is the multivariate equivalent of the z-score. It gives the distance of a point to a probability distribution.
Understanding the Mahalanobis distance requires understanding of few statistical definitions and concepts.
- Mean: Mean is the arithmetic average of the data and it is a measure of the central tendency. It gives an idea about what is the expected value of an outcome.
- Variance: Variance is the squared average of the distance of an outcome from the mean. It is the expected value of the squared average of the distance from the mean. The value of the variance is equal to the square of standard deviation.
- Standard Deviation: The standard deviation describes how for a value of an outcome differs from the mean. Is computed by taking square root of the variance.
- The z-score: The formula for the z-score is given by

z = x-μ/σ

z score = Difference from mean / Standard Deviation

The z-score describes how many standard deviations a data-point varies from the mean.
- While z-score is for the univariate data, the Mahalanobis distance is for the multivariate data.
- The Mahalanobis distance finds the distance between a data-point and a probability distribution.
Mahalanobis distance is given by

D²=(x-m)^T.C^-1.(x-m)

x is the value of the observation

C is the covariance matrix.

C^-1is the inverse of the covariance matrix.

m is the mean of the distribution.

The superscript ^T denotes that transpose needs to be taken. The superscript ^-1 denotes the inverse operation.

Covariance describes the relation between two variables. For multi-variate data the covariance is given as a square matrix whose elements are pairwise covariance of two variables.
Mahalanobis distance is used in clustering, classification and outlier detection problems.
The SciPy function mahalanobis() from the scipy.spatial.distance module computes the Mahalanobis distance of a given multi-variate point to a probability distribution.

Example:

# Example Python program that finds the Mahalanobis distance between
# a multivariate distribution and a given point.
import scipy.spatial.distance as dist
import numpy

# Set the floating pointing printing format for numpy
numpy.set_printoptions(precision=4, suppress=True)

# The data points
brain_weights_birds = numpy.array([[0.5, 0.01365], # Parrot(GWA)
[0.45, 0.00725], # Crow
[0.3, 0.001], # Sparrow
[1.8, 0.012], # Seagull
[2, 0.022], # Snowy Owl
[1, 0.065] # Duck
])

# Find the covariance
covariance = numpy.cov(brain_weights_birds, rowvar=False)
print("Covariance matrix:")
print(covariance)

# Find the inverse of the covariance matrix
# (As the matrix seems to be singular as reported by the
# linalg.inv() function)
invCovariance = numpy.linalg.pinv(covariance)
print("Inverse covariance matrix:")
print(invCovariance)

# Find the centroid of the distribution through mean
centroid = numpy.mean(brain_weights_birds, axis=0)
print("Centroid of the distribution:")
print(centroid)

# Let us find where a Hen is from the given distribution
wolf = [2.5, .003]
mahalanobis_distance = dist.mahalanobis(wolf, centroid, invCovariance)
print("Mahalanobis_distance:")
print(mahalanobis_distance)

Output:

Covariance matrix:
[[0.5364 0.0038]
[0.0038 0.0005]]
Inverse covariance matrix:
[[ 1.9646 -14.1091]
[ -14.1091 1983.3263]]
Centroid of the distribution:
[1.0083 0.0202]
Mahalanobis_distance:
2.382557193435126