Overview:
-
Mahalanobis distance is the multivariate equivalent of the z-score. It gives the distance of a point to a probability distribution.
-
Understanding the Mahalanobis distance requires understanding of few statistical definitions and concepts.
- Mean: Mean is the arithmetic average of the data and it is a measure of the central tendency. It gives an idea about what is the expected value of an outcome.
-
Variance: Variance is the squared average of the distance of an outcome from the mean. It is the expected value of the squared average of the distance from the mean. The value of the variance is equal to the square of standard deviation.
- Standard Deviation: The standard deviation describes how for a value of an outcome differs from the mean. Is computed by taking square root of the variance.
-
The z-score: The formula for the z-score is given by
z = x-μ/σ
z score = Difference from mean / Standard Deviation
-
The z-score describes how many standard deviations a data-point varies from the mean.
- While z-score is for the univariate data, the Mahalanobis distance is for the multivariate data.
-
The Mahalanobis distance finds the distance between a data-point and a probability distribution.
-
Mahalanobis distance is given by
D2=(x-m)T.C-1.(x-m)
x is the value of the observation
C is the covariance matrix.
C-1 is the inverse of the covariance matrix.
m is the mean of the distribution.
The superscript T denotes that transpose needs to be taken. The superscript -1 denotes the inverse operation.
-
Covariance describes the relation between two variables. For multi-variate data the covariance is given as a square matrix whose elements are pairwise covariance of two variables.
-
Mahalanobis distance is used in clustering, classification and outlier detection problems.
-
The SciPy function mahalanobis() from the scipy.spatial.distance module computes the Mahalanobis distance of a given multi-variate point to a probability distribution.
Example:
# Example Python program that finds the Mahalanobis distance between # Set the floating pointing printing format for numpy # The data points # Find the covariance # Find the inverse of the covariance matrix # Find the centroid of the distribution through mean # Let us find where a Hen is from the given distribution |
Output:
Covariance matrix: |