Finding correlation coefficient between columns of a pandas dataframe

Overview:

  • Correlation coefficients evaluate how two variables are related to each other. The relationship could be linear, linear but in opposite direction (i.e., inversely related), or monotonic. In a monotonic relationship the variables may not change together at the same rate.
  • pandas’ DataFrame class has the method corr() that computes three different correlation coefficients between two variables using any of the following methods : Pearson correlation method, Kendall Tau correlation method and Spearman correlation method. The correlation coefficients calculated using these methods vary from +1 to -1.
  • While the corr() function finds the correlation coefficients between the columns of a DataFrame instance, the corrwith() function computes correlation coefficients between rows or columns of two different dataframe instances. The one dimensional collection pandas.series supports finding correlation between variables represented by two pandas.series objects.

Pearson correlation coefficient:

  • Pearson correlation coefficient is defined as the covariance of two variables divided by the product of their standard deviations. It evaluates the linear relationship between two variables. Pearson correlation coefficient has a value between +1 and -1.
  • The value 1 indicates that there is a linear correlation between variable x and y. The value 0 indicates that the variables x and y are not related. The value -1 indicates that there is an inverse correlation between variable x and y.
  • Pearson correlation coefficient is also called as Pearson product-moment correlation coefficient.

Kendall Tau correlation coefficient:

  • It quantifies the discrepancy between the number of concordant and discordant pairs of two variables.

Spearman correlation coefficient:

  • Spearman correlation method is a nonparametric evaluation that finds the strength and direction of the monotonic relationship between two variables.
  • This method is used when the data is not normally distributed or when the sample size is small (less than 30).

Example - Finding correlation coefficient between rows of a same DataFrame instance:

import pandas as pd

import numpy as np

import scipy as sp

 

values = {"X":[20, 25, 30, 35, 40, 45],

          "Y":[10, 9, 9, 8, 8, 7]};

 

dataFrame       = pd.DataFrame(data=values);

print("DataFrame:");

print(dataFrame);

 

corrrelation    = dataFrame.corr(method="pearson");

print("Pearson correlation coefficient:");

print(corrrelation);

 

corrrelation    = dataFrame.corr(method="kendall");

print("Kendall Tau correlation coefficient:");

print(corrrelation);

 

corrrelation    = dataFrame.corr(method="spearman");

print("Spearman rank correlation:");

print(corrrelation);

 

Output:

DataFrame:

    X   Y

0  20  10

1  25   9

2  30   9

3  35   8

4  40   8

5  45   7

Pearson correlation coefficient:

         X        Y

X  1.00000 -0.96833

Y -0.96833  1.00000

Kendall Tau correlation coefficient:

          X         Y

X  1.000000 -0.930949

Y -0.930949  1.000000

Spearman rank correlation:

          X         Y

X  1.000000 -0.971008

Y -0.971008  1.000000

Example - Finding correlation coefficient between rows of a different DataFrame instances:

import pandas as pd
import scipy as sp

dataValues1 = [(8, 9, 10, 11, 12, 13, 14, 15, 16),
               (8.5, 9.5, 10.5, 11.5, 12.5, 13.5, 14.5, 15.5, 16.5)];

dataValues2 = [(2, 1.5, 1, 1.5, 3, 3, 2, 2.5, 3),
               (2.1, 1.5, 1.2, 1.4, 3.2, 3.1, 2.2, 2.53, 3.2)];
               
dataFrame1   = pd.DataFrame(data=dataValues1);
dataFrame2   = pd.DataFrame(data=dataValues2);
print("DataFrame1:");
print(dataFrame1)

print("DataFrame2:");
print(dataFrame2)

# Find Pearson correlation coefficient between rows of different data drames
pearsonCorrelation  = dataFrame1.corrwith(dataFrame2, axis=1);
print("Pearson correlation coefficient between rows of dataFrame1 and dataFrame2: ");
print(pearsonCorrelation);

# Find Kendall Tau correlation coefficient between rows of different data drames
kendallCorrelation  = dataFrame1.corrwith(dataFrame2, axis=1, method="kendall");
print("Kendall Tau correlation coefficient between rows of dataFrame1 and dataFrame2: ");
print(kendallCorrelation);

# Find Spearman rank correlation between rows of different data drames
spearmanCorrelation  = dataFrame1.corrwith(dataFrame2, axis=1, method="spearman");
print("Spearman rank correlation between rows of dataFrame1 and dataFrame2: ");
print(spearmanCorrelation);

Output:

DataFrame1:

     0    1     2     3     4     5     6     7     8

0  8.0  9.0  10.0  11.0  12.0  13.0  14.0  15.0  16.0

1  8.5  9.5  10.5  11.5  12.5  13.5  14.5  15.5  16.5

DataFrame2:

     0    1    2    3    4    5    6     7    8

0  2.0  1.5  1.0  1.5  3.0  3.0  2.0  2.50  3.0

1  2.1  1.5  1.2  1.4  3.2  3.1  2.2  2.53  3.2

Pearson correlation coefficient between rows of dataFrame1 and dataFrame2:

0    0.639010

1    0.645101

dtype: float64

Kendall Tau correlation coefficient between rows of dataFrame1 and dataFrame2:

0    0.449013

1    0.422577

dtype: float64

Spearman rank correlation between rows of dataFrame1 and dataFrame2:

0    0.632687

1    0.669462

dtype: float64


Copyright 2024 © pythontic.com