Compute correlation coefficient for the variables represented by two pandas.series objects

Overview:

  • Majority of the Data Analysis done using the Python library pandas, involve the data structures Series and DataFrame. While pandas.Series being a 1–dimensional mutable, heterogeneous array and the pandas.DataFrame being a 2–dimensional mutable, heterogeneous array - both Series and DataFrame are implemented using the numpy's ndarray as the underlying Data Structure.
  • The classes pandas.Series and pandas.DataFrame provide methods for holding, re-shaping the data and performing statistical and mathematical operations on the data.
  • The method series.corr() finds the correlation between two variables represented by two pandas.Series instances.The DataFrame.corr() method finds correlation coefficient between two pandas.DataFrame columns.
  • Correlation is a statistical measure that finds how far two variables are related if at all there exists a relationship between them. Examples include, Per capita income and life expectancy, Forest coverage and annual rainfall of a region. Correlation is measured by the Correlation Coefficient (r).
  • The value of the correlation coefficient is always in the range of -1 to +1.
  • When the correlation coefficient is +1, the two variables are correlated in the positive direction. Which means, if a variable increases in value by +1 the other variable also increases by +1. If a variable increases by +1 and the other variable increases by +0.5 then they are still correlated in the positive direction. When the correlation coefficient is -1, the two variables are negatively correlated. This means if a variable increases by one unit in positive direction the other variable increases by one unit in the negative direction.
  • There are several methods to measure the correlation coefficient. The pandas method series.corr() supports calculating correlation coefficient using the methods: Pearson, Kendall and Spearman. It also supports any other custom method through the parameter callable. The custom function calculating the correlation coefficient should take two one-dimensional ndarray objects as parameters and should return a float.

Example:

# Python example to find the Correlation coefficient
# of two variables represented by two pandas Series instances
import pandas as pd

# Prices of house 
housePriceList = [250, 265, 270, 262, 268, 272];

# The years
yearList       = [2014, 2015, 2016, 2017, 2018, 2019];

# House prices loaded into a pandas series
housePrices     = pd.Series(housePriceList);

# Years loaded into a pandas series
years           = pd.Series(yearList);

# Find the correlation coefficient between house price and year
corr_value = housePrices.corr(years, method="pearson");
print("Correlation coefficient between house price and year (Method:Pearson)");
print(round(corr_value,2));

corr_value = housePrices.corr(years, method="kendall");
print("Correlation coefficient between house price and year (Method:Kendall rank correlation coefficient)");
print(round(corr_value,2));

corr_value = housePrices.corr(years, method="spearman");
print("Correlation coefficient between house price and year (Method:Spearman rank correlation coefficient)");
print(round(corr_value,2));

Output:

Correlation coefficient between house price and year (Method:Pearson)

0.75

Correlation coefficient between house price and year (Method:Kendall rank correlation coefficient)

0.6

Correlation coefficient between house price and year (Method:Spearman rank correlation coefficient)

0.71

 


Copyright 2024 © pythontic.com