Overview:
- Majority of the Data Analysis done using the Python library pandas, involve the data structures Series and DataFrame. While pandas.Series being a 1–dimensional mutable, heterogeneous array and the pandas.DataFrame being a 2–dimensional mutable, heterogeneous array - both Series and DataFrame are implemented using the numpy's ndarray as the underlying Data Structure.
- The classes pandas.Series and pandas.DataFrame provide methods for holding, re-shaping the data and performing statistical and mathematical operations on the data.
- The method series.corr() finds the correlation between two variables represented by two pandas.Series instances.The DataFrame.corr() method finds correlation coefficient between two pandas.DataFrame columns.
- Correlation is a statistical measure that finds how far two variables are related if at all there exists a relationship between them. Examples include, Per capita income and life expectancy, Forest coverage and annual rainfall of a region. Correlation is measured by the Correlation Coefficient (r).
- The value of the correlation coefficient is always in the range of -1 to +1.
- When the correlation coefficient is +1, the two variables are correlated in the positive direction. Which means, if a variable increases in value by +1 the other variable also increases by +1. If a variable increases by +1 and the other variable increases by +0.5 then they are still correlated in the positive direction. When the correlation coefficient is -1, the two variables are negatively correlated. This means if a variable increases by one unit in positive direction the other variable increases by one unit in the negative direction.
- There are several methods to measure the correlation coefficient. The pandas method series.corr() supports calculating correlation coefficient using the methods: Pearson, Kendall and Spearman. It also supports any other custom method through the parameter callable. The custom function calculating the correlation coefficient should take two one-dimensional ndarray objects as parameters and should return a float.
Example:
# Python example to find the Correlation coefficient # Prices of house # The years # House prices loaded into a pandas series # Years loaded into a pandas series # Find the correlation coefficient between house price and year corr_value = housePrices.corr(years, method="kendall"); corr_value = housePrices.corr(years, method="spearman"); |
Output:
Correlation coefficient between house price and year (Method:Pearson) 0.75 Correlation coefficient between house price and year (Method:Kendall rank correlation coefficient) 0.6 Correlation coefficient between house price and year (Method:Spearman rank correlation coefficient) 0.71 |