Free cookie consent management tool by TermsFeed Computing quantile values for a pandas DataFrame | Pythontic.com

Computing quantile values for a pandas DataFrame

Overview:

A class room of students can be grouped based on their height – students with height greater than 160 cm, 170 cm and so on. These measurements 160 cm, 170 cm and 180 cm divide a population into groups wherein the population of the group is of certain percentage of the whole. When this mechanism is applied to a probability distribution the locations that divide the distribution into groups are called quantiles. A quantile divides a  group into 100 portions. 

Popular quantiles:

Based on the percentage divide they create there are popular quantiles used in statistics. If a distribution is divided in to four groups consisting of below 25 percent, below 50 percent, below 75 percent and below 100 percent – these quantiles are called first quartile, second quantile, third quantile.

The Python example 1 loads the heights of students in a class into a pandas DataFrame and calculates the quartiles of the distribution. The quartiles are plot as given in the output below using matplotlib. When the quantiles are in the multiples of tens they are called deciles.

Quartiles calculated using pandas

The quantile() function of Pandas DataFrame class computes the value below which a given portion of the data lies. The example 2 plots a normal distribution for a value of student scores and marks the deciles in the normal curve as given in the diagram below.

Deciles of a normal distribution

Example 1:

The Python example prints the heights of the students of a class at the first quartile (value at 25th percentile), at the second quartile(value at the 50th percentile) and at the third quartile (value at 75th percentile) and plots the heights using normal distribution.

# Example Python program that finds 
# the first quartile, second quartile and the 
# third quartile for a given data points in a
# pandas DataFrame. The quartiles are plot using
# matplotlib
import pandas as pds
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Heights of students in a class
heights = np.linspace(164, 182, 50)

# Load heights into DataFrame
df         = pds.DataFrame(heights)

# Compute the 25th percentile
q1 = df.quantile(0.25)

# Compute the 50th percentile
q2 = df.quantile(0.50)

# Compute the 75th percentile
q3 = df.quantile(0.75)

print("First quartile:{:.2f}".format(q1[0]))
print("Second quartile:{:.2f}".format(q2[0]))
print("Third quartile:{:.2f}".format(q3[0]))

# Find the standard deviation
stdDev = np.std(heights)
print("Standard deviation:{:.2f}".format(stdDev))

# Plot using the normal distribution
heights    = np.sort(heights)
pd     = stats.norm.pdf(heights, loc = q2, scale = stdDev)

plt.plot(heights, pd, color = 'blue')

plt.axvline(q1[0], color='green', linestyle='--', label=f'Q1={q1[0]:.2f}')
plt.axvline(q2[0], color='red', linestyle='solid', label=f'Q2={q2[0]:.2f}')
plt.axvline(q3[0], color='green', linestyle='--', label=f'Q3={q3[0]:.2f}')

plt.title('Student heights - First quartile, Mean, Third quartile')
plt.xlabel('Student heights')
plt.ylabel('Probability density')

plt.legend()
plt.show()

Output:

First quartile:168.50
Second quartile:173.00
Third quartile:177.50
Standard deviation:5.30

Example2 - Deciles marked on a normal curve:

# Example Python program that plots the deciles 
# of a probability distribution
import pandas as pds
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Student test scores
scores = np.linspace(0, 100, 50)

# Student scores loaded into a DataFrame
df        = pds.DataFrame(scores)

fractions = np.arange(0.1, 1.0, 0.1)
deciles = []
for i in fractions:
    # Use quantile to get deciles
    dn = df.quantile(i)

    decile = int(round(dn[0]))
    deciles.append(decile)

    decileLoc = int(round(i * 10))
    print("Decile{} location:{}".format(decileLoc , decile))

# Find the standard deviation
stdDev = np.std(scores)
print("Standard deviation:{:.2f}".format(stdDev))

# Plot using the normal distribution
scores    = np.sort(scores)
pd     = stats.norm.pdf(scores, loc = np.mean(scores), scale = stdDev)

plt.plot(scores, pd, color = 'blue')
colors = ['blue', 'orange', 'green',
          'red', 'purple', 'brown',
          'pink', 'gray', 'olive']
for decile in deciles:
    loc    = int(decile/10)
    plt.axvline(x = decile, color = colors[loc-1], 
                linestyle = '--', 
                label = f'Decile{loc}')

plt.title('Student scores - deciles')
plt.xlabel('Student scores')
plt.ylabel('Probability density')

plt.legend()
plt.show()

Output:

Decile1 location:10
Decile2 location:20
Decile3 location:30
Decile4 location:40
Decile5 location:50
Decile6 location:60
Decile7 location:70
Decile8 location:80
Decile9 location:90
Standard deviation:29.45

 


Copyright 2025 © pythontic.com