Removing missing values and duplicate values from a pandas DataFrame

Overview:

A DataFrame in pandas is a two-dimensional container with rows and columns. The data can have column labels and row index. In data processing, it is a common occurence for the data to have duplicate values and empty values. These empty values and duplicate value can occur in so many ways and patterns. Pandas DataFrame class provides the methods dropna(), drop_duplicates() to handle these cases in a comprehensive manner.

Removing of Missing Values:

  • The dropna() method of the DataFrame class is comprehensive in providing multiple means to remove missing values of various patterns.
  • Missing values can be removed in column-wise and row-wise fashions.
  • A row or column can be removed, if any one of the value is missing or all of the values are missing. Also, a threshold specifying the number of missing values can be used to remove a row or column.
  • A subset of columns or rows can be specified that defines the scope of the missing value removal. Missing values occur in other columns or in the rows beyond this specified scope will not be removed.

Example - Removing pandas DataFrame rows containing missing values:

# Example Python program that removes rows containing empty values
# from a DataFrame 
import pandas as pds

# Construct example data with empty values
columnHeaders = {"exchange_ref_no", "name", "qty", "price", "timestamp"};

priceStream = {(0x202cb962ac59075b964b07152d234b70, "FUNCOMP", 12, 1200.45, 1619876978),
              (0xc8ffe9a587b126f152ed3d89a146b445, "FUNCOMP", None, 1201.45, 1619876979),
              (0x3def184ad8f4755ff269862ea77393dd, "FUNCOMP", 10, 1100.40, 1619876980),
              (0x069059b7ef840f0c74a814ec9237b6ec, "FUNCOMP", 25, 1090.00, 1619876971),
              (0xec5decca5ed3d6b8079e2e7e7bacc9f2, "FUNCOMP", 25, None, 1619876972)
              };

# Construct a DataFrame with raw data
priceTicks = pds.DataFrame(data=priceStream, columns=columnHeaders);
print("Data with empty values:");
print(priceTicks);

# Remove rows with empty values
cleanPriceTicks = priceTicks.dropna(axis = "index", how = "any");
print("Data with empty values removed:");
print(cleanPriceTicks);

Output:

Data with empty values:

                                 timestamp exchange_ref_no   qty     name       price

0   42767516990368493138776584305024125808         FUNCOMP  12.0  1200.45  1619876978

1  267174373771988661416381715658526078021         FUNCOMP   NaN  1201.45  1619876979

2  314185493295186862902690342039947364850         FUNCOMP  25.0      NaN  1619876972

3    8724878429673542145727510873258833644         FUNCOMP  25.0  1090.00  1619876971

4   82324359399928500054185503234815398877         FUNCOMP  10.0  1100.40  1619876980

Data with empty values removed:

                                timestamp exchange_ref_no   qty     name       price

0  42767516990368493138776584305024125808         FUNCOMP  12.0  1200.45  1619876978

3   8724878429673542145727510873258833644         FUNCOMP  25.0  1090.00  1619876971

4  82324359399928500054185503234815398877         FUNCOMP  10.0  1100.40  1619876980

Example - Removing pandas DataFrame columns containing missing values:

# Example Python program that removes columns containing empty values
# from a DataFrame 
import pandas as pds

# Example data
headers     = ("Sensor1", "Sensor2", "Sensor3", "Sensor4", "Sensor5");
readings    = {(77.0, 77.1, 77.0, 77.2, 77.5),
               (76.1, 76.5, None, 76.4, 76.7),
               (64.2, 64.8, 64.5, 64.1, 64.0),
               (81.3, 81.4, 80.9, 81.7, None),
               (78.2, 78.4, 78.6, 78.2, 78.3)};

# Construct a DataFrame with raw data
sensorData = pds.DataFrame(data=readings, columns=headers);
print("Data with few empty readings:");
print(sensorData);

# Remove columns with empty values from the DataFrame
processedData = sensorData.dropna(axis = "columns", how = "any");
print("Data with columns containing empty values removed:");
print(processedData);

Output:

Data with few empty readings:

   Sensor1  Sensor2  Sensor3  Sensor4  Sensor5

0     64.2     64.8     64.5     64.1     64.0

1     77.0     77.1     77.0     77.2     77.5

2     78.2     78.4     78.6     78.2     78.3

3     76.1     76.5      NaN     76.4     76.7

4     81.3     81.4     80.9     81.7      NaN

Data with columns containing empty values removed:

   Sensor1  Sensor2  Sensor4

0     64.2     64.8     64.1

1     77.0     77.1     77.2

2     78.2     78.4     78.2

3     76.1     76.5     76.4

4     81.3     81.4     81.7

Example-Removing rows/columns with missing values based on threshold:

# Example Python program that removes a column
# if the column contains at least two missing values

import pandas as pds

# Example data
binaryLists = {"A":[1, 0, 1, 1, 0],
               "B":[0, 1, 1, 0, None],
               "C":[None, 0, 0, 1, 1],
               "D":[None, 1, 0, 0, None]
               };
               
binaryDataFrame = pds.DataFrame(data=binaryLists);
print("Data with missing values present:");
print(binaryDataFrame);

binaryDataFrame_Cln = binaryDataFrame.dropna();
print("Data with missing values removed based on a threshold of counts:");
print(binaryDataFrame_Cln);

Output:

Data with missing values present:

   A    B    C    D

0  1  0.0  NaN  NaN

1  0  1.0  0.0  1.0

2  1  1.0  0.0  0.0

3  1  0.0  1.0  0.0

4  0  NaN  1.0  NaN

Data with missing values removed based on a threshold of counts:

   A    B    C    D

1  0  1.0  0.0  1.0

2  1  1.0  0.0  0.0

3  1  0.0  1.0  0.0

Example-Remove rows if specified column values are missing:

# Example Python program that removes the rows
# with missing values, only if specified columns 
# contain None values 
import pandas as pds

# Data
voteData = [(1620223791,  "A", 51, 34),
            (1620223792,  "B", 22, 35),
            (1620223793,  "C", 67, 33),
            (1620223794,  "D", 7,  None),
            (1620223795,  "E", None, 35)];
columns = ["timestamp", "candidate", "count", "age"];
voteFrame = pds.DataFrame(data=voteData, columns = columns);

print("DataFrame object with None values:");
print(voteFrame);

# Process the data and remove rows with missing values 
# only if those rows have the count column missing
processedFrame = voteFrame.dropna(subset=["count"]);
print(processedFrame);

Output:

DataFrame object with None values:

    timestamp candidate  count   age

0  1620223791         A   51.0  34.0

1  1620223792         B   22.0  35.0

2  1620223793         C   67.0  33.0

3  1620223794         D    7.0   NaN

4  1620223795         E    NaN  35.0

    timestamp candidate  count   age

0  1620223791         A   51.0  34.0

1  1620223792         B   22.0  35.0

2  1620223793         C   67.0  33.0

3  1620223794         D    7.0   NaN

Removing of Duplicate Values:    

  • The drop_duplicates() method removes duplicates entries of data present in a DataFrame.
  • During the duplicate removal process the first or the last occurence of the duplicate value in the row can be retained with the parameter "keep".
  • With the "inplace" parameter the duplicate values can be removed in the original DataFrame on which drop_duplicates() is called or the method can be made to return a new copy with the duplicate values removed.

Example - Removing duplicate rows from a pandas DataFrame:

# Example Python program to remove duplicate rows
# from a pandas DataFrame instance
import pandas as pds

# Boolean data as a pandas DataFrame instance
booleans = [(1, 1, 1, 0, 1),
            (1, 1, 1, 0, 1),
            (1, 0, 1, 0, 1),
            (1, 0, 0, 0, 1),
            (1, 1, 0, 0, 0)];
            
pdf = pds.DataFrame(data=booleans, columns = ["A", "B", "C", "D", "E"]);
print("DataFrame with duplicate rows:");
print(pdf);

# Remove the duplicates and print the processed pandas DataFrame
print("DataFrame with duplicate rows removed:");
ppdf = pdf.drop_duplicates();
print(ppdf);

Output:

DataFrame with duplicate rows:

   A  B  C  D  E

0  1  1  1  0  1

1  1  1  1  0  1

2  1  0  1  0  1

3  1  0  0  0  1

4  1  1  0  0  0

DataFrame with duplicate rows removed:

   A  B  C  D  E

0  1  1  1  0  1

2  1  0  1  0  1

3  1  0  0  0  1

4  1  1  0  0  0

Example:

# Example Python program that removes duplicate rows
# from a pandas DataFrame, based on specified columns
import pandas as pds

# Example data
bookData = [("978-9380816715", "Alice's Adventures in the Wonderland", 10), 
            ("978-1975675691", "Through the Looking-Glass", 9),
            ("978-9382616597", "Gone with the Wind", 4),
            ("978-8175993259", "Gulliver's Travels", 3),
            ("978-9381607701", "Gulliver's Travels", 1),
            ("978-1408845646", "Harry Potter and the Philosopher's Stone", 2),
            ("978-1408883761", "Harry Potter and the Philosopher's Stone", 1),
            ("978-1408883761", "Harry Potter and the Philosopher's Stone", 5)];
bookFrame   = pds.DataFrame(data=bookData, columns=["ISBN", "Title", "Copies"]);  
print(bookFrame);

# Keep only two publications - (no to same ISBNs AND same Titles) 
conciseFrame   = bookFrame.drop_duplicates(subset=["ISBN", "Title"], keep="last");
print(conciseFrame);

Output:

             ISBN                                     Title  Copies

0  978-9380816715      Alice's Adventures in the Wonderland      10

1  978-1975675691                 Through the Looking-Glass       9

2  978-9382616597                        Gone with the Wind       4

3  978-8175993259                        Gulliver's Travels       3

4  978-9381607701                        Gulliver's Travels       1

5  978-1408845646  Harry Potter and the Philosopher's Stone       2

6  978-1408883761  Harry Potter and the Philosopher's Stone       1

7  978-1408883761  Harry Potter and the Philosopher's Stone       5

             ISBN                                     Title  Copies

0  978-9380816715      Alice's Adventures in the Wonderland      10

1  978-1975675691                 Through the Looking-Glass       9

2  978-9382616597                        Gone with the Wind       4

3  978-8175993259                        Gulliver's Travels       3

4  978-9381607701                        Gulliver's Travels       1

5  978-1408845646  Harry Potter and the Philosopher's Stone       2

7  978-1408883761  Harry Potter and the Philosopher's Stone       5

 


Copyright 2024 © pythontic.com