Removing missing values and duplicate values from a pandas DataFrame

Overview:

A DataFrame in pandas is a two-dimensional container with rows and columns. The data can have column labels and row index. In data processing, it is a common occurence for the data to have duplicate values and empty values. These empty values and duplicate value can occur in so many ways and patterns. Pandas DataFrame class provides the methods dropna(), drop_duplicates() to handle these cases in a comprehensive manner.

Removing of Missing Values:

The dropna() method of the DataFrame class is comprehensive in providing multiple means to remove missing values of various patterns.
Missing values can be removed in column-wise and row-wise fashions.
A row or column can be removed, if any one of the value is missing or all of the values are missing. Also, a threshold specifying the number of missing values can be used to remove a row or column.
A subset of columns or rows can be specified that defines the scope of the missing value removal. Missing values occur in other columns or in the rows beyond this specified scope will not be removed.

Example - Removing pandas DataFrame rows containing missing values:

# Example Python program that removes rows containing empty values
# from a DataFrame
import pandas as pds

# Construct example data with empty values
columnHeaders = {"exchange_ref_no", "name", "qty", "price", "timestamp"};

priceStream = {(0x202cb962ac59075b964b07152d234b70, "FUNCOMP", 12, 1200.45, 1619876978),
(0xc8ffe9a587b126f152ed3d89a146b445, "FUNCOMP", None, 1201.45, 1619876979),
(0x3def184ad8f4755ff269862ea77393dd, "FUNCOMP", 10, 1100.40, 1619876980),
(0x069059b7ef840f0c74a814ec9237b6ec, "FUNCOMP", 25, 1090.00, 1619876971),
(0xec5decca5ed3d6b8079e2e7e7bacc9f2, "FUNCOMP", 25, None, 1619876972)
};

# Construct a DataFrame with raw data
priceTicks = pds.DataFrame(data=priceStream, columns=columnHeaders);
print("Data with empty values:");
print(priceTicks);

# Remove rows with empty values
cleanPriceTicks = priceTicks.dropna(axis = "index", how = "any");
print("Data with empty values removed:");
print(cleanPriceTicks);

Output:

Data with empty values:

timestamp exchange_ref_no qty name price

0 42767516990368493138776584305024125808 FUNCOMP 12.0 1200.45 1619876978

1 267174373771988661416381715658526078021 FUNCOMP NaN 1201.45 1619876979

2 314185493295186862902690342039947364850 FUNCOMP 25.0 NaN 1619876972

3 8724878429673542145727510873258833644 FUNCOMP 25.0 1090.00 1619876971

4 82324359399928500054185503234815398877 FUNCOMP 10.0 1100.40 1619876980

Data with empty values removed:

timestamp exchange_ref_no qty name price

0 42767516990368493138776584305024125808 FUNCOMP 12.0 1200.45 1619876978

3 8724878429673542145727510873258833644 FUNCOMP 25.0 1090.00 1619876971

4 82324359399928500054185503234815398877 FUNCOMP 10.0 1100.40 1619876980

Example - Removing pandas DataFrame columns containing missing values:

# Example Python program that removes columns containing empty values
# from a DataFrame
import pandas as pds

# Example data
headers = ("Sensor1", "Sensor2", "Sensor3", "Sensor4", "Sensor5");
readings = {(77.0, 77.1, 77.0, 77.2, 77.5),
(76.1, 76.5, None, 76.4, 76.7),
(64.2, 64.8, 64.5, 64.1, 64.0),
(81.3, 81.4, 80.9, 81.7, None),
(78.2, 78.4, 78.6, 78.2, 78.3)};

# Construct a DataFrame with raw data
sensorData = pds.DataFrame(data=readings, columns=headers);
print("Data with few empty readings:");
print(sensorData);

# Remove columns with empty values from the DataFrame
processedData = sensorData.dropna(axis = "columns", how = "any");
print("Data with columns containing empty values removed:");
print(processedData);

Output:

Data with few empty readings:

Sensor1 Sensor2 Sensor3 Sensor4 Sensor5

0 64.2 64.8 64.5 64.1 64.0

1 77.0 77.1 77.0 77.2 77.5

2 78.2 78.4 78.6 78.2 78.3

3 76.1 76.5 NaN 76.4 76.7

4 81.3 81.4 80.9 81.7 NaN

Data with columns containing empty values removed:

Sensor1 Sensor2 Sensor4

0 64.2 64.8 64.1

1 77.0 77.1 77.2

2 78.2 78.4 78.2

3 76.1 76.5 76.4

4 81.3 81.4 81.7

Example-Removing rows/columns with missing values based on threshold:

# Example Python program that removes a column
# if the column contains at least two missing values

import pandas as pds

# Example data
binaryLists = {"A":[1, 0, 1, 1, 0],
"B":[0, 1, 1, 0, None],
"C":[None, 0, 0, 1, 1],
"D":[None, 1, 0, 0, None]
};

binaryDataFrame = pds.DataFrame(data=binaryLists);
print("Data with missing values present:");
print(binaryDataFrame);

binaryDataFrame_Cln = binaryDataFrame.dropna();
print("Data with missing values removed based on a threshold of counts:");
print(binaryDataFrame_Cln);

Output:

Data with missing values present:

A B C D

0 1 0.0 NaN NaN

1 0 1.0 0.0 1.0

2 1 1.0 0.0 0.0

3 1 0.0 1.0 0.0

4 0 NaN 1.0 NaN

Data with missing values removed based on a threshold of counts:

A B C D

1 0 1.0 0.0 1.0

2 1 1.0 0.0 0.0

3 1 0.0 1.0 0.0

Example-Remove rows if specified column values are missing:

# Example Python program that removes the rows
# with missing values, only if specified columns
# contain None values
import pandas as pds

# Data
voteData = [(1620223791, "A", 51, 34),
(1620223792, "B", 22, 35),
(1620223793, "C", 67, 33),
(1620223794, "D", 7, None),
(1620223795, "E", None, 35)];
columns = ["timestamp", "candidate", "count", "age"];
voteFrame = pds.DataFrame(data=voteData, columns = columns);

print("DataFrame object with None values:");
print(voteFrame);

# Process the data and remove rows with missing values
# only if those rows have the count column missing
processedFrame = voteFrame.dropna(subset=["count"]);
print(processedFrame);

Output:

DataFrame object with None values:

timestamp candidate count age

0 1620223791 A 51.0 34.0

1 1620223792 B 22.0 35.0

2 1620223793 C 67.0 33.0

3 1620223794 D 7.0 NaN

4 1620223795 E NaN 35.0

timestamp candidate count age

0 1620223791 A 51.0 34.0

1 1620223792 B 22.0 35.0

2 1620223793 C 67.0 33.0

3 1620223794 D 7.0 NaN

Removing of Duplicate Values:

The drop_duplicates() method removes duplicates entries of data present in a DataFrame.
During the duplicate removal process the first or the last occurence of the duplicate value in the row can be retained with the parameter "keep".
With the "inplace" parameter the duplicate values can be removed in the original DataFrame on which drop_duplicates() is called or the method can be made to return a new copy with the duplicate values removed.

Example - Removing duplicate rows from a pandas DataFrame:

# Example Python program to remove duplicate rows
# from a pandas DataFrame instance
import pandas as pds

# Boolean data as a pandas DataFrame instance
booleans = [(1, 1, 1, 0, 1),
(1, 1, 1, 0, 1),
(1, 0, 1, 0, 1),
(1, 0, 0, 0, 1),
(1, 1, 0, 0, 0)];

pdf = pds.DataFrame(data=booleans, columns = ["A", "B", "C", "D", "E"]);
print("DataFrame with duplicate rows:");
print(pdf);

# Remove the duplicates and print the processed pandas DataFrame
print("DataFrame with duplicate rows removed:");
ppdf = pdf.drop_duplicates();
print(ppdf);

Output:

DataFrame with duplicate rows:

A B C D E

0 1 1 1 0 1

1 1 1 1 0 1

2 1 0 1 0 1

3 1 0 0 0 1

4 1 1 0 0 0

DataFrame with duplicate rows removed:

A B C D E

0 1 1 1 0 1

2 1 0 1 0 1

3 1 0 0 0 1

4 1 1 0 0 0

Example:

# Example Python program that removes duplicate rows
# from a pandas DataFrame, based on specified columns
import pandas as pds

# Example data
bookData = [("978-9380816715", "Alice's Adventures in the Wonderland", 10),
("978-1975675691", "Through the Looking-Glass", 9),
("978-9382616597", "Gone with the Wind", 4),
("978-8175993259", "Gulliver's Travels", 3),
("978-9381607701", "Gulliver's Travels", 1),
("978-1408845646", "Harry Potter and the Philosopher's Stone", 2),
("978-1408883761", "Harry Potter and the Philosopher's Stone", 1),
("978-1408883761", "Harry Potter and the Philosopher's Stone", 5)];
bookFrame = pds.DataFrame(data=bookData, columns=["ISBN", "Title", "Copies"]);
print(bookFrame);

# Keep only two publications - (no to same ISBNs AND same Titles)
conciseFrame = bookFrame.drop_duplicates(subset=["ISBN", "Title"], keep="last");
print(conciseFrame);

Output:

ISBN Title Copies

0 978-9380816715 Alice's Adventures in the Wonderland 10

1 978-1975675691 Through the Looking-Glass 9

2 978-9382616597 Gone with the Wind 4

3 978-8175993259 Gulliver's Travels 3

4 978-9381607701 Gulliver's Travels 1

5 978-1408845646 Harry Potter and the Philosopher's Stone 2

6 978-1408883761 Harry Potter and the Philosopher's Stone 1

7 978-1408883761 Harry Potter and the Philosopher's Stone 5

ISBN Title Copies

0 978-9380816715 Alice's Adventures in the Wonderland 10

1 978-1975675691 Through the Looking-Glass 9

2 978-9382616597 Gone with the Wind 4

3 978-8175993259 Gulliver's Travels 3

4 978-9381607701 Gulliver's Travels 1

5 978-1408845646 Harry Potter and the Philosopher's Stone 2

7 978-1408883761 Harry Potter and the Philosopher's Stone 5

Removing missing values and duplicate values from a pandas DataFrame

Overview:

Removing of Missing Values:

Example - Removing pandas DataFrame rows containing missing values:

Output:

Data with empty values:

Example - Removing pandas DataFrame columns containing missing values:

Output:

Example-Removing rows/columns with missing values based on threshold:

Output:

Example-Remove rows if specified column values are missing:

Output:

Removing of Duplicate Values:

Example - Removing duplicate rows from a pandas DataFrame:

Output:

Example:

# Example Python program that removes duplicate rows # from a pandas DataFrame, based on specified columns import pandas as pds

Output:

# Example Python program that removes duplicate rows
# from a pandas DataFrame, based on specified columns
import pandas as pds