Overview:
A DataFrame in pandas is a two-dimensional container with rows and columns. The data can have column labels and row index. In data processing, it is a common occurence for the data to have duplicate values and empty values. These empty values and duplicate value can occur in so many ways and patterns. Pandas DataFrame class provides the methods dropna(), drop_duplicates() to handle these cases in a comprehensive manner.
Removing of Missing Values:
- The dropna() method of the DataFrame class is comprehensive in providing multiple means to remove missing values of various patterns.
- Missing values can be removed in column-wise and row-wise fashions.
- A row or column can be removed, if any one of the value is missing or all of the values are missing. Also, a threshold specifying the number of missing values can be used to remove a row or column.
- A subset of columns or rows can be specified that defines the scope of the missing value removal. Missing values occur in other columns or in the rows beyond this specified scope will not be removed.
Example - Removing pandas DataFrame rows containing missing values:
# Example Python program that removes rows containing empty values # Construct example data with empty values priceStream = {(0x202cb962ac59075b964b07152d234b70, "FUNCOMP", 12, 1200.45, 1619876978), # Construct a DataFrame with raw data # Remove rows with empty values |
Output:
Data with empty values:timestamp exchange_ref_no qty name price 0 42767516990368493138776584305024125808 FUNCOMP 12.0 1200.45 1619876978 1 267174373771988661416381715658526078021 FUNCOMP NaN 1201.45 1619876979 2 314185493295186862902690342039947364850 FUNCOMP 25.0 NaN 1619876972 3 8724878429673542145727510873258833644 FUNCOMP 25.0 1090.00 1619876971 4 82324359399928500054185503234815398877 FUNCOMP 10.0 1100.40 1619876980 Data with empty values removed: timestamp exchange_ref_no qty name price 0 42767516990368493138776584305024125808 FUNCOMP 12.0 1200.45 1619876978 3 8724878429673542145727510873258833644 FUNCOMP 25.0 1090.00 1619876971 4 82324359399928500054185503234815398877 FUNCOMP 10.0 1100.40 1619876980 |
Example - Removing pandas DataFrame columns containing missing values:
# Example Python program that removes columns containing empty values # Example data # Construct a DataFrame with raw data # Remove columns with empty values from the DataFrame |
Output:
Data with few empty readings: Sensor1 Sensor2 Sensor3 Sensor4 Sensor5 0 64.2 64.8 64.5 64.1 64.0 1 77.0 77.1 77.0 77.2 77.5 2 78.2 78.4 78.6 78.2 78.3 3 76.1 76.5 NaN 76.4 76.7 4 81.3 81.4 80.9 81.7 NaN Data with columns containing empty values removed: Sensor1 Sensor2 Sensor4 0 64.2 64.8 64.1 1 77.0 77.1 77.2 2 78.2 78.4 78.2 3 76.1 76.5 76.4 4 81.3 81.4 81.7 |
Example-Removing rows/columns with missing values based on threshold:
# Example Python program that removes a column import pandas as pds # Example data binaryDataFrame_Cln = binaryDataFrame.dropna(); |
Output:
Data with missing values present: A B C D 0 1 0.0 NaN NaN 1 0 1.0 0.0 1.0 2 1 1.0 0.0 0.0 3 1 0.0 1.0 0.0 4 0 NaN 1.0 NaN Data with missing values removed based on a threshold of counts: A B C D 1 0 1.0 0.0 1.0 2 1 1.0 0.0 0.0 3 1 0.0 1.0 0.0 |
Example-Remove rows if specified column values are missing:
# Example Python program that removes the rows # Data print("DataFrame object with None values:"); # Process the data and remove rows with missing values |
Output:
DataFrame object with None values: timestamp candidate count age 0 1620223791 A 51.0 34.0 1 1620223792 B 22.0 35.0 2 1620223793 C 67.0 33.0 3 1620223794 D 7.0 NaN 4 1620223795 E NaN 35.0 timestamp candidate count age 0 1620223791 A 51.0 34.0 1 1620223792 B 22.0 35.0 2 1620223793 C 67.0 33.0 3 1620223794 D 7.0 NaN |
Removing of Duplicate Values:
- The drop_duplicates() method removes duplicates entries of data present in a DataFrame.
- During the duplicate removal process the first or the last occurence of the duplicate value in the row can be retained with the parameter "keep".
- With the "inplace" parameter the duplicate values can be removed in the original DataFrame on which drop_duplicates() is called or the method can be made to return a new copy with the duplicate values removed.
Example - Removing duplicate rows from a pandas DataFrame:
# Example Python program to remove duplicate rows # Boolean data as a pandas DataFrame instance # Remove the duplicates and print the processed pandas DataFrame |
Output:
DataFrame with duplicate rows: A B C D E 0 1 1 1 0 1 1 1 1 1 0 1 2 1 0 1 0 1 3 1 0 0 0 1 4 1 1 0 0 0 DataFrame with duplicate rows removed: A B C D E 0 1 1 1 0 1 2 1 0 1 0 1 3 1 0 0 0 1 4 1 1 0 0 0 |
Example:
# Example Python program that removes duplicate rows
|
Output:
ISBN Title Copies 0 978-9380816715 Alice's Adventures in the Wonderland 10 1 978-1975675691 Through the Looking-Glass 9 2 978-9382616597 Gone with the Wind 4 3 978-8175993259 Gulliver's Travels 3 4 978-9381607701 Gulliver's Travels 1 5 978-1408845646 Harry Potter and the Philosopher's Stone 2 6 978-1408883761 Harry Potter and the Philosopher's Stone 1 7 978-1408883761 Harry Potter and the Philosopher's Stone 5 ISBN Title Copies 0 978-9380816715 Alice's Adventures in the Wonderland 10 1 978-1975675691 Through the Looking-Glass 9 2 978-9382616597 Gone with the Wind 4 3 978-8175993259 Gulliver's Travels 3 4 978-9381607701 Gulliver's Travels 1 5 978-1408845646 Harry Potter and the Philosopher's Stone 2 7 978-1408883761 Harry Potter and the Philosopher's Stone 5 |