In This Article, You Will Learn About Python Pandas Data Cleaning
Python Pandas Cleaning Data – Before moving ahead, let’s know about Pandas DataFrames Analyzing
Table of Contents
Data Cleaning
Data Cleaning means setting up empty or null, wrong, duplicates, and irrelevant data in the data set.
Data Set Shows
Click to Download File
Cleaning Empty Cells
Empty Cell
While analyzing the data set, an empty set can give the wrong result.
Remove Rows
There is only way to deal with empty cells that is removing rows that contains empty cells.
This is normally right, since data set can be big therefore removing few empty rows will not affect overall result.
To remove empty or null rows & columns, we will use dropna() method.
Example – Return a new DataFrame with no empty cells.
import pandas as pd
data = pd.read_csv('data.csv')
updated_data = data.dropna()
print(updated_data.to_string())
Note: dropna() method returns a new DataFrame, by default, and will remain unchanged the original file.
To make changes in the original DataFrame, use the inplace = True argument.
Example – Remove all rows with NULL values:
import pandas as pd
data = pd.read_csv('file.csv')
data.dropna(inplace = True)
print(data.to_string())
Note: Since inplace is True, therefore it will not return new DataFrame instead will return existing DataFrame after removing null rows.
Replace Empty Values
There is another way to deal with empty cells that is inserting new value in place of empty cell.
It will allow not to delete entire rows because of having some empty rows.
To insert a new value in empty cell, we will use fillna() method.
Example – Replacing all null values with new value i.e., 94
import pandas as pd
df = pd.read_csv('data.csv')
df.fillna(94, inplace = True)
print(df.to_string())
Replace Only Specified Columns
Specify the column name for the DataFrame, to replace only specified column for empty value.
Example – Replace null values in the “new value” columns with the number 94.
import pandas as pd
df = pd.read_csv('data.csv')
df["Score"].fillna(94, inplace = True)
print(df.to_string())
Replace Using Mean, Median, or Mode
Replacing empty cell is also possible with calculation of mean, median, mode value of the column.
Pandas uses the mean() median() and mode() methods to calculate the respective values for a specified column:
Example – Calculate the MEAN, and replace empty values with it.
import pandas as pd
df = pd.read_csv('data.csv')
x = df["Score"].mean()
df["Score"].fillna(x, inplace = True)
print(df.to_string())
Example – Calculate the MEDIAN, and replace empty values with it.
import pandas as pd
df = pd.read_csv('data.csv')
x = df["Score"].median()
df["Score"].fillna(x, inplace = True)
print(df.to_string())
Example – Calculate the MODE, and replace empty values with it.
import pandas as pd
df = pd.read_csv('data.csv')
x = df["Score"].mode()[0]
df["Score"].fillna(x, inplace = True)
print(df.to_string())
If you find anything incorrect in the above-discussed topic and have any further questions, please comment below.