Introduction to Python Pandas Data Cleaning

In This Article, You Will Learn About Python Pandas Data Cleaning

Python Pandas Cleaning Data – Before moving ahead, let’s know about Pandas DataFrames Analyzing

Table of Contents

Data Cleaning

Data Cleaning means setting up empty or null, wrong, duplicates, and irrelevant data in the data set.

Data Set Shows

Click to Download File

Cleaning Empty Cells

Empty Cell

While analyzing the data set, an empty set can give the wrong result.

Remove Rows

There is only way to deal with empty cells that is removing rows that contains empty cells.

This is normally right, since data set can be big therefore removing few empty rows will not affect overall result.

To remove empty or null rows & columns, we will use dropna() method.

Example – Return a new DataFrame with no empty cells.

				
					import pandas as pd

data = pd.read_csv('data.csv')

updated_data = data.dropna()

print(updated_data.to_string())
				
			

Note: dropna() method returns a new DataFrame, by default, and will remain unchanged the original file.

To make changes in the original DataFrame, use the inplace = True argument.

Example – Remove all rows with NULL values:

				
					import pandas as pd

data = pd.read_csv('file.csv')

data.dropna(inplace = True)

print(data.to_string())

				
			

Note: Since inplace is True, therefore it will not return new DataFrame instead will return existing DataFrame after removing null rows.  

Replace Empty Values

There is another way to deal with empty cells that is inserting new value in place of empty cell.

It will allow not to delete entire rows because of having some empty rows.

To insert a new value in empty cell, we will use fillna() method.

Example – Replacing all null values with new value i.e., 94

				
					import pandas as pd

df = pd.read_csv('data.csv')

df.fillna(94, inplace = True)

print(df.to_string())

				
			

Replace Only Specified Columns

Specify the column name for the DataFrame, to replace only specified column for empty value.

Example – Replace null values in the “new value” columns with the number 94.

				
					import pandas as pd

df = pd.read_csv('data.csv')

df["Score"].fillna(94, inplace = True)

print(df.to_string())

				
			

Replace Using Mean, Median, or Mode

Replacing empty cell is also possible with calculation of mean, median, mode value of the column.

Pandas uses the mean() median() and mode() methods to calculate the respective values for a specified column:

Example – Calculate the MEAN, and replace empty values with it.

				
					import pandas as pd

df = pd.read_csv('data.csv')

x = df["Score"].mean()

df["Score"].fillna(x, inplace = True)

print(df.to_string())

				
			

Example – Calculate the MEDIAN, and replace empty values with it.

				
					import pandas as pd

df = pd.read_csv('data.csv')

x = df["Score"].median()

df["Score"].fillna(x, inplace = True)

print(df.to_string())

				
			

Example – Calculate the MODE, and replace empty values with it.

				
					import pandas as pd

df = pd.read_csv('data.csv')

x = df["Score"].mode()[0]

df["Score"].fillna(x, inplace = True)

print(df.to_string())

				
			

If you find anything incorrect in the above-discussed topic and have any further questions, please comment below.

Leave a Comment

Stay in the loop

codingstreets