In This Article, You Will Learn About Python Pandas DataFrame Operation.
Table of Contents
Data of Wrong Format
Cells with incorrect data format could cause problems or even impossible to study data.
To correct the issue, you have two options: either delete the rows or convert all the columns’ cells into identical format.
Convert Into a Correct Format
Let’s try to convert all cells in the ‘Date’ column into dates.
Pandas has a method to convert date-data into dates i.e., to_datetime()
Example – Convert to date to right date format.
import pandas as pd file = pd.read_csv('data.csv') file['Date'] = pd.to_datetime(file['Date']) print(file.to_string())
The output of the conversion in the previous example gives us a NaT value that can be used as a null value and then we can delete this row making use of dropping the row using dropna() method.
import pandas as pd file = pd.read_csv('data.csv') file['Date'] = pd.to_datetime(file['Date']) file.dropna(subset=['Date'], inplace = True) print(file.to_string())
Fixing Wrong Data
Wrong data means not “empty cell” or “wrong format”, it can be something written in wrong format according to specific rule or sort of instruction.
For example, in our data, you can see it clearly that in row no. 4, date in “NaN” and same is also row no. 6.
The simple way is to incorrect wrong data is to replace those data with correct value.
Example – Replacing value with new data. Inserted new value 65 in row number 3.
import pandas as pd file = pd.read_csv('data.csv') file.loc[3,'Marks'] = 65 print(file.to_string())
As shown clearly, it returned a file with inserted value 65 at row number 3.
Smaller data sets can change the incorrect data one at a time; however, it is not so for large data sets.
To replace inaccurate data with more significant data sets, you can establish some regulations, e.g., establish some boundaries for legal values, and then replace any value beyond the bounds.
Example – Loop through all values in the “Marks” column. If the value is higher than 73, set it to 80.
import pandas as pd df = pd.read_csv('data.csv') for x in df.index: if df.loc[x, "Marks"] > 73: df.loc[x, "Marks"] = 80 print(df.to_string())
Another method of dealing with incorrect data is to delete those rows that contain wrong data.
That way, you don’t need to figure out which replacements you could use the next time, and there’s an excellent chance that you will not need them to conduct your analysis.
Example – Delete rows where “Marks” is higher than 73.
import pandas as pd df = pd.read_csv('data.csv') for x in df.index: if df.loc[x, "Marks"] > 73: df.drop(x, inplace = True) print(df.to_string())
Duplicates rows are rows that contains same value more than one time.
To discover duplicate values, use duplicated() method.
Duplicated() method returns a Boolean values for each row.
Example – Returns True for every row that is a duplicate value, otherwise False.
import pandas as pd file = pd.read_csv('data.csv') print(file.duplicated())
As a result, it returned Ture for each row containing the duplicate value.
To remove duplicate values, use drop_duplicates() method.
Example – Using drop_duplicates() method to remove duplicate values.
import pandas as pd file = pd.read_csv('data.csv') file.drop_duplicates(inplace = True) print(file.to_string())
As shown above, it returned the file after removing duplicate values.
Note: Argument inplace = True will not return a new DataFrame and will remove the all duplicate values from the original DataFrame.
If you find anything incorrect in the above-discussed topic and have any further questions, please comment below.
Like us on