Pandas Fixing Wrong Data

Pandas – Fixing Wrong Data

Wrong data refers to values that are incorrect, unrealistic, or invalid, even though they may be in the correct format. Examples include negative ages, impossible dates, incorrect categories, or out-of-range values. Pandas provides several ways to detect and fix such data.


1. Identify Wrong Data

Check Summary Statistics

df.describe()

Inspect Unique Values

df["Age"].unique()
df["City"].value_counts()

2. Fixing Values Using Conditions

Example: Fix Negative or Zero Age

df.loc[df["Age"] <= 0, "Age"] = df["Age"].mean()

3. Replace Wrong Values

df.replace(-1, pd.NA, inplace=True)
df.replace("Unknown", pd.NA, inplace=True)

4. Removing Wrong Rows

df = df[df["Age"] < 100]

5. Fixing Out-of-Range Data

df.loc[df["Marks"] > 100, "Marks"] = 100
df.loc[df["Marks"] < 0, "Marks"] = 0

6. Fixing Inconsistent Categories

df["Gender"] = df["Gender"].replace({
"M": "Male",
"F": "Female",
"male": "Male",
"female": "Female"
})

7. Fixing Date Errors

df["Date"] = pd.to_datetime(df["Date"], errors="coerce")

Remove future dates:

df = df[df["Date"] <= pd.Timestamp.today()]

8. Filling Corrected Missing Values

df["Age"].fillna(df["Age"].median(), inplace=True)

9. Real-World Example

import pandas as pd

data = {
“Name”: [“Amit”, “Riya”, “Karan”],
“Age”: [22, –5, 150],
“Marks”: [85, 110, –10]
}

df = pd.DataFrame(data)

df.loc[df[“Age”] < 0, “Age”] = df[“Age”].median()
df.loc[df[“Age”] > 100, “Age”] = 100

df.loc[df[“Marks”] > 100, “Marks”] = 100
df.loc[df[“Marks”] < 0, “Marks”] = 0

print(df)


10. Best Practices

✔ Define valid data ranges
✔ Use conditional checks
✔ Replace or remove invalid values
✔ Recheck data after fixing


Conclusion

Fixing wrong data ensures your dataset reflects realistic and meaningful values. Pandas makes it easy to detect, correct, or remove invalid data for reliable analysis.

You may also like...