Data Cleaning in Pandas

Carson West

Libraries like Pandas

Data Cleaning in Pandas

This note covers data cleaning techniques within the Pandas library. Focus will be on practical application and common issues.

Key areas to cover:

import pandas as pd
import numpy as np

# Sample DataFrame with missing values and outliers
data = {'A': 1, 2, np.nan, 4, 5, 100]], 
        'B': ['apple', 'banana', 'orange', 'apple', 'banana', 'Apple']]}
df = pd.DataFrame(data)

# Handling missing values
df_cleaned = df.dropna() # Remove rows with NaN
df_filled = df.fillna({'A': df['A']].mean()}) # Fill NaN with mean of column A

# Handling outliers (example: removing values > 10 in column A)
df_no_outliers = df_filled[df_filled['A']] <= 10]]

#String Manipulation - lowercase and remove leading/trailing whitespace
df_cleaned['B']] = df_cleaned['B']].str.lower().str.strip()

print(df)
print(df_cleaned)
print(df_filled)
print(df_no_outliers)

Remember to always explore and understand your data before applying any cleaning techniques. The best approach depends heavily on the specific dataset and the intended analysis.