Handling Duplicates: The Pythonic Way

Welcome to **Day 123**. Today we're cleaning again. Duplicates are the enemy of accuracy.

Finding Duplicates

# Returns True for every row that is a repeat

is_duplicate = df.duplicated()

# Find duplicates based only on email

dupe_emails = df[df.duplicated(subset=['email'])]

# Keep the FIRST occurrence, delete the rest

df_clean = df.drop_duplicates(subset=['email'], keep='first')

# Keep the LAST (useful if the last one has the latest data)

df_latest = df.drop_duplicates(subset=['email'], keep='last')

In SQL (Day 80), we used `DISTINCT ON`. In Pandas, `drop_duplicates` is more intuitive and faster to write.

Identify rows that share the same 'User ID' but different 'Order IDs' and decide which one to keep using `keep='first'`.

*Day 124: Sample and Shuffle for Data Evaluation.*