Handling Duplicates: The Pythonic Way
Welcome to **Day 123**. Today we're cleaning again. Duplicates are the enemy of accuracy.
Finding Duplicates
# Returns True for every row that is a repeat
is_duplicate = df.duplicated()
# Find duplicates based only on email
dupe_emails = df[df.duplicated(subset=['email'])]
Removing Duplicates
# Keep the FIRST occurrence, delete the rest
df_clean = df.drop_duplicates(subset=['email'], keep='first')
# Keep the LAST (useful if the last one has the latest data)
df_latest = df.drop_duplicates(subset=['email'], keep='last')
Comparison to SQL
In SQL (Day 80), we used `DISTINCT ON`. In Pandas, `drop_duplicates` is more intuitive and faster to write.
Your Task for Today
Identify rows that share the same 'User ID' but different 'Order IDs' and decide which one to keep using `keep='first'`.
*Day 124: Sample and Shuffle for Data Evaluation.*