Data Science

Handling Duplicates: The Pythonic Way

SQL Mastery Team
May 24, 2026
5 min read

Welcome to **Day 123**. Today we're cleaning again. Duplicates are the enemy of accuracy.

Finding Duplicates

# Returns True for every row that is a repeat

is_duplicate = df.duplicated()

# Find duplicates based only on email

dupe_emails = df[df.duplicated(subset=['email'])]

Removing Duplicates

# Keep the FIRST occurrence, delete the rest

df_clean = df.drop_duplicates(subset=['email'], keep='first')

# Keep the LAST (useful if the last one has the latest data)

df_latest = df.drop_duplicates(subset=['email'], keep='last')

Comparison to SQL

In SQL (Day 80), we used `DISTINCT ON`. In Pandas, `drop_duplicates` is more intuitive and faster to write.

Your Task for Today

Identify rows that share the same 'User ID' but different 'Order IDs' and decide which one to keep using `keep='first'`.

*Day 124: Sample and Shuffle for Data Evaluation.*

Ready to put your knowledge into practice?

Join SQL Mastery and learn through interactive exercises.