Common Table Expressions for Data Cleaning

It's **Day 56**, and we're cleaning. Data cleaning is often a "Multi-Pass" process. You can't fix everything in one `SELECT`.

The Cleaning Pipeline

Imagine your data has:

1. Missing emails.

2. Duplicate usernames.

3. Messy capitalisation.

Using CTEs, you can solve these one-by-one:

WITH step1_lowercase AS (

SELECT id, LOWER(TRIM(email)) as email, username FROM users

step2_remove_invalid AS (

SELECT * FROM step1_lowercase

WHERE email LIKE '%@%'

step3_handle_duplicates AS (

-- Your logic here (e.g., Row Numbering)

SELECT DISTINCT ON (email) * FROM step2_remove_invalid

)

SELECT * FROM step3_handle_duplicates;

Why this is better than one big query

If you try to do `DISTINCT`, `LOWER`, `TRIM`, and `WHERE` all in one block, the order of operations can lead to confusing results. CTEs force a specific order.

Your Task for Today

Build a 2-step cleaning pipe: First, handle NULLs with `COALESCE`, then filter the results in the second CTE.

*Day 57: Subqueries in the SELECT Clause—The Silent Performance Killer.*

The Cleaning Pipeline

Why this is better than one big query

Your Task for Today

Ready to put your knowledge into practice?