Common Table Expressions for Data Cleaning
It's **Day 56**, and we're cleaning. Data cleaning is often a "Multi-Pass" process. You can't fix everything in one `SELECT`.
The Cleaning Pipeline
Imagine your data has:
1. Missing emails.
2. Duplicate usernames.
3. Messy capitalisation.
Using CTEs, you can solve these one-by-one:
WITH step1_lowercase AS (
SELECT id, LOWER(TRIM(email)) as email, username FROM users
),
step2_remove_invalid AS (
SELECT * FROM step1_lowercase
WHERE email LIKE '%@%'
),
step3_handle_duplicates AS (
-- Your logic here (e.g., Row Numbering)
SELECT DISTINCT ON (email) * FROM step2_remove_invalid
)
SELECT * FROM step3_handle_duplicates;
Why this is better than one big query
If you try to do `DISTINCT`, `LOWER`, `TRIM`, and `WHERE` all in one block, the order of operations can lead to confusing results. CTEs force a specific order.
Your Task for Today
Build a 2-step cleaning pipe: First, handle NULLs with `COALESCE`, then filter the results in the second CTE.
*Day 57: Subqueries in the SELECT Clause—The Silent Performance Killer.*