Data Engineering

Project: The Automated Data Cleanup Pipeline

SQL Mastery Team
April 16, 2026
8 min read

Congratulations! You've completed **Phase 6**. You now know more about data sanitization than most general-purpose developers.

Today, we're building the **"Golden Source" Pipeline**.

The Challenge

You've inherited a CSV import table with:

  • Multiple spellings of "United Kingdom" & "UK".
  • Prices that are sometimes negative (errors).
  • Usernames with leading/trailing spaces.
  • Dates stored as text in the format `DD/MM/YYYY`.
  • The Master Cleaning Script

    WITH base_cleanup AS (

    SELECT

    id,

    TRIM(LOWER(username)) as username,

    ABS(price) as price, -- Fix negative prices

    TO_DATE(date_text, 'DD/MM/YYYY') as order_date,

    CASE

    WHEN country IN ('UK', 'U.K.', 'Great Britain') THEN 'United Kingdom'

    ELSE country

    END as country

    FROM raw_imports

    ),

    deduplicated AS (

    SELECT DISTINCT ON (username, order_date) *

    FROM base_cleanup

    ORDER BY username, order_date, id DESC

    )

    SELECT * FROM deduplicated;

    Why this is professional grade

    We didn't just "Fix" the data; we built a reproducible pipeline. Every time a new CSV is imported, this query will produce a perfect, clean result that the Phase 2, 3, and 5 queries can trust.

    **Next Phase (Days 86–95): Performance Optimization**. We're going to learn about Indexes, Execution Plans, and how to make a query 100x faster.

    See you in Phase 7!

    Ready to put your knowledge into practice?

    Join SQL Mastery and learn through interactive exercises.