Project: The Automated Data Cleanup Pipeline

Congratulations! You've completed **Phase 6**. You now know more about data sanitization than most general-purpose developers.

Today, we're building the **"Golden Source" Pipeline**.

The Challenge

You've inherited a CSV import table with:

Multiple spellings of "United Kingdom" & "UK".

Prices that are sometimes negative (errors).

Usernames with leading/trailing spaces.

Dates stored as text in the format `DD/MM/YYYY`.

The Master Cleaning Script

WITH base_cleanup AS (

SELECT

id,

TRIM(LOWER(username)) as username,

ABS(price) as price, -- Fix negative prices

TO_DATE(date_text, 'DD/MM/YYYY') as order_date,

CASE

WHEN country IN ('UK', 'U.K.', 'Great Britain') THEN 'United Kingdom'

ELSE country

END as country

FROM raw_imports

deduplicated AS (

SELECT DISTINCT ON (username, order_date) *

FROM base_cleanup

ORDER BY username, order_date, id DESC

)

SELECT * FROM deduplicated;

Why this is professional grade

We didn't just "Fix" the data; we built a reproducible pipeline. Every time a new CSV is imported, this query will produce a perfect, clean result that the Phase 2, 3, and 5 queries can trust.

**Next Phase (Days 86–95): Performance Optimization**. We're going to learn about Indexes, Execution Plans, and how to make a query 100x faster.

See you in Phase 7!

The Challenge

The Master Cleaning Script

Why this is professional grade

Ready to put your knowledge into practice?