Project: The Automated Data Cleanup Pipeline
Congratulations! You've completed **Phase 6**. You now know more about data sanitization than most general-purpose developers.
Today, we're building the **"Golden Source" Pipeline**.
The Challenge
You've inherited a CSV import table with:
The Master Cleaning Script
WITH base_cleanup AS (
SELECT
id,
TRIM(LOWER(username)) as username,
ABS(price) as price, -- Fix negative prices
TO_DATE(date_text, 'DD/MM/YYYY') as order_date,
CASE
WHEN country IN ('UK', 'U.K.', 'Great Britain') THEN 'United Kingdom'
ELSE country
END as country
FROM raw_imports
),
deduplicated AS (
SELECT DISTINCT ON (username, order_date) *
FROM base_cleanup
ORDER BY username, order_date, id DESC
)
SELECT * FROM deduplicated;
Why this is professional grade
We didn't just "Fix" the data; we built a reproducible pipeline. Every time a new CSV is imported, this query will produce a perfect, clean result that the Phase 2, 3, and 5 queries can trust.
**Next Phase (Days 86–95): Performance Optimization**. We're going to learn about Indexes, Execution Plans, and how to make a query 100x faster.
See you in Phase 7!