Data Science

Project: The Multi-Source Cleanup Pipeline

SQL Mastery Team
May 26, 2026
8 min read

Congratulations! You've completed **Phase 2: Data Wrangling**. You are now officially a "Pandas Power User."

Today, we build a **Data Consolidation Engine**.

The Challenge

You have:

1. `users.csv` (Basic info).

2. `transactions.csv` (Price info).

3. `web_logs.csv` (Dates and categories).

The Roadmap

1. **Join**: Merge these into one DataFrame.

2. **Clean**: Fix DateTime formats and handle missing prices.

3. **Transform**: Create a "Status" column based on transaction amounts.

4. **Aggregate**: Show the average shipping time per category for the Top 10% of spenders.

The Solution (Skeleton)

import pandas as pd

# 1. Merge

df = pd.merge(users, tx, on='uid').merge(logs, on='uid')

# 2. DateTime & Nulls

df['date'] = pd.to_datetime(df['date'])

df['price'] = df['price'].fillna(0)

# 3. Logic

df['segment'] = pd.qcut(df['price'], q=10, labels=False)

# 4. Filter & Aggregate

high_value = df[df['segment'] == 9]

result = high_value.groupby('category')['shipping'].mean()

print(result)

Why this is professional grade

You've taken raw, disconnected files and turned them into actionable business intelligence. This is the exact workflow used in top tech companies like Amazon and Netflix.

**Phase 3 (Days 126–140): Visualization & EDA**. We're going to turn these numbers into beautiful, interactive charts.

See you in Phase 3!

Ready to put your knowledge into practice?

Join SQL Mastery and learn through interactive exercises.