Project: The Multi-Source Cleanup Pipeline
Congratulations! You've completed **Phase 2: Data Wrangling**. You are now officially a "Pandas Power User."
Today, we build a **Data Consolidation Engine**.
The Challenge
You have:
1. `users.csv` (Basic info).
2. `transactions.csv` (Price info).
3. `web_logs.csv` (Dates and categories).
The Roadmap
1. **Join**: Merge these into one DataFrame.
2. **Clean**: Fix DateTime formats and handle missing prices.
3. **Transform**: Create a "Status" column based on transaction amounts.
4. **Aggregate**: Show the average shipping time per category for the Top 10% of spenders.
The Solution (Skeleton)
import pandas as pd
# 1. Merge
df = pd.merge(users, tx, on='uid').merge(logs, on='uid')
# 2. DateTime & Nulls
df['date'] = pd.to_datetime(df['date'])
df['price'] = df['price'].fillna(0)
# 3. Logic
df['segment'] = pd.qcut(df['price'], q=10, labels=False)
# 4. Filter & Aggregate
high_value = df[df['segment'] == 9]
result = high_value.groupby('category')['shipping'].mean()
print(result)
Why this is professional grade
You've taken raw, disconnected files and turned them into actionable business intelligence. This is the exact workflow used in top tech companies like Amazon and Netflix.
**Phase 3 (Days 126–140): Visualization & EDA**. We're going to turn these numbers into beautiful, interactive charts.
See you in Phase 3!