Sample and Shuffle for Evaluation
It's **Day 124**. In Data Science, we rarely look at 100% of the data during the prototype phase. We use **Sampling**.
How to Sample
# Get 50 random rows
small_sample = df.sample(n=50)
# Get 10% of the data
ten_percent = df.sample(frac=0.1)
Shuffling (Randomizing Order)
Why shuffle? If your data is sorted by date, and you take the first 100 rows, you are only looking at "Past" data. Shuffling ensures your sample is representative.
# Shuffling the whole dataset
df_shuffled = df.sample(frac=1.0).reset_index(drop=True)
Reproducibility: random_state
If you want to get the SAME "Random" sample every time (so your coworker can run your experiment), set a seed:
df.sample(n=10, random_state=42)
Your Task for Today
Pull a 5% sample of your dataset using a fixed `random_state`.
*Day 125: Phase 2 Project—The Multi-Source Cleanup Pipeline.*