Sample and Shuffle for Evaluation

It's **Day 124**. In Data Science, we rarely look at 100% of the data during the prototype phase. We use **Sampling**.

How to Sample

# Get 50 random rows

small_sample = df.sample(n=50)

# Get 10% of the data

ten_percent = df.sample(frac=0.1)

Shuffling (Randomizing Order)

Why shuffle? If your data is sorted by date, and you take the first 100 rows, you are only looking at "Past" data. Shuffling ensures your sample is representative.

# Shuffling the whole dataset

df_shuffled = df.sample(frac=1.0).reset_index(drop=True)

Reproducibility: random_state

If you want to get the SAME "Random" sample every time (so your coworker can run your experiment), set a seed:

df.sample(n=10, random_state=42)

Your Task for Today

Pull a 5% sample of your dataset using a fixed `random_state`.

*Day 125: Phase 2 Project—The Multi-Source Cleanup Pipeline.*

How to Sample

Shuffling (Randomizing Order)

Reproducibility: random_state

Your Task for Today

Ready to put your knowledge into practice?