Data Science

Sample and Shuffle for Evaluation

SQL Mastery Team
May 25, 2026
4 min read

It's **Day 124**. In Data Science, we rarely look at 100% of the data during the prototype phase. We use **Sampling**.

How to Sample

# Get 50 random rows

small_sample = df.sample(n=50)

# Get 10% of the data

ten_percent = df.sample(frac=0.1)

Shuffling (Randomizing Order)

Why shuffle? If your data is sorted by date, and you take the first 100 rows, you are only looking at "Past" data. Shuffling ensures your sample is representative.

# Shuffling the whole dataset

df_shuffled = df.sample(frac=1.0).reset_index(drop=True)

Reproducibility: random_state

If you want to get the SAME "Random" sample every time (so your coworker can run your experiment), set a seed:

df.sample(n=10, random_state=42)

Your Task for Today

Pull a 5% sample of your dataset using a fixed `random_state`.

*Day 125: Phase 2 Project—The Multi-Source Cleanup Pipeline.*

Ready to put your knowledge into practice?

Join SQL Mastery and learn through interactive exercises.