Data Science

Training vs Testing Sets

SQL Mastery Team
June 14, 2026
5 min read

It's **Day 144**. Today we talk about **Honesty** in Machine Learning.

The Exam Analogy

Imagine a teacher gives you the answers to a test on Monday, and then gives you the EXACT same test on Friday. If you get a 100%, are you smart, or did you just memorize the answers?

This is called **Overfitting**.

The Solution: Train/Test Split

We split our rows into two buckets:

1. **Training Set (80%)**: The computer sees this data and learns the patterns.

2. **Testing Set (20%)**: The computer has NEVER seen this data. We use it to see if the model actually works on new information.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Why the `random_state`?

It's the seed for the randomizer (Day 124). Using `42` ensures that every time you run the script, you get the same split, making your results reproducible.

Your Task for Today

Practice splitting a small DataFrame into subsets of 70/30 and 80/20.

*Day 145: Classification with Logistic Regression.*

Ready to put your knowledge into practice?

Join SQL Mastery and learn through interactive exercises.