Training vs Testing Sets
It's **Day 144**. Today we talk about **Honesty** in Machine Learning.
The Exam Analogy
Imagine a teacher gives you the answers to a test on Monday, and then gives you the EXACT same test on Friday. If you get a 100%, are you smart, or did you just memorize the answers?
This is called **Overfitting**.
The Solution: Train/Test Split
We split our rows into two buckets:
1. **Training Set (80%)**: The computer sees this data and learns the patterns.
2. **Testing Set (20%)**: The computer has NEVER seen this data. We use it to see if the model actually works on new information.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Why the `random_state`?
It's the seed for the randomizer (Day 124). Using `42` ensures that every time you run the script, you get the same split, making your results reproducible.
Your Task for Today
Practice splitting a small DataFrame into subsets of 70/30 and 80/20.
*Day 145: Classification with Logistic Regression.*