Text Cleanup: Fundamental String Functions
The Messy Database
I was merging two customer lists when I realized the problem. One list had "LONDON," another had "london," and a third had " London " (with a space). To a computer, those are three different cities. To a human, they are the same place.
My reports were a mess. I needed to standardize the data.
The Quest: The Polishers
In the real world, data is rarely "Clean." People make typos, caps-lock gets left on, and legacy systems add extra spaces. As an analyst, you must "Polish" the text before you present it.
Three essential tools are:
1. **UPPER / LOWER**: Standardize case.
2. **TRIM**: Remove accidental spaces at the start or end.
3. **CONCAT**: Glue two pieces of text together.
The Implementation: The Standardizer
I wrote a query to "Fix" the city names on the fly.
-- Cleaning up messy customer data
SELECT
TRIM(UPPER(city)) AS clean_city,
CONCAT(first_name, ' ', last_name) AS full_name
FROM
customers;
Why This is Critical
If you want to count customers in London, and half are lowercase, your count will be wrong. By using `UPPER()`, you ensure that everything matches perfectly.
The "Oops" Moment
I once forgot to include a space in my `CONCAT` function: `CONCAT(first, last)`. The result was `JohnDoe`. Not exactly a professional directory.
**Pro Tip**: Always remember the space! `CONCAT(first, ' ', last)`.
The Victory
The "Dirty" list of 100 variations became a clean list of 5 standardized cities. My aggregation was accurate for the first time. I learned that data analysis is 80% cleaning and 20% calculating.
Your Task for Today
Choose a text column that looks inconsistent. Use `UPPER(TRIM(column_name))` and see how much more consistent the results become.
*Day 13: The Mystery of NULL—Unknown vs. Empty.*