Strings and String Manipulation
Strings are text data—names, emails, addresses, product descriptions. Business analysts spend half their time cleaning messy strings: trimming whitespace, fixing case, extracting codes, validating formats. Master string manipulation and you turn dirty data into clean insights fast.
Estimated reading time: 25–30 minutes
Essential String Operations
- strip() → remove leading/trailing whitespace
- lower() / upper() → normalize case for comparisons
- split() / join() → parse CSV, build paths
- replace() → fix typos, standardize formats
Great for: cleaning user input, normalizing data
Advanced Techniques
- String formatting → readable output with expressions
- in / startswith / endswith → fast checks
- Slicing [start:end] → extract substrings
- isdigit() / isalpha() → validate input types
Great for: validation, reporting, ETL pipelines
Creating and Accessing Strings
Strings are immutable sequences of characters. Use quotes (single, double, or triple for multi-line).
# Creating strings
name = "Ana Garcia"
sku = 'SKU-12345'
note = """This is a
multi-line comment
for documentation"""
# Indexing (0-based)
first = name[0] # "A"
last = name[-1] # "a"
# Slicing [start:end]
first_name = name[:3] # "Ana"
last_name = name[4:] # "Garcia"Essential String Methods
These methods return new strings (originals are immutable).
text = " Product Name "
# Cleaning
clean = text.strip() # "Product Name"
lower = text.lower() # " product name "
upper = text.upper() # " PRODUCT NAME "
# Replacing
fixed = "color".replace("o", "ou") # "colour"
# Splitting and joining
parts = "apple,banana,cherry".split(",") # ["apple", "banana", "cherry"]
csv = ",".join(parts) # "apple,banana,cherry"
# Checking
print("Name" in text) # True
print(text.startswith(" Prod")) # True
print(text.endswith("me ")) # TrueString Formatting
Embed expressions directly in strings for cleaner output.
name = "Ana"
sales = 125000
target = 100000
# Basic formatting
msg = name + " hit $" + str(sales) + " in sales!"
print(msg)
# With format method
report = "{} exceeded target by {:.1f}%".format(name, (sales/target - 1) * 100)
print(report)Validation and Type Checks
Use built-in methods to validate input before processing.
code = "SKU12345"
email = "ana@company.com"
qty = "42"
# Type checks
print(qty.isdigit()) # True
print(code.isalnum()) # True (alphanumeric)
print(email.isalpha()) # False (has @ and .)
# Validation patterns
def is_valid_sku(s):
return s.startswith("SKU") and len(s) == 9 and s[3:].isdigit()
print(is_valid_sku("SKU12345")) # True
print(is_valid_sku("ABC12345")) # FalseCornerstone Project — Email List Cleaner (step-by-step)
Build a tool to clean a messy email list: trim whitespace, normalize case, remove duplicates, validate formats, and flag suspicious entries. This is a real task analysts face when importing CRM data or preparing campaigns.
Step 1 — Define the messy input
Simulate real-world data: extra spaces, mixed case, duplicates, invalid entries.
raw_emails = [
" Ana@Company.COM ",
"bob@company.com",
"CAROL@COMPANY.COM",
"ana@company.com", # duplicate (different case)
"invalid-email", # missing @
"dave@", # incomplete
" eve@company.com ",
]Step 2 — Clean and normalize
Strip whitespace and convert to lowercase for consistent comparisons.
cleaned = []
for email in raw_emails:
normalized = email.strip().lower()
cleaned.append(normalized)
print("Cleaned:", cleaned)Step 3 — Validate format
Simple check: must contain @ and a dot after it.
def is_valid_email(email):
if "@" not in email:
return False
parts = email.split("@")
if len(parts) != 2:
return False
domain = parts[1]
return "." in domain
valid = [e for e in cleaned if is_valid_email(e)]
invalid = [e for e in cleaned if not is_valid_email(e)]
print("Valid:", valid)
print("Invalid:", invalid)Step 4 — Remove duplicates
Use a set to deduplicate, then convert back to list.
unique = list(set(valid))
print("Removed", len(valid) - len(unique), "duplicates")
print("Unique emails:", unique)Step 5 — Flag suspicious patterns
Check for common issues: free email providers, short domains, etc.
FREE_PROVIDERS = ["gmail.com", "yahoo.com", "hotmail.com"]
flagged = []
for email in unique:
domain = email.split("@")[1]
if domain in FREE_PROVIDERS:
flagged.append((email, "free provider"))
elif len(domain) < 6:
flagged.append((email, "short domain"))
if flagged:
print("Flagged for review:")
for email, reason in flagged:
print(" •", email, ":", reason)Step 6 — Generate final report
Combine everything into a summary function.
def clean_email_list(raw_list):
# Clean and normalize
cleaned = [e.strip().lower() for e in raw_list]
# Validate
valid = [e for e in cleaned if is_valid_email(e)]
invalid = [e for e in cleaned if not is_valid_email(e)]
# Deduplicate
unique = list(set(valid))
# Flag
flagged = []
for e in unique:
domain = e.split("@")[1]
if domain in FREE_PROVIDERS:
flagged.append((e, "free provider"))
return {
"valid": unique,
"invalid": invalid,
"duplicates_removed": len(valid) - len(unique),
"flagged": flagged
}
report = clean_email_list(raw_emails)
print("Valid:", len(report['valid']))
print("Invalid:", len(report['invalid']))
print("Duplicates removed:", report['duplicates_removed'])
print("Flagged:", len(report['flagged']))How this helps at work
- Campaign prep → clean lists before sending, avoid bounces
- Data quality → catch bad imports before they break reports
- Compliance → flag personal emails in B2B lists
- Reusable → adapt for phone numbers, SKUs, addresses
Key Takeaways
- Strings are immutable → methods return new strings
- strip() / lower() → essential for cleaning user input
- String formatting → modern, readable output with expressions
- split() / join() → parse and build structured text
- Validation methods → isdigit(), startswith(), in operator
- Cornerstone → email cleaner solves real data quality problems
Next Steps
You have mastered string manipulation. Next, explore file handling to read/write CSV and JSON files, or dive into regular expressions for advanced pattern matching and validation.