Strings are text data—names, emails, addresses, product descriptions. Business analysts spend half their time cleaning messy strings: trimming whitespace, fixing case, extracting codes, validating formats. Master string manipulation and you turn dirty data into clean insights fast.

Estimated reading time: 25–30 minutes

Essential String Operations

strip() → remove leading/trailing whitespace
lower() / upper() → normalize case for comparisons
split() / join() → parse CSV, build paths
replace() → fix typos, standardize formats

Great for: cleaning user input, normalizing data

Advanced Techniques

String formatting → readable output with expressions
in / startswith / endswith → fast checks
Slicing [start:end] → extract substrings
isdigit() / isalpha() → validate input types

Great for: validation, reporting, ETL pipelines

Creating and Accessing Strings

Strings are immutable sequences of characters. Use quotes (single, double, or triple for multi-line).

python

# Creating strings
name = "Ana Garcia"
sku = 'SKU-12345'
note = """This is a
multi-line comment
for documentation"""

# Indexing (0-based)
first = name[0]      # "A"
last = name[-1]      # "a"

# Slicing [start:end]
first_name = name[:3]   # "Ana"
last_name = name[4:]    # "Garcia"

Essential String Methods

These methods return new strings (originals are immutable).

python

text = "  Product Name  "

# Cleaning
clean = text.strip()           # "Product Name"
lower = text.lower()           # "  product name  "
upper = text.upper()           # "  PRODUCT NAME  "

# Replacing
fixed = "color".replace("o", "ou")  # "colour"

# Splitting and joining
parts = "apple,banana,cherry".split(",")  # ["apple", "banana", "cherry"]
csv = ",".join(parts)                     # "apple,banana,cherry"

# Checking
print("Name" in text)              # True
print(text.startswith("  Prod"))   # True
print(text.endswith("me  "))       # True

String Formatting

Embed expressions directly in strings for cleaner output.

python

name = "Ana"
sales = 125000
target = 100000

# Basic formatting
msg = name + " hit $" + str(sales) + " in sales!"
print(msg)

# With format method
report = "{} exceeded target by {:.1f}%".format(name, (sales/target - 1) * 100)
print(report)

Validation and Type Checks

Use built-in methods to validate input before processing.

python

code = "SKU12345"
email = "ana@company.com"
qty = "42"

# Type checks
print(qty.isdigit())       # True
print(code.isalnum())      # True (alphanumeric)
print(email.isalpha())     # False (has @ and .)

# Validation patterns
def is_valid_sku(s):
    return s.startswith("SKU") and len(s) == 9 and s[3:].isdigit()

print(is_valid_sku("SKU12345"))  # True
print(is_valid_sku("ABC12345"))  # False

Cornerstone Project — Email List Cleaner (step-by-step)

Build a tool to clean a messy email list: trim whitespace, normalize case, remove duplicates, validate formats, and flag suspicious entries. This is a real task analysts face when importing CRM data or preparing campaigns.

Step 1 — Define the messy input

Simulate real-world data: extra spaces, mixed case, duplicates, invalid entries.

python

raw_emails = [
    "  Ana@Company.COM  ",
    "bob@company.com",
    "CAROL@COMPANY.COM",
    "ana@company.com",      # duplicate (different case)
    "invalid-email",         # missing @
    "dave@",                 # incomplete
    "  eve@company.com  ",
]

Step 2 — Clean and normalize

Strip whitespace and convert to lowercase for consistent comparisons.

python

cleaned = []
for email in raw_emails:
    normalized = email.strip().lower()
    cleaned.append(normalized)

print("Cleaned:", cleaned)

Step 3 — Validate format

Simple check: must contain @ and a dot after it.

python

def is_valid_email(email):
    if "@" not in email:
        return False
    parts = email.split("@")
    if len(parts) != 2:
        return False
    domain = parts[1]
    return "." in domain

valid = [e for e in cleaned if is_valid_email(e)]
invalid = [e for e in cleaned if not is_valid_email(e)]

print("Valid:", valid)
print("Invalid:", invalid)

Step 4 — Remove duplicates

Use a set to deduplicate, then convert back to list.

python

unique = list(set(valid))
print("Removed", len(valid) - len(unique), "duplicates")
print("Unique emails:", unique)

Step 5 — Flag suspicious patterns

Check for common issues: free email providers, short domains, etc.

python

FREE_PROVIDERS = ["gmail.com", "yahoo.com", "hotmail.com"]

flagged = []
for email in unique:
    domain = email.split("@")[1]
    if domain in FREE_PROVIDERS:
        flagged.append((email, "free provider"))
    elif len(domain) < 6:
        flagged.append((email, "short domain"))

if flagged:
    print("Flagged for review:")
    for email, reason in flagged:
        print(" •", email, ":", reason)

Step 6 — Generate final report

Combine everything into a summary function.

python

def clean_email_list(raw_list):
    # Clean and normalize
    cleaned = [e.strip().lower() for e in raw_list]
    
    # Validate
    valid = [e for e in cleaned if is_valid_email(e)]
    invalid = [e for e in cleaned if not is_valid_email(e)]
    
    # Deduplicate
    unique = list(set(valid))
    
    # Flag
    flagged = []
    for e in unique:
        domain = e.split("@")[1]
        if domain in FREE_PROVIDERS:
            flagged.append((e, "free provider"))
    
    return {
        "valid": unique,
        "invalid": invalid,
        "duplicates_removed": len(valid) - len(unique),
        "flagged": flagged
    }

report = clean_email_list(raw_emails)
print("Valid:", len(report['valid']))
print("Invalid:", len(report['invalid']))
print("Duplicates removed:", report['duplicates_removed'])
print("Flagged:", len(report['flagged']))

How this helps at work

Campaign prep → clean lists before sending, avoid bounces
Data quality → catch bad imports before they break reports
Compliance → flag personal emails in B2B lists
Reusable → adapt for phone numbers, SKUs, addresses

Key Takeaways

Strings are immutable → methods return new strings
strip() / lower() → essential for cleaning user input
String formatting → modern, readable output with expressions
split() / join() → parse and build structured text
Validation methods → isdigit(), startswith(), in operator
Cornerstone → email cleaner solves real data quality problems

Next Steps

You have mastered string manipulation. Next, explore file handling to read/write CSV and JSON files, or dive into regular expressions for advanced pattern matching and validation.

Strings and String Manipulation

Essential String Operations

Advanced Techniques

Creating and Accessing Strings

Essential String Methods

String Formatting

Validation and Type Checks

Cornerstone Project — Email List Cleaner (step-by-step)

Step 1 — Define the messy input

Step 2 — Clean and normalize

Step 3 — Validate format

Step 4 — Remove duplicates

Step 5 — Flag suspicious patterns

Step 6 — Generate final report

How this helps at work

Key Takeaways

Next Steps

On This Page