Strings and String Manipulation

Strings are text data—names, emails, addresses, product descriptions. Business analysts spend half their time cleaning messy strings: trimming whitespace, fixing case, extracting codes, validating formats. Master string manipulation and you turn dirty data into clean insights fast.

Estimated reading time: 25–30 minutes

Essential String Operations

  • strip() → remove leading/trailing whitespace
  • lower() / upper() → normalize case for comparisons
  • split() / join() → parse CSV, build paths
  • replace() → fix typos, standardize formats

Great for: cleaning user input, normalizing data

Advanced Techniques

  • String formatting → readable output with expressions
  • in / startswith / endswith → fast checks
  • Slicing [start:end] → extract substrings
  • isdigit() / isalpha() → validate input types

Great for: validation, reporting, ETL pipelines

Creating and Accessing Strings

Strings are immutable sequences of characters. Use quotes (single, double, or triple for multi-line).

python
# Creating strings
name = "Ana Garcia"
sku = 'SKU-12345'
note = """This is a
multi-line comment
for documentation"""

# Indexing (0-based)
first = name[0]      # "A"
last = name[-1]      # "a"

# Slicing [start:end]
first_name = name[:3]   # "Ana"
last_name = name[4:]    # "Garcia"

Essential String Methods

These methods return new strings (originals are immutable).

python
text = "  Product Name  "

# Cleaning
clean = text.strip()           # "Product Name"
lower = text.lower()           # "  product name  "
upper = text.upper()           # "  PRODUCT NAME  "

# Replacing
fixed = "color".replace("o", "ou")  # "colour"

# Splitting and joining
parts = "apple,banana,cherry".split(",")  # ["apple", "banana", "cherry"]
csv = ",".join(parts)                     # "apple,banana,cherry"

# Checking
print("Name" in text)              # True
print(text.startswith("  Prod"))   # True
print(text.endswith("me  "))       # True

String Formatting

Embed expressions directly in strings for cleaner output.

python
name = "Ana"
sales = 125000
target = 100000

# Basic formatting
msg = name + " hit $" + str(sales) + " in sales!"
print(msg)

# With format method
report = "{} exceeded target by {:.1f}%".format(name, (sales/target - 1) * 100)
print(report)

Validation and Type Checks

Use built-in methods to validate input before processing.

python
code = "SKU12345"
email = "ana@company.com"
qty = "42"

# Type checks
print(qty.isdigit())       # True
print(code.isalnum())      # True (alphanumeric)
print(email.isalpha())     # False (has @ and .)

# Validation patterns
def is_valid_sku(s):
    return s.startswith("SKU") and len(s) == 9 and s[3:].isdigit()

print(is_valid_sku("SKU12345"))  # True
print(is_valid_sku("ABC12345"))  # False

Cornerstone Project — Email List Cleaner (step-by-step)

Build a tool to clean a messy email list: trim whitespace, normalize case, remove duplicates, validate formats, and flag suspicious entries. This is a real task analysts face when importing CRM data or preparing campaigns.

Step 1 — Define the messy input

Simulate real-world data: extra spaces, mixed case, duplicates, invalid entries.

python
raw_emails = [
    "  Ana@Company.COM  ",
    "bob@company.com",
    "CAROL@COMPANY.COM",
    "ana@company.com",      # duplicate (different case)
    "invalid-email",         # missing @
    "dave@",                 # incomplete
    "  eve@company.com  ",
]

Step 2 — Clean and normalize

Strip whitespace and convert to lowercase for consistent comparisons.

python
cleaned = []
for email in raw_emails:
    normalized = email.strip().lower()
    cleaned.append(normalized)

print("Cleaned:", cleaned)

Step 3 — Validate format

Simple check: must contain @ and a dot after it.

python
def is_valid_email(email):
    if "@" not in email:
        return False
    parts = email.split("@")
    if len(parts) != 2:
        return False
    domain = parts[1]
    return "." in domain

valid = [e for e in cleaned if is_valid_email(e)]
invalid = [e for e in cleaned if not is_valid_email(e)]

print("Valid:", valid)
print("Invalid:", invalid)

Step 4 — Remove duplicates

Use a set to deduplicate, then convert back to list.

python
unique = list(set(valid))
print("Removed", len(valid) - len(unique), "duplicates")
print("Unique emails:", unique)

Step 5 — Flag suspicious patterns

Check for common issues: free email providers, short domains, etc.

python
FREE_PROVIDERS = ["gmail.com", "yahoo.com", "hotmail.com"]

flagged = []
for email in unique:
    domain = email.split("@")[1]
    if domain in FREE_PROVIDERS:
        flagged.append((email, "free provider"))
    elif len(domain) < 6:
        flagged.append((email, "short domain"))

if flagged:
    print("Flagged for review:")
    for email, reason in flagged:
        print(" •", email, ":", reason)

Step 6 — Generate final report

Combine everything into a summary function.

python
def clean_email_list(raw_list):
    # Clean and normalize
    cleaned = [e.strip().lower() for e in raw_list]
    
    # Validate
    valid = [e for e in cleaned if is_valid_email(e)]
    invalid = [e for e in cleaned if not is_valid_email(e)]
    
    # Deduplicate
    unique = list(set(valid))
    
    # Flag
    flagged = []
    for e in unique:
        domain = e.split("@")[1]
        if domain in FREE_PROVIDERS:
            flagged.append((e, "free provider"))
    
    return {
        "valid": unique,
        "invalid": invalid,
        "duplicates_removed": len(valid) - len(unique),
        "flagged": flagged
    }

report = clean_email_list(raw_emails)
print("Valid:", len(report['valid']))
print("Invalid:", len(report['invalid']))
print("Duplicates removed:", report['duplicates_removed'])
print("Flagged:", len(report['flagged']))

How this helps at work

  • Campaign prep → clean lists before sending, avoid bounces
  • Data quality → catch bad imports before they break reports
  • Compliance → flag personal emails in B2B lists
  • Reusable → adapt for phone numbers, SKUs, addresses

Key Takeaways

  • Strings are immutable → methods return new strings
  • strip() / lower() → essential for cleaning user input
  • String formatting → modern, readable output with expressions
  • split() / join() → parse and build structured text
  • Validation methods → isdigit(), startswith(), in operator
  • Cornerstone → email cleaner solves real data quality problems

Next Steps

You have mastered string manipulation. Next, explore file handling to read/write CSV and JSON files, or dive into regular expressions for advanced pattern matching and validation.