8.2. Regex#

Hide code cell source

import sys
from pathlib import Path

current = Path.cwd()
for parent in [current, *current.parents]:
    if (parent / '_config.yml').exists():
        project_root = parent  # ← Add project root, not chapters
        break
else:
    project_root = Path.cwd().parent.parent

sys.path.insert(0, str(project_root))

from shared import thinkpython, diagram, jupyturtle
from shared.download import download

# Register as top-level modules so direct imports work in subsequent cells
sys.modules['thinkpython'] = thinkpython
sys.modules['diagram'] = diagram
sys.modules['jupyturtle'] = jupyturtle

String methods allow you to search and manipulate text to a certain extent. A regular expression (regex) is a sequence of characters that defines a search pattern to search, match, and manipulate text in more powerful ways that go beyond simple string methods. For example:

Task

Use

Check if string starts with ‘http’

str.startswith()

Replace all spaces with underscores

str.replace()

Extract all email addresses from text

regex

Validate a phone number format

regex

Find words matching a complex pattern

regex

The rule of thumb: if the pattern is fixed and simple, use string methods. If the pattern is variable or complex, use regex.

For example, to search a pattern in a text, we may use the find() string method and an index is returned if the pattern is found.

text = "I am Dracula; and I bid you welcome, Mr. Harker,\
    to my house."
pattern = 'Dracula'
text.find(pattern)
5

8.2.1. Escape Sequences and Raw Strings#

Before using the regular expression (re) functions, we need to understand regex escapes and raw strings.

Regex patterns use backslashes heavily. In regex, the backslash \ introduces escape sequences, which create special patterns or allow metacharacters to be treated as literal characters.

For example, common escape sequences include:

Pattern

Meaning

Example match

\d

digit

7

\w

word character (letters, digits, underscore)

A, x, 9, _ ([a-zA-Z0-9_])

\s

whitespace character

space, tab, new line

\.

literal dot

.

\$

literal dollar sign

$

\\

literal backslash

\

8.2.1.1. Raw Strings#

A raw string, on the other hand, is a string prefixed with r, which tells Python to treat backslashes \ as literal characters rather than escape sequences.

  • Prefix with r to prevent Python from interpreting backslashes before the regex engine sees the pattern; works like \.

  • Raw strings avoid double escaping and make patterns easier to read.

  • Without raw strings, you often need extra backslashes, like '\\d+'.

regular = "\n"   # newline character
raw      = r"\n" # literally backslash + n (two characters)

print(regular)  # prints a newline
print(raw)      # prints \n
\n

Use escapes for literal special regex characters too (for example \., \$, \?).

print('\\\\')           # double backslash in a normal string
print(r'\\')            # double backslash in a raw string (same result)
print(r'\\\n')          # backslash + n in a raw string
print(r'\\\n' == '\\\\n')   # False: raw string is backslash + n, normal string is backslash + backslash + n
\\
\\
\\\n
False

Raw string tells Python we are treating this backslash \ as a special character.

8.2.2. The re Module#

Python’s built-in re module provides regex support. The 6 most commonly used regex functions are:

Function

Description

Sample Syntax

Return

re.search()

Find first match anywhere in the string

re.search(pattern, text)

Match object or None

re.match()

Match only at the start of the string

re.match(pattern, text)

Match object or None

re.findall()

Find all matches; return as a list

re.findall(pattern, text)

list of strings

re.sub()

Find and replace matches

re.sub(pattern, repl, text)

str

re.split()

Split string on a pattern

re.split(pattern, text)

list of strings

re.fulmatch()

Match the entire string against the pattern

re.fullmatch(pattern, text)

Match object or None

8.2.2.1. The Match object#

re.search(), re.match(), and re.fullmatch() functions return a Match object when pattern is matched.

For example,

  • returns: re.search(pattern, text) scans through text and returns

    • a Match object for the first location where pattern is found.

    • If the pattern is not found anywhere in the string, it returns None.

  • Match: A Match object has the following commonly used attributes and methods:

Attribute / Method

Description

Example

.group()

Returns the matched substring

m.group()'Dracula'

.start()

Index where the match begins

m.start()5

.end()

Index where the match ends

m.end()12

.span()

Tuple of (start, end)

m.span()(5, 12)

.string

The original string that was searched

m.string'I am Dracula...'

import re

text = "I am Dracula; and I bid you welcome, Mr. Harker, to my house."
pattern = 'Dracula'

result = re.search(pattern, text)     ### pattern: Dracula; text: the line
result                              ### the Match object
<re.Match object; span=(5, 12), match='Dracula'>

If the pattern appears in the text, search returns a Match object that contains the results of the search.

  1. String: Among other information, it has a variable named string that contains the text that was searched.

  2. Group: It also provides a method called group that returns the part of the text that matched the pattern.

  3. Span and Start/End: And it provides a method called span that returns the index in the text where the pattern starts and ends.

print(result.string)
print(result.group())
print(result.start())
print(result.end())
print(result.span())
I am Dracula; and I bid you welcome, Mr. Harker, to my house.
Dracula
5
12
(5, 12)

Note

.group() returns the matched substring from the text — the portion of the text that the pattern matched against. In simple cases like re.search('Dracula', text), the match equals the pattern string. But with a regex like r'\$[\d.]+', .group() would return something like '$42.99' — the actual text that matched, not the pattern expression itself.

If the pattern doesn’t appear in the text, the return value from search is None. So we can check whether the search was successful by checking whether the result is None.

result = re.search('Count', text)
print(result)

result is None
None
True
s = "This is a test of the regular expression system."
print(re.findall('is', s))  # ['is', 'is']
print(re.findall('is.', s)) # ['is ', 'is ']    ### 'is' followed by any character (space in this case)
print(re.findall('is.?', s)) # ['is ', 'is ']   ### 'is' followed by zero or one character (space in this case)
print(re.findall('is.?', s, re.IGNORECASE)) # ['is ', 'is '] ### same as above, but case-insensitive   
print(re.findall('is.?', s, re.IGNORECASE | re.DOTALL)) # ['is ', 'is ']    ### same as above, but also makes '.' match newline characters (not relevant in this case since there are no newlines)
['is', 'is']
['is ', 'is ']
['is ', 'is ']
['is ', 'is ']
['is ', 'is ']

The + in the pattern means one or more occurrence.

import re

text = "The price is $42.99 and $7.50"

# findall — get all matches
print(re.findall(r"\$[\d.]+", text))         # ['$42.99', '$7.50']

# search — first match object
m = re.search(r"\$[\d.]+", text)
print(m.group())                              # '$42.99'
print(m.start(), m.end())                     # position in string

# sub — replace
print(re.sub(r"\$[\d.]+", "PRICE", text))    # 'The price is PRICE and PRICE'

# split
print(re.split(r"\s+", "one  two   three"))  # ['one', 'two', 'three']
['$42.99', '$7.50']
$42.99
13 19
The price is PRICE and PRICE
['one', 'two', 'three']
### EXERCISE: Regex Escape Sequences
# Difficulty: Basic
s = "Price: $19.95, code=A_7, spaces here"
# 1. Extract all digit sequences
# 2. Extract all word tokens
# 3. Extract literal '$' and literal '.' matches
### Your code starts here:



### Your code ends here.

Hide code cell source

import re

# Solution
s = "Price: $19.95, code=A_7, spaces here"
print(re.findall(r'\d+', s))
print(re.findall(r'\w+', s))
print(re.findall(r'\$', s))
print(re.findall(r'\.', s))
['19', '95', '7']
['Price', '19', '95', 'code', 'A_7', 'spaces', 'here']
['$']
['.']
### EXERCISE: The Match Object
# Difficulty: Basic
import re
text = "Customer ID: 4892, Order date: 2024-03-15"
# 1. Use re.search() to find the first 4-digit number in text
# 2. Print the matched string, start index, end index, and span
### Your code starts here:



### Your code ends here.

Hide code cell source

# Solution
import re
text = "Customer ID: 4892, Order date: 2024-03-15"
m = re.search(r'\d{4}', text)
print(m.group())
print(m.start())
print(m.end())
print(m.span())
4892
13
17
(13, 17)

8.2.3. Metacharacters#

Metacharacters are characters that carry special meaning inside a regex pattern — instead of matching themselves literally, they instruct the regex engine to do something specific, like match any character, mark a boundary, or repeat a pattern. There are 14 of them in Python’s re module. You need to escape them if you want them to be regular characters.

Type

Character

Meaning

Example

Wildcard

.

Matches any character except newline

c.t → cat, cot

Anchor

^

Start of string

^Hello

Anchor

$

End of string

end$

Quantifier

*

0 or more repetitions

a*

Quantifier

+

1 or more repetitions

a+

Quantifier

?

Optional (0 or 1) / makes quantifier lazy

colou?r

Quantifier

{}

Specific repetition range

\d{3}

Character Class Delimiters

[]

Defines a set of allowed characters

[a-z]

Grouping Delimiters

()

Groups patterns and captures matches

(cat| dog)

Escape

\

Escapes metacharacters or forms special sequences

\d, \w, \.

Alternation

|

Logical OR between patterns

cat  | dog

If you want to match the character literally, you must escape it. Now let us look at the metacharacters in groups.

8.2.4. Quantifiers#

Quantifiers tell the regex engine how many times the preceding character, group, or character class should match.

Quantifier

Meaning

Example

Matches

*

0 or more

ab*

a, ab, abb, abbb

+

1 or more

ab+

ab, abb, abbb (not a)

?

0 or 1

ab?

a or ab only

{n}

Exactly n

\d{3}

123, 456

{n,}

n or more

\d{2,}

12, 123, 1234…

{n,m}

Between n & m

\d{2,4}

12, 123, 1234

By default quantifiers are greedy (match as much as possible). Add ? to make them lazy.

text = "<b>bold</b> and <i>italic</i>"

# Greedy — matches as much as possible
print(re.findall(r"<.+>", text))     # ['<b>bold</b> and <i>italic</i>']

# Lazy — matches as little as possible
print(re.findall(r"<.+?>", text))    # ['<b>', '</b>', '<i>', '</i>']

# Exact and ranged quantifiers
print(re.findall(r"\d{3}", "123 4567 89"))     # ['123', '456']
print(re.findall(r"\d{2,4}", "1 12 123 1234")) # ['12', '123', '1234']
['<b>bold</b> and <i>italic</i>']
['<b>', '</b>', '<i>', '</i>']
['123', '456']
['12', '123', '1234']

In <.+>, the + is greedy. It matches as many characters as possible while still allowing the overall pattern to succeed. So it gobbles everything from the first < all the way to the last >.

Adding ? after a quantifier switches it to lazy mode — instead of matching as much as possible, it now matches as little as possible. So <.+?> still needs at least one character (that’s the +), but stops at the earliest > it can find.

8.2.4.1. Greedy vs Non-greedy#

Quantifiers like * and + are greedy by default. Add ? to make them non-greedy.

text_block = """Title: Notes
Email: ALICE@example.com
Email: bob@Example.org"""

# IGNORECASE
emails = re.findall(r'[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}', text_block, flags=re.IGNORECASE)
print(emails)

# Greedy vs non-greedy on tags
html = "<b>bold</b><i>italic</i>"
print(re.findall(r'<.*>', html))     # greedy
print(re.findall(r'<.*?>', html))    # non-greedy
['ALICE@example.com', 'bob@Example.org']
['<b>bold</b><i>italic</i>']
['<b>', '</b>', '<i>', '</i>']
### EXERCISE: Flags and Quantifiers
# Difficulty: Challenge
text_block = """Task: clean logs
ERROR: Disk full
error: retry failed
INFO: done"""
# 1. Extract all lines that start with 'error' (case-insensitive) using MULTILINE
# 2. From '<x>1</x><x>2</x>', extract tags non-greedily
### Your code starts here:



### Your code ends here.
# Solution
text_block = """Task: clean logs
ERROR: Disk full
error: retry failed
INFO: done"""

errs = re.findall(r'^error:.*$', text_block, flags=re.IGNORECASE | re.MULTILINE)
print(errs)

print(re.findall(r'<.*?>', '<x>1</x><x>2</x>'))
['ERROR: Disk full', 'error: retry failed']
['<x>', '</x>', '<x>', '</x>']

8.2.5. Anchors#

Anchors don’t match characters — they match positions in the string.

Anchor

Meaning

^

Start of string (or line with re.MULTILINE)

$

End of string (or line)

\b

Word boundary

\B

Non-word boundary

# ^ and $
print(re.findall(r"^\w+", "Hello world"))     # ['Hello'] — only at start
print(re.findall(r"\w+$", "Hello world"))     # ['world'] — only at end

# Word boundary \b
text = "cat catfish concatenate"
print(re.findall(r"\bcat\b", text))           # ['cat'] — whole word only
print(re.findall(r"cat", text))               # ['cat', 'cat', 'cat'] — anywhere

# Multiline
multi = "line1\nline2\nline3"
print(re.findall(r"^\w+", multi, re.MULTILINE))  # ['line1', 'line2', 'line3']
['Hello']
['world']
['cat']
['cat', 'cat', 'cat']
['line1', 'line2', 'line3']

8.2.6. Character Classes#

Before writing larger patterns, it helps to know the core building blocks. Character classes match one character from a defined set. They’re written with square brackets [ ].

Pattern

Matches

[aeiou]

any single vowel

[a-z]

any lowercase letter

[A-Z]

any uppercase letter

[0-9]

any digit

[a-zA-Z0-9]

any alphanumeric character

[^aeiou]

any character not a vowel (^ negates)

Shorthand classes (work outside brackets too):

Pattern

Meaning

Example Match

.

Any character (except newline)

\d

digit

7

\w

word char (letter/digit/underscore)

A, x, 9, _

\s

whitespace

space, tab

Observe the escape sequence '\w'.

import re
s = "This is a regular expression."
print(re.findall(r'\w', s))     ### \w matches any alphanumeric character (letters, digits, and underscore)
print(re.findall(r'\w+', s))    ### + means "one or more occurrences of the preceding pattern"
print(re.findall(r'\w*', s))    ### * means "zero or more occurrences of the preceding pattern"
['T', 'h', 'i', 's', 'i', 's', 'a', 'r', 'e', 'g', 'u', 'l', 'a', 'r', 'e', 'x', 'p', 'r', 'e', 's', 's', 'i', 'o', 'n']
['This', 'is', 'a', 'regular', 'expression']
['This', '', 'is', '', 'a', '', 'regular', '', 'expression', '', '']

\\s matches these whitespace characters:

Character

Name

\\n

newline

\\t

tab

\\r

carriage return

space

\\f

form feed

\\v

vertical tab

Use raw strings like r'\d+' for regex patterns so backslashes are interpreted correctly.

import re

text = "Hello World 123! foo_bar"

print(re.findall(r"\d", text))        # individual digits
print(re.findall(r"\d+", text))       # consecutive digits
print(re.findall(r"\w+", text))       # words (incl. underscore)
print(re.findall(r"[A-Z][a-z]+", text))  # capitalized words
print(re.findall(r"[^a-zA-Z\s]+", text)) # non-alpha, non-space
['1', '2', '3']
['123']
['Hello', 'World', '123', 'foo_bar']
['Hello', 'World']
['123!', '_']
sample = "User_42 logged in at 09:30 on 2026-03-11"

print(re.findall(r'\d+', sample))                  # all digit runs
print(re.findall(r'[A-Za-z_]+', sample))            # word-like alphabetic tokens
print(re.findall(r'\d{2}:\d{2}', sample))        # HH:MM time
print(re.findall(r'\d{4}-\d{2}-\d{2}', sample))  # YYYY-MM-DD date
['42', '09', '30', '2026', '03', '11']
['User_', 'logged', 'in', 'at', 'on']
['09:30']
['2026-03-11']
### EXERCISE: Regex Syntax Essentials
# Difficulty: Basic
s = "IDs: A12, B7, C999"
# 1. Extract all uppercase letters
# 2. Extract all digit sequences
# 3. Extract letter+digit tokens like A12, B7, C999
### Your code starts here:



### Your code ends here.

Hide code cell source

# Solution
s = "IDs: A12, B7, C999"
print(re.findall(r'[A-Z]', s))
print(re.findall(r'\d+', s))
print(re.findall(r'[A-Z]\d+', s))
['I', 'D', 'A', 'B', 'C']
['12', '7', '999']
['A12', 'B7', 'C999']

8.2.7. Groups & Capturing#

Parentheses () group part of a pattern into a single unit. A capturing group also saves the matched text so you can extract or reuse it afterward. Use non-capturing groups (?:...) when you need grouping for structure but don’t need to extract the text. Named groups (?P<name>...) let you refer to captured text by name instead of number.

Syntax

Meaning

(...)

Capturing group

(?:...)

Non-capturing group

(?P<name>...)

Named group

|

Alternation (OR)

# Capturing groups
dates = "2024-01-15 and 2023-12-31"
print(re.findall(r"(\d{4})-(\d{2})-(\d{2})", dates))
# [('2024', '01', '15'), ('2023', '12', '31')]

# Named groups: m is one object from the search, so gives only one match, not all matches
m = re.search(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})", dates)
print(m.group('year'), m.group('month'), m.group('day'))

# Alternation
print(re.findall(r"cat|dog", "I have a cat and a dog"))  # ['cat', 'dog']

# Using groups in sub()
print(re.sub(r"(\d{4})-(\d{2})-(\d{2})", r"\2/\3/\1", dates))
# '01/15/2024 and 12/31/2023'
[('2024', '01', '15'), ('2023', '12', '31')]
2024 01 15
['cat', 'dog']
01/15/2024 and 12/31/2023

8.2.7.1. Groups and Extraction#

Parentheses create capture groups. You can extract parts of a match with .group(1), .group(2), etc. Named groups can make patterns more readable.

record = "OrderID=4821; Customer=Alice; Total=$39.50"
pattern = r'OrderID=(\d+); Customer=([A-Za-z]+); Total=\$(\d+(?:\.\d{2})?)'
m = re.search(pattern, record)

print(m.group(0))  # full match
print(m.group(1))  # order id
print(m.group(2))  # customer
print(m.group(3))  # total amount
OrderID=4821; Customer=Alice; Total=$39.50
4821
Alice
39.50
### EXERCISE: Capture Groups
# Difficulty: Intermediate
line = "name=Bob,age=27,dept=Sales"
# 1. Use one regex with 3 capture groups to extract name, age, dept
# 2. Print each extracted value on its own line
### Your code starts here:



### Your code ends here.

Hide code cell source

# Solution
line = "name=Bob,age=27,dept=Sales"
m = re.search(r'name=([A-Za-z]+),age=(\d+),dept=([A-Za-z]+)', line)
print(m.group(1))
print(m.group(2))
print(m.group(3))
Bob
27
Sales

8.2.8. Alternation (OR)#

Use | to match one of multiple patterns.

import re

# pattern using alternation
pattern = r"\.(jpg|png|gif)$"

files = [
    "photo.jpg",
    "diagram.png",
    "animation.gif",
    "document.pdf",
    "archive.zip"
]

for file in files:
    if re.search(pattern, file):
        print(f"{file} -> valid image file")
    else:
        print(f"{file} -> not an image")
photo.jpg -> valid image file
diagram.png -> valid image file
animation.gif -> valid image file
document.pdf -> not an image
archive.zip -> not an image
import re

pattern = r"\b(coffee|tea)\b"

sentences = [
    "I like coffee in the morning.",
    "She prefers tea at night.",
    "He drinks water.",
    "Coffee is my favorite."
]

for sentence in sentences:
    match = re.search(pattern, sentence, re.IGNORECASE)

    if match:
        print(f"Found beverage: {match.group()}")
    else:
        print("No beverage found")
Found beverage: coffee
Found beverage: tea
No beverage found
Found beverage: Coffee

8.2.9. Advanced Topics#

8.2.9.1. Flags#

Flags change matching behavior:

Flag

Shorthand

Meaning

re.IGNORECASE

re.I

Case-insensitive matching

re.MULTILINE

re.M

^/$ match line start/end

re.DOTALL

re.S

. matches newline too

re.VERBOSE

re.X

Allow comments/whitespace in pattern

# IGNORECASE
print(re.findall(r"hello", "Hello HELLO hello", re.I))  # ['Hello', 'HELLO', 'hello']

# DOTALL — dot matches newline
text = "<div>\nsome content\n</div>"
print(re.findall(r"<div>.*</div>", text, re.DOTALL))  # matches across lines

# VERBOSE — write readable patterns with comments
email_pattern = re.compile(r"""
    [\w.+-]+       # username
    @              # at sign
    [\w-]+         # domain name
    \.             # dot
    [\w.]+         # TLD
""", re.VERBOSE)

print(email_pattern.findall("Contact us at hello@example.com or support@test.org"))
['Hello', 'HELLO', 'hello']
['<div>\nsome content\n</div>']
['hello@example.com', 'support@test.org']
### EXERCISE: Regex Flags
# Difficulty: Basic
import re
log = """INFO: Server started
error: disk full
WARNING: low memory
ERROR: connection lost"""
# 1. Use re.findall() with re.IGNORECASE | re.MULTILINE to
#    extract every line that begins with 'error'
# 2. Print the list of matches
### Your code starts here:



### Your code ends here.

Hide code cell source

# Solution
import re
log = """INFO: Server started
error: disk full
WARNING: low memory
ERROR: connection lost"""
results = re.findall(r'^error.*$', log, flags=re.IGNORECASE | re.MULTILINE)
print(results)
['error: disk full', 'ERROR: connection lost']

8.2.9.2. Compiled Patterns#

Use re.compile() when reusing the same pattern multiple times — more efficient and cleaner.

# Compile once, use many times
phone_pattern = re.compile(r"\b\d{3}[-.]\d{3}[-.]\d{4}\b")

texts = [
    "Call me at 123-456-7890",
    "My number is 987.654.3210",
    "No phone here",
    "Reach us at 555-123-4567 or 800-999-0000"
]

for t in texts:
    matches = phone_pattern.findall(t)
    if matches:
        print(f"Found: {matches} in '{t}'")
Found: ['123-456-7890'] in 'Call me at 123-456-7890'
Found: ['987.654.3210'] in 'My number is 987.654.3210'
Found: ['555-123-4567', '800-999-0000'] in 'Reach us at 555-123-4567 or 800-999-0000'
### EXERCISE: Compiled Patterns
# Difficulty: Basic
import re
emails = ['alice@example.com', 'not-an-email', 'bob@company.org', 'charlie_at_test.net']
# 1. Compile a regex pattern that matches a simple email address
# 2. Print each email with True or False using the compiled pattern
### Your code starts here:



### Your code ends here.

Hide code cell source

# Solution
import re
emails = ['alice@example.com', 'not-an-email', 'bob@company.org', 'charlie_at_test.net']
email_re = re.compile(r'[\w.+-]+@[\w-]+\.[\w.]+')
for e in emails:
    print(e, bool(email_re.fullmatch(e)))
alice@example.com True
not-an-email False
bob@company.org True
charlie_at_test.net False

8.2.9.3. Lookahead & Lookbehind#

Match a pattern only if it is (or isn’t) preceded/followed by another pattern — without including that other pattern in the match.

Syntax

Type

Meaning

(?=...)

Positive lookahead

Followed by

(?!...)

Negative lookahead

NOT followed by

(?<=...)

Positive lookbehind

Preceded by

(?<!...)

Negative lookbehind

NOT preceded by

# Positive lookahead — prices followed by USD
text = "100USD 200EUR 300USD"
print(re.findall(r"\d+(?=USD)", text))     # ['100', '300']

# Negative lookahead
print(re.findall(r"\d+(?!USD)", text))     # numbers NOT followed by USD

# Positive lookbehind — extract amount after $
text2 = "Price: $42.99, discount: $5.00"
print(re.findall(r"(?<=\$)[\d.]+", text2)) # ['42.99', '5.00']
['100', '300']
['10', '200', '30']
['42.99', '5.00']
### EXERCISE: Lookahead & Lookbehind
# Difficulty: Intermediate
import re
text = "Alice scored 95pts, Bob scored 80pts, Charlie scored 73pts"
# 1. Use a positive lookahead to extract all numbers followed by 'pts'
# 2. Use a positive lookbehind to extract numbers preceded by 'scored '
### Your code starts here:



### Your code ends here.

Hide code cell source

# Solution
import re
text = "Alice scored 95pts, Bob scored 80pts, Charlie scored 73pts"
print(re.findall(r'\d+(?=pts)', text))        # positive lookahead
print(re.findall(r'(?<=scored )\d+', text))   # positive lookbehind
['95', '80', '73']
['95', '80', '73']

8.2.10. Applications#

8.2.10.1. Cleaning Text#

Before we can search the text of Dracula, we need to download it from Project Gutenberg and remove the header and footer information.

We’ll download the Dracula text from Project Gutenberg and save it to the data folder. Then we’ll clean the file and save the cleaned version in the same folder. All subsequent analysis will use these files.

from pathlib import Path
from urllib.request import urlretrieve

data_dir = project_root / 'data'
data_dir.mkdir(parents=True, exist_ok=True)

# Download Dracula text to the project data folder
url = 'https://www.gutenberg.org/files/345/345-0.txt'
raw_path = data_dir / 'pg345.txt'
clean_path = data_dir / 'pg345_cleaned.txt'
if not raw_path.exists():
    urlretrieve(url, raw_path)
    print('Downloaded Dracula to', raw_path)
else:
    print('Dracula already downloaded:', raw_path)
Dracula already downloaded: /Users/tychen/workspace/py/data/pg345.txt
# download('https://www.gutenberg.org/cache/epub/345/pg345.txt');
def clean_file(infile, outfile):
    """Read infile, write to outfile skipping special lines."""
    with open(infile, encoding='utf8') as fin, open(outfile, 'w', encoding='utf8') as fout:
        for line in fin:
            if not is_special_line(line):
                fout.write(line)
def clean_file(input_file, output_file):
    reader = open(input_file, encoding='utf-8')
    writer = open(output_file, 'w')

    for line in reader:
        if is_special_line(line):
            break

    for line in reader:
        if is_special_line(line):
            break
        writer.write(line)
        
    reader.close()
    writer.close()
def is_special_line(line):
    """Return True if the line marks the start or end of the Gutenberg content."""
    return line.startswith('***')
# def is_special_line(line):
#     return line.strip().startswith('*** ')
# Clean the Dracula text and save to data/pg345_cleaned.txt
clean_file(raw_path, clean_path)
print('Cleaned file saved to', clean_path)
Cleaned file saved to /Users/tychen/workspace/py/data/pg345_cleaned.txt

Putting all that together, here’s a function that loops through the lines in the book until it finds one that matches the given pattern, and returns the Match object.

def find_first(pattern, path=clean_path):
    with open(path, encoding='utf8') as f:
        for line in f:
            result = re.search(pattern, line)
            if result is not None:
                return result

We can use it to find the first mention of a character.

result = find_first('Harker')
result.string
'CHAPTER I. Jonathan Harker’s Journal\n'

For this example, we didn’t have to use regular expressions – we could have done the same thing more easily with the in operator. But regular expressions can do things the in operator cannot.

For example, if the pattern includes the vertical bar character, '|', it can match either the sequence on the left or the sequence on the right. Suppose we want to find the first mention of Mina Murray in the book, but we are not sure whether she is referred to by first name or last. We can use the following pattern, which matches either name.

pattern = 'Mina|Murray'
result = find_first(pattern)
result.string
'CHAPTER V. Letters—Lucy and Mina\n'

We can use a pattern like this to see how many times a character is mentioned by either name. Here’s a function that loops through the book and counts the number of lines that match the given pattern.

def count_matches(pattern, path=clean_path):
    count = 0
    with open(path, encoding='utf8') as f:
        for line in f:
            result = re.search(pattern, line)
            if result is not None:
                count += 1
    return count

Now let’s see how many times Mina is mentioned.

count_matches('Mina|Murray')
229

The special character '^' matches the beginning of a string, so we can find a line that starts with a given pattern.

result = find_first('^Dracula')
result.string
'Dracula, jumping to his feet, said:--\n'

And the special character '$' matches the end of a string, so we can find a line that ends with a given pattern (ignoring the newline at the end).

result = find_first('Harker$')
result.string
'by five o’clock, we must start off; for it won’t do to leave Mrs. Harker\n'
### EXERCISE: Download and Clean Text
# Difficulty: Intermediate
# 1. Use raw_path and clean_path to print whether each file exists
# 2. If clean_path does not exist, run clean_file(raw_path, clean_path)
# 3. Print the size (in bytes) of clean_path
### Your code starts here:



### Your code ends here.

Hide code cell source

# Solution
print(raw_path.exists(), clean_path.exists())
if not clean_path.exists():
    clean_file(raw_path, clean_path)
print(clean_path.stat().st_size)
True True
855112

8.2.10.2. String substitution#

Bram Stoker was born in Ireland, and when Dracula was published in 1897, he was living in England. So we would expect him to use the British spelling of words like “centre” and “colour”. To check, we can use the following pattern, which matches either “centre” or the American spelling “center”.

pattern = 'cent(er|re)'

In this pattern, the parentheses enclose the part of the pattern the vertical bar applies to. So this pattern matches a sequence that starts with 'cent' and ends with either 'er' or 're'.

result = find_first(pattern)
result.string
'horseshoe of the Carpathians, as if it were the centre of some sort of\n'

As expected, he used the British spelling.

We can also check whether he used the British spelling of “colour”. The following pattern uses the special character '?', which means that the previous character is optional.

pattern = 'colou?r'

This pattern matches either “colour” with the 'u' or “color” without it.

result = find_first(pattern)
line = result.string
line
'undergarment with long double apron, front, and back, of coloured stuff\n'

Again, as expected, he used the British spelling.

Now suppose we want to produce an edition of the book with American spellings. We can use the sub function in the re module, which does string substitution.

re.sub(pattern, 'color', line)
'undergarment with long double apron, front, and back, of colored stuff\n'

The first argument is the pattern we want to find and replace, the second is what we want to replace it with, and the third is the string we want to search. In the result, you can see that “colour” has been replaced with “color”.

# I used this function to search for lines to use as examples

def all_matches(pattern, path=clean_path):
    with open(path, encoding='utf8') as f:
        for line in f:
            result = re.search(pattern, line)
            if result:
                print(line.strip())
### e.g., 

all_matches('weather')
weather. As I stood, the driver jumped again into his seat and shook the
weatherworn, was still complete; but it was evidently many a day since
it is a buoy with a bell, which swings in bad weather, and sends in a
am awakened by her moving about the room. Fortunately, the weather is so
learn the weather signs. To-day is a grey day, and the sun as I write is
experienced here, with results both strange and unique. The weather had
kept watch on weather signs from the East Cliff, foretold in an emphatic
_22 July_.--Rough weather last three days, and all hands busy with
weather. Passed Gibralter and out through Straits. All well.
and entering on the Bay of Biscay with wild weather ahead, and yet last
weather influences as we know that the Count can bring to bear; and if
that I am fully armed as there may be wolves; the weather is getting
# Here's the pattern I used (which uses some features we haven't seen)

# names = r'(?<!\.\s)[A-Z][a-zA-Z]+'

# all_matches(names)
### EXERCISE: String Substitution
# Difficulty: Intermediate
sample = "The colour of the city centre changed overnight."
# 1. Replace British spellings with American spellings using regex:
#    colour -> color, centre -> center
# 2. Print the transformed sentence
### Your code starts here:



### Your code ends here.

Hide code cell source

# Solution
import re

sample = "The colour of the city centre changed overnight."
sample = re.sub(r'colou?r', 'color', sample)
sample = re.sub(r'cent(er|re)', 'center', sample)
print(sample)
The color of the city center changed overnight.

8.2.10.3. re.fullmatch() for Validation#

re.fullmatch(pattern, text) succeeds only if the entire string matches the pattern. This is the right tool for validation tasks (IDs, simple emails, phone formats, etc.).

employee_id_pattern = r'EMP-\d{4}'
ids = ['EMP-0001', 'EMP-12', 'AEMP-0001', 'EMP-12345']

for emp_id in ids:
    print(emp_id, bool(re.fullmatch(employee_id_pattern, emp_id)))
EMP-0001 True
EMP-12 False
AEMP-0001 False
EMP-12345 False
### EXERCISE: Full String Validation
# Difficulty: Intermediate
codes = ['CS-101', 'MATH-240', 'CS101', 'EE-7']
# A valid course code must be: 2-4 uppercase letters, a dash, then 3 digits.
# 1. Write the regex pattern
# 2. Print each code with True/False using re.fullmatch
### Your code starts here:



### Your code ends here.

Hide code cell source

# Solution
codes = ['CS-101', 'MATH-240', 'CS101', 'EE-7']
pattern = r'[A-Z]{2,4}-\d{3}'
for code_str in codes:
    print(code_str, bool(re.fullmatch(pattern, code_str)))
CS-101 True
MATH-240 True
CS101 False
EE-7 False

8.2.11. Quick Reference#

Characters

Pattern

Meaning

Example match

.

Any character (except newline)

c.tcat, cot

\d

Digit

7

\w

Word character (letter, digit, underscore)

A, x, 9, _

\s

Whitespace (space, tab, newline)

[abc]

Character class — any one of a, b, c

a

[^abc]

Negated class — any character except a, b, c

d

Quantifiers

Pattern

Meaning

Example

*

0 or more

ab*a, ab, abb

+

1 or more

ab+ab, abb

?

0 or 1 (optional)

colou?rcolor, colour

{n}

Exactly n

\d{3}123

{n,m}

Between n and m

\d{2,4}12, 123

*? +?

Lazy (match as little as possible)

<.+?>

Anchors

Pattern

Meaning

^

Start of string (or line with re.M)

$

End of string (or line)

\b

Word boundary

Groups

Syntax

Meaning

(...)

Capturing group

(?:...)

Non-capturing group

(?P<name>...)

Named group

(?=...)

Lookahead

(?<=...)

Lookbehind

Flags

Flag

Shorthand

Meaning

re.IGNORECASE

re.I

Case-insensitive matching

re.MULTILINE

re.M

^/$ match line start/end

re.DOTALL

re.S

. matches newline too

re.VERBOSE

re.X

Allow comments/whitespace in pattern