8.2. Regex#
String methods allow you to search and manipulate text to a certain extent. A regular expression (regex) is a sequence of characters that defines a search pattern to search, match, and manipulate text in more powerful ways that go beyond simple string methods. For example:
Task |
Use |
|---|---|
Check if string starts with ‘http’ |
|
Replace all spaces with underscores |
|
Extract all email addresses from text |
regex |
Validate a phone number format |
regex |
Find words matching a complex pattern |
regex |
The rule of thumb: if the pattern is fixed and simple, use string methods. If the pattern is variable or complex, use regex.
For example, to search a pattern in a text, we may use the find() string method and an index is returned if the pattern is found.
text = "I am Dracula; and I bid you welcome, Mr. Harker,\
to my house."
pattern = 'Dracula'
text.find(pattern)
5
8.2.1. Escape Sequences and Raw Strings#
Before using the regular expression (re) functions, we need to understand regex escapes and raw strings.
Regex patterns use backslashes heavily. In regex, the backslash \ introduces escape sequences, which create special patterns or allow metacharacters to be treated as literal characters.
For example, common escape sequences include:
Pattern |
Meaning |
Example match |
|---|---|---|
|
digit |
|
|
word character (letters, digits, underscore) |
|
|
whitespace character |
space, tab, new line |
|
literal dot |
|
|
literal dollar sign |
|
|
literal backslash |
|
8.2.1.1. Raw Strings#
A raw string, on the other hand, is a string prefixed with r, which tells Python to treat backslashes \ as literal characters rather than escape sequences.
Prefix with
rto prevent Python from interpreting backslashes before the regex engine sees the pattern; works like\.Raw strings avoid double escaping and make patterns easier to read.
Without raw strings, you often need extra backslashes, like
'\\d+'.
regular = "\n" # newline character
raw = r"\n" # literally backslash + n (two characters)
print(regular) # prints a newline
print(raw) # prints \n
\n
Use escapes for literal special regex characters too (for example \., \$, \?).
print('\\\\') # double backslash in a normal string
print(r'\\') # double backslash in a raw string (same result)
print(r'\\\n') # backslash + n in a raw string
print(r'\\\n' == '\\\\n') # False: raw string is backslash + n, normal string is backslash + backslash + n
\\
\\
\\\n
False
Raw string tells Python we are treating this backslash \ as a special character.
8.2.2. The re Module#
Python’s built-in re module provides regex support. The 6 most commonly used regex functions are:
Function |
Description |
Sample Syntax |
Return |
|---|---|---|---|
|
Find first match anywhere in the string |
|
|
|
Match only at the start of the string |
|
|
|
Find all matches; return as a list |
|
|
|
Find and replace matches |
|
|
|
Split string on a pattern |
|
|
|
Match the entire string against the pattern |
re.fullmatch(pattern, text) |
|
8.2.2.1. The Match object#
re.search(), re.match(), and re.fullmatch() functions return a Match object when pattern is matched.
For example,
returns:
re.search(pattern, text)scans throughtextand returnsa
Matchobject for the first location wherepatternis found.If the pattern is not found anywhere in the string, it returns
None.
Match: AMatchobject has the following commonly used attributes and methods:
Attribute / Method |
Description |
Example |
|---|---|---|
|
Returns the matched substring |
|
|
Index where the match begins |
|
|
Index where the match ends |
|
|
Tuple of |
|
|
The original string that was searched |
|
import re
text = "I am Dracula; and I bid you welcome, Mr. Harker, to my house."
pattern = 'Dracula'
result = re.search(pattern, text) ### pattern: Dracula; text: the line
result ### the Match object
<re.Match object; span=(5, 12), match='Dracula'>
If the pattern appears in the text, search returns a Match object that contains the results of the search.
String: Among other information, it has a variable named
stringthat contains the text that was searched.Group: It also provides a method called
groupthat returns the part of the text that matched the pattern.Span and Start/End: And it provides a method called
spanthat returns the index in the text where the pattern starts and ends.
print(result.string)
print(result.group())
print(result.start())
print(result.end())
print(result.span())
I am Dracula; and I bid you welcome, Mr. Harker, to my house.
Dracula
5
12
(5, 12)
Note
.group() returns the matched substring from the text — the portion of the text that the pattern matched against. In simple cases like re.search('Dracula', text), the match equals the pattern string. But with a regex like r'\$[\d.]+', .group() would return something like '$42.99' — the actual text that matched, not the pattern expression itself.
If the pattern doesn’t appear in the text, the return value from search is None. So we can check whether the search was successful by checking whether the result is None.
result = re.search('Count', text)
print(result)
result is None
None
True
s = "This is a test of the regular expression system."
print(re.findall('is', s)) # ['is', 'is']
print(re.findall('is.', s)) # ['is ', 'is '] ### 'is' followed by any character (space in this case)
print(re.findall('is.?', s)) # ['is ', 'is '] ### 'is' followed by zero or one character (space in this case)
print(re.findall('is.?', s, re.IGNORECASE)) # ['is ', 'is '] ### same as above, but case-insensitive
print(re.findall('is.?', s, re.IGNORECASE | re.DOTALL)) # ['is ', 'is '] ### same as above, but also makes '.' match newline characters (not relevant in this case since there are no newlines)
['is', 'is']
['is ', 'is ']
['is ', 'is ']
['is ', 'is ']
['is ', 'is ']
The + in the pattern means one or more occurrence.
import re
text = "The price is $42.99 and $7.50"
# findall — get all matches
print(re.findall(r"\$[\d.]+", text)) # ['$42.99', '$7.50']
# search — first match object
m = re.search(r"\$[\d.]+", text)
print(m.group()) # '$42.99'
print(m.start(), m.end()) # position in string
# sub — replace
print(re.sub(r"\$[\d.]+", "PRICE", text)) # 'The price is PRICE and PRICE'
# split
print(re.split(r"\s+", "one two three")) # ['one', 'two', 'three']
['$42.99', '$7.50']
$42.99
13 19
The price is PRICE and PRICE
['one', 'two', 'three']
### EXERCISE: Regex Escape Sequences
# Difficulty: Basic
s = "Price: $19.95, code=A_7, spaces here"
# 1. Extract all digit sequences
# 2. Extract all word tokens
# 3. Extract literal '$' and literal '.' matches
### Your code starts here:
### Your code ends here.
['19', '95', '7']
['Price', '19', '95', 'code', 'A_7', 'spaces', 'here']
['$']
['.']
### EXERCISE: The Match Object
# Difficulty: Basic
import re
text = "Customer ID: 4892, Order date: 2024-03-15"
# 1. Use re.search() to find the first 4-digit number in text
# 2. Print the matched string, start index, end index, and span
### Your code starts here:
### Your code ends here.
4892
13
17
(13, 17)
8.2.3. Metacharacters#
Metacharacters are characters that carry special meaning inside a regex pattern — instead of matching themselves literally, they instruct the regex engine to do something specific, like match any character, mark a boundary, or repeat a pattern. There are 14 of them in Python’s re module. You need to escape them if you want them to be regular characters.
Type |
Character |
Meaning |
Example |
|---|---|---|---|
Wildcard |
|
Matches any character except newline |
|
Anchor |
|
Start of string |
|
Anchor |
|
End of string |
|
Quantifier |
|
0 or more repetitions |
|
Quantifier |
|
1 or more repetitions |
|
Quantifier |
|
Optional (0 or 1) / makes quantifier lazy |
|
Quantifier |
|
Specific repetition range |
|
Character Class Delimiters |
|
Defines a set of allowed characters |
|
Grouping Delimiters |
|
Groups patterns and captures matches |
|
Escape |
|
Escapes metacharacters or forms special sequences |
|
Alternation |
|
Logical OR between patterns |
|
If you want to match the character literally, you must escape it. Now let us look at the metacharacters in groups.
8.2.4. Quantifiers#
Quantifiers tell the regex engine how many times the preceding character, group, or character class should match.
Quantifier |
Meaning |
Example |
Matches |
|---|---|---|---|
* |
0 or more |
ab* |
a, ab, abb, abbb |
+ |
1 or more |
ab+ |
ab, abb, abbb (not a) |
? |
0 or 1 |
ab? |
a or ab only |
{n} |
Exactly n |
\d{3} |
123, 456 |
{n,} |
n or more |
\d{2,} |
12, 123, 1234… |
{n,m} |
Between n & m |
\d{2,4} |
12, 123, 1234 |
By default quantifiers are greedy (match as much as possible). Add ? to make them lazy.
text = "<b>bold</b> and <i>italic</i>"
# Greedy — matches as much as possible
print(re.findall(r"<.+>", text)) # ['<b>bold</b> and <i>italic</i>']
# Lazy — matches as little as possible
print(re.findall(r"<.+?>", text)) # ['<b>', '</b>', '<i>', '</i>']
# Exact and ranged quantifiers
print(re.findall(r"\d{3}", "123 4567 89")) # ['123', '456']
print(re.findall(r"\d{2,4}", "1 12 123 1234")) # ['12', '123', '1234']
['<b>bold</b> and <i>italic</i>']
['<b>', '</b>', '<i>', '</i>']
['123', '456']
['12', '123', '1234']
In <.+>, the + is greedy. It matches as many characters as possible while still allowing the overall pattern to succeed. So it gobbles everything from the first < all the way to the last >.
Adding ? after a quantifier switches it to lazy mode — instead of matching as much as possible, it now matches as little as possible. So <.+?> still needs at least one character (that’s the +), but stops at the earliest > it can find.
8.2.4.1. Greedy vs Non-greedy#
Quantifiers like * and + are greedy by default. Add ? to make them non-greedy.
text_block = """Title: Notes
Email: ALICE@example.com
Email: bob@Example.org"""
# IGNORECASE
emails = re.findall(r'[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}', text_block, flags=re.IGNORECASE)
print(emails)
# Greedy vs non-greedy on tags
html = "<b>bold</b><i>italic</i>"
print(re.findall(r'<.*>', html)) # greedy
print(re.findall(r'<.*?>', html)) # non-greedy
['ALICE@example.com', 'bob@Example.org']
['<b>bold</b><i>italic</i>']
['<b>', '</b>', '<i>', '</i>']
### EXERCISE: Flags and Quantifiers
# Difficulty: Challenge
text_block = """Task: clean logs
ERROR: Disk full
error: retry failed
INFO: done"""
# 1. Extract all lines that start with 'error' (case-insensitive) using MULTILINE
# 2. From '<x>1</x><x>2</x>', extract tags non-greedily
### Your code starts here:
### Your code ends here.
# Solution
text_block = """Task: clean logs
ERROR: Disk full
error: retry failed
INFO: done"""
errs = re.findall(r'^error:.*$', text_block, flags=re.IGNORECASE | re.MULTILINE)
print(errs)
print(re.findall(r'<.*?>', '<x>1</x><x>2</x>'))
['ERROR: Disk full', 'error: retry failed']
['<x>', '</x>', '<x>', '</x>']
8.2.5. Anchors#
Anchors don’t match characters — they match positions in the string.
Anchor |
Meaning |
|---|---|
|
Start of string (or line with |
|
End of string (or line) |
|
Word boundary |
|
Non-word boundary |
# ^ and $
print(re.findall(r"^\w+", "Hello world")) # ['Hello'] — only at start
print(re.findall(r"\w+$", "Hello world")) # ['world'] — only at end
# Word boundary \b
text = "cat catfish concatenate"
print(re.findall(r"\bcat\b", text)) # ['cat'] — whole word only
print(re.findall(r"cat", text)) # ['cat', 'cat', 'cat'] — anywhere
# Multiline
multi = "line1\nline2\nline3"
print(re.findall(r"^\w+", multi, re.MULTILINE)) # ['line1', 'line2', 'line3']
['Hello']
['world']
['cat']
['cat', 'cat', 'cat']
['line1', 'line2', 'line3']
8.2.6. Character Classes#
Before writing larger patterns, it helps to know the core building blocks. Character classes match one character from a defined set. They’re written with square brackets [ ].
Pattern |
Matches |
|---|---|
[aeiou] |
any single vowel |
[a-z] |
any lowercase letter |
[A-Z] |
any uppercase letter |
[0-9] |
any digit |
[a-zA-Z0-9] |
any alphanumeric character |
[^aeiou] |
any character not a vowel (^ negates) |
Shorthand classes (work outside brackets too):
Pattern |
Meaning |
Example Match |
|---|---|---|
|
Any character (except newline) |
|
|
digit |
|
|
word char (letter/digit/underscore) |
|
|
whitespace |
space, tab |
Observe the escape sequence '\w'.
import re
s = "This is a regular expression."
print(re.findall(r'\w', s)) ### \w matches any alphanumeric character (letters, digits, and underscore)
print(re.findall(r'\w+', s)) ### + means "one or more occurrences of the preceding pattern"
print(re.findall(r'\w*', s)) ### * means "zero or more occurrences of the preceding pattern"
['T', 'h', 'i', 's', 'i', 's', 'a', 'r', 'e', 'g', 'u', 'l', 'a', 'r', 'e', 'x', 'p', 'r', 'e', 's', 's', 'i', 'o', 'n']
['This', 'is', 'a', 'regular', 'expression']
['This', '', 'is', '', 'a', '', 'regular', '', 'expression', '', '']
\\s matches these whitespace characters:
Character |
Name |
|---|---|
|
newline |
|
tab |
|
carriage return |
|
space |
|
form feed |
|
vertical tab |
Use raw strings like r'\d+' for regex patterns so backslashes are interpreted correctly.
import re
text = "Hello World 123! foo_bar"
print(re.findall(r"\d", text)) # individual digits
print(re.findall(r"\d+", text)) # consecutive digits
print(re.findall(r"\w+", text)) # words (incl. underscore)
print(re.findall(r"[A-Z][a-z]+", text)) # capitalized words
print(re.findall(r"[^a-zA-Z\s]+", text)) # non-alpha, non-space
['1', '2', '3']
['123']
['Hello', 'World', '123', 'foo_bar']
['Hello', 'World']
['123!', '_']
sample = "User_42 logged in at 09:30 on 2026-03-11"
print(re.findall(r'\d+', sample)) # all digit runs
print(re.findall(r'[A-Za-z_]+', sample)) # word-like alphabetic tokens
print(re.findall(r'\d{2}:\d{2}', sample)) # HH:MM time
print(re.findall(r'\d{4}-\d{2}-\d{2}', sample)) # YYYY-MM-DD date
['42', '09', '30', '2026', '03', '11']
['User_', 'logged', 'in', 'at', 'on']
['09:30']
['2026-03-11']
### EXERCISE: Regex Syntax Essentials
# Difficulty: Basic
s = "IDs: A12, B7, C999"
# 1. Extract all uppercase letters
# 2. Extract all digit sequences
# 3. Extract letter+digit tokens like A12, B7, C999
### Your code starts here:
### Your code ends here.
['I', 'D', 'A', 'B', 'C']
['12', '7', '999']
['A12', 'B7', 'C999']
8.2.7. Groups & Capturing#
Parentheses () group part of a pattern into a single unit. A capturing group also saves the matched text so you can extract or reuse it afterward. Use non-capturing groups (?:...) when you need grouping for structure but don’t need to extract the text. Named groups (?P<name>...) let you refer to captured text by name instead of number.
Syntax |
Meaning |
|---|---|
|
Capturing group |
|
Non-capturing group |
|
Named group |
|
Alternation (OR) |
# Capturing groups
dates = "2024-01-15 and 2023-12-31"
print(re.findall(r"(\d{4})-(\d{2})-(\d{2})", dates))
# [('2024', '01', '15'), ('2023', '12', '31')]
# Named groups: m is one object from the search, so gives only one match, not all matches
m = re.search(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})", dates)
print(m.group('year'), m.group('month'), m.group('day'))
# Alternation
print(re.findall(r"cat|dog", "I have a cat and a dog")) # ['cat', 'dog']
# Using groups in sub()
print(re.sub(r"(\d{4})-(\d{2})-(\d{2})", r"\2/\3/\1", dates))
# '01/15/2024 and 12/31/2023'
[('2024', '01', '15'), ('2023', '12', '31')]
2024 01 15
['cat', 'dog']
01/15/2024 and 12/31/2023
8.2.7.1. Groups and Extraction#
Parentheses create capture groups. You can extract parts of a match with .group(1), .group(2), etc.
Named groups can make patterns more readable.
record = "OrderID=4821; Customer=Alice; Total=$39.50"
pattern = r'OrderID=(\d+); Customer=([A-Za-z]+); Total=\$(\d+(?:\.\d{2})?)'
m = re.search(pattern, record)
print(m.group(0)) # full match
print(m.group(1)) # order id
print(m.group(2)) # customer
print(m.group(3)) # total amount
OrderID=4821; Customer=Alice; Total=$39.50
4821
Alice
39.50
### EXERCISE: Capture Groups
# Difficulty: Intermediate
line = "name=Bob,age=27,dept=Sales"
# 1. Use one regex with 3 capture groups to extract name, age, dept
# 2. Print each extracted value on its own line
### Your code starts here:
### Your code ends here.
Bob
27
Sales
8.2.8. Alternation (OR)#
Use | to match one of multiple patterns.
import re
# pattern using alternation
pattern = r"\.(jpg|png|gif)$"
files = [
"photo.jpg",
"diagram.png",
"animation.gif",
"document.pdf",
"archive.zip"
]
for file in files:
if re.search(pattern, file):
print(f"{file} -> valid image file")
else:
print(f"{file} -> not an image")
photo.jpg -> valid image file
diagram.png -> valid image file
animation.gif -> valid image file
document.pdf -> not an image
archive.zip -> not an image
import re
pattern = r"\b(coffee|tea)\b"
sentences = [
"I like coffee in the morning.",
"She prefers tea at night.",
"He drinks water.",
"Coffee is my favorite."
]
for sentence in sentences:
match = re.search(pattern, sentence, re.IGNORECASE)
if match:
print(f"Found beverage: {match.group()}")
else:
print("No beverage found")
Found beverage: coffee
Found beverage: tea
No beverage found
Found beverage: Coffee
8.2.9. Advanced Topics#
8.2.9.1. Flags#
Flags change matching behavior:
Flag |
Shorthand |
Meaning |
|---|---|---|
|
|
Case-insensitive matching |
|
|
|
|
|
|
|
|
Allow comments/whitespace in pattern |
# IGNORECASE
print(re.findall(r"hello", "Hello HELLO hello", re.I)) # ['Hello', 'HELLO', 'hello']
# DOTALL — dot matches newline
text = "<div>\nsome content\n</div>"
print(re.findall(r"<div>.*</div>", text, re.DOTALL)) # matches across lines
# VERBOSE — write readable patterns with comments
email_pattern = re.compile(r"""
[\w.+-]+ # username
@ # at sign
[\w-]+ # domain name
\. # dot
[\w.]+ # TLD
""", re.VERBOSE)
print(email_pattern.findall("Contact us at hello@example.com or support@test.org"))
['Hello', 'HELLO', 'hello']
['<div>\nsome content\n</div>']
['hello@example.com', 'support@test.org']
### EXERCISE: Regex Flags
# Difficulty: Basic
import re
log = """INFO: Server started
error: disk full
WARNING: low memory
ERROR: connection lost"""
# 1. Use re.findall() with re.IGNORECASE | re.MULTILINE to
# extract every line that begins with 'error'
# 2. Print the list of matches
### Your code starts here:
### Your code ends here.
['error: disk full', 'ERROR: connection lost']
8.2.9.2. Compiled Patterns#
Use re.compile() when reusing the same pattern multiple times — more efficient and cleaner.
# Compile once, use many times
phone_pattern = re.compile(r"\b\d{3}[-.]\d{3}[-.]\d{4}\b")
texts = [
"Call me at 123-456-7890",
"My number is 987.654.3210",
"No phone here",
"Reach us at 555-123-4567 or 800-999-0000"
]
for t in texts:
matches = phone_pattern.findall(t)
if matches:
print(f"Found: {matches} in '{t}'")
Found: ['123-456-7890'] in 'Call me at 123-456-7890'
Found: ['987.654.3210'] in 'My number is 987.654.3210'
Found: ['555-123-4567', '800-999-0000'] in 'Reach us at 555-123-4567 or 800-999-0000'
### EXERCISE: Compiled Patterns
# Difficulty: Basic
import re
emails = ['alice@example.com', 'not-an-email', 'bob@company.org', 'charlie_at_test.net']
# 1. Compile a regex pattern that matches a simple email address
# 2. Print each email with True or False using the compiled pattern
### Your code starts here:
### Your code ends here.
alice@example.com True
not-an-email False
bob@company.org True
charlie_at_test.net False
8.2.9.3. Lookahead & Lookbehind#
Match a pattern only if it is (or isn’t) preceded/followed by another pattern — without including that other pattern in the match.
Syntax |
Type |
Meaning |
|---|---|---|
|
Positive lookahead |
Followed by |
|
Negative lookahead |
NOT followed by |
|
Positive lookbehind |
Preceded by |
|
Negative lookbehind |
NOT preceded by |
# Positive lookahead — prices followed by USD
text = "100USD 200EUR 300USD"
print(re.findall(r"\d+(?=USD)", text)) # ['100', '300']
# Negative lookahead
print(re.findall(r"\d+(?!USD)", text)) # numbers NOT followed by USD
# Positive lookbehind — extract amount after $
text2 = "Price: $42.99, discount: $5.00"
print(re.findall(r"(?<=\$)[\d.]+", text2)) # ['42.99', '5.00']
['100', '300']
['10', '200', '30']
['42.99', '5.00']
### EXERCISE: Lookahead & Lookbehind
# Difficulty: Intermediate
import re
text = "Alice scored 95pts, Bob scored 80pts, Charlie scored 73pts"
# 1. Use a positive lookahead to extract all numbers followed by 'pts'
# 2. Use a positive lookbehind to extract numbers preceded by 'scored '
### Your code starts here:
### Your code ends here.
['95', '80', '73']
['95', '80', '73']
8.2.10. Applications#
8.2.10.1. Cleaning Text#
Before we can search the text of Dracula, we need to download it from Project Gutenberg and remove the header and footer information.
We’ll download the Dracula text from Project Gutenberg and save it to the data folder. Then we’ll clean the file and save the cleaned version in the same folder. All subsequent analysis will use these files.
from pathlib import Path
from urllib.request import urlretrieve
data_dir = project_root / 'data'
data_dir.mkdir(parents=True, exist_ok=True)
# Download Dracula text to the project data folder
url = 'https://www.gutenberg.org/files/345/345-0.txt'
raw_path = data_dir / 'pg345.txt'
clean_path = data_dir / 'pg345_cleaned.txt'
if not raw_path.exists():
urlretrieve(url, raw_path)
print('Downloaded Dracula to', raw_path)
else:
print('Dracula already downloaded:', raw_path)
Dracula already downloaded: /Users/tychen/workspace/py/data/pg345.txt
# download('https://www.gutenberg.org/cache/epub/345/pg345.txt');
def clean_file(infile, outfile):
"""Read infile, write to outfile skipping special lines."""
with open(infile, encoding='utf8') as fin, open(outfile, 'w', encoding='utf8') as fout:
for line in fin:
if not is_special_line(line):
fout.write(line)
def clean_file(input_file, output_file):
reader = open(input_file, encoding='utf-8')
writer = open(output_file, 'w')
for line in reader:
if is_special_line(line):
break
for line in reader:
if is_special_line(line):
break
writer.write(line)
reader.close()
writer.close()
def is_special_line(line):
"""Return True if the line marks the start or end of the Gutenberg content."""
return line.startswith('***')
# def is_special_line(line):
# return line.strip().startswith('*** ')
# Clean the Dracula text and save to data/pg345_cleaned.txt
clean_file(raw_path, clean_path)
print('Cleaned file saved to', clean_path)
Cleaned file saved to /Users/tychen/workspace/py/data/pg345_cleaned.txt
Putting all that together, here’s a function that loops through the lines in the book until it finds one that matches the given pattern, and returns the Match object.
def find_first(pattern, path=clean_path):
with open(path, encoding='utf8') as f:
for line in f:
result = re.search(pattern, line)
if result is not None:
return result
We can use it to find the first mention of a character.
result = find_first('Harker')
result.string
'CHAPTER I. Jonathan Harker’s Journal\n'
For this example, we didn’t have to use regular expressions – we could have done the same thing more easily with the in operator.
But regular expressions can do things the in operator cannot.
For example, if the pattern includes the vertical bar character, '|', it can match either the sequence on the left or the sequence on the right.
Suppose we want to find the first mention of Mina Murray in the book, but we are not sure whether she is referred to by first name or last.
We can use the following pattern, which matches either name.
pattern = 'Mina|Murray'
result = find_first(pattern)
result.string
'CHAPTER V. Letters—Lucy and Mina\n'
We can use a pattern like this to see how many times a character is mentioned by either name. Here’s a function that loops through the book and counts the number of lines that match the given pattern.
def count_matches(pattern, path=clean_path):
count = 0
with open(path, encoding='utf8') as f:
for line in f:
result = re.search(pattern, line)
if result is not None:
count += 1
return count
Now let’s see how many times Mina is mentioned.
count_matches('Mina|Murray')
229
The special character '^' matches the beginning of a string, so we can find a line that starts with a given pattern.
result = find_first('^Dracula')
result.string
'Dracula, jumping to his feet, said:--\n'
And the special character '$' matches the end of a string, so we can find a line that ends with a given pattern (ignoring the newline at the end).
result = find_first('Harker$')
result.string
'by five o’clock, we must start off; for it won’t do to leave Mrs. Harker\n'
### EXERCISE: Download and Clean Text
# Difficulty: Intermediate
# 1. Use raw_path and clean_path to print whether each file exists
# 2. If clean_path does not exist, run clean_file(raw_path, clean_path)
# 3. Print the size (in bytes) of clean_path
### Your code starts here:
### Your code ends here.
True True
855112
8.2.10.2. String substitution#
Bram Stoker was born in Ireland, and when Dracula was published in 1897, he was living in England. So we would expect him to use the British spelling of words like “centre” and “colour”. To check, we can use the following pattern, which matches either “centre” or the American spelling “center”.
pattern = 'cent(er|re)'
In this pattern, the parentheses enclose the part of the pattern the vertical bar applies to.
So this pattern matches a sequence that starts with 'cent' and ends with either 'er' or 're'.
result = find_first(pattern)
result.string
'horseshoe of the Carpathians, as if it were the centre of some sort of\n'
As expected, he used the British spelling.
We can also check whether he used the British spelling of “colour”.
The following pattern uses the special character '?', which means that the previous character is optional.
pattern = 'colou?r'
This pattern matches either “colour” with the 'u' or “color” without it.
result = find_first(pattern)
line = result.string
line
'undergarment with long double apron, front, and back, of coloured stuff\n'
Again, as expected, he used the British spelling.
Now suppose we want to produce an edition of the book with American spellings.
We can use the sub function in the re module, which does string substitution.
re.sub(pattern, 'color', line)
'undergarment with long double apron, front, and back, of colored stuff\n'
The first argument is the pattern we want to find and replace, the second is what we want to replace it with, and the third is the string we want to search. In the result, you can see that “colour” has been replaced with “color”.
# I used this function to search for lines to use as examples
def all_matches(pattern, path=clean_path):
with open(path, encoding='utf8') as f:
for line in f:
result = re.search(pattern, line)
if result:
print(line.strip())
### e.g.,
all_matches('weather')
weather. As I stood, the driver jumped again into his seat and shook the
weatherworn, was still complete; but it was evidently many a day since
it is a buoy with a bell, which swings in bad weather, and sends in a
am awakened by her moving about the room. Fortunately, the weather is so
learn the weather signs. To-day is a grey day, and the sun as I write is
experienced here, with results both strange and unique. The weather had
kept watch on weather signs from the East Cliff, foretold in an emphatic
_22 July_.--Rough weather last three days, and all hands busy with
weather. Passed Gibralter and out through Straits. All well.
and entering on the Bay of Biscay with wild weather ahead, and yet last
weather influences as we know that the Count can bring to bear; and if
that I am fully armed as there may be wolves; the weather is getting
# Here's the pattern I used (which uses some features we haven't seen)
# names = r'(?<!\.\s)[A-Z][a-zA-Z]+'
# all_matches(names)
### EXERCISE: String Substitution
# Difficulty: Intermediate
sample = "The colour of the city centre changed overnight."
# 1. Replace British spellings with American spellings using regex:
# colour -> color, centre -> center
# 2. Print the transformed sentence
### Your code starts here:
### Your code ends here.
The color of the city center changed overnight.
8.2.10.3. re.fullmatch() for Validation#
re.fullmatch(pattern, text) succeeds only if the entire string matches the pattern.
This is the right tool for validation tasks (IDs, simple emails, phone formats, etc.).
employee_id_pattern = r'EMP-\d{4}'
ids = ['EMP-0001', 'EMP-12', 'AEMP-0001', 'EMP-12345']
for emp_id in ids:
print(emp_id, bool(re.fullmatch(employee_id_pattern, emp_id)))
EMP-0001 True
EMP-12 False
AEMP-0001 False
EMP-12345 False
### EXERCISE: Full String Validation
# Difficulty: Intermediate
codes = ['CS-101', 'MATH-240', 'CS101', 'EE-7']
# A valid course code must be: 2-4 uppercase letters, a dash, then 3 digits.
# 1. Write the regex pattern
# 2. Print each code with True/False using re.fullmatch
### Your code starts here:
### Your code ends here.
CS-101 True
MATH-240 True
CS101 False
EE-7 False
8.2.11. Quick Reference#
Characters
Pattern |
Meaning |
Example match |
|---|---|---|
|
Any character (except newline) |
|
|
Digit |
|
|
Word character (letter, digit, underscore) |
|
|
Whitespace (space, tab, newline) |
|
|
Character class — any one of |
|
|
Negated class — any character except |
|
Quantifiers
Pattern |
Meaning |
Example |
|---|---|---|
|
0 or more |
|
|
1 or more |
|
|
0 or 1 (optional) |
|
|
Exactly n |
|
|
Between n and m |
|
|
Lazy (match as little as possible) |
|
Anchors
Pattern |
Meaning |
|---|---|
|
Start of string (or line with |
|
End of string (or line) |
|
Word boundary |
Groups
Syntax |
Meaning |
|---|---|
|
Capturing group |
|
Non-capturing group |
|
Named group |
|
Lookahead |
|
Lookbehind |
Flags
Flag |
Shorthand |
Meaning |
|---|---|---|
|
|
Case-insensitive matching |
|
|
|
|
|
|
|
|
Allow comments/whitespace in pattern |