{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3a7d8b47",
   "metadata": {},
   "source": [
    "# Regex"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "38f16749",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "from pathlib import Path\n",
    "\n",
    "current = Path.cwd()\n",
    "for parent in [current, *current.parents]:\n",
    "    if (parent / '_config.yml').exists():\n",
    "        project_root = parent  # ← Add project root, not chapters\n",
    "        break\n",
    "else:\n",
    "    project_root = Path.cwd().parent.parent\n",
    "\n",
    "sys.path.insert(0, str(project_root))\n",
    "\n",
    "from shared import thinkpython, diagram, jupyturtle\n",
    "from shared.download import download\n",
    "\n",
    "# Register as top-level modules so direct imports work in subsequent cells\n",
    "sys.modules['thinkpython'] = thinkpython\n",
    "sys.modules['diagram'] = diagram\n",
    "sys.modules['jupyturtle'] = jupyturtle"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a8f141fd",
   "metadata": {},
   "source": [
    "String methods allow you to search and manipulate text to a certain extent. A **regular expression** (regex) is a sequence of characters that defines a **search pattern** to search, match, and manipulate text in more powerful ways that go beyond simple string methods. For example:\n",
    "\n",
    "| Task | Use |\n",
    "|---|---|\n",
    "| Check if string starts with 'http' | `str.startswith()` |\n",
    "| Replace all spaces with underscores | `str.replace()` |\n",
    "| Extract all email addresses from text | regex |\n",
    "| Validate a phone number format | regex |\n",
    "| Find words matching a complex pattern | regex |\n",
    "\n",
    "The rule of thumb: if the pattern is **fixed and simple**, use string methods. If the pattern is **variable or complex**, use regex."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ecec41c",
   "metadata": {},
   "source": [
    "For example, to search a pattern in a text, we may use the `find()` string method and an index is returned if the pattern is found."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "0aa5f24b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text = \"I am Dracula; and I bid you welcome, Mr. Harker,\\\n",
    "    to my house.\"\n",
    "pattern = 'Dracula'\n",
    "text.find(pattern)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b992f30",
   "metadata": {},
   "source": [
    "## Escape Sequences and Raw Strings\n",
    "\n",
    "Before using the regular expression (`re`) functions, we need to understand regex escapes and raw strings.\n",
    "\n",
    "Regex patterns use backslashes heavily. In regex, the backslash `\\` introduces escape sequences, which create special patterns or allow metacharacters to be treated as literal characters.\n",
    "\n",
    "For example, common escape sequences include:\n",
    "\n",
    "| Pattern | Meaning | Example match |\n",
    "|---|---|---|\n",
    "| **`\\d`** | **digit** | `7` |\n",
    "| **`\\w`** | **word character** (letters, digits, underscore) | `A`, `x`, `9`, `_`  (`[a-zA-Z0-9_]`) |\n",
    "| **`\\s`** | **whitespace character** | **space**, **tab**, **new line** |\n",
    "| `\\.` | literal dot | `.` |\n",
    "| `\\$` | literal dollar sign | `$` |\n",
    "| `\\\\` | literal backslash | `\\` |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6500b3b8",
   "metadata": {},
   "source": [
    "### Raw Strings"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6826c201",
   "metadata": {},
   "source": [
    "A raw string, on the other hand, is a string prefixed with **`r`**, which tells Python to treat backslashes `\\` as literal characters rather than escape sequences. \n",
    "\n",
    "- Prefix with `r` to prevent Python from interpreting backslashes before the regex engine sees the pattern; works like `\\`. \n",
    "- Raw strings avoid double escaping and make patterns easier to read.\n",
    "- Without raw strings, you often need extra backslashes, like `'\\\\d+'`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "c37dbeaf",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\\n\n"
     ]
    }
   ],
   "source": [
    "regular = \"\\n\"   # newline character\n",
    "raw      = r\"\\n\" # literally backslash + n (two characters)\n",
    "\n",
    "print(regular)  # prints a newline\n",
    "print(raw)      # prints \\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "018d11be",
   "metadata": {},
   "source": [
    "\n",
    "Use escapes for literal special regex characters too (for example `\\.`, `\\$`, `\\?`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "1e887253",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\\\\\n",
      "\\\\\n",
      "\\\\\\n\n",
      "False\n"
     ]
    }
   ],
   "source": [
    "print('\\\\\\\\')           # double backslash in a normal string\n",
    "print(r'\\\\')            # double backslash in a raw string (same result)\n",
    "print(r'\\\\\\n')          # backslash + n in a raw string\n",
    "print(r'\\\\\\n' == '\\\\\\\\n')   # False: raw string is backslash + n, normal string is backslash + backslash + n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90a7cb99",
   "metadata": {},
   "source": [
    "Raw string tells Python we are treating this backslash `\\` as a special character. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3dab4e57",
   "metadata": {},
   "source": [
    "## The `re` Module\n",
    "\n",
    "Python's built-in `re` module provides regex support. The 6 most commonly used regex functions are:\n",
    "\n",
    "| Function | Description | Sample Syntax | Return |\n",
    "|---|---|---|---|\n",
    "| `re.search()` | Find **first** match anywhere in the string | `re.search(pattern, text)` | **`Match` object** or `None` |\n",
    "| `re.match()` | Match only at the **start** of the string | `re.match(pattern, text)` | **`Match` object** or `None` |\n",
    "| `re.findall()` | Find **all** matches; return as a **list** | `re.findall(pattern, text)` | `list` of strings |\n",
    "| `re.sub()` | Find and replace matches | `re.sub(pattern, repl, text)` | `str` |\n",
    "| `re.split()` | Split string on a pattern | `re.split(pattern, text)` | `list` of strings |\n",
    "| `re.fulmatch()` | Match the entire string against the pattern | re.fullmatch(pattern, text) | **`Match` object** or `None` |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d1950d45",
   "metadata": {},
   "source": [
    "### The `Match` object\n",
    "\n",
    "`re.search()`, `re.match()`, and `re.fullmatch()` functions return a `Match object` when pattern is matched. \n",
    "\n",
    "For example, \n",
    "- returns: `re.search(pattern, text)` scans through `text` and returns \n",
    "  - a `Match` object for the **first** location where `pattern` is found. \n",
    "  - If the pattern is not found anywhere in the string, it returns `None`. \n",
    "\n",
    "- `Match`: A `Match` object has the following commonly used attributes and methods: \n",
    "\n",
    "| Attribute / Method | Description | Example |\n",
    "|---|---|---|\n",
    "| `.group()` | Returns the matched substring | `m.group()` → `'Dracula'` |\n",
    "| `.start()` | Index where the match begins | `m.start()` → `5` |\n",
    "| `.end()` | Index where the match ends | `m.end()` → `12` |\n",
    "| `.span()` | Tuple of `(start, end)` | `m.span()` → `(5, 12)` |\n",
    "| `.string` | The original string that was searched | `m.string` → `'I am Dracula...'` |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "62713517",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<re.Match object; span=(5, 12), match='Dracula'>"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "text = \"I am Dracula; and I bid you welcome, Mr. Harker, to my house.\"\n",
    "pattern = 'Dracula'\n",
    "\n",
    "result = re.search(pattern, text)     ### pattern: Dracula; text: the line\n",
    "result                              ### the Match object"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ecf6690b",
   "metadata": {},
   "source": [
    "If the pattern appears in the text, `search` returns a `Match` object that contains the results of the search. \n",
    "\n",
    "1. String: Among other information, it has a variable named `string` that contains the text that was searched.\n",
    "2. Group: It also provides a method called `group` that returns the part of the text that **matched** the pattern.\n",
    "3. Span and Start/End: And it provides a method called `span` that returns the index in the text where the pattern starts and ends."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "6fdd12a9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I am Dracula; and I bid you welcome, Mr. Harker, to my house.\n",
      "Dracula\n",
      "5\n",
      "12\n",
      "(5, 12)\n"
     ]
    }
   ],
   "source": [
    "print(result.string)\n",
    "print(result.group())\n",
    "print(result.start())\n",
    "print(result.end())\n",
    "print(result.span())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d38ad650",
   "metadata": {},
   "source": [
    ":::{note}\n",
    "`.group()` returns the **matched substring from the text** — the portion of the text that the pattern matched against. In simple cases like `re.search('Dracula', text)`, the match equals the pattern string. But with a regex like `r'\\$[\\d.]+'`, `.group()` would return something like `'$42.99'` — the actual text that matched, not the pattern expression itself.\n",
    ":::"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2fb89ac5",
   "metadata": {},
   "source": [
    "If the pattern doesn't appear in the text, the return value from `search` is `None`. So we can check whether the search was successful by checking whether the result is `None`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "613a304d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "None\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = re.search('Count', text)\n",
    "print(result)\n",
    "\n",
    "result is None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "3a3b88a7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['is', 'is']\n",
      "['is ', 'is ']\n",
      "['is ', 'is ']\n",
      "['is ', 'is ']\n",
      "['is ', 'is ']\n"
     ]
    }
   ],
   "source": [
    "s = \"This is a test of the regular expression system.\"\n",
    "print(re.findall('is', s))  # ['is', 'is']\n",
    "print(re.findall('is.', s)) # ['is ', 'is ']    ### 'is' followed by any character (space in this case)\n",
    "print(re.findall('is.?', s)) # ['is ', 'is ']   ### 'is' followed by zero or one character (space in this case)\n",
    "print(re.findall('is.?', s, re.IGNORECASE)) # ['is ', 'is '] ### same as above, but case-insensitive   \n",
    "print(re.findall('is.?', s, re.IGNORECASE | re.DOTALL)) # ['is ', 'is ']    ### same as above, but also makes '.' match newline characters (not relevant in this case since there are no newlines)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ffef1c34",
   "metadata": {},
   "source": [
    "The `+` in the pattern means one or more occurrence. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "daa3fba6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['$42.99', '$7.50']\n",
      "$42.99\n",
      "13 19\n",
      "The price is PRICE and PRICE\n",
      "['one', 'two', 'three']\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "text = \"The price is $42.99 and $7.50\"\n",
    "\n",
    "# findall — get all matches\n",
    "print(re.findall(r\"\\$[\\d.]+\", text))         # ['$42.99', '$7.50']\n",
    "\n",
    "# search — first match object\n",
    "m = re.search(r\"\\$[\\d.]+\", text)\n",
    "print(m.group())                              # '$42.99'\n",
    "print(m.start(), m.end())                     # position in string\n",
    "\n",
    "# sub — replace\n",
    "print(re.sub(r\"\\$[\\d.]+\", \"PRICE\", text))    # 'The price is PRICE and PRICE'\n",
    "\n",
    "# split\n",
    "print(re.split(r\"\\s+\", \"one  two   three\"))  # ['one', 'two', 'three']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "96205219",
   "metadata": {
    "tags": [
     "thebe-interactive"
    ]
   },
   "outputs": [],
   "source": [
    "### EXERCISE: Regex Escape Sequences\n",
    "# Difficulty: Basic\n",
    "s = \"Price: $19.95, code=A_7, spaces here\"\n",
    "# 1. Extract all digit sequences\n",
    "# 2. Extract all word tokens\n",
    "# 3. Extract literal '$' and literal '.' matches\n",
    "### Your code starts here:\n",
    "\n",
    "\n",
    "\n",
    "### Your code ends here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "8b965a41",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['19', '95', '7']\n",
      "['Price', '19', '95', 'code', 'A_7', 'spaces', 'here']\n",
      "['$']\n",
      "['.']\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "# Solution\n",
    "s = \"Price: $19.95, code=A_7, spaces here\"\n",
    "print(re.findall(r'\\d+', s))\n",
    "print(re.findall(r'\\w+', s))\n",
    "print(re.findall(r'\\$', s))\n",
    "print(re.findall(r'\\.', s))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8c5d4f62",
   "metadata": {
    "tags": [
     "thebe-interactive"
    ]
   },
   "outputs": [],
   "source": [
    "### EXERCISE: The Match Object\n",
    "# Difficulty: Basic\n",
    "import re\n",
    "text = \"Customer ID: 4892, Order date: 2024-03-15\"\n",
    "# 1. Use re.search() to find the first 4-digit number in text\n",
    "# 2. Print the matched string, start index, end index, and span\n",
    "### Your code starts here:\n",
    "\n",
    "\n",
    "\n",
    "### Your code ends here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6408257c",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "import re\n",
    "text = \"Customer ID: 4892, Order date: 2024-03-15\"\n",
    "m = re.search(r'\\d{4}', text)\n",
    "print(m.group())\n",
    "print(m.start())\n",
    "print(m.end())\n",
    "print(m.span())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a245d84a",
   "metadata": {},
   "source": [
    "## Metacharacters\n",
    "\n",
    "Metacharacters are characters that carry special meaning inside a regex pattern — instead of matching themselves literally, they instruct the regex engine to do something specific, like match any character, mark a boundary, or repeat a pattern. There are 14 of them in Python's re module. You need to escape them if you want them to be regular characters.\n",
    "\n",
    "| Type            | Character | Meaning      | Example                     | \n",
    "| -------------------------- | --------- | ------------------------------------------------- | --------------------------- | \n",
    "| Wildcard                   | `.`       | Matches any character except newline              | `c.t` → cat, cot            |      \n",
    "| Anchor                     | `^`       | Start of string                                   | `^Hello`                    |      \n",
    "| Anchor                     | `$`       | End of string                                     | `end$`            |   \n",
    "| Quantifier                 | `*`       | 0 or more repetitions                             | `a*`                        |      \n",
    "| Quantifier                 | `+`       | 1 or more repetitions   | `a+`                        |      \n",
    "| Quantifier                 | `?`       | Optional (0 or 1) / makes quantifier lazy   | `colou?r`                   |       \n",
    "| Quantifier                 | `{}`      | Specific repetition range  | `\\d{3}`  |    |    |\n",
    "| Character Class Delimiters | `[]`      | Defines a set of allowed characters   | `[a-z]`                     |      \n",
    "| Grouping Delimiters        | `()`  | Groups patterns and captures matches  | `(cat\\| dog)` |  |\n",
    "| Escape           | `\\`  | Escapes metacharacters or forms special sequences | `\\d`, `\\w`, `\\.`            |       \n",
    "| Alternation                | `\\|`        | Logical OR between patterns | `cat  \\| dog` |\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d07fdcb",
   "metadata": {},
   "source": [
    "If you want to match the character literally, you must escape it. Now let us look at the metacharacters in groups."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ce72136",
   "metadata": {},
   "source": [
    "## Quantifiers\n",
    "\n",
    "Quantifiers tell the regex engine how many times the preceding character, group, or character class should match.\n",
    "\n",
    "| Quantifier | Meaning       | Example  | Matches                        |\n",
    "|------------|---------------|----------|--------------------------------|\n",
    "| *          | **0 or more**     | ab*      | a, ab, abb, abbb           |\n",
    "| +          | **1 or more**    | ab+      | ab, abb, abbb  (not a)      |\n",
    "| ?          | **0 or 1**        | ab?      | a  or  ab  only            |\n",
    "| {n}        | Exactly n     | \\d{3}    | 123, 456                       |\n",
    "| {n,}       | n or more     | \\d{2,}   | 12, 123, 1234...               |\n",
    "| {n,m}      | Between n & m | \\d{2,4}  | 12, 123, 1234                  |\n",
    "\n",
    "By default quantifiers are **greedy** (match as much as possible). Add `?` to make them **lazy**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "90c996be",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['<b>bold</b> and <i>italic</i>']\n",
      "['<b>', '</b>', '<i>', '</i>']\n",
      "['123', '456']\n",
      "['12', '123', '1234']\n"
     ]
    }
   ],
   "source": [
    "text = \"<b>bold</b> and <i>italic</i>\"\n",
    "\n",
    "# Greedy — matches as much as possible\n",
    "print(re.findall(r\"<.+>\", text))     # ['<b>bold</b> and <i>italic</i>']\n",
    "\n",
    "# Lazy — matches as little as possible\n",
    "print(re.findall(r\"<.+?>\", text))    # ['<b>', '</b>', '<i>', '</i>']\n",
    "\n",
    "# Exact and ranged quantifiers\n",
    "print(re.findall(r\"\\d{3}\", \"123 4567 89\"))     # ['123', '456']\n",
    "print(re.findall(r\"\\d{2,4}\", \"1 12 123 1234\")) # ['12', '123', '1234']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4b76fbc8",
   "metadata": {},
   "source": [
    "In `<.+>`, the `+` is **greedy**. It matches as many characters as possible while still allowing the overall pattern to succeed. So it gobbles everything from the first `<` all the way to the last `>`.\n",
    "\n",
    "Adding `?` after a quantifier switches it to **lazy** mode — instead of matching as much as possible, it now matches **as little as possible**. So `<.+?>` still needs at least one character (that's the `+`), but stops at the earliest `>` it can find."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de7246f4",
   "metadata": {},
   "source": [
    "### Greedy vs Non-greedy\n",
    "\n",
    "Quantifiers like `*` and `+` are greedy by default. Add `?` to make them non-greedy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "453800d5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['ALICE@example.com', 'bob@Example.org']\n",
      "['<b>bold</b><i>italic</i>']\n",
      "['<b>', '</b>', '<i>', '</i>']\n"
     ]
    }
   ],
   "source": [
    "text_block = \"\"\"Title: Notes\n",
    "Email: ALICE@example.com\n",
    "Email: bob@Example.org\"\"\"\n",
    "\n",
    "# IGNORECASE\n",
    "emails = re.findall(r'[a-z0-9._%+-]+@[a-z0-9.-]+\\.[a-z]{2,}', text_block, flags=re.IGNORECASE)\n",
    "print(emails)\n",
    "\n",
    "# Greedy vs non-greedy on tags\n",
    "html = \"<b>bold</b><i>italic</i>\"\n",
    "print(re.findall(r'<.*>', html))     # greedy\n",
    "print(re.findall(r'<.*?>', html))    # non-greedy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "33dbf4fd",
   "metadata": {},
   "outputs": [],
   "source": [
    "### EXERCISE: Flags and Quantifiers\n",
    "# Difficulty: Challenge\n",
    "text_block = \"\"\"Task: clean logs\n",
    "ERROR: Disk full\n",
    "error: retry failed\n",
    "INFO: done\"\"\"\n",
    "# 1. Extract all lines that start with 'error' (case-insensitive) using MULTILINE\n",
    "# 2. From '<x>1</x><x>2</x>', extract tags non-greedily\n",
    "### Your code starts here:\n",
    "\n",
    "\n",
    "\n",
    "### Your code ends here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "93c34729",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['ERROR: Disk full', 'error: retry failed']\n",
      "['<x>', '</x>', '<x>', '</x>']\n"
     ]
    }
   ],
   "source": [
    "# Solution\n",
    "text_block = \"\"\"Task: clean logs\n",
    "ERROR: Disk full\n",
    "error: retry failed\n",
    "INFO: done\"\"\"\n",
    "\n",
    "errs = re.findall(r'^error:.*$', text_block, flags=re.IGNORECASE | re.MULTILINE)\n",
    "print(errs)\n",
    "\n",
    "print(re.findall(r'<.*?>', '<x>1</x><x>2</x>'))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e80681c",
   "metadata": {},
   "source": [
    "## Anchors\n",
    "\n",
    "Anchors don't match characters — they match positions in the string.\n",
    "\n",
    "| Anchor | Meaning |\n",
    "|---|---|\n",
    "| `^` | Start of string (or line with `re.MULTILINE`) |\n",
    "| `$` | End of string (or line) |\n",
    "| `\\b` | Word boundary |\n",
    "| `\\B` | Non-word boundary |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "2b48b6ae",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Hello']\n",
      "['world']\n",
      "['cat']\n",
      "['cat', 'cat', 'cat']\n",
      "['line1', 'line2', 'line3']\n"
     ]
    }
   ],
   "source": [
    "# ^ and $\n",
    "print(re.findall(r\"^\\w+\", \"Hello world\"))     # ['Hello'] — only at start\n",
    "print(re.findall(r\"\\w+$\", \"Hello world\"))     # ['world'] — only at end\n",
    "\n",
    "# Word boundary \\b\n",
    "text = \"cat catfish concatenate\"\n",
    "print(re.findall(r\"\\bcat\\b\", text))           # ['cat'] — whole word only\n",
    "print(re.findall(r\"cat\", text))               # ['cat', 'cat', 'cat'] — anywhere\n",
    "\n",
    "# Multiline\n",
    "multi = \"line1\\nline2\\nline3\"\n",
    "print(re.findall(r\"^\\w+\", multi, re.MULTILINE))  # ['line1', 'line2', 'line3']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd02837e",
   "metadata": {},
   "source": [
    "## Character Classes\n",
    "\n",
    "Before writing larger patterns, it helps to know the core building blocks. **Character classes** match one character from a defined set. They're written with square brackets `[ ]`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60663719",
   "metadata": {},
   "source": [
    "| Pattern |\tMatches |\n",
    "| --- | --- | \n",
    "| [aeiou]\t| any single vowel |\n",
    "| [a-z]\t| any lowercase letter |\n",
    "| [A-Z]\t| any uppercase letter |\n",
    "| [0-9]\t| any digit |\n",
    "| [a-zA-Z0-9] |\tany alphanumeric character |\n",
    "| [^aeiou]\t| any character not a vowel (^ negates) |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "230aa9bf",
   "metadata": {},
   "source": [
    "Shorthand classes (work outside brackets too):"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d70de733",
   "metadata": {},
   "source": [
    "| Pattern | Meaning | Example Match |\n",
    "|---|---|---|\n",
    "| `.` | Any character (except newline) \n",
    "| `\\d` | digit | `7` |\n",
    "| `\\w` | word char (letter/digit/underscore) | `A`, `x`, `9`, `_` |\n",
    "| `\\s` | whitespace | space, tab |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44b5903b",
   "metadata": {},
   "source": [
    "Observe the escape sequence `'\\w'`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bbe068a5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['T', 'h', 'i', 's', 'i', 's', 'a', 'r', 'e', 'g', 'u', 'l', 'a', 'r', 'e', 'x', 'p', 'r', 'e', 's', 's', 'i', 'o', 'n']\n",
      "['This', 'is', 'a', 'regular', 'expression']\n",
      "['This', '', 'is', '', 'a', '', 'regular', '', 'expression', '', '']\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "s = \"This is a regular expression.\"\n",
    "print(re.findall(r'\\w', s))     ### \\w matches any alphanumeric character (letters, digits, and underscore)\n",
    "print(re.findall(r'\\w+', s))    ### + means \"one or more occurrences of the preceding pattern\"\n",
    "print(re.findall(r'\\w*', s))    ### * means \"zero or more occurrences of the preceding pattern\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a7268cf",
   "metadata": {},
   "source": [
    "`\\\\s` matches these whitespace characters:\n",
    "\n",
    "| Character | Name |\n",
    "|---|---|\n",
    "| `\\\\n` | newline |\n",
    "| `\\\\t` | tab |\n",
    "| `\\\\r` | carriage return |\n",
    "| ` ` | space |\n",
    "| `\\\\f` | form feed |\n",
    "| `\\\\v` | vertical tab |\n",
    "\n",
    "Use raw strings like `r'\\d+'` for regex patterns so backslashes are interpreted correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d6dfe973",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['1', '2', '3']\n",
      "['123']\n",
      "['Hello', 'World', '123', 'foo_bar']\n",
      "['Hello', 'World']\n",
      "['123!', '_']\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "text = \"Hello World 123! foo_bar\"\n",
    "\n",
    "print(re.findall(r\"\\d\", text))        # individual digits\n",
    "print(re.findall(r\"\\d+\", text))       # consecutive digits\n",
    "print(re.findall(r\"\\w+\", text))       # words (incl. underscore)\n",
    "print(re.findall(r\"[A-Z][a-z]+\", text))  # capitalized words\n",
    "print(re.findall(r\"[^a-zA-Z\\s]+\", text)) # non-alpha, non-space"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16a6a5c6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['42', '09', '30', '2026', '03', '11']\n",
      "['User_', 'logged', 'in', 'at', 'on']\n",
      "['09:30']\n",
      "['2026-03-11']\n"
     ]
    }
   ],
   "source": [
    "sample = \"User_42 logged in at 09:30 on 2026-03-11\"\n",
    "\n",
    "print(re.findall(r'\\d+', sample))                  # all digit runs\n",
    "print(re.findall(r'[A-Za-z_]+', sample))            # word-like alphabetic tokens\n",
    "print(re.findall(r'\\d{2}:\\d{2}', sample))        # HH:MM time\n",
    "print(re.findall(r'\\d{4}-\\d{2}-\\d{2}', sample))  # YYYY-MM-DD date"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bf8a0af3",
   "metadata": {
    "tags": [
     "thebe-interactive"
    ]
   },
   "outputs": [],
   "source": [
    "### EXERCISE: Regex Syntax Essentials\n",
    "# Difficulty: Basic\n",
    "s = \"IDs: A12, B7, C999\"\n",
    "# 1. Extract all uppercase letters\n",
    "# 2. Extract all digit sequences\n",
    "# 3. Extract letter+digit tokens like A12, B7, C999\n",
    "### Your code starts here:\n",
    "\n",
    "\n",
    "\n",
    "### Your code ends here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "30189825",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['I', 'D', 'A', 'B', 'C']\n",
      "['12', '7', '999']\n",
      "['A12', 'B7', 'C999']\n"
     ]
    }
   ],
   "source": [
    "# Solution\n",
    "s = \"IDs: A12, B7, C999\"\n",
    "print(re.findall(r'[A-Z]', s))\n",
    "print(re.findall(r'\\d+', s))\n",
    "print(re.findall(r'[A-Z]\\d+', s))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "161e1a6b",
   "metadata": {},
   "source": [
    "## Groups & Capturing\n",
    "\n",
    "Parentheses `()` group part of a pattern into a single unit. A **capturing group** also saves the matched text so you can extract or reuse it afterward. Use **non-capturing groups** `(?:...)` when you need grouping for structure but don't need to extract the text. **Named groups** `(?P<name>...)` let you refer to captured text by name instead of number.\n",
    "\n",
    "| Syntax | Meaning |\n",
    "|---|---|\n",
    "| `(...)` | Capturing group |\n",
    "| `(?:...)` | Non-capturing group |\n",
    "| `(?P<name>...)` | Named group |\n",
    "| `\\|` | Alternation (OR) |\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "7f845756",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('2024', '01', '15'), ('2023', '12', '31')]\n",
      "2024 01 15\n",
      "['cat', 'dog']\n",
      "01/15/2024 and 12/31/2023\n"
     ]
    }
   ],
   "source": [
    "# Capturing groups\n",
    "dates = \"2024-01-15 and 2023-12-31\"\n",
    "print(re.findall(r\"(\\d{4})-(\\d{2})-(\\d{2})\", dates))\n",
    "# [('2024', '01', '15'), ('2023', '12', '31')]\n",
    "\n",
    "# Named groups: m is one object from the search, so gives only one match, not all matches\n",
    "m = re.search(r\"(?P<year>\\d{4})-(?P<month>\\d{2})-(?P<day>\\d{2})\", dates)\n",
    "print(m.group('year'), m.group('month'), m.group('day'))\n",
    "\n",
    "# Alternation\n",
    "print(re.findall(r\"cat|dog\", \"I have a cat and a dog\"))  # ['cat', 'dog']\n",
    "\n",
    "# Using groups in sub()\n",
    "print(re.sub(r\"(\\d{4})-(\\d{2})-(\\d{2})\", r\"\\2/\\3/\\1\", dates))\n",
    "# '01/15/2024 and 12/31/2023'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b006ebdc",
   "metadata": {},
   "source": [
    "### Groups and Extraction\n",
    "\n",
    "Parentheses create capture groups. You can extract parts of a match with `.group(1)`, `.group(2)`, etc.\n",
    "Named groups can make patterns more readable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "id": "6e827051",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "OrderID=4821; Customer=Alice; Total=$39.50\n",
      "4821\n",
      "Alice\n",
      "39.50\n"
     ]
    }
   ],
   "source": [
    "record = \"OrderID=4821; Customer=Alice; Total=$39.50\"\n",
    "pattern = r'OrderID=(\\d+); Customer=([A-Za-z]+); Total=\\$(\\d+(?:\\.\\d{2})?)'\n",
    "m = re.search(pattern, record)\n",
    "\n",
    "print(m.group(0))  # full match\n",
    "print(m.group(1))  # order id\n",
    "print(m.group(2))  # customer\n",
    "print(m.group(3))  # total amount"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "id": "13f21af9",
   "metadata": {
    "tags": [
     "thebe-interactive"
    ]
   },
   "outputs": [],
   "source": [
    "### EXERCISE: Capture Groups\n",
    "# Difficulty: Intermediate\n",
    "line = \"name=Bob,age=27,dept=Sales\"\n",
    "# 1. Use one regex with 3 capture groups to extract name, age, dept\n",
    "# 2. Print each extracted value on its own line\n",
    "### Your code starts here:\n",
    "\n",
    "\n",
    "\n",
    "### Your code ends here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "id": "2224328f",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Bob\n",
      "27\n",
      "Sales\n"
     ]
    }
   ],
   "source": [
    "# Solution\n",
    "line = \"name=Bob,age=27,dept=Sales\"\n",
    "m = re.search(r'name=([A-Za-z]+),age=(\\d+),dept=([A-Za-z]+)', line)\n",
    "print(m.group(1))\n",
    "print(m.group(2))\n",
    "print(m.group(3))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cce69516",
   "metadata": {},
   "source": [
    "## Alternation (OR)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9a02878",
   "metadata": {},
   "source": [
    "Use `|` to match one of multiple patterns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "c3e51f4a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "photo.jpg -> valid image file\n",
      "diagram.png -> valid image file\n",
      "animation.gif -> valid image file\n",
      "document.pdf -> not an image\n",
      "archive.zip -> not an image\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "# pattern using alternation\n",
    "pattern = r\"\\.(jpg|png|gif)$\"\n",
    "\n",
    "files = [\n",
    "    \"photo.jpg\",\n",
    "    \"diagram.png\",\n",
    "    \"animation.gif\",\n",
    "    \"document.pdf\",\n",
    "    \"archive.zip\"\n",
    "]\n",
    "\n",
    "for file in files:\n",
    "    if re.search(pattern, file):\n",
    "        print(f\"{file} -> valid image file\")\n",
    "    else:\n",
    "        print(f\"{file} -> not an image\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "ff59c4e8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found beverage: coffee\n",
      "Found beverage: tea\n",
      "No beverage found\n",
      "Found beverage: Coffee\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "pattern = r\"\\b(coffee|tea)\\b\"\n",
    "\n",
    "sentences = [\n",
    "    \"I like coffee in the morning.\",\n",
    "    \"She prefers tea at night.\",\n",
    "    \"He drinks water.\",\n",
    "    \"Coffee is my favorite.\"\n",
    "]\n",
    "\n",
    "for sentence in sentences:\n",
    "    match = re.search(pattern, sentence, re.IGNORECASE)\n",
    "\n",
    "    if match:\n",
    "        print(f\"Found beverage: {match.group()}\")\n",
    "    else:\n",
    "        print(\"No beverage found\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e1c7e284",
   "metadata": {},
   "source": [
    "## Advanced Topics"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ee22e3c",
   "metadata": {},
   "source": [
    "### Flags\n",
    "\n",
    "Flags change matching behavior:\n",
    "\n",
    "| Flag | Shorthand | Meaning |\n",
    "|---|---|---|\n",
    "| `re.IGNORECASE` | `re.I` | Case-insensitive matching |\n",
    "| `re.MULTILINE` | `re.M` | `^`/`$` match line start/end |\n",
    "| `re.DOTALL` | `re.S` | `.` matches newline too |\n",
    "| `re.VERBOSE` | `re.X` | Allow comments/whitespace in pattern |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "id": "4d3245a1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Hello', 'HELLO', 'hello']\n",
      "['<div>\\nsome content\\n</div>']\n",
      "['hello@example.com', 'support@test.org']\n"
     ]
    }
   ],
   "source": [
    "# IGNORECASE\n",
    "print(re.findall(r\"hello\", \"Hello HELLO hello\", re.I))  # ['Hello', 'HELLO', 'hello']\n",
    "\n",
    "# DOTALL — dot matches newline\n",
    "text = \"<div>\\nsome content\\n</div>\"\n",
    "print(re.findall(r\"<div>.*</div>\", text, re.DOTALL))  # matches across lines\n",
    "\n",
    "# VERBOSE — write readable patterns with comments\n",
    "email_pattern = re.compile(r\"\"\"\n",
    "    [\\w.+-]+       # username\n",
    "    @              # at sign\n",
    "    [\\w-]+         # domain name\n",
    "    \\.             # dot\n",
    "    [\\w.]+         # TLD\n",
    "\"\"\", re.VERBOSE)\n",
    "\n",
    "print(email_pattern.findall(\"Contact us at hello@example.com or support@test.org\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f7afa4c2",
   "metadata": {
    "tags": [
     "thebe-interactive"
    ]
   },
   "outputs": [],
   "source": [
    "### EXERCISE: Regex Flags\n",
    "# Difficulty: Basic\n",
    "import re\n",
    "log = \"\"\"INFO: Server started\n",
    "error: disk full\n",
    "WARNING: low memory\n",
    "ERROR: connection lost\"\"\"\n",
    "# 1. Use re.findall() with re.IGNORECASE | re.MULTILINE to\n",
    "#    extract every line that begins with 'error'\n",
    "# 2. Print the list of matches\n",
    "### Your code starts here:\n",
    "\n",
    "\n",
    "\n",
    "### Your code ends here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15265b18",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "import re\n",
    "log = \"\"\"INFO: Server started\n",
    "error: disk full\n",
    "WARNING: low memory\n",
    "ERROR: connection lost\"\"\"\n",
    "results = re.findall(r'^error.*$', log, flags=re.IGNORECASE | re.MULTILINE)\n",
    "print(results)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e898c620",
   "metadata": {},
   "source": [
    "### Compiled Patterns\n",
    "Use `re.compile()` when reusing the same pattern multiple times — more efficient and cleaner."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "id": "1c2b8f29",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found: ['123-456-7890'] in 'Call me at 123-456-7890'\n",
      "Found: ['987.654.3210'] in 'My number is 987.654.3210'\n",
      "Found: ['555-123-4567', '800-999-0000'] in 'Reach us at 555-123-4567 or 800-999-0000'\n"
     ]
    }
   ],
   "source": [
    "# Compile once, use many times\n",
    "phone_pattern = re.compile(r\"\\b\\d{3}[-.]\\d{3}[-.]\\d{4}\\b\")\n",
    "\n",
    "texts = [\n",
    "    \"Call me at 123-456-7890\",\n",
    "    \"My number is 987.654.3210\",\n",
    "    \"No phone here\",\n",
    "    \"Reach us at 555-123-4567 or 800-999-0000\"\n",
    "]\n",
    "\n",
    "for t in texts:\n",
    "    matches = phone_pattern.findall(t)\n",
    "    if matches:\n",
    "        print(f\"Found: {matches} in '{t}'\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2c84d8f4",
   "metadata": {
    "tags": [
     "thebe-interactive"
    ]
   },
   "outputs": [],
   "source": [
    "### EXERCISE: Compiled Patterns\n",
    "# Difficulty: Basic\n",
    "import re\n",
    "emails = ['alice@example.com', 'not-an-email', 'bob@company.org', 'charlie_at_test.net']\n",
    "# 1. Compile a regex pattern that matches a simple email address\n",
    "# 2. Print each email with True or False using the compiled pattern\n",
    "### Your code starts here:\n",
    "\n",
    "\n",
    "\n",
    "### Your code ends here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d023474f",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "import re\n",
    "emails = ['alice@example.com', 'not-an-email', 'bob@company.org', 'charlie_at_test.net']\n",
    "email_re = re.compile(r'[\\w.+-]+@[\\w-]+\\.[\\w.]+')\n",
    "for e in emails:\n",
    "    print(e, bool(email_re.fullmatch(e)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f90ad03b",
   "metadata": {},
   "source": [
    "### Lookahead & Lookbehind\n",
    "\n",
    "Match a pattern only if it is (or isn't) preceded/followed by another pattern — without including that other pattern in the match.\n",
    "\n",
    "| Syntax | Type | Meaning |\n",
    "|---|---|---|\n",
    "| `(?=...)` | Positive lookahead | Followed by |\n",
    "| `(?!...)` | Negative lookahead | NOT followed by |\n",
    "| `(?<=...)` | Positive lookbehind | Preceded by |\n",
    "| `(?<!...)` | Negative lookbehind | NOT preceded by |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "id": "c84820d8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['100', '300']\n",
      "['10', '200', '30']\n",
      "['42.99', '5.00']\n"
     ]
    }
   ],
   "source": [
    "# Positive lookahead — prices followed by USD\n",
    "text = \"100USD 200EUR 300USD\"\n",
    "print(re.findall(r\"\\d+(?=USD)\", text))     # ['100', '300']\n",
    "\n",
    "# Negative lookahead\n",
    "print(re.findall(r\"\\d+(?!USD)\", text))     # numbers NOT followed by USD\n",
    "\n",
    "# Positive lookbehind — extract amount after $\n",
    "text2 = \"Price: $42.99, discount: $5.00\"\n",
    "print(re.findall(r\"(?<=\\$)[\\d.]+\", text2)) # ['42.99', '5.00']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "60e69911",
   "metadata": {
    "tags": [
     "thebe-interactive"
    ]
   },
   "outputs": [],
   "source": [
    "### EXERCISE: Lookahead & Lookbehind\n",
    "# Difficulty: Intermediate\n",
    "import re\n",
    "text = \"Alice scored 95pts, Bob scored 80pts, Charlie scored 73pts\"\n",
    "# 1. Use a positive lookahead to extract all numbers followed by 'pts'\n",
    "# 2. Use a positive lookbehind to extract numbers preceded by 'scored '\n",
    "### Your code starts here:\n",
    "\n",
    "\n",
    "\n",
    "### Your code ends here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c5ae44a4",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "import re\n",
    "text = \"Alice scored 95pts, Bob scored 80pts, Charlie scored 73pts\"\n",
    "print(re.findall(r'\\d+(?=pts)', text))        # positive lookahead\n",
    "print(re.findall(r'(?<=scored )\\d+', text))   # positive lookbehind"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5861fdc",
   "metadata": {},
   "source": [
    "## Applications\n",
    "\n",
    "### Cleaning Text\n",
    "\n",
    "Before we can search the text of *Dracula*, we need to download it from Project Gutenberg and remove the header and footer information."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d1afc4a3",
   "metadata": {},
   "source": [
    "We'll download the Dracula text from Project Gutenberg and save it to the `data` folder. Then we'll clean the file and save the cleaned version in the same folder. All subsequent analysis will use these files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "id": "68b04060",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dracula already downloaded: /Users/tcn85/workspace/py/data/pg345.txt\n"
     ]
    }
   ],
   "source": [
    "from pathlib import Path\n",
    "from urllib.request import urlretrieve\n",
    "\n",
    "data_dir = project_root / 'data'\n",
    "data_dir.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "# Download Dracula text to the project data folder\n",
    "url = 'https://www.gutenberg.org/files/345/345-0.txt'\n",
    "raw_path = data_dir / 'pg345.txt'\n",
    "clean_path = data_dir / 'pg345_cleaned.txt'\n",
    "if not raw_path.exists():\n",
    "    urlretrieve(url, raw_path)\n",
    "    print('Downloaded Dracula to', raw_path)\n",
    "else:\n",
    "    print('Dracula already downloaded:', raw_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "id": "1dfd4fd3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# download('https://www.gutenberg.org/cache/epub/345/pg345.txt');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "id": "f110f8bc",
   "metadata": {},
   "outputs": [],
   "source": [
    "def clean_file(infile, outfile):\n",
    "    \"\"\"Read infile, write to outfile skipping special lines.\"\"\"\n",
    "    with open(infile, encoding='utf8') as fin, open(outfile, 'w', encoding='utf8') as fout:\n",
    "        for line in fin:\n",
    "            if not is_special_line(line):\n",
    "                fout.write(line)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d759ef63",
   "metadata": {},
   "outputs": [],
   "source": [
    "def clean_file(input_file, output_file):\n",
    "    reader = open(input_file, encoding='utf-8')\n",
    "    writer = open(output_file, 'w')\n",
    "\n",
    "    for line in reader:\n",
    "        if is_special_line(line):\n",
    "            break\n",
    "\n",
    "    for line in reader:\n",
    "        if is_special_line(line):\n",
    "            break\n",
    "        writer.write(line)\n",
    "        \n",
    "    reader.close()\n",
    "    writer.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "id": "28577dec",
   "metadata": {},
   "outputs": [],
   "source": [
    "def is_special_line(line):\n",
    "    \"\"\"Return True if the line marks the start or end of the Gutenberg content.\"\"\"\n",
    "    return line.startswith('***')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "id": "d6fb49c8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# def is_special_line(line):\n",
    "#     return line.strip().startswith('*** ')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "id": "9f689533",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cleaned file saved to /Users/tcn85/workspace/py/data/pg345_cleaned.txt\n"
     ]
    }
   ],
   "source": [
    "# Clean the Dracula text and save to data/pg345_cleaned.txt\n",
    "clean_file(raw_path, clean_path)\n",
    "print('Cleaned file saved to', clean_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "521a7c5a",
   "metadata": {},
   "source": [
    "Putting all that together, here's a function that loops through the lines in the book until it finds one that matches the given pattern, and returns the `Match` object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "id": "dd3afac4",
   "metadata": {},
   "outputs": [],
   "source": [
    "def find_first(pattern, path=clean_path):\n",
    "    with open(path, encoding='utf8') as f:\n",
    "        for line in f:\n",
    "            result = re.search(pattern, line)\n",
    "            if result is not None:\n",
    "                return result"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "52b080be",
   "metadata": {},
   "source": [
    "We can use it to find the first mention of a character."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "id": "a74228d0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'CHAPTER I. Jonathan Harker’s Journal\\n'"
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = find_first('Harker')\n",
    "result.string"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "565413cf",
   "metadata": {},
   "source": [
    "For this example, we didn't have to use regular expressions -- we could have done the same thing more easily with the `in` operator.\n",
    "But regular expressions can do things the `in` operator cannot.\n",
    "\n",
    "For example, if the pattern includes the vertical bar character, `'|'`, it can match either the sequence on the left or the sequence on the right.\n",
    "Suppose we want to find the first mention of Mina Murray in the book, but we are not sure whether she is referred to by first name or last.\n",
    "We can use the following pattern, which matches either name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "id": "2d042bed",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'CHAPTER V. Letters—Lucy and Mina\\n'"
      ]
     },
     "execution_count": 77,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pattern = 'Mina|Murray'\n",
    "result = find_first(pattern)\n",
    "result.string"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b62fd404",
   "metadata": {},
   "source": [
    "We can use a pattern like this to see how many times a character is mentioned by either name.\n",
    "Here's a function that loops through the book and counts the number of lines that match the given pattern."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "id": "35dd291b",
   "metadata": {},
   "outputs": [],
   "source": [
    "def count_matches(pattern, path=clean_path):\n",
    "    count = 0\n",
    "    with open(path, encoding='utf8') as f:\n",
    "        for line in f:\n",
    "            result = re.search(pattern, line)\n",
    "            if result is not None:\n",
    "                count += 1\n",
    "    return count"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "adaf5bb1",
   "metadata": {},
   "source": [
    "Now let's see how many times Mina is mentioned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "id": "585882de",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "229"
      ]
     },
     "execution_count": 79,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "count_matches('Mina|Murray')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81a35c6f",
   "metadata": {},
   "source": [
    "The special character `'^'` matches the beginning of a string, so we can find a line that starts with a given pattern."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "id": "de17fdda",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Dracula, jumping to his feet, said:--\\n'"
      ]
     },
     "execution_count": 80,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = find_first('^Dracula')\n",
    "result.string"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0c7cfb3f",
   "metadata": {},
   "source": [
    "And the special character `'$'` matches the end of a string, so we can find a line that ends with a given pattern (ignoring the newline at the end)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "id": "b7dbd7ef",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'by five o’clock, we must start off; for it won’t do to leave Mrs. Harker\\n'"
      ]
     },
     "execution_count": 81,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = find_first('Harker$')\n",
    "result.string"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "id": "5f14fc19",
   "metadata": {
    "tags": [
     "thebe-interactive"
    ]
   },
   "outputs": [],
   "source": [
    "### EXERCISE: Download and Clean Text\n",
    "# Difficulty: Intermediate\n",
    "# 1. Use raw_path and clean_path to print whether each file exists\n",
    "# 2. If clean_path does not exist, run clean_file(raw_path, clean_path)\n",
    "# 3. Print the size (in bytes) of clean_path\n",
    "### Your code starts here:\n",
    "\n",
    "\n",
    "\n",
    "### Your code ends here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "id": "92f55807",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True True\n",
      "852703\n"
     ]
    }
   ],
   "source": [
    "# Solution\n",
    "print(raw_path.exists(), clean_path.exists())\n",
    "if not clean_path.exists():\n",
    "    clean_file(raw_path, clean_path)\n",
    "print(clean_path.stat().st_size)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f67418ba",
   "metadata": {},
   "source": [
    "### String substitution\n",
    "\n",
    "Bram Stoker was born in Ireland, and when *Dracula* was published in 1897, he was living in England.\n",
    "So we would expect him to use the British spelling of words like \"centre\" and \"colour\".\n",
    "To check, we can use the following pattern, which matches either \"centre\" or the American spelling \"center\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "id": "a2557856",
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern = 'cent(er|re)'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e197ea79",
   "metadata": {},
   "source": [
    "In this pattern, the parentheses enclose the part of the pattern the vertical bar applies to.\n",
    "So this pattern matches a sequence that starts with `'cent'` and ends with either `'er'` or `'re'`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "id": "9912bca3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'horseshoe of the Carpathians, as if it were the centre of some sort of\\n'"
      ]
     },
     "execution_count": 85,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = find_first(pattern)\n",
    "result.string"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "994b3902",
   "metadata": {},
   "source": [
    "As expected, he used the British spelling.\n",
    "\n",
    "We can also check whether he used the British spelling of \"colour\".\n",
    "The following pattern uses the special character `'?'`, which means that the previous character is optional."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "id": "5648ad9c",
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern = 'colou?r'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64633322",
   "metadata": {},
   "source": [
    "This pattern matches either \"colour\" with the `'u'` or \"color\" without it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "id": "2caa4b8c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'undergarment with long double apron, front, and back, of coloured stuff\\n'"
      ]
     },
     "execution_count": 87,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = find_first(pattern)\n",
    "line = result.string\n",
    "line"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3dbb91ce",
   "metadata": {},
   "source": [
    "Again, as expected, he used the British spelling.\n",
    "\n",
    "Now suppose we want to produce an edition of the book with American spellings.\n",
    "We can use the `sub` function in the `re` module, which does **string substitution**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "id": "c252a3b7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'undergarment with long double apron, front, and back, of colored stuff\\n'"
      ]
     },
     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.sub(pattern, 'color', line)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2baef97d",
   "metadata": {},
   "source": [
    "The first argument is the pattern we want to find and replace, the second is what we want to replace it with, and the third is the string we want to search.\n",
    "In the result, you can see that \"colour\" has been replaced with \"color\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "id": "35c380ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "# I used this function to search for lines to use as examples\n",
    "\n",
    "def all_matches(pattern, path=clean_path):\n",
    "    with open(path, encoding='utf8') as f:\n",
    "        for line in f:\n",
    "            result = re.search(pattern, line)\n",
    "            if result:\n",
    "                print(line.strip())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "id": "53b797ca",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "weather. As I stood, the driver jumped again into his seat and shook the\n",
      "weatherworn, was still complete; but it was evidently many a day since\n",
      "it is a buoy with a bell, which swings in bad weather, and sends in a\n",
      "am awakened by her moving about the room. Fortunately, the weather is so\n",
      "learn the weather signs. To-day is a grey day, and the sun as I write is\n",
      "experienced here, with results both strange and unique. The weather had\n",
      "kept watch on weather signs from the East Cliff, foretold in an emphatic\n",
      "_22 July_.--Rough weather last three days, and all hands busy with\n",
      "weather. Passed Gibralter and out through Straits. All well.\n",
      "and entering on the Bay of Biscay with wild weather ahead, and yet last\n",
      "weather influences as we know that the Count can bring to bear; and if\n",
      "that I am fully armed as there may be wolves; the weather is getting\n"
     ]
    }
   ],
   "source": [
    "### e.g., \n",
    "\n",
    "all_matches('weather')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "id": "1832203b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Here's the pattern I used (which uses some features we haven't seen)\n",
    "\n",
    "# names = r'(?<!\\.\\s)[A-Z][a-zA-Z]+'\n",
    "\n",
    "# all_matches(names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "id": "656f4723",
   "metadata": {
    "tags": [
     "thebe-interactive"
    ]
   },
   "outputs": [],
   "source": [
    "### EXERCISE: String Substitution\n",
    "# Difficulty: Intermediate\n",
    "sample = \"The colour of the city centre changed overnight.\"\n",
    "# 1. Replace British spellings with American spellings using regex:\n",
    "#    colour -> color, centre -> center\n",
    "# 2. Print the transformed sentence\n",
    "### Your code starts here:\n",
    "\n",
    "\n",
    "\n",
    "### Your code ends here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "id": "ec5f54c7",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The color of the city center changed overnight.\n"
     ]
    }
   ],
   "source": [
    "# Solution\n",
    "import re\n",
    "\n",
    "sample = \"The colour of the city centre changed overnight.\"\n",
    "sample = re.sub(r'colou?r', 'color', sample)\n",
    "sample = re.sub(r'cent(er|re)', 'center', sample)\n",
    "print(sample)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5fd5be89",
   "metadata": {},
   "source": [
    "### `re.fullmatch()` for Validation\n",
    "\n",
    "`re.fullmatch(pattern, text)` succeeds only if the **entire** string matches the pattern.\n",
    "This is the right tool for validation tasks (IDs, simple emails, phone formats, etc.)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "id": "a50bff32",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "EMP-0001 True\n",
      "EMP-12 False\n",
      "AEMP-0001 False\n",
      "EMP-12345 False\n"
     ]
    }
   ],
   "source": [
    "employee_id_pattern = r'EMP-\\d{4}'\n",
    "ids = ['EMP-0001', 'EMP-12', 'AEMP-0001', 'EMP-12345']\n",
    "\n",
    "for emp_id in ids:\n",
    "    print(emp_id, bool(re.fullmatch(employee_id_pattern, emp_id)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "id": "4f4e253b",
   "metadata": {
    "tags": [
     "thebe-interactive"
    ]
   },
   "outputs": [],
   "source": [
    "### EXERCISE: Full String Validation\n",
    "# Difficulty: Intermediate\n",
    "codes = ['CS-101', 'MATH-240', 'CS101', 'EE-7']\n",
    "# A valid course code must be: 2-4 uppercase letters, a dash, then 3 digits.\n",
    "# 1. Write the regex pattern\n",
    "# 2. Print each code with True/False using re.fullmatch\n",
    "### Your code starts here:\n",
    "\n",
    "\n",
    "\n",
    "### Your code ends here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "id": "98b2b2e6",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CS-101 True\n",
      "MATH-240 True\n",
      "CS101 False\n",
      "EE-7 False\n"
     ]
    }
   ],
   "source": [
    "# Solution\n",
    "codes = ['CS-101', 'MATH-240', 'CS101', 'EE-7']\n",
    "pattern = r'[A-Z]{2,4}-\\d{3}'\n",
    "for code_str in codes:\n",
    "    print(code_str, bool(re.fullmatch(pattern, code_str)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62aa23e7",
   "metadata": {},
   "source": [
    "## Quick Reference\n",
    "\n",
    "**Characters**\n",
    "\n",
    "| Pattern | Meaning | Example match |\n",
    "|---|---|---|\n",
    "| `.` | Any character (except newline) | `c.t` → `cat`, `cot` |\n",
    "| `\\d` | Digit | `7` |\n",
    "| `\\w` | Word character (letter, digit, underscore) | `A`, `x`, `9`, `_` |\n",
    "| `\\s` | Whitespace (space, tab, newline) | ` ` |\n",
    "| `[abc]` | Character class — any one of `a`, `b`, `c` | `a` |\n",
    "| `[^abc]` | Negated class — any character except `a`, `b`, `c` | `d` |\n",
    "\n",
    "**Quantifiers**\n",
    "\n",
    "| Pattern | Meaning | Example |\n",
    "|---|---|---|\n",
    "| `*` | 0 or more | `ab*` → `a`, `ab`, `abb` |\n",
    "| `+` | 1 or more | `ab+` → `ab`, `abb` |\n",
    "| `?` | 0 or 1 (optional) | `colou?r` → `color`, `colour` |\n",
    "| `{n}` | Exactly n | `\\d{3}` → `123` |\n",
    "| `{n,m}` | Between n and m | `\\d{2,4}` → `12`, `123` |\n",
    "| `*?` `+?` | Lazy (match as little as possible) | `<.+?>` |\n",
    "\n",
    "**Anchors**\n",
    "\n",
    "| Pattern | Meaning |\n",
    "|---|---|\n",
    "| `^` | Start of string (or line with `re.M`) |\n",
    "| `$` | End of string (or line) |\n",
    "| `\\b` | Word boundary |\n",
    "\n",
    "**Groups**\n",
    "\n",
    "| Syntax | Meaning |\n",
    "|---|---|\n",
    "| `(...)` | Capturing group |\n",
    "| `(?:...)` | Non-capturing group |\n",
    "| `(?P<name>...)` | Named group |\n",
    "| `(?=...)` | Lookahead |\n",
    "| `(?<=...)` | Lookbehind |\n",
    "\n",
    "**Flags**\n",
    "\n",
    "| Flag | Shorthand | Meaning |\n",
    "|---|---|---|\n",
    "| `re.IGNORECASE` | `re.I` | Case-insensitive matching |\n",
    "| `re.MULTILINE` | `re.M` | `^`/`$` match line start/end |\n",
    "| `re.DOTALL` | `re.S` | `.` matches newline too |\n",
    "| `re.VERBOSE` | `re.X` | Allow comments/whitespace in pattern |"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
