8.3. Text Analysis#
At this point we have covered Python’s core data structures – lists, dictionaries, and tuples – and some algorithms that use them. In this chapter, we’ll use them to explore text analysis and Markov generation:
Text analysis is a way to describe the statistical relationships between the words in a document, like the probability that one word is followed by another, and
Markov generation is a way to generate new text with words and phrases similar to the original text.
These algorithms are similar to parts of a Large Language Model (LLM), which is the key component of a chatbot.
We’ll start by counting the number of times each word appears in a book. Then we’ll look at pairs of words, and make a list of the words that can follow each word. We’ll make a simple version of a Markov generator, and as an exercise, you’ll have a chance to make a more general version.
8.3.1. Unique words#
As a first step toward text analysis, let’s read a book – The Strange Case Of Dr. Jekyll And Mr. Hyde by Robert Louis Stevenson – and count the number of unique words. Instructions for downloading the book are in the notebook for this chapter.
The following cell downloads the book from Project Gutenberg.
data_dir = project_root / 'data'
data_dir.mkdir(parents=True, exist_ok=True) # Create the data directory if it doesn't exist
raw_path = data_dir / 'pg43.txt' ### This is the raw text file downloaded from Project Gutenberg
clean_path = data_dir / 'dr_jekyll.txt' ### This will the cleaned text file that we will use for analysis
if not raw_path.exists():
download('https://www.gutenberg.org/cache/epub/43/pg43.txt', str(raw_path))
print('Downloaded to', raw_path)
else:
print('Already downloaded:', raw_path)
Already downloaded: /Users/tychen/workspace/py/data/pg43.txt
The version available from Project Gutenberg includes information about the book at the beginning and license information at the end.
We’ll use clean_file from Chapter 8 to remove this material and write a “clean” file that contains only the text of the book.
def is_special_line(line):
return line.strip().startswith('*** ') ### This is the marker for the start and end of the actual text
def clean_file(input_file, output_file):
reader = open(input_file, encoding='utf-8') # Open the input file for reading with UTF-8 encoding
writer = open(output_file, 'w') # Open the output file for writing
# reader and writer are file objects that we can use to read from and write to the files, respectively
# read/write operations are line by line, so we can use a for loop to iterate through the lines
for line in reader:
if is_special_line(line): ### This is the marker for the start and end of the actual text
break ### Stop reading until we find the start of the actual text
for line in reader:
if is_special_line(line): ### This is the marker for the start and end of the actual text
break
writer.write(line) ### Write the line to the output file if it's not a special line
reader.close() # Close the input file
writer.close() # Close the output file
### using with statement to automatically close files
# with open(input_file, encoding='utf-8') as reader, open(output_file, 'w') as writer:
# for line in reader:
# if is_special_line(line):
# break
# for line in reader:
# if is_special_line(line):
# break
# writer.write(line)
filename = clean_path ### 'drjekyll.txt' will be the cleaned whole text
clean_file(raw_path, filename) ### read from pg43.txt,
### write to dr_jekyll.txt
count = 0 ### avoid reading the entire file
for line in open(filename):
print(line, end='')
count += 1
if count > 20: ### read the first 20 lines
break
The Strange Case Of Dr. Jekyll And Mr. Hyde
by Robert Louis Stevenson
Contents
STORY OF THE DOOR
SEARCH FOR MR. HYDE
DR. JEKYLL WAS QUITE AT EASE
THE CAREW MURDER CASE
INCIDENT OF THE LETTER
INCIDENT OF DR. LANYON
We’ll use a for loop to read lines from the file and split to divide the lines into words.
Then, to keep track of unique words, we’ll store each word as a key in a dictionary.
unique_words = {}
for line in open(filename): ### filename is dr_jekyll.txt, cleaned text
seq = line.split()
for word in seq:
unique_words[word] = 1
len(unique_words)
# unique_words
6042
The length of the dictionary is the number of unique words – about 6000 by this way of counting.
But if we inspect them, we’ll see that some are not valid words.
For example, let’s look at the longest words in unique_words.
We can use sorted to sort the words, passing the len function as a keyword argument so the words are sorted by length.
sorted(unique_words, key=len)[-5:]
['chocolate-coloured',
'superiors—behold!”',
'coolness—frightened',
'gentleman—something',
'pocket-handkerchief.']
The slice index, [-5:], selects the last 5 elements of the sorted list, which are the longest words.
The list includes some legitimately long words, like “circumscription”, and some hyphenated words, like “chocolate-coloured”. But some of the longest “words” are actually two words separated by a dash. And other words include punctuation like periods, exclamation points, and quotation marks.
So, before we move on, let’s deal with dashes and other punctuation.
### EXERCISE: Counting Unique Words
text = "to be or not to be that is the question to be"
# 1. Build a dictionary where each key is a unique word from 'text'
# 2. Print the total number of unique words
### Your code starts here:
### Your code ends here.
8
8.3.2. Punctuation#
To identify the words in the text, we need to deal with two issues:
When a dash appears in a line, we should replace it with a space – then when we use
split, the words will be separated.After splitting the words, we can use
stripto remove punctuation.
To handle the first issue, we can use the following function, which takes a string, replaces dashes with spaces, splits the string, and returns the resulting list.
def split_line(line):
return line.replace('—', ' ').split()
Notice that split_line only replaces dashes, not hyphens.
Here’s an example.
split_line('coolness—frightened')
['coolness', 'frightened']
Now, to remove punctuation from the beginning and end of each word, we can use strip, but we need a list of characters that are considered punctuation.
Characters in Python strings are in Unicode, which is an international standard used to represent letters in nearly every alphabet, numbers, symbols, punctuation marks, and more.
The unicodedata module provides a category function we can use to tell which characters are punctuation.
Given a letter, it returns a string with information about what category the letter is in.
import unicodedata
unicodedata.category('A')
'Lu'
The category string of 'A' is 'Lu' – the 'L' means it is a letter and the 'u' means it is uppercase.
The category string of '.' is 'Po' – the 'P' means it is punctuation and the 'o' means its subcategory is “other”.
unicodedata.category('.')
'Po'
We can find the punctuation marks in the book by checking for characters with categories that begin with 'P'.
The following loop stores the unique punctuation marks in a dictionary.
punc_marks = {}
for line in open(filename): ### filename is dr_jekyll.txt, cleaned text
for char in line:
category = unicodedata.category(char)
if category.startswith('P'):
punc_marks[char] = 1
To make a list of punctuation marks, we can join the keys of the dictionary into a string.
punctuation = ''.join(punc_marks)
print(punctuation)
.’;,-“”:?—‘!()_
Now that we know which characters in the book are punctuation, we can write a function that takes a word, strips punctuation from the beginning and end, and converts it to lower case.
def clean_word(word):
return word.strip(punctuation).lower()
Here’s an example.
clean_word('“Behold!”')
'behold'
Because strip removes characters from the beginning and end, it leaves hyphenated words alone.
clean_word('pocket-handkerchief')
'pocket-handkerchief'
Now here’s a loop that uses split_line and clean_word to identify the unique words in the book.
unique_words2 = {}
for line in open(filename):
for word in split_line(line): ### split_line handles the em dash
word = clean_word(word) ### removes punctuation, lowercase
unique_words2[word] = 1
len(unique_words2)
4005
With this stricter definition of what a word is, there are about 4000 unique words. And we can confirm that the list of longest words has been cleaned up.
key=len tells sorted() to sort by the length of each word.It calls len() on each word and sorts by that number from shortest to longest by default.
sorted(unique_words2, key=len)[-5:]
['circumscription',
'unimpressionable',
'fellow-creatures',
'chocolate-coloured',
'pocket-handkerchief']
### EXERCISE: Cleaning Words
words = ['"Hello,"', 'world!', 'chocolate-coloured', '"Behold!"', 'pocket-handkerchief']
# Use clean_word() to process each word and print the cleaned version.
# Note: clean_word() strips punctuation from both ends and lowercases.
### Your code starts here:
### Your code ends here.
"hello,"
world
chocolate-coloured
"behold!"
pocket-handkerchief
Now let’s see how many times each word is used.
8.3.3. Word Frequencies#
The following loop computes the frequency of each unique word that is very similar to the value_counts() function in the dictionary section when we count the letter frequency of “brontosaurus” and “mississippi”.
word_counter = {}
for line in open(filename):
for word in split_line(line):
word = clean_word(word)
if word not in word_counter:
word_counter[word] = 1
else:
word_counter[word] += 1
The first time we see a word, we initialize its frequency to 1. If we see the same word again later, we increment its frequency.
To see which words appear most often, we can use items to get the key-value pairs from word_counter, and sort them by the second element of the pair, which is the frequency.
First we’ll define a function that selects the second element just like the tuple chapter second_element().
def second_element(t):
return t[1]
Now we can use sorted with two keyword arguments:
key=second_elementmeans the items will be sorted according to the frequencies of the words.reverse=Truemeans the items will be sorted in reverse order, with the most frequent words first.
items = sorted(word_counter.items(), key=second_element, reverse=True)
Here are the five most frequent words.
for word, freq in items[:5]:
print(freq, word, sep='\t')
1614 the
972 and
941 of
640 to
640 i
### EXERCISE: Word Frequency Counter
text = "the cat sat on the mat and the cat sat"
# 1. Build a word frequency dictionary (word → count) for 'text'
# 2. Print the top 3 most frequent words with their counts (tab-separated)
### Your code starts here:
### Your code ends here.
3 the
2 cat
2 sat
In the next section, we’ll encapsulate this loop in a function. And we’ll use it to demonstrate a new feature – optional parameters.
8.3.4. Optional Parameters#
We’ve used built-in functions that take optional parameters.
For example, round takes an optional parameters called ndigits that indicates how many decimal places to keep.
round(3.141592653589793, ndigits=3)
3.142
But it’s not just built-in functions – we can write functions with optional parameters, too.
For example, the following function takes two parameters, word_counter and num.
def print_most_common(word_counter, num=5):
items = sorted(word_counter.items(), key=second_element, reverse=True)
for word, freq in items[:num]:
print(freq, word, sep='\t')
The second parameter looks like an assignment statement, but it’s not – it’s an optional parameter.
If you call this function with one argument, num gets the default value, which is 5.
print_most_common(word_counter)
1614 the
972 and
941 of
640 to
640 i
If you call this function with two arguments, the second argument gets assigned to num instead of the default value.
print_most_common(word_counter, 3)
1614 the
972 and
941 of
In that case, we would say the optional argument overrides the default value.
If a function has both required and optional parameters, all of the required parameters have to come first, followed by the optional ones.
%%expect SyntaxError
def bad_function(n=5, word_counter):
return None
Cell In[34], line 1
def bad_function(n=5, word_counter):
^
SyntaxError: parameter without a default follows parameter with a default
### EXERCISE: Function with Optional Parameter
word_counter = {'the': 3, 'cat': 2, 'sat': 2, 'on': 1, 'mat': 1, 'and': 1}
# write a function called print_least_common that takes a word_counter dictionary
# and an optional parameter num (default 3)
# the function should print the num least frequent words in the word_counter,
# one per line, with the frequency and word separated by a tab
### Your code starts here:
### Your code ends here.
1 on
1 mat
1 and
1 on
1 mat
1 and
2 cat
2 sat
8.3.5. Dictionary Subtraction#
Suppose we want to spell-check a book – that is, find a list of words that might be misspelled. One way to do that is to find words in the book that don’t appear in a list of valid words. Now we’ll use this list to spell-check Robert Louis Stevenson. We can think of this problem as set subtraction – that is, we want to find all the words from one set (the words in the book) that are not in the other (the words in the list).
The following cell downloads the word list.
(*I am using ../../data as the data folder and your data folder maybe different from mine if you download the notebook and place it in your project folder. In that case you do not have to specify the data folder if you have the words.txt downloaded into the same folder as the notebook.)
if __name__ == '__main__': ### This is a common Python idiom that checks if the script is being run directly (as the main program) rather than imported as a module. If this condition is true, the code inside this block will be executed.
print_most_common(word_counter)
download('https://raw.githubusercontent.com/AllenDowney/ThinkPython/v3/words.txt', '../../data');
3 the
2 cat
2 sat
1 on
1 mat
We can read the contents of words.txt and split it into a list of strings.
### This is a common way to read a file and split it into a list of words.
### However, it does not properly close the file after reading, which can
### lead to resource leaks. Using a with statement is a better practice as
### it ensures that the file is properly closed after its suite finishes,
### even if an error occurs.
# word_list = open('../../data/words.txt').read().split()
with open('../../data/words.txt') as f:
word_list = f.read().split()
Then we’ll store the words as keys in a dictionary so we can use the in operator to check quickly whether a word is valid.
valid_words = {} ### another dictionary to store the valid words from the word list
for word in word_list:
valid_words[word] = 1
Now, to identify words that appear in the book but not in the word list, we’ll use subtract, which takes two dictionaries as parameters and returns a new dictionary that contains all the keys from one that are not in the other.
def subtract(d1, d2):
res = {}
for key in d1:
if key not in d2:
res[key] = d1[key]
return res
Here’s how we use it.
diff = subtract(word_counter, valid_words)
To get a sample of words that might be misspelled, we can print the most common words in diff.
print_most_common(diff)
The most common “misspelled” words are mostly names and a few single-letter words (Mr. Utterson is Dr. Jekyll’s friend and lawyer).
If we select words that only appear once, they are more likely to be actual misspellings.
We can do that by looping through the items and making a list of words with frequency 1.
singletons = []
for word, freq in diff.items():
if freq == 1:
singletons.append(word)
Here are the last few elements of the list.
singletons[-5:]
[]
Most of them are valid words that are not in the word list.
But 'reindue' appears to be a misspelling of 'reinduce', so at least we found one legitimate error.
### EXERCISE: Dictionary Subtraction
d1 = {'apple': 3, 'banana': 1, 'cherry': 4, 'date': 2}
d2 = {'banana': 1, 'cherry': 4}
# Use subtract(d1, d2) to find keys in d1 that are not in d2.
# Print the resulting dictionary.
### Your code starts here:
### Your code ends here.
{'apple': 3, 'date': 2}
8.3.6. Random numbers#
As a step toward Markov text generation, next we’ll choose a random sequence of words from word_counter.
But first let’s talk about randomness.
Given the same inputs, most computer programs are deterministic, which means they generate the same outputs every time. Determinism is usually a good thing, since we expect the same calculation to yield the same result. For some applications, though, we want the computer to be unpredictable. Games are one example, but there are more.
Making a program truly nondeterministic turns out to be difficult, but there are ways to fake it. One is to use algorithms that generate pseudorandom numbers. Pseudorandom numbers are not truly random because they are generated by a deterministic computation, but just by looking at the numbers it is all but impossible to distinguish them from random.
The random module provides functions that generate pseudorandom numbers – which I will simply call “random” from here on.
We can import it like this.
import random
# this cell initializes the random number generator so it
# generates the same sequence each time the notebook runs.
random.seed(4)
The random module provides a function called choice that chooses an element from a list at random, with every element having the same probability of being chosen.
t = [1, 2, 3]
random.choice(t)
1
If you call the function again, you might get the same element again, or a different one.
for i in range(10):
print(random.choice(t), end=' ')
2 1 3 2 2 1 1 1 1 2
In the long run, we expect to get every element about the same number of times.
If you use choice with a dictionary, you get a KeyError.
%%expect KeyError
random.choice(word_counter)
KeyError: 4
To choose a random key, you have to put the keys in a list and then call choice.
words = list(word_counter)
random.choice(words)
'sat'
If we generate a random sequence of words, it doesn’t make much sense.
for i in range(6):
word = random.choice(words)
print(word, end=' ')
the cat mat mat sat sat
Part of the problem is that we are not taking into account that some words are more common than others. The results will be better if we choose words with different “weights”, so that some are chosen more often than others.
If we use the values from word_counter as weights, each word is chosen with a probability that depends on its frequency.
weights = word_counter.values()
The random module provides another function called choices that takes weights as an optional argument.
random.choices(words, weights=weights)
['on']
And it takes another optional argument, k, that specifies the number of words to select.
random_words = random.choices(words, weights=weights, k=6)
random_words
['mat', 'the', 'and', 'the', 'sat', 'the']
The result is a list of strings that we can join into something that’s looks more like a sentence.
' '.join(random_words)
'mat the and the sat the'
### EXERCISE: Weighted Random Selection
import random
fruits = ['apple', 'banana', 'cherry']
weights = [10, 3, 1] # apple is most common
# 1. Use random.choices() to select 6 fruits with the given weights
# 2. Print the result
### Your code starts here:
### Your code ends here.
['apple', 'apple', 'apple', 'banana', 'cherry', 'banana']
If you choose words from the book at random, you get a sense of the vocabulary, but a series of random words seldom makes sense because there is no relationship between successive words. For example, in a real sentence you expect an article like “the” to be followed by an adjective or a noun, and probably not a verb or adverb. So the next step is to look at these relationships between words.
8.3.7. Bigrams#
Instead of looking at one word at a time, now we’ll look at sequences of two words, which are called bigrams. A sequence of three words is called a trigram, and a sequence with some unspecified number of words is called an n-gram.
Let’s write a program that finds all of the bigrams in the book and the number of times each one appears. To store the results, we’ll use a dictionary where
The keys are tuples of strings that represent bigrams, and
The values are integers that represent frequencies.
Let’s call it bigram_counter.
bigram_counter = {}
The following function takes a list of two strings as a parameter.
First it makes a tuple of the two strings, which can be used as a key in a dictionary.
Then it adds the key to bigram_counter, if it doesn’t exist, or increments the frequency if it does.
def count_bigram(bigram):
key = tuple(bigram)
if key not in bigram_counter:
bigram_counter[key] = 1
else:
bigram_counter[key] += 1
As we go through the book, we have to keep track of each pair of consecutive words. So if we see the sequence “man is not truly one”, we would add the bigrams “man is”, “is not”, “not truly”, and so on.
To keep track of these bigrams, we’ll use a list called window, because it is like a window that slides over the pages of the book, showing only two words at a time.
Initially, window is empty.
window = []
We’ll use the following function to process the words one at a time.
def process_word(word):
window.append(word)
if len(window) == 2:
count_bigram(window)
window.pop(0)
# from collections import defaultdict
# model = defaultdict(list)
# for i in range(len(words) - 1):
# current_word = words[i]
# next_word = words[i + 1]
# model[current_word].append(next_word)
The first time this function is called, it appends the given word to window.
Since there is only one word in the window, we don’t have a bigram yet, so the function ends.
The second time it’s called – and every time thereafter – it appends a second word to window.
Since there are two words in the window, it calls count_bigram to keep track of how many times each bigram appears.
Then it uses pop to remove the first word from the window.
The following program loops through the words in the book and processes them one at a time.
for line in open(filename):
for word in split_line(line):
word = clean_word(word)
process_word(word)
The result is a dictionary that maps from each bigram to the number of times it appears.
We can use print_most_common to see the most common bigrams.
print_most_common(bigram_counter)
178 ('of', 'the')
139 ('in', 'the')
94 ('it', 'was')
80 ('and', 'the')
73 ('to', 'the')
Looking at these results, we can get a sense of which pairs of words are most likely to appear together. We can also use the results to generate random text, like this.
random.seed(42) ### just to make the random choices reproducible
bigrams = list(bigram_counter) ### list of bigrams (tuples of two words)
weights = bigram_counter.values()
random_bigrams = random.choices(bigrams, weights=weights, k=10)
bigrams is a list of the bigrams that appear in the books.
weights is a list of their frequencies, so random_bigrams is a sample where the probability a bigram is selected is proportional to its frequency.
Here are the results.
for pair in random_bigrams:
print(' '.join(pair), end=' ')
seriously amiss a man as the but i the cup but here stain of that the silence after by a
This way of generating text is better than choosing random words, but still doesn’t make a lot of sense.
### EXERCISE: Counting Bigrams
sentence = "the cat sat on the mat the cat"
# 1. Split the sentence into words
# 2. Build a bigram_counts dictionary mapping (word1, word2) tuples to their frequency
# 3. Print each bigram and its count, sorted by frequency (highest first)
### Your code starts here:
### Your code ends here.
2 ('the', 'cat')
1 ('cat', 'sat')
1 ('sat', 'on')
1 ('on', 'the')
1 ('the', 'mat')
1 ('mat', 'the')
8.3.8. Markov analysis#
We can do better with Markov chain text analysis, which computes, for each word in a text, the list of words that come next. As an example, we’ll analyze these lyrics from the Monty Python song Eric, the Half a Bee:
song = """
Half a bee, philosophically,
Must, ipso facto, half not be.
But half the bee has got to be
Vis a vis, its entity. D'you see?
"""
To store the results, we’ll use a dictionary that maps from each word to the list of words that follow it.
successor_map = {}
As an example, let’s start with the first two words of the song.
first = 'half'
second = 'a'
If the first word is not in successor_map, we have to add a new item that maps from the first word to a list containing the second word.
successor_map[first] = [second]
successor_map
{'half': ['a']}
If the first word is already in the dictionary, we can look it up to get the list of successors we’ve seen so far, and append the new one.
first = 'half'
second = 'not'
successor_map[first].append(second)
successor_map
{'half': ['a', 'not']}
The following function encapsulates these steps.
def add_bigram(bigram):
first, second = bigram
if first not in successor_map:
successor_map[first] = [second]
else:
successor_map[first].append(second)
If the same bigram appears more that once, the second word is added to the list more than once.
In this way, successor_map keeps track of how many times each successor appears.
As we did in the previous section, we’ll use a list called window to store pairs of consecutive words.
And we’ll use the following function to process the words one at a time.
def process_word_bigram(word):
window.append(word)
if len(window) == 2:
add_bigram(window)
window.pop(0)
Here’s how we use it to process the words in the song.
successor_map = {}
window = []
for word in song.split():
word = clean_word(word)
process_word_bigram(word)
And here are the results.
successor_map
{'half': ['a', 'not', 'the'],
'a': ['bee', 'vis'],
'bee': ['philosophically', 'has'],
'philosophically': ['must'],
'must': ['ipso'],
'ipso': ['facto'],
'facto': ['half'],
'not': ['be'],
'be': ['but', 'vis'],
'but': ['half'],
'the': ['bee'],
'has': ['got'],
'got': ['to'],
'to': ['be'],
'vis': ['a', 'its'],
'its': ['entity'],
'entity': ["d'you"],
"d'you": ['see']}
The word 'half' can be followed by 'a', 'not', or 'the'.
The word 'a' can be followed by 'bee' or 'vis'.
Most of the other words appear only once, so they are followed by only a single word.
Now let’s analyze the book.
successor_map = {}
window = []
for line in open(filename):
for word in split_line(line):
word = clean_word(word)
process_word_bigram(word)
# from collections import defaultdict
# model = defaultdict(list)
# for i in range(len(words) - 1):
# current_word = words[i]
# next_word = words[i + 1]
# model[current_word].append(next_word)
We can look up any word and find the words that can follow it.
# I used this cell to find a predecessor with a good number of
# possible successors and at least one repeated word.
def has_duplicates(t):
return len(set(t)) < len(t) ### A set is a collection of unique
### elements, so if the length of
# the set is less than the length
# of the original list, it means
# there are duplicates in the list.
for key, value in successor_map.items():
if len(value) == 7 and has_duplicates(value):
print(key, value)
story ['of', 'of', 'indeed', 'but', 'for', 'of', 'that']
incident ['of', 'of', 'at', 'of', 'of', 'at', 'this']
lanyon’s ['narrative', 'there', 'face', 'manner', 'narrative', 'the', 'condemnation']
common ['it', 'interest', 'friends', 'friends', 'observers', 'but', 'quarry']
relief ['the', 'to', 'when', 'that', 'that', 'of', 'it']
appearance ['of', 'well', 'something', 'he', 'amply', 'of', 'which']
going ['east', 'in', 'to', 'to', 'up', 'to', 'of']
till ['at', 'the', 'i', 'yesterday', 'the', 'that', 'weariness']
walk ['and', 'into', 'with', 'all', 'steadfastly', 'attired', 'with']
sounds ['nothing', 'carried', 'out', 'the', 'of', 'of', 'with']
really ['like', 'damnable', 'can', 'a', 'a', 'not', 'be']
does ['not', 'not', 'indeed', 'not', 'the', 'not', 'not']
reply ['i', 'but', 'whose', 'i', 'some', 'that’s', 'i']
continued ['mr', 'the', 'the', 'the', 'the', 'poole', 'utterson']
seems ['scarcely', 'hardly', 'to', 'she', 'much', 'he', 'to']
walked ['on', 'over', 'some', 'was', 'on', 'with', 'fast']
that’s ['a', 'it', 'talking', 'very', 'not', 'such', 'not']
although ['i', 'a', 'it', 'the', 'we', 'they', 'i']
until ['the', 'the', 'they', 'they', 'the', 'to-morrow', 'i']
disappearance ['or', 'the', 'of', 'of', 'here', 'and', 'but']
step ['into', 'or', 'back', 'natural', 'into', 'of', 'leaping']
wish ['the', 'you', 'to', 'you', 'to', 'i', 'to']
aware ['of', 'of', 'of', 'that', 'jekyll’s', 'that', 'of']
thank ['you', 'you', 'you', 'you', 'you', 'god', 'you']
maid ['servant', 'described', 'fainted', 'had', 'calls', 'had', 'lifted']
besides ['were', 'was', 'for', 'a', 'with', 'with', 'which']
observed ['the', 'utterson', 'with', 'the', 'that', 'that', 'that']
among ['other', 'the', 'the', 'the', 'my', 'my', 'temptations']
successor_map['going']
['east', 'in', 'to', 'to', 'up', 'to', 'of']
In this list of successors, notice that the word 'to' appears three times – the other successors only appear once.
### EXERCISE: Building a Successor Map
text = "alice was nice alice was clever alice was kind"
# 1. Split the text into words
# 2. Build a successor_map where each word maps to the list of words that follow it
# 3. Print the map and verify that 'alice' maps to ['was', 'was', 'was']
### Your code starts here:
### Your code ends here.
{'alice': ['was', 'was', 'was'], 'was': ['nice', 'clever', 'kind'], 'nice': ['alice'], 'clever': ['alice']}
8.3.9. Generating text#
We can use the results from the previous section to generate new text with the same relationships between consecutive words as in the original. Here’s how it works:
Starting with any word that appears in the text, we look up its possible successors and choose one at random.
Then, using the chosen word, we look up its possible successors, and choose one at random.
We can repeat this process to generate as many words as we want.
As an example, let’s start with the word 'although'.
Here are the words that can follow it.
word = 'although'
successors = successor_map[word]
successors
['i', 'a', 'it', 'the', 'we', 'they', 'i']
# this cell initializes the random number generator so it
# starts at the same point in the sequence each time this
# notebook runs.
random.seed(2)
We can use choice to choose from the list with equal probability.
word = random.choice(successors)
word
'i'
If the same word appears more than once in the list, it is more likely to be selected.
Repeating these steps, we can use the following loop to generate a longer series.
for i in range(10):
successors = successor_map[word]
word = random.choice(successors)
print(word, end=' ')
continue to hesitate and swallowed the smile withered from that
The result sounds more like a real sentence, but it still doesn’t make much sense.
We can do better using more than one word as a key in successor_map.
For example, we can make a dictionary that maps from each bigram – or trigram – to the list of words that come next.
### EXERCISE: Generate Text from Successor Map
import random
random.seed(7)
# Using the successor_map built from Dr. Jekyll and Mr. Hyde:
# 1. Start with the word 'jekyll'
# 2. Generate a sequence of 8 more words by repeatedly looking up successors
# and choosing one at random with random.choice()
# 3. Print the full generated sequence as a sentence
### Your code starts here:
### Your code ends here.
jekyll god cried the situation tell you say is