Markov Chain Inauguration Address generator

To play around with text synthesis, I used the Natural Language Toolkit (nltk) library to generate an Inauguration Address. It works just about as well as you might imagine, its terrible. But it was pretty fun seeing what it came up with.

The general method used here was something along the lines of a markov chain. At each state, we transition to the next with a probability that we have learned from training on the corpora. The random path that we take is used to form a sentence.

The project is broken into two experiments. In the first the states are represented as words, and in the second the states are presentented by the word's part of speech tag (e.g noun, verb, etc.).

Part One - Words as states

This way of representing the states is the most common in speech synthesis. Its also pretty simple. The idea is that we build an adjacency structure for each word (or group of words) so we know what should follow it.

In [1]:
import numpy as np
from textblob import TextBlob
from nltk.corpus import inaugural
from collections import defaultdict
from IPython.core.display import HTML

We'll use start/stop tags to simplify the code. By sticking these on the start and end, we don't have to worry about edge cases so much

In [ ]:
start_tag = 'START'
end_tag   = 'STOP'

The following function generates the word-adjacency structure for a given corpus and order. Order here translates to the order of our markov chain. If the order is 1, then we look at the previous word to determine the next. If the order is 2, we look at the previous 2 words to determine the next and so on.

There is also a struct parameter that we can pass a list or a set. The importance of this choice will be explained later

In [ ]:
def generate_adjacency_map(corpus, order, struct=list):
    adjacency_map = defaultdict(struct)    
    for sentence in corpus.sents():        
        # generate a list where the start and end are buffered by start and end tags.
        words = ([start_tag] * order) + sentence + ([end_tag] * order)
        
        #slide over the list
        #add each new word to the adjacency_map indexed by the preceding order-words
        for i in range(0, len(sentence) + 1):
            preceding_words = tuple([words[j] for j in range(i, i + order)])
            word = words[i + order]
            if struct is list:
                adjacency_map[preceding_words].append(word)   
            elif struct is set:
                adjacency_map[preceding_words].add(word)  
                
    return adjacency_map         
In [ ]:
word_adjacency_list = generate_adjacency_map(corpus = inaugural, order = 2, struct=list)
word_adjacency_set = generate_adjacency_map(corpus = inaugural, order = 2, struct=set)

Of course both structures contain the same number of unique elements, but in a set all unique words have equal probability to be picked at random. In a list, the more frequest the word, the more times it is expected to be picked at random.

In [ ]:
print 'set : mean word adjacencies -', np.mean(map(len, word_adjacency_set.values()))
print 'list: mean word adjacencies -', np.mean(map(len, word_adjacency_list.values()))

The following function takes the adjacency structure and builds a random sentence out of it.

In [ ]:
def build_sentence(adjacency_map, order, max_len=50):
    sentence = []
    next_word = ''    
    state = tuple([start_tag] * order)    
    while next_word != end_tag and len(sentence) < max_len:
        sentence.append(next_word)
        next_word = np.random.choice(list(adjacency_map[state]))
        state = state[1:] + (next_word,)
    return ' '.join(sentence)    
In [ ]:
sentence = build_sentence(word_adjacency_list, order=2)
sentence

Part Two - Word tags as states

At this point, I had an idea. I could use the TextBlob library to tag the words and use that as the markov state instead.

For each part of speech tag, I saved the assosiated bag of words, and built an adjacency structure similarly to what I did for words

In [ ]:
def generate_tag_adjacencies(corpus, order, struct=list):
        
    adjacency_map = defaultdict(struct)
    tag_word_bags = defaultdict(struct, {end_tag : ['.']})

    for sentence in corpus.sents(): 
        tagged_sent = TextBlob(' '.join(sentence)).tags    
        
        #add tagged words to the relevent tag bag
        for word, tag in tagged_sent:
            if struct is list:                
                tag_word_bags[tag].append(word)
            elif struct is set:
                tag_word_bags[tag].add(word)

        #build tag adjacency map similarly to the word-based one
        tags = ([start_tag] * order) + map(lambda tup: tup[1], tagged_sent) + ([end_tag] * order)

        for i in range(0, len(tagged_sent) + 1):
            index = tuple([tags[j] for j in range(i, i + order)])
            tag = tags[i + order]
            if struct is list:                
                adjacency_map[index].append(tag)
            elif struct is set:
                adjacency_map[index].add(tag)
            
    return adjacency_map, tag_word_bags
In [ ]:
tag_adjacency_list, tag_bags_list = generate_tag_adjacencies(corpus = inaugural, order = 2, struct=list)
tag_adjacency_set, tag_bags_set = generate_tag_adjacencies(corpus = inaugural, order = 2, struct=set)

In this case, the list/set difference makes a significant difference.

In [ ]:
print 'list: mean tag adjacencies -', np.mean(map(len, tag_adjacency_list.values())), ', mean word bag size -', np.mean(map(len, tag_bags_list.values()))
print 'set : mean tag adjacencies -', np.mean(map(len, tag_adjacency_set.values())), ', mean word bag size -', np.mean(map(len, tag_bags_set.values()))

This means that if we choose to use a set as the datastructure, our resulting sentences will feel much more random.

Next we build a sentence using the above structures similarly to before. But now the word we want to add to the sentence and what state we are entering are separate concepts.

In [ ]:
def build_tag_sentence(adjacency_map, tag_word_bags, order, max_len=50):
    sentence = []
    next_tag = None
    state = tuple([start_tag] * order)
    while next_tag != end_tag and len(sentence) < max_len:
        next_tag = np.random.choice(list(adjacency_map[state]))
        next_word = np.random.choice(list(tag_word_bags[next_tag]))
        state = state[1:] + (next_tag,)
        sentence.append(next_word)        
    return ' '.join(sentence)    
In [ ]:
sentence = build_tag_sentence(tag_adjacency_list, tag_bags_list, order = 2)
sentence

Results

Finally, we put together a bunch of sentences to make our Inaugural Address.

In [ ]:
def output(sents):
    return HTML('<p>' + ('</p><p>'.join(sents)) + '</p>')
In [ ]:
output([build_sentence(word_adjacency_list, order=2) for i in range(10)])
In [ ]:
output([build_tag_sentence(tag_adjacency_list, tag_bags_list, order = 2) for i in range(10)])

Conclusions

As you might have expected, its just a nonsense generator. Increasing the order can reduce the nonsense-rating, but on a corpus as small as this, it ends up regurgitating entire sentences.

The tag method seems to be even worse than the word-based method, but maybe you could combine them somehow to improve the result.

Download IPython Notebook