Markov Chains for Recognition

Introduction

In this assignment you’ll practice

Problem Description

You’re interested in natural language processing, and in the problem of identifying the source of a text.

Solution Description

Write a module named source.py with a class named SourceModel whose constructor takes a name for a source, and a corpus object of type TextIOWrapper (such as a file object – see io module) and builds a first-order Markov model of the transitions between letters in the source. Only alphabetic characters in the source corpus should be considered, and they should be normalized to upper or lower case. For simplicity (see background) only consider the 26 letters of the English alphabet (for languages whose alphabets have diacritics, simply drop them, e.g., use e for é, u for ü, etc.).

Here are some example corpus files and test files:

You can assume corpus files are of the form <source-name>.corpus.

Background

In machine learning we train a model on some data and then use that model to make predictions about unseen instance of the same kind of data. For example, we can train a machine learning model on a data set consisting of several labeled images, some of which depict a dog and some of which don’t. We can then use the trained model to predict whether some unseen image (an image not in the training set) has a dog. The better the model, the better the accuracy (percentage of correct predictions) on unseen data.

We can create a model of language and use that model to predict the likelihood that some unseen text was generated by that model, in other words, how likely the unseen text is an example of the language modeled by the model. The model could be of a particular author, or a language such as French or English. One simple kind of model is well-suited to this problem: Markov models.

Markov models are useful for time-series or other sequential data. A Markov model is a finite-state model in which the current state is dependent only on a bounded history of previous states. In a first-order Markov model the current state is dependent on only one previous state.

One can construct a simple first-order Markov model of a language as the transition probabilities between letters in the language’s alphabet. For example, given this training corpus of a language:

BIG A, little a, what begins with A?
Aunt Annie’s alligator.
A...a...A

BIG B, little b, what begins with B?
Barber, baby, bubbles and a bumblebee.

We would have the model:

Dr Seuss Language Model

We’ve only shown the letter-to-letter transitions that occur in the training corpus. Notice:

This model indicates that whenever the letter a occurs in our training corpus, the next letter is a, b, e, g, h, i, l, n, r, s, t, u or w. The arrow from a to b is labeled .19 because b appears after a 3 out of 16 times, or approximately 19 percent of the time. A first-order Markov chain is a kind of bigram model. Here are all the bigrams in the training text that begin with a, that is, all the state transitions from a:

(a, l), (a, w), (a, t), (a, a), (a, u), (a, n), (a, l), (a, t),
(a, a), (a, a), (a, b), (a, t), (a, r), (a, b), (a, n), (a, b)

A Markov chain represents all the bigrams and their probabilities of occurrence in the training corpus.

Representation as a matrix.

A Markov chain can be represented as a transition matrix in which the probability of state j after state i is found in element (i, j) of the matrix. In the example below we have labeled the rows and columns with letters for readability. The probability of seeing the letter n after the letter a in the training corpus is found by entering row a and scanning accross to column n, where we find the probability .12.

abcdefghijklmnopqrstuvwxyz
a0.190.190.010.010.010.010.010.010.010.010.010.120.010.120.010.010.010.060.010.190.060.010.060.010.010.01
b0.120.120.010.010.240.010.010.010.120.010.010.180.010.010.010.010.010.010.010.010.120.010.060.010.060.01
c0.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.01
d1.000.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.01
e0.110.220.010.010.110.010.220.010.010.010.010.010.010.010.010.010.010.110.220.010.010.010.010.010.010.01
f0.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.01
g0.400.200.010.010.010.010.010.010.400.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.01
h0.750.250.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.01
i0.010.010.010.010.100.010.300.010.010.010.010.010.010.200.010.010.010.010.010.400.010.010.010.010.010.01
j0.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.01
k0.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.01
l0.010.010.010.010.500.010.010.010.380.010.010.120.010.010.010.010.010.010.010.010.010.010.010.010.010.01
m0.011.000.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.01
n0.010.010.010.170.010.010.010.010.170.010.010.010.010.170.010.010.010.010.330.170.010.010.010.010.010.01
o0.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.011.000.010.010.010.010.010.010.010.01
p0.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.01
q0.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.01
r0.330.670.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.01
s0.500.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.500.010.010.01
t0.100.200.010.010.010.010.010.200.010.010.010.200.010.010.100.010.010.010.010.200.010.010.010.010.010.01
u0.010.330.010.010.010.010.010.010.010.010.010.010.330.330.010.010.010.010.010.010.010.010.010.010.010.01
v0.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.01
w0.010.010.010.010.010.010.010.500.500.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.01
x0.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.01
y0.011.000.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.01
z0.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.01

Prediction using a Markov Chain Model.

Given a Markov chain model of a source, we can compute the probability that the model would produce a given string of letters by applying the chain rule. Simply stated, we walk the transitions in the Markov chain and multiply the transition probabilities. For example, to compute the probability that “Big C, Little C” would be produced by our model, we would get the following probabilities from the transition matrix and compute their product:

p(b, i) = .12
p(i, g) = .30
p(g, c) = .01
p(c, l) = .01
p(l, i) = .38
p(i, t) = .40
p(t, t) = .20
p(t, l) = .20
p(l, e) = .50
p(e, c) = .01

Multiplying them gives us 1.0588235294117648e-10. Notice that, in order to avoid getting zero-probability predictions using our simplified technique, we store .01 in our transition matrix for any bigram we don’t see in the training corpus.

Note: for longer texts that would require long chains of multiplication it is possible to get floating-point underflow since each probability is < 1. There are ways to avoid this, but numerical computation is beyond the scope of this course. For this exercise, stick to shorter query sentences.

Additional Information

We’ve greatly simplified the presentation here to focus on the programming. For more information consult the following references.

Requirements

Here’s a skeleton source.py to get you started:

class SourceModel:

    def __init__(self, name, text_stream):
        # Recommended algorithm for building your transition matrix:
        # Initialize a 26x26 matrix (e.g., 26-element list of 26-element lists)
        # Print "Training {lang} model ... "
        # Read the text_stream one character at a time.
        # For each character, increment the corresponding (row, col) in your
        # matrix. The row is the for the previous character, the col is for
        # the current character. (You could also think of this in terms of bigrams.)
        # After you read the entire text_stream, you'll have a matrix of counts.
        # From the matrix of counts, create a matrix of probabilities --
        # each row of the transition matrix is a probability distribution.
        # Print "done."
    def __repr__(self):

    def probability(self, text_stream):

if __name__=="__main__":
   # The first n-1 arguments to the script are corpus files to train models.
   # Corupus files are of the form <source-name>.corpus

   # The last argument to the script is either a file to analyze or a sentence.
   # Sentences should be in quotes. (Same with file names that have special characters
   # or spaces, but decent people don't put special characters or spaces in file names.)

   # Create a SourceModel object for each corpus

   # Use the models to compute the probability that the test text was produced by the model

   # Probabilities will be very small. Normalize the probabilities of all the model
   # predictions to a probability distribution (so they sum to 1)
   # (closed-world assumption -- we only state probabilities relative to models we have).

   # Print results of analysis, sorted in order from most likely to least likely

Running Your Script

Here’s an analysis of a line from a famous rap song:

$ python source.py *.corpus "If you got a gun up in your waist please don't shoot up the place (why?)"
Training english model ... done.
Training french model ... done.
Training hiphop model ... done.
Training lisp model ... done.
Training spanish model ... done.
Analyzing string If you got a gun up in your waist please don't shoot up the place (why?).
Relative probability that test string is hiphop  : 0.998103661.
Relative probability that test string is english : 0.001896339.
Relative probability that test string is french  : 0.000000000.
Relative probability that test string is spanish : 0.000000000.
Relative probability that test string is lisp    : 0.000000000.
$ python source.py *.corpus "Ou va le monde"
Training english model ... done.
Training french model ... done.
Training hiphop model ... done.
Training lisp model ... done.
Training spanish model ... done.
Analyzing string Ou va le monde.
Relative probability that test string is french  : 0.904369735.
Relative probability that test string is lisp    : 0.064067191.
Relative probability that test string is english : 0.014523700.
Relative probability that test string is hiphop  : 0.009526881.
Relative probability that test string is spanish : 0.007512493.