Assignment 10
Due date: Thursday, March 31st at midnight.
Be sure to use version control ("git"), as you develop your code. Do "git add ...., git commit
" repeatedly as you add and edit your code. You will hand in the output of "git log
" for your assignment repository as part of the
assignment.
A fun application of Markov Chain Monte Carlo methods is its application to the problem of decoding messages that have been encoded using a simple direct encoding algorithm, meaning that each encoded character directly maps to a single decoded character. If f
is our decoding dictionary, then f["a"] = "b"
, for example.
The goal of this assignment is to implement an MCMC that will determine f
, our decoding dictionary, that will be used to decode the text:
epd:!-pbdxod!ydyepdp kmzkxudr!,pldrkmycdcp!mudjyde!bdy!hp dejzdykdgp!m dae!ydhj bdkrduzjgpda!udejbbp d.p p!yedyepdb!mhdzkxuy!,epldkd,mxpgnd ppbgpuudzjux bpmuy! bj :fdkduyx..km ndupgrsajggpbdpqjgpdrmkzdyepdgkwj :d.mp!uyfdyakd:j su,p ypbdyp!mudymj,hgpbdbka dyepdujbpudkrdejud kupld.xydjyda!ud!ggdmj:eyndpwpmcyej :da!ud!ggdmj:eyndyepduymx::gpda!udrj juepbldepde!bdak dyepdwj,ykmcdkwpmdejzupgrldepdgkwpbd.j:d.mkyepml
Note that this string has 412 characters in it. Make sure that's how many you have when you put it into your code.
How is this accomplished? Because the goal is to determine f
, our decoding dictionary, and because our decoding dictionary is discrete, not continuous, the approach we take will be slightly different than that discussed in lecture. We need to create a discrete analogy to the "posterior" that was developed for the continuous case in class. For this problem we make the observation that in the English language the probability, within a block of text, of transitioning from the letter "a" to the letter "b" is not the same as the probability of transitioning from the letter "a" to the letter "c", for example. By determining these probabilities we can create a "posterior" function which can be applied to a block of text. Once we have such a function we can perform our MCMC.
1 Create a file, called mcmc_utilities.py
, which contains the following functions.
1a) I have taken the liberty of downloading the text of "War and Peace", a text sufficiently long that it can be used to determine the probabilities of transitioning between any two letters in the English language. I have lower-cased and filtered the text for some punctuation and numbers and new lines. I then calculated the probabilities of transitioning from one letter to another. Those probabilities which were zero were given small values, to help prevent them from falling into local minima, and the rows were normalized so that the total transition probability for a given letter is 1. The file containing this data can be found here. Note that there are 32 columns and rows, as there is some punctuation, and spaces, included in the alphabet.
Write a function that will read the aforementioned data in the CSV file. Use the pandas
package's read_csv
function to read the file. It should return the matrix of transition probabilities, as a numpy array, and the alphabet, which are the column names, as a list. The column names can be accessed using the df.columns
, if you have named the data variable df
.
1b) Write a function that will return the above encoded text.
1c) Write a function that will initialize a decoding dictionary, f
. This dictionary should consist of 32 entries, one for each letter of the alphabet. This dictionary gives the mapping from the encoded letter to the decoded letter. The decoded letter may be represented by a string or an index, whichever you prefer. You may initialize the dictionary however you like. The function should return the dictionary.
1d) Write a function that will calculate and return the "posterior" for the given value of the decoding dictionary f
. The function should take 3 arguments, the array of transition probabilities, the decoding dictionary f
, and the encoded text. The posterior should be calculated as
$$P = \sum^{n-1}_{i=1} \log\left(M[j_i, j_{i+1}]\right)$$
where M
is the array of transition probabilities, i
is the ith letter of the encoded text, ji
is the alphabet index of the ith letter of the decoded text, and n
is the number of letters in the encoded text. Note that the log of the probabilities should be used when calculating the "posterior". The log
function is part of the numpy
package.
1e) Write a function that takes a single argument, f
, the decoding dictionary. The function will randomly permute 2 of the mappings of f
, and return the resulting dictionary. This means that if, for example, the incoming f
has values f["a"] = "b"
and f["c"] = "d"
, the function would return the decoding dictionary that has values f["a"] = "d"
and f["c"] = "b"
. Again, the choice of entries to permute must be chosen randomly.
The purpose of this function is to propose new values of our decoding dictionary f
, so that we can form a Markov chain. Functions in the random
package will be useful here.
1f) Write a function that decodes the encoded text, using the current value of f
, and prints it out decoded text.
2) Create a Python script, named run_mcmc.py
, which:
- Reads in the transition probability array,
- reads in the encoded text,
- initializes
f
, - Loops 10000 times, during which the code will perform the Metropolis algorithm, updating the value of
f
appropriately. Note that in this case we are not interested in the chain itself, just the decoding dictionary. - Every 100 iterations the code should print out the decoded text, using the current value of the decoding dictionary.
Note that the decoding of the text may not be successful every time you run the script. This algorithm can fall into a local minimum and fail to continue improving, in which case the decoding will fail. Also note that the decoding may not be perfect. A few letters may be incorrect. This is okay. If most of the text looks like English then you have been successful. Run your driver script several times before concluding that it's not working.
Submit your mcmc_utilities.py
, run_mcmc.py
and the output of git log
from your assignment repository.
To capture the output of 'git log
' use redirection, git log > git.log
, and hand in the "git.log" file.
Assignments will be graded on a 10 point basis.
Due date is March 31st, 2022 at midnight, with 0.5 point penalty per day for late submission until the cut-off date of April 7th, 2022 at 11:00am.