We have made a revision to the late policy for problem sets: we will
give students 5 "free" days that can be used as they wish across the 4
problem sets. Specifically, we will not penalise the first 5 late days
that a student incurs on problem sets. After that, the penalties
posted on the problem sets will apply (e.g., 5 points per day late on
the first problem set). The final (0 point) deadline will still apply;
for example for pset 1 any solutions handed in after October 1st will
get 0 points.
6.864 homework #1
6.864 homework #2
The homework is due on 18th October, 2007, at 5pm (an earlier
version of the homework said 18th October
2006, this typo has now been corrected).
You can download poscounts.gz from here, wsj.19-21.test from here, and
6.864 homework #3
- ft.tar.gz — A
package containing the scripts that were used to generate the
poscounts.gz corpus. We are providing
this code in case you are
curious about the data generation. For the purposes of the problem
set, however, please use the poscounts.gz training corpus to ensure
that your results comply with the reference implementation.
- tritest, tritest.probs — Development data for
testing your tag-trigram probabilities; tritest contains tag trigrams, while tritest.probs contains the corresponding
simplesents, simplesents.bf_tagged —
Development data for testing your Viterbi tag assignments. The
simplesents file contains about 530
simple sentences that admit
relatively few possible tag assignments. The simplesents.bf_tagged
file contains optimal tag assignments and log-probabilities as
discovered by brute-force enumeration. The first element in every
line of simplesents.bf_tagged gives the
log-probability of the best
tagging, and the rest of the line gives the tag assignment itself.
You can download corpus.de.gz from here and corpus.en.gz from here.
A set of words and their associated translation probabilities.
The output file is formatted as a series of lines, where each
line contains a number of (German word, translation probability)
pairs, all tokens separated by spaces.
A set of words for which you must provide output probabilities.
Please provide a file testwords.out
with the same format as devwords.out
- Here's a
note that hopefully clarifies how n(e) is defined in the programming
assignment (n(e) values are used to initialize the T(f | e) parameters).
6.864 homework #4
Note: material for question 2 will be covered in the lecture
on November 20th; material for question 4 will be covered on