layout	img	img_url	caption	title	active_tab
ling506_frame	rosetta	http://www.flickr.com/photos/calotype46/6683293633/	Rosetta stone (credit; calotype46)	Homework \| Alignment	homework

Due Saturday, 20 February 2016, 11:59pm.

Alignment

Aligning words is a key task in machine translation. We start with a large parallel corpus of aligned sentences. For example, we might have the following sentence pair from the proceedings of the bilingual Canadian parliament:

le droit de permis passe donc de $ 25 à $ 500.

we see the licence fee going up from $ 25 to $ 500.

Getting documents aligned at the sentence level like this is relatively easy: we can use paragraph boundaries and cues like the length and order of each sentence. But to learn a translation model we need alignments at the word level. That's where you come in. Your challenge is to write a program that aligns words automatically. For example, given the sentence above, your program would ideally output these pairs:

le -- the, droit -- fee, permis -- license, passe -- going, passe -- up, donc -- from, $ -- $, 25 -- 25, à -- to, $ -- $, 50 -- 50

Your program can leave words unaligned (e.g. we and see) or multiply aligned (e.g. passe aligned to going up). It will be faced with difficult choices. Suppose it sees this sentence pair:

I want to make it clear that we have to let this issue come to a vote today.

il est donc essentiel que cette question fasse le objet de un vote aujourd' hui .

Your program must make a choice about how to align the words of the non-literal translations I want to make it clear and il est donc essentiel. Even experienced bilinguals will disagree on examples like this. So word alignment does not capture every nuance, but it is still very useful.

Getting Started

ssh in to cl.linguistics.illinois.edu Once you've logged in, run this command:

git clone https://github.com/2016-Spring-UIUC-LING506/YOUR_GITHUB_USERNAME_GOES_HERE-alignment.git

You will find a python program called align, which contains a complete but very simple alignment algorithm. For every word, it computes the set of sentences that the word appears in. Intuititvely, word pairs that appear in similar sets of sentences are likely to be translations. Our aligner first computes the similarity of these sets with Dice's coefficient. Given sets $$X$$ and $$Y$$, Dice's coefficient is:

$$\delta(X,Y) = \frac{2 \times |X \cap Y|}{|X| + |Y|}$$

For any two sets $$X$$ and $$Y$$, $$\delta(X,Y)$$ will be a number between 0 and 1. The baseline aligner will align any word pair with a coefficient over 0.5. Run it on 1000 sentences:

python align -n 1000 > dice.a

This command stores the output in dice.a. To compute accuracy, run:

python tools/score-alignments < dice.a

This compares the alignments against human-produced alignments, computing alignment error rate, which balances precision and recall. It will also show you the comparison in a grid. Look at the terrible output of this heuristic method -- it's better than chance, but not any good. Try training on 10,000 sentences:

python align -n 10000 | python tools/score-alignments

Performance should improve, but only slightly! Try changing the threshold for alignment. How does this affect alignment error rate?

The Challenge

Your task is to improve the alignment error rate as much as possible. It shouldn't be hard: you've probably noticed that thresholding a Dice coefficient is a bad idea because alignments don't compete against one another. A good way to correct this is with a probabilistic model like IBM Model 1. It forces all of the English words in a sentence to compete as the explanation for each foreign word.

Formally, IBM Model 1 is a probabilistic model that generates each word of the foreign sentence $${\bf f}$$ independently, conditioned on some word in the English sentence $${\bf e}$$. Given $${\bf f}$$, the joint probability of an alignment $${\bf a}$$ and translation $${\bf e}$$ factors across words: $$P({\bf f}, {\bf a} | {\bf e}) = \prod_i P(a_i = j | |{\bf e}|) \times P(f_i | e_j)$$. In Model 1, we fix $$P(a_i = j | |{\bf e}|)$$ to be uniform (i.e. equal to $$\frac{1}{|{\bf e}|}$$), so this probability depends only on the word translation parameters $$P(f | e)$$. But where do thes parameters come from? You will first learn them from the data using expectation maximization (EM), and then use them to align. EM attempts to maximize the observed data likelihood $$P({\bf e}|{\bf f})$$, which does not contain alignments. To do this, we marginalize over the alignment variable:

$$P({\bf e}|{\bf f}) = \prod_i \sum_j P(a_i = j | |{\bf e}|)$$

This problem can't be solved in closed form, but we can iteratively hill-climb on the likelihood by first fixing some parameters, computing expectations under those parameters, and maximizing the likelihood as treating expected counts as observed. To compute the iterative update, for every pair of an English word type $$e$$ and a French word type $$f$$, count up the expected number of times $$f$$ aligns to $$e$$ and normalize over values of $$e$$. That will give you a new estimate of the translation probabilities $$P (f |e)$$, which leads to new expectations, and so on. For more detail, read this note. We recommend developing on a small data set (1000 sentences) with a few iterations of EM. When you see improvements on this small set, try it out on the complete data.

Developing a Model 1 aligner should be enough to beat our baseline system and earn a passing grade. But alignment isn't a solved problem, and the goal of this assignment isn't for you to just implement a well-known algorithm. To get full credit you must experiment with at least one additional model of your choice and document your work. Here are some ideas:

Implement a model that prefers to align words close to the diagonal.
Implement an HMM alignment model.
Implement a morphologically-aware alignment model.
Use maximum a posteriori inference under a Bayesian prior.
Train a French-English model and an English-French model and combine their predictions.
Train a supervised discriminative alignment model on the annotated development set.
Train an unsupervised discriminative alignment model.
Seek out additional inspiration.

But the sky's the limit! You are welcome to design your own model, as long as you follow the ground rules:

Ground Rules

For this assignment you must work independently. You may not work in groups.
You must turn in:
1. All of your work, including your code and LaTeX writeup, must be checked in and pushed to the appropriate github repository.
2. A clear, mathematical description of your algorithm and its motivation written in scientific style. This needn't be long, but it should be clear enough that one of your fellow students could re-implement it exactly.
You may only use data or code resources other than the ones we provide with advance permission. We will ask you to make your resources available to everyone. If you have a cool idea using the Berkeley parser, or a French-English dictionary, that's great. But we want everyone to have access to the same resources, so we'll ask you to share the parses. This kind of constrained data condition is common in real-world evaluations of AI systems, to make evaluations fair. A few things are off-limits: Giza++, the Berkeley Aligner, FastAlign or anything else that already does the alignment for you. You must write your own code.

If you have any questions or you're confused about anything, just ask.

Credits: This assignment is adapted from one originally developed by Philipp Koehn and later modified by John DeNero. It incorporates some ideas from Chris Dyer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ling506_hw_alignment.md

ling506_hw_alignment.md

Alignment

Getting Started

The Challenge

Ground Rules

Files

ling506_hw_alignment.md

Latest commit

History

ling506_hw_alignment.md

File metadata and controls

Alignment

Getting Started

The Challenge

Ground Rules