layout | img | img_url | caption | title | active_tab |
---|---|---|---|---|---|
ling506_frame |
rosetta |
Rosetta stone (credit; calotype46) |
Homework | Alignment |
homework |
Aligning words is a key task in machine translation. We start with a large parallel corpus of aligned sentences. For example, we might have the following sentence pair from the proceedings of the bilingual Canadian parliament:
le droit de permis passe donc de $ 25 à $ 500.
we see the licence fee going up from $ 25 to $ 500.
Getting documents aligned at the sentence level like this is relatively easy: we can use paragraph boundaries and cues like the length and order of each sentence. But to learn a translation model we need alignments at the word level. That's where you come in. Your challenge is to write a program that aligns words automatically. For example, given the sentence above, your program would ideally output these pairs:
le -- the, droit -- fee, permis -- license, passe -- going, passe -- up, donc -- from, $ -- $, 25 -- 25, à -- to, $ -- $, 50 -- 50
Your program can leave words unaligned (e.g. we and see) or multiply aligned (e.g. passe aligned to going up). It will be faced with difficult choices. Suppose it sees this sentence pair:
I want to make it clear that we have to let this issue come to a vote today.
il est donc essentiel que cette question fasse le objet de un vote aujourd' hui .
Your program must make a choice about how to align the words of the non-literal translations I want to make it clear and il est donc essentiel. Even experienced bilinguals will disagree on examples like this. So word alignment does not capture every nuance, but it is still very useful.
ssh in to cl.linguistics.illinois.edu
Once you've logged in, run this command:
git clone https://github.com/2016-Spring-UIUC-LING506/YOUR_GITHUB_USERNAME_GOES_HERE-alignment.git
You will find a python program called
align
, which contains a complete but very simple alignment algorithm.
For every word, it computes the set of sentences that the word appears in.
Intuititvely, word pairs that appear in similar sets of sentences are likely
to be translations. Our aligner first computes the similarity of these sets with
Dice's coefficient. Given
sets
For any two sets
python align -n 1000 > dice.a
This command stores the output in dice.a
. To compute accuracy, run:
python tools/score-alignments < dice.a
This compares the alignments against human-produced alignments, computing alignment error rate, which balances precision and recall. It will also show you the comparison in a grid. Look at the terrible output of this heuristic method -- it's better than chance, but not any good. Try training on 10,000 sentences:
python align -n 10000 | python tools/score-alignments
Performance should improve, but only slightly! Try changing the threshold for alignment. How does this affect alignment error rate?
Your task is to improve the alignment error rate as much as possible. It shouldn't be hard: you've probably noticed that thresholding a Dice coefficient is a bad idea because alignments don't compete against one another. A good way to correct this is with a probabilistic model like IBM Model 1. It forces all of the English words in a sentence to compete as the explanation for each foreign word.
Formally, IBM Model 1 is a probabilistic model that generates each word of
the foreign sentence
This problem can't be solved in closed form, but we can iteratively
hill-climb on the likelihood by first fixing some parameters, computing
expectations under those parameters, and maximizing the likelihood as
treating expected counts as observed. To compute the iterative update, for
every pair of an English word type
Developing a Model 1 aligner should be enough to beat our baseline system and earn a passing grade. But alignment isn't a solved problem, and the goal of this assignment isn't for you to just implement a well-known algorithm. To get full credit you must experiment with at least one additional model of your choice and document your work. Here are some ideas:
- Implement a model that prefers to align words close to the diagonal.
- Implement an HMM alignment model.
- Implement a morphologically-aware alignment model.
- Use maximum a posteriori inference under a Bayesian prior.
- Train a French-English model and an English-French model and combine their predictions.
- Train a supervised discriminative alignment model on the annotated development set.
- Train an unsupervised discriminative alignment model.
- Seek out additional inspiration.
But the sky's the limit! You are welcome to design your own model, as long as you follow the ground rules:
- For this assignment you must work independently. You may not work in groups.
- You must turn in:
- All of your work, including your code and LaTeX writeup, must be checked in and pushed to the appropriate github repository.
- A clear, mathematical description of your algorithm and its motivation written in scientific style. This needn't be long, but it should be clear enough that one of your fellow students could re-implement it exactly.
- You may only use data or code resources other than the ones we provide with advance permission. We will ask you to make your resources available to everyone. If you have a cool idea using the Berkeley parser, or a French-English dictionary, that's great. But we want everyone to have access to the same resources, so we'll ask you to share the parses. This kind of constrained data condition is common in real-world evaluations of AI systems, to make evaluations fair. A few things are off-limits: Giza++, the Berkeley Aligner, FastAlign or anything else that already does the alignment for you. You must write your own code.
If you have any questions or you're confused about anything, just ask.
Credits: This assignment is adapted from one originally developed by Philipp Koehn and later modified by John DeNero. It incorporates some ideas from Chris Dyer.