We’re building open machine translation using the Marian machine translation toolkit that’s fast enough to run locally. Translation models have varying quality depending on the data available and abilities of the models; we can’t fix individual errors made by the model but systematic errors are interesting. We aim to improve the models with better data, training recipes, and speed optimizations.
We have a discourse forum for general discussions and questions. Feel free to join :)
This repo is currently being used for storing the non reproducible training pipeline we currently use (data fetching, data preparation (filtering, augmentation), training and evaluation script, etc...). For now, we have what we can call a Bash Hell : a pipeline of bash script calling other bash scripts or perl/python/C scripts... The end goal is to have this pipeline reproducible, minimalist and performant in one library.
Altough the OpusFilter repo has some really handy tools, it has 3 major drawbacks that avoid us to use it in this project :
- it is too slow (it's using only one single thread)
- it has an architecture too complex and too many dependencies
- it is not as light and fast to add new dataset than MTData
For this reason, we would prefer to get an end-to-end pipeline (fetching, filtering and generating marian training script) in python (>3.7).
- Marian machine translation toolkit
- Bergamot Project
- Helsinki-NLP
- Opus Corpus
- Paracrawl
- Mozilla
- MTData
- Language Reactor (Previously Learning Languages with Netflix)