Train both directions at once #46

XapaJIaMnu · 2021-12-28T21:50:15Z

Currently, it's difficult to reuse data between two translation directions as majority of the files are placed in different directories https://github.com/mozilla/firefox-translations-training/blob/3b3f33bf2581238d325f05015123fc0a026c394e/configs/config.prod.yml#L18
eg: exp-name/src-trg, meaning that all datasets will be redownloaded.

Furthermore data cleaning is done via concating src and target files and is asymetrical at places: #41 (comment)

In practise preprocessing can be symmetrical and once a good student model is trained in one direction, it may even be used for producing backtranslations in the other automatically (prior to quantising). By training src-trg and trg-src at the same time, we can avoid data duplication, lengthy and space consuming preprocessing, training vocabulary and training one separate model

The text was updated successfully, but these errors were encountered:

eu9ene · 2022-01-04T01:02:38Z

It is supposed to be partly fixed by:

Snakemake caching - it is implemented but I couldn't make it work, snakemake somehow doesn't recognize symlinks to the cached files. I disabled it because of that. See cache: False directive for data downloading and monolingual data cleaning in the pipeline. You can try to enable it and check if it works. I assumed that parallel data cleaning is asymmetrical.
You can specify a path to backward model in the config - for example to a student for the opposite direction

experiment:
  ...
  backward-model: "..."

If we make caching work it will cover most of the cases. It will allow reusing of downloaded and cleaned mono data for English between all language pairs, not only the ones of opposite directions. Parallel cleaning is relatively fast compared to training and decoding, so I think it's fine to do it in both directions. If we still want to reuse data, we can either copy it manually or try to normalize lang pair path for this step (sort languages), so that it always point to the same directory, which will require rethinking of the directory structure.

XapaJIaMnu · 2022-01-04T12:11:42Z

I think that we shouldn't rely on snakemake caching to get it to work but it should be part of the pipeline with something like "train-reverse-model: true" appended at the end. I do recognised that it's a lot of work and this is mostly just an enhancement.

Copying the files over doesn't work because they get concatenated as src-trg which means that one needs to go and manually rename everything. Not something you want to do in general.

Cheers,

Nick

eu9ene · 2022-01-05T20:46:40Z

Actually, the naming convention everywhere is <corpus_name>.<lang>.gz, so you can copy directories original, clean and biclean between language pairs assuming that cleaning is symmetrical and you use the same monolingual datasets for back and forward translation.

Where are they get concatenated as src-trg?

XapaJIaMnu · 2022-01-05T20:55:45Z

data/data/bg-en/snakemake-bg-en/original/eval$ ls
custom-corpus_  devset.bg.gz  devset.en.gz  merge.bgen.gz

The merge is direction dependent.

eu9ene · 2022-01-05T21:21:17Z

merge.bgen.gz this is an intermediate file for deduplication that doesn't affect pipeline execution, it should probably be deleted after the job is completed. The final results are devset.bg.gz devset.en.gz, so it's safe to copy.

XapaJIaMnu · 2022-01-06T07:52:23Z

I see, so i could in theory do a blanket copy of all the clean biclean etc directories and the only thing that would be rebuilt is the vocabulary (since it's named vocab.$SRC$TRG.spm)?

eu9ene · 2022-01-06T23:06:11Z

Vocabulary is stored in a directory like models/en-ru/test/vocab and named vocab.spm, so you can copy this directory too. You shouldn't copy any other directories except those I mentioned, the results are not usable for training in the opposite direction.

gregtatum · 2024-04-09T21:05:38Z

Taskcluster caching is pretty robust these days for this type of issue.

@abhi-agg

* Draft adjustments to API * Adjustments to docs * Let's call the word + sentence ranges annotations * Editing confusing comment on size() * Fixing compilation for template adjustments for SentenceRanges * string_view template hacks This commit shifts AnnotatedBlob into a templated type and gets the troubled part to compile. All to manage absl::string_view and std::string_view. Objective: marian::bergamot stays C++ 11 to pluck and put in marian code, bergamot-translator somehow flexes C++17. Simplify development in one place. * Fixing the wiring: Gets source to build Runtime errors exist, but AnnotatedBlobs are consistent. * Bugfix: Matching old-state after factoring AnnotatedBlob in * Removing vocabs_ from Response. (For the umpteenth time). * Alignment API ready in marian::bergamot::Response * Wiring alignments upto TranslationResult * Adjustment to get alignments; bergamot-translator-app has alignments available * Accessing words instead of Ids This code sets up access of word string_views from annotations instead of printing Ids. However, we have segfault. This is likely due to targetRanges not being set, pending from browsermt/bergamot-translator#25. Could also be a rogue EOS token which we're filtering for in string_view annotations, but not so in alignments. * Switching to browsermt/marian-dev@jp/decode-string-view for targetTokenRanges * Target word byte range annotations available Issues corresponding to #25 should be resolved. There is still a segfault. Could be due to EOS. Pending investigation. * Bugfix: Tokens for alignments are now through. Was not EOS. * browsermt/marian-dev@master ByteRange changes work downstream and has been merged to master. Updating submodule to point to master. * Style and documentation enhancements: response.cpp * Style and documentation enhancements: TranslationResult.h * Descriptions for SentenceRanges templating * Switching to marian-dev@wasm-sync * AnnotatedBlob can be copy-ctord/copy-assigned * TranslationResult: Empty ctor + WASM Bindings Allows empty construction of TranslationResult. Using this empty constructor, WASM bindings are adjusted. Unsure of the results, maybe @abhi-agg can test. * Cosmetic: SentenceRangesT -> Annotation - SentenceRangesT is renamed to AnnotationT; - Further comments to explain heavily templated files. * Response: Cleaning up unused members and adding docs * Adding quality scores - attempt * Stub QualityScores This adjustment adds capability to get "scores", which should potentially indicate how confident (at least relative in a target-sentence) should be. This enables writing the code forward for TranslationResult, and an example quality-score people can be pointed at. - These are not between [0,1] yet. - In addition, guards to check out-of-bounds access have been placed so illegal accesses are caught early on during development. * Removing token debug statements * Reworking Annotation without templates mozilla/bergamot-translator#8 provides ByteRanges. - This ByteRange data-type is used in Annotation and converted to marian::string_view(=absl::string-view) on demand. - Since Annotation[using ByteRange] is not bound to anything else, it can be unit tested. A unit test is added (originally to test independently for integration after). - Annotation with ByteRange is now propogated across marian::bergamot and functionality matched to how it was previously working. This eliminates the string-view conversion and template code. * Nit: Removing std::endl flushes * Bring TranslationResult and Response closer Helps browsermt/bergamot-translator#53. In preparation , the data-export types for Quality and Alignment are pushed down to Response from TranslationResult and computed during construction. This brings TranslationResult closer to Response, paving way to avoid having two TranslationResults. histories_ only remain for marian-decoder replacement usage, which can be removed in a separate PR. * Clean up hacks originally added for a unit-test to compile * Moving Annotation functions to cpp and documenting header file * Shifting alignments, qualityScore testing capability into main-mts * Restore Unified API files to previous state * Adaptations to fix Response with Quality, Alignments to connect to old Unified API * Missing reset on TranslationResultBindings * Cleaning up Response documentation to reflect newer code * Minor adjustments to get build back after main sync * Marian seems to make available Catch somehow * Disable COMPILE_BERGAMOT_TESTS for WASM * Add COMPILE_BERGAMOT_TESTS as a CMakeDependent option * Use the COMPILE_TESTS flag instead to skip macos.yml * Trigger unit-tests on GitHub runners for Annotation * Reordering enable_testing() to before inclusion of test directory * doc constructs required to operate with alignments Documents with doxygen compatible documentation for Response, AnnotatedBlob, Annotation, ByteRange. Incorporates doxygen compatible documentation for * Updates ByteRange consistent with general C++ Also little documentation enhancements in the process. * Updating marian-dev@9337105 * Copy-paste documentation because lazy * Turn off autoformat and manually edit to fix style changes * AnnotatedBlob -> AnnotatedText; blob -> text * text.text in test app renamed * text of text -> blob of text in places of documentation

eu9ene added the enhancement New feature or request label Jan 4, 2022

kpu mentioned this issue Mar 15, 2022

Start en-pl and en-fr training browsermt/students#62

Open

gregtatum closed this as completed Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train both directions at once #46

Train both directions at once #46

XapaJIaMnu commented Dec 28, 2021

eu9ene commented Jan 4, 2022

XapaJIaMnu commented Jan 4, 2022

eu9ene commented Jan 5, 2022

XapaJIaMnu commented Jan 5, 2022

eu9ene commented Jan 5, 2022

XapaJIaMnu commented Jan 6, 2022

eu9ene commented Jan 6, 2022

gregtatum commented Apr 9, 2024

Train both directions at once #46

Train both directions at once #46

Comments

XapaJIaMnu commented Dec 28, 2021

eu9ene commented Jan 4, 2022

XapaJIaMnu commented Jan 4, 2022

eu9ene commented Jan 5, 2022

XapaJIaMnu commented Jan 5, 2022

eu9ene commented Jan 5, 2022

XapaJIaMnu commented Jan 6, 2022

eu9ene commented Jan 6, 2022

gregtatum commented Apr 9, 2024