Adjust default values for batching #934

gregtatum · 2024-11-18T17:45:57Z

This results in a 2.2x speed up for train-mono tasks according to my testing in #931. There's also a bug with loading in Marian args #933, so I'm putting the values in the decoder.yml file as well.

ZJaume · 2024-11-18T19:06:12Z

The fp16: true option may be safer than just precision: float16, as it is an alias for float16 precision that also adjusts other precision scaling parameters.

eu9ene · 2024-11-19T17:48:51Z

pipeline/translate/decoder.yml

 maxi-batch-sort: src
+precision: float16


One downside of moving it to this config is that some GPUs on Berlin cluster that we plan to include soon don't support half precision decoding. And if I remember correctly it fails silently and produces bad results on decoding, so it's quite dangerous. So, this setting should be likely propagated through marian args or there should be a detection mechanism in the script.

Well, in fact all those values were kind of tuned for those smaller GPUs, so they'll likely OOM. We should either leave the safe values and override them in the production training config template or add a todo and an issue to refactor this to adjust those values based on a GPU model form a python script.

I added a TODO for checking that the values are safe #936. The issue is that the decoder.yml is the only place to modify these values, as the config values aren't forwarded, see #933. My CTranslate2 patch stack fixes that piece of it, but I don't want to put it up for review until the experiment results come back.

At the very least in the short term, we should have the config working for our current execution environment.

gregtatum · 2024-11-20T19:18:32Z

The fp16: true option may be safer than just precision: float16, as it is an alias for float16 precision that also adjusts other precision scaling parameters.

I checked the Marian code and it matters for training, but it's the same for just translating. I'm happy to switch to it though.

gregtatum requested review from a team as code owners November 18, 2024 17:45

gregtatum requested a review from ahal November 18, 2024 17:45

bhearsum removed request for a team and ahal November 18, 2024 18:07

eu9ene requested changes Nov 19, 2024

View reviewed changes

gregtatum mentioned this pull request Nov 20, 2024

Check for float16 precision support when running translate-* tasks #936

Open

Adjust default values for batching

b007fca

gregtatum force-pushed the teacher-decoder-config branch from 7ee578d to b007fca Compare November 20, 2024 19:15

gregtatum requested a review from eu9ene November 20, 2024 19:17

eu9ene approved these changes Nov 20, 2024

View reviewed changes

gregtatum merged commit 47efa45 into mozilla:main Nov 20, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust default values for batching #934

Adjust default values for batching #934

gregtatum commented Nov 18, 2024

ZJaume commented Nov 18, 2024 •

edited

Loading

eu9ene Nov 19, 2024

gregtatum Nov 20, 2024

gregtatum commented Nov 20, 2024

Adjust default values for batching #934

Adjust default values for batching #934

Conversation

gregtatum commented Nov 18, 2024

ZJaume commented Nov 18, 2024 • edited Loading

eu9ene Nov 19, 2024

Choose a reason for hiding this comment

gregtatum Nov 20, 2024

Choose a reason for hiding this comment

gregtatum commented Nov 20, 2024

ZJaume commented Nov 18, 2024 •

edited

Loading