Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust default values for batching #934

Merged
merged 1 commit into from
Nov 20, 2024

Conversation

gregtatum
Copy link
Member

This results in a 2.2x speed up for train-mono tasks according to my testing in #931. There's also a bug with loading in Marian args #933, so I'm putting the values in the decoder.yml file as well.

@gregtatum gregtatum requested review from a team as code owners November 18, 2024 17:45
@gregtatum gregtatum requested a review from ahal November 18, 2024 17:45
@bhearsum bhearsum removed request for a team and ahal November 18, 2024 18:07
@ZJaume
Copy link
Collaborator

ZJaume commented Nov 18, 2024

The fp16: true option may be safer than just precision: float16, as it is an alias for float16 precision that also adjusts other precision scaling parameters.

maxi-batch-sort: src
precision: float16
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One downside of moving it to this config is that some GPUs on Berlin cluster that we plan to include soon don't support half precision decoding. And if I remember correctly it fails silently and produces bad results on decoding, so it's quite dangerous. So, this setting should be likely propagated through marian args or there should be a detection mechanism in the script.

Well, in fact all those values were kind of tuned for those smaller GPUs, so they'll likely OOM. We should either leave the safe values and override them in the production training config template or add a todo and an issue to refactor this to adjust those values based on a GPU model form a python script.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a TODO for checking that the values are safe #936. The issue is that the decoder.yml is the only place to modify these values, as the config values aren't forwarded, see #933. My CTranslate2 patch stack fixes that piece of it, but I don't want to put it up for review until the experiment results come back.

At the very least in the short term, we should have the config working for our current execution environment.

@gregtatum gregtatum force-pushed the teacher-decoder-config branch from 7ee578d to b007fca Compare November 20, 2024 19:15
@gregtatum gregtatum requested a review from eu9ene November 20, 2024 19:17
@gregtatum
Copy link
Member Author

The fp16: true option may be safer than just precision: float16, as it is an alias for float16 precision that also adjusts other precision scaling parameters.

I checked the Marian code and it matters for training, but it's the same for just translating. I'm happy to switch to it though.

@gregtatum gregtatum merged commit 47efa45 into mozilla:main Nov 20, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants