Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure training pipeline is deterministic #253

Open
jackapbutler opened this issue May 22, 2023 · 2 comments
Open

Ensure training pipeline is deterministic #253

jackapbutler opened this issue May 22, 2023 · 2 comments
Assignees
Labels
good first issue Good for newcomers priority: nice-to-have / low work package: model training Relates to the model training pipeline

Comments

@jackapbutler
Copy link
Collaborator

jackapbutler commented May 22, 2023

In order to decrease sources of error we want to make sure the training pipeline creates a deterministic ordering of the dataset batches and other elements regarding initialisation. This will help to track down the sources of loss spikes i.e. malformed samples.

See this experiment for more information. We might want to check out this.

@jackapbutler jackapbutler added work package: model training Relates to the model training pipeline priority: medium labels May 22, 2023
@jackapbutler jackapbutler self-assigned this May 22, 2023
@jackapbutler
Copy link
Collaborator Author

In the current non-nightly releases of PyTorch the setting above causes training to hang and is a known issue huggingface/transformers#22363, going to try torch.backends.cudnn.deterministic = True.

@jackapbutler
Copy link
Collaborator Author

jackapbutler commented May 24, 2023

Using torch.backends.cudnn.deterministic = True and model.eval() to counteract dropout layers the loss for each of the first N batches became more similar but not completely consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers priority: nice-to-have / low work package: model training Relates to the model training pipeline
Projects
None yet
Development

No branches or pull requests

1 participant