-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepSpeed with trl #6852
Comments
Can you try different offload settings? Only offload optimizer or parameters and not both. How much cpu memory does the system have? |
Hi @jomayeri, thanks for answering. I already tried all the offload permutations and got the same error. I don't know the exact cpu memory, but I have 32 cpu's and during the run i got the following info:
so I don't think CPU memory is the problem. |
This indicates that the parameter was not fetched or all-gathered as required before use. This is a very strange failure for zero stage 3. Are you able to share full repro steps? By this, I mean including command line and datasets. |
perfernces_dataset_from_ranker_train_queries_and_baseline_doc.csv command line:
The datasets include a small csv file with prompts and an accepted + rejected sample. |
What does the program file consist of? And what's in param_file.json? |
Hi again @jomayeri what do you mean by consist of? it's a regular python file executing DPO pipeline. |
Describe the bug
I am trying to train meta-llama/Llama-3.1-8B-Instruct with trl DPOTrainer.
After creating the trainer and starting the training loop, I'm getting the following error (in the forward pass):
I tried to downgrade transformers with no success.
System info (please complete the following information):
my accelerate config:
The text was updated successfully, but these errors were encountered: