-
Notifications
You must be signed in to change notification settings - Fork 457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llama3-70B LoRA multi GPU #802
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/802
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 652cd3d with merge base a9180b5 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
README.md
Outdated
|
||
|
||
In our initial experiments, QLoRA has a peak allocated memory of ``~9GB`` while LoRA on a single GPU has a peak allocated memory of ``~19GB``. To get started, you can use our default configs to kick off training. | ||
In our initial experiments for Llama3 8B, QLoRA has a peak allocated memory of ``~9GB`` while LoRA on a single GPU has a peak allocated memory of ``~19GB``. To get started, you can use our default configs to kick off training. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
In our initial experiments for Llama3 8B, QLoRA has a peak allocated memory of ``~9GB`` while LoRA on a single GPU has a peak allocated memory of ``~19GB``. To get started, you can use our default configs to kick off training. | |
In our initial experiments for Llama3-8B, QLoRA has a peak allocated memory of ``~9GB`` while LoRA on a single GPU has a peak allocated memory of ``~19GB``. To get started, you can use our default configs to kick off training. |
|
||
```bash | ||
tune run --nproc_per_node 2 full_finetune_distributed --config llama3/8B_full | ||
``` | ||
|
||
- 70B LoRA finetune on 8 GPUs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what the right place to do this is, but do we wanna mention memory requirements to run 70B somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, maybe we can add it to the table?
recipes/configs/llama3/70B_lora.yaml
Outdated
# Model Arguments | ||
model: | ||
_component_: torchtune.models.llama3.lora_llama3_70b | ||
lora_attn_modules: ['q_proj', 'v_proj', 'k_proj'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I feel like this is clearer
lora_attn_modules: ['q_proj', 'v_proj', 'k_proj'] | |
lora_attn_modules: ['q_proj', 'k_proj', 'v_proj'] |
apply_lora_to_mlp: False | ||
apply_lora_to_output: False | ||
lora_rank: 16 | ||
lora_alpha: 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are these defaults set? Are rank and alpha higher than our 7/8B defaults cause the embedding dim is larger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly copied these from https://github.com/pytorch/torchtune/blob/main/recipes/configs/llama2/70B_lora.yaml.
I'm not sure whether rank and alpha being higher is because of the embedding dim being larger - what do those have to do with each other?
checkpoint_files: [ | ||
model-00001-of-00030.safetensors, | ||
model-00002-of-00030.safetensors, | ||
model-00003-of-00030.safetensors, | ||
model-00004-of-00030.safetensors, | ||
model-00005-of-00030.safetensors, | ||
model-00006-of-00030.safetensors, | ||
model-00007-of-00030.safetensors, | ||
model-00008-of-00030.safetensors, | ||
model-00009-of-00030.safetensors, | ||
model-00010-of-00030.safetensors, | ||
model-00011-of-00030.safetensors, | ||
model-00012-of-00030.safetensors, | ||
model-00013-of-00030.safetensors, | ||
model-00014-of-00030.safetensors, | ||
model-00015-of-00030.safetensors, | ||
model-00016-of-00030.safetensors, | ||
model-00017-of-00030.safetensors, | ||
model-00018-of-00030.safetensors, | ||
model-00019-of-00030.safetensors, | ||
model-00020-of-00030.safetensors, | ||
model-00021-of-00030.safetensors, | ||
model-00022-of-00030.safetensors, | ||
model-00023-of-00030.safetensors, | ||
model-00024-of-00030.safetensors, | ||
model-00025-of-00030.safetensors, | ||
model-00026-of-00030.safetensors, | ||
model-00027-of-00030.safetensors, | ||
model-00028-of-00030.safetensors, | ||
model-00029-of-00030.safetensors, | ||
model-00030-of-00030.safetensors, | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol we really need to have a way to generate these programmatically
@rohan-varma Have you tested if full-weight training works with 8x80GB ? Maybe it works if we use 8bit AdamW? |
@musabgultekin I've dug into this a bit, and haven't gotten it into a full working state yet. This is something on our immediate list of priorities, so stay tuned! |
Thank you! Will be following you! |
Context
tune download
and we use the meta checkpointer.Changelog
--ignore-patterns "original/consolidated*"
to be able to get the safetensorsTest plan
tune download meta-llama/Meta-Llama-3-70b --hf-token <> --output-dir /tmp/Meta-Llama-3-70b --ignore-patterns "original/consolidated*"
tune run --nproc_per_node 8 lora_finetune_distributed --config recipes/configs/llama3/70B_lora.yaml