Switch to fsdp training #81

MichaelClifford · 2024-10-09T23:33:54Z

Related to #51

This is PR simply adds the distributed_training_framework parameter to the torchrun call of our PyTorchJob, and sets it to fsdp. This requires that the latest RHEL AI image (1.2) is used since it has the recent updated needed for FSDP.

This has been tested on the MOC and seems to work as expected.

Signed-off-by: Michael Clifford <[email protected]>

leseb

just a small question, otherwise LGTM

leseb · 2024-10-10T08:16:23Z

training/components.py

@@ -191,7 +191,7 @@ def list_phase1_final_model():
                          export XDG_CACHE_HOME=/tmp
                          export HF_HOME=/tmp
                          export TRANSFORMERS_CACHE=/tmp
-                          torchrun --nnodes {nnodes} --nproc_per_node {nproc_per_node} --node_rank \$(RANK) --rdzv_endpoint \$(MASTER_ADDR):\$(MASTER_PORT) -m instructlab.training.main_ds --model_name_or_path={path_to_model}  --data_path=/input_data/processed_data/data.jsonl --output_dir=/tmp/model --num_epochs=2 --effective_batch_size=3840 --learning_rate=2e-6 --num_warmup_steps=800 --save_samples=0 --log_level=INFO --max_batch_len=20000 --seed=42 --cpu_offload_optimizer --sharding_strategy=FULL_SHARD --is_granite --checkpoint_at_epoch
+                          torchrun --nnodes {nnodes} --nproc_per_node {nproc_per_node} --node_rank \$(RANK) --rdzv_endpoint \$(MASTER_ADDR):\$(MASTER_PORT) -m instructlab.training.main_ds --model_name_or_path={path_to_model}  --data_path=/input_data/processed_data/data.jsonl --output_dir=/tmp/model --num_epochs=2 --effective_batch_size=3840 --learning_rate=1e-4 --num_warmup_steps=800 --save_samples=0 --log_level=INFO --max_batch_len=20000 --seed=42 --cpu_offload_optimizer --distributed_training_framework fsdp --is_granite --checkpoint_at_epoch


What is the --learning_rate fix about?

This was just fixing a typo where I forgot to update the learning_rate as it should be the same value on the master and worker nodes calls to torchrun

Shreyanand

LGTM 🚀

switch to fsdp training

de79bc6

Signed-off-by: Michael Clifford <[email protected]>

MichaelClifford marked this pull request as ready for review October 9, 2024 23:33

MichaelClifford requested review from Shreyanand and leseb October 9, 2024 23:34

leseb reviewed Oct 10, 2024

View reviewed changes

astefanutti mentioned this pull request Oct 10, 2024

Switch to FSDP #51

Closed

Shreyanand approved these changes Oct 10, 2024

View reviewed changes

Shreyanand merged commit 46a8374 into opendatahub-io:main Oct 10, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to fsdp training #81

Switch to fsdp training #81

MichaelClifford commented Oct 9, 2024

leseb left a comment

leseb Oct 10, 2024

MichaelClifford Oct 10, 2024 •

edited

Loading

Shreyanand left a comment

Switch to fsdp training #81

Switch to fsdp training #81

Conversation

MichaelClifford commented Oct 9, 2024

leseb left a comment

Choose a reason for hiding this comment

leseb Oct 10, 2024

Choose a reason for hiding this comment

MichaelClifford Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

Shreyanand left a comment

Choose a reason for hiding this comment

MichaelClifford Oct 10, 2024 •

edited

Loading