-
Notifications
You must be signed in to change notification settings - Fork 457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metric logger improvements #831
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/831
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit d66919e with merge base a46560e (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Hey, can I help you here? Looks similiar to what I was working: #730 |
@tcapelle thanks, yeah actually this started from trying to get the gradient accumulation test on #730 to pass and kinda expanded from there. If it's easiest for you, I am happy to just let you commandeer this PR so you don't have to go adding the changes to all the other recipes. Let me know what you'd prefer. |
recipes/full_finetune_distributed.py
Outdated
@@ -107,7 +107,11 @@ def __init__(self, cfg: DictConfig) -> None: | |||
# logging attributes | |||
self._output_dir = cfg.output_dir | |||
self._log_every_n_steps = cfg.log_every_n_steps if cfg.log_every_n_steps else 1 | |||
self._log_peak_memory_every_n_steps = 100 | |||
self._log_peak_memory_stats = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._log_peak_memory_stats = ( | |
self._log_peak_memory_stats = cfg.get("log_peak_memory_stats", False) |
recipes/full_finetune_distributed.py
Outdated
@@ -107,7 +107,11 @@ def __init__(self, cfg: DictConfig) -> None: | |||
# logging attributes | |||
self._output_dir = cfg.output_dir | |||
self._log_every_n_steps = cfg.log_every_n_steps if cfg.log_every_n_steps else 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while you're here
self._log_every_n_steps = cfg.log_every_n_steps if cfg.log_every_n_steps else 1 | |
self._log_every_n_steps = cfg.get("log_every_n_steps", 1) |
log_dict = { | ||
"loss": loss_to_log, | ||
"lr": self._optimizer.param_groups[0]["lr"], | ||
"tokens_per_second": num_tokens / time_per_step, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to multiply num_tokens per rank, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is a good point. Actually I'm not sure what the right thing to do here is for two reasons. (1) We only log rank-zero loss, and (2) I think a naive multiplication with num_devices will not work. If I understand correctly DistributedSampler will apply collate_fn to each worker's batch separately, in which case we actually have different numbers of tokens on each rank, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the more I think about it the less I want to change this. (1) it will introduce unneeded syncs to calculate this value properly and (2) we could just log on every rank if we really want to and aggregate posthoc
@@ -126,7 +127,8 @@ def test_log_dict(self) -> None: | |||
assert_expected(tensor_tag.step, 1) | |||
|
|||
|
|||
class WandBLoggerTest: | |||
@pytest.mark.skip(reason="This was never running and needs to be fixed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol
great work! |
This is inspired by @tcapelle's changes in #730 so props to him for getting the ball rolling here.
Quick primer/definition to make what follows clearer: by iteration I mean a single batch from our dataloader. By step I mean a single step with our optimizer. So without gradient accumulation
iteration == step
, with gradient accumulation there aregradient_accumulation_steps
iterations per step (maybe the naming of this field is itself. confusing tbh)The main set of changes here is:
gradient_accumulation_steps
).tokens_per_second
metric. Similarly this is accumulated over iterations to log the per-step average.log_peak_memory_every_n_steps=100
field, which (a) was completely arbitrary and (b) bloated up our training loop with multiplelog_dict
calls and multiple different logging frequencies. For BC with existing configs I do a default check in the recipe and set it to FalseTest plan
Full finetune single device
Metric logger outputs
Full finetune distributed
Metric logger outputs
LoRA finetune single device
Metric logger outputs
LoRA finetune distributed
Metric logger outputs
LoRA DPO single device
Metric logger outputs
LoRA DPO distributed
Metric logger outputs