-
Notifications
You must be signed in to change notification settings - Fork 457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More consistent trace names. #1825
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1825
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit bf7d1ac with merge base 7744608 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@RdoubleA Would love to get your comment about this |
Such format is inspired by tensorboard. Probably the best we can do without touching public API |
Could you share an example profile with the updated name? |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1825 +/- ##
==========================================
+ Coverage 67.05% 67.30% +0.24%
==========================================
Files 305 304 -1
Lines 15937 16001 +64
==========================================
+ Hits 10687 10769 +82
+ Misses 5250 5232 -18 ☔ View full report in Codecov by Sentry. |
We can't use |
Hmm, maybe add named argument trace_name: str = "" in trace_handler? But this will require extra attention to user. So I assume current approach might be first step to fix this design issue. |
torchtune/training/_profiler.py
Outdated
@@ -98,7 +99,9 @@ def trace_handler( | |||
# Use tensorboard trace handler rather than directly exporting chrome traces since | |||
# tensorboard doesn't seem to be able to parse traces with prof.export_chrome_trace | |||
exporter = tensorboard_trace_handler( | |||
curr_trace_dir, worker_name=f"rank{rank}", use_gzip=True | |||
curr_trace_dir, | |||
worker_name=f"rank{rank}_" + f"{socket.gethostname()}_{os.getpid()}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry noob question on this choice of worker_name: if I am launching a bunch of runs with profiling on the same host and not keeping track of the pid when I launch, does this actually solve the problem? Like why not instead allow the manual specification of an output filename or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ebsmothers We can do a mamed argument probably. But I was speaking about solution which comes "out of the box". If we will do something like expirement_name: str = "", probably it wan't be usually defined if we don't actually require to define it. Let me update the PR and see if we can do better
Thanks for the PR! having a user experiment name would be neat. I think that having the date is a big deal too. Something like:
what do you think? Maybe if no exp_name is given, it could just default to the model name: 2024_09_14_10_09_32_rank0_llama3_8b |
Looks great! But we cant remove this big time number actually(( it is tensorboard internal |
So for example name in issue case will look like: 2024_09_14_10_09_32_rank0.{Here some big number}_llama3_8b |
since "some big number" doesnt seem to be much informative, maybe this can go last? I also think that model could be always there, and just keep the exp name optional: I just dont know if there is a character limit :/ 2024_09_14_10_09_rank0_llama3_8b_{optional_exp_name}.{Here some big number} and if the trace is being saved anyway inside of the model folder, then we probably dont need the model name: |
Your current idea is fine, but isn't filename too long? Even though, lets probably add exp_name + dates as you proposed |
I agree its long. Maybe rank0 -> r0? model_path/2024_09_14_10_09_r0_{optional_exp_name}.{big number}.trace.json.gzip But i think that it may raise errors because of the length. Some maybe truncate --> optional_exp_name[:max_num_characters]. I think that this works for me, if no one has issues with it. n00b question: is this "big number" already in our current tracing fname? |
Yes, see examples in issue. It is already included in
|
|
No, we can't to do like this because in internal of tensorboard it is done like: |
I don't want to take |
@felipemello1 Maybe do something like:
With this we can remove useless number and do not touch internals |
@RdoubleA @joecummings Can we consider such trick as fine? |
i dont think so. We should be using trace_handler. Let me get a reference for you edit: never mind. I had something like this in mind:
but this is exactly what the tensorboard trace handler is already doing. |
Yeah, I'm using trace_handler, but I'm renaming newest file that came from it with more consistent name that we have chosen. We can't remove "big number" that is pretty useless without touching internals, we also are not really flexible in name choice. And if we use dates as proposed earlier the name will be to big. Now it is pretty strict: |
Without this trick it will be something like this in finetuning cases:
|
Not a big fan. I think it works, but it can be dangerous. Do we know for a fact that if we dont rename it, the name will be too big? |
Yeah, I don't like it much either, It is better though then example that I've shown previously. Another way to get such format - do not use tensorboard_trace_handler or modify it's internals(we can't do it) |
@felipemello1 Easiest solution probably: re-implement this tensorboard.trace_handler and add normal trace_name: str argument. |
Maybe PR to tensorboard? |
The PR or reimplementation are valid options, but i think for the sake of this PR, just having a better date in the name is already much better than what we have. I would be satisfied with it, and we can follow up in another PR, if thats the case. What do you think? |
So we do not touch useless big number right now and just save the date then? |
yes, i am fine with it :) |
done |
awesome, thank you! do you mind running it once with profiler.enabled=True and confirming it works? |
Reproducible code:
|
Context
What is the purpose of this PR? Is it to
Please link to any issues this PR addresses.
#1647
Changelog
What are the changes made in this PR?
Test plan
Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.
pre-commit install
)pytest tests
pytest tests -m integration_test
UX
If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example