Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More consistent trace names. #1825

Merged
merged 3 commits into from
Oct 15, 2024
Merged

Conversation

krammnic
Copy link
Contributor

Context

What is the purpose of this PR? Is it to

  • add a new feature
  • fix a bug
  • update tests and/or documentation
  • other (please add here)

Please link to any issues this PR addresses.
#1647

Changelog

What are the changes made in this PR?

  • PID + socket hostname in trace files names

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

  • run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
  • add unit tests for any new functionality
  • update docstrings for any new or updated methods or classes
  • run unit tests via pytest tests
  • run recipe tests via pytest tests -m integration_test
  • manually run any new or modified recipes with sufficient proof of correctness
  • include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

  • I did not change any public API
  • I have added an example to docs or docstrings

Copy link

pytorch-bot bot commented Oct 13, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1825

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit bf7d1ac with merge base 7744608 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 13, 2024
@krammnic
Copy link
Contributor Author

@RdoubleA Would love to get your comment about this

@krammnic
Copy link
Contributor Author

Such format is inspired by tensorboard. Probably the best we can do without touching public API

@RdoubleA
Copy link
Contributor

Could you share an example profile with the updated name?

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 67.30%. Comparing base (54673b7) to head (3479e91).
Report is 15 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1825      +/-   ##
==========================================
+ Coverage   67.05%   67.30%   +0.24%     
==========================================
  Files         305      304       -1     
  Lines       15937    16001      +64     
==========================================
+ Hits        10687    10769      +82     
+ Misses       5250     5232      -18     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@krammnic
Copy link
Contributor Author

rank0-mipt-1465.17883.pt.trace.json.gz

We can't use __name__. We also can't get some expirement name without new argument or for example wandb. We can't remove first time number - it is tensorboard internal. So here you can determine newest or some expirement by PID and host. Maybe we can find something better?

@krammnic
Copy link
Contributor Author

Hmm, maybe add named argument trace_name: str = "" in trace_handler? But this will require extra attention to user. So I assume current approach might be first step to fix this design issue.

@@ -98,7 +99,9 @@ def trace_handler(
# Use tensorboard trace handler rather than directly exporting chrome traces since
# tensorboard doesn't seem to be able to parse traces with prof.export_chrome_trace
exporter = tensorboard_trace_handler(
curr_trace_dir, worker_name=f"rank{rank}", use_gzip=True
curr_trace_dir,
worker_name=f"rank{rank}_" + f"{socket.gethostname()}_{os.getpid()}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry noob question on this choice of worker_name: if I am launching a bunch of runs with profiling on the same host and not keeping track of the pid when I launch, does this actually solve the problem? Like why not instead allow the manual specification of an output filename or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ebsmothers We can do a mamed argument probably. But I was speaking about solution which comes "out of the box". If we will do something like expirement_name: str = "", probably it wan't be usually defined if we don't actually require to define it. Let me update the PR and see if we can do better

@felipemello1
Copy link
Contributor

felipemello1 commented Oct 14, 2024

Thanks for the PR! having a user experiment name would be neat. I think that having the date is a big deal too. Something like:

f"{year}_{month}_{day}_{hour}_{min}_{sec}_rank{rank}_{optional_exp_name}"

what do you think? Maybe if no exp_name is given, it could just default to the model name:

2024_09_14_10_09_32_rank0_llama3_8b

@krammnic
Copy link
Contributor Author

Thanks for the PR! having a user experiment name would be neat. I think that having the date is a big deal too. Something like:

f"{year}_{month}_{day}_{hour}_{min}_{sec}_rank{rank}_{optional_exp_name}"

what do you think? Maybe if no exp_name is given, it could just default to the model name:

2024_09_14_10_09_32_rank0_llama3_8b

Looks great! But we cant remove this big time number actually(( it is tensorboard internal

@krammnic
Copy link
Contributor Author

So for example name in issue case will look like:

2024_09_14_10_09_32_rank0.{Here some big number}_llama3_8b

@felipemello1
Copy link
Contributor

since "some big number" doesnt seem to be much informative, maybe this can go last?

I also think that model could be always there, and just keep the exp name optional:
2024_09_14_10_09_32_rank0_llama3_8b_{optional_exp_name}.{Here some big number}

I just dont know if there is a character limit :/
If "some big number" is usually unique, maybe we can remove the "seconds"

2024_09_14_10_09_rank0_llama3_8b_{optional_exp_name}.{Here some big number}

and if the trace is being saved anyway inside of the model folder, then we probably dont need the model name:
2024_09_14_10_09_rank0_{optional_exp_name}.{Here some big number}

@krammnic
Copy link
Contributor Author

since "some big number" doesnt seem to be much informative, maybe this can go last?

I also think that model could be always there, and just keep the exp name optional: 2024_09_14_10_09_32_rank0_llama3_8b_{optional_exp_name}.{Here some big number}

I just dont know if there is a character limit :/ If "some big number" is usually unique, maybe we can remove the "seconds"

2024_09_14_10_09_rank0_llama3_8b_{optional_exp_name}.{Here some big number}

and if the trace is being saved anyway inside of the model folder, then we probably dont need the model name: 2024_09_14_10_09_rank0_{optional_exp_name}.{Here some big number}

Your current idea is fine, but isn't filename too long? Even though, lets probably add exp_name + dates as you proposed

@felipemello1
Copy link
Contributor

felipemello1 commented Oct 14, 2024

I agree its long. Maybe rank0 -> r0?

model_path/2024_09_14_10_09_r0_{optional_exp_name}.{big number}.trace.json.gzip

But i think that it may raise errors because of the length. Some maybe truncate --> optional_exp_name[:max_num_characters]. I think that this works for me, if no one has issues with it.

n00b question: is this "big number" already in our current tracing fname?

@krammnic
Copy link
Contributor Author

I agree its long. Maybe rank0 -> r0?

model_path/2024_09_14_10_09_r0_{optional_exp_name}.{big number}.trace.json.gzip

But i think that it may raise errors because of the length. Some maybe truncate --> optional_exp_name[:max_num_characters]. I think that this works for me, if no one has issues with it.

n00b question: is this "big number" already in our current tracing fname?

Yes, see examples in issue. It is already included in tensorboard_trace_handler.

file_name = f"{worker_name}.{time.time_ns()}.pt.trace.json"

@krammnic
Copy link
Contributor Author

krammnic commented Oct 14, 2024

Actually we can do some python magic and try to redefine name as we want. Oh, I have a better idea. let's just save with basic name and rename!

@krammnic
Copy link
Contributor Author

Actually we can do some python magic and try to redefine name as we want. Oh, I have a better idea. let's just save with basic name and rename!

No, we can't to do like this because in internal of tensorboard it is done like:
time.time_ns()

@krammnic
Copy link
Contributor Author

I don't want to take tensorboard_trace_handler and implement it in our code either

@krammnic
Copy link
Contributor Author

@felipemello1 Maybe do something like:

exporter = tensorboard_trace_handler(
        curr_trace_dir,
        worker_name="rank0",
        use_gzip=True,
    )
    exporter(prof)
latest_trace = max(glob.glob(curr_trace_dir + "/*.pt.trace.json.gz"), key=os.path.getctime)

now = datetime.datetime.now()
os.rename(latest_trace, f"{curr_trace_dir}/r0-{now.year}-{now.month}-{now.day}-{now.hour}-{now.minute}.pt.trace.json.gz")

With this we can remove useless number and do not touch internals

@krammnic
Copy link
Contributor Author

@RdoubleA @joecummings Can we consider such trick as fine?

@felipemello1
Copy link
Contributor

felipemello1 commented Oct 14, 2024

i dont think so. We should be using trace_handler. Let me get a reference for you

edit: never mind. I had something like this in mind:

def trace_handler(prof: torch.profiler.profile):
        prof.export_memory_timeline(f"{name}.json.gz", device="cuda:0")
        
with torch.profiler.profile(
      	...
        on_trace_ready=trace_handler,
        ) as prof:

but this is exactly what the tensorboard trace handler is already doing.

@krammnic
Copy link
Contributor Author

i dont think so. We should be using trace_handler. Let me get a reference for you

Yeah, I'm using trace_handler, but I'm renaming newest file that came from it with more consistent name that we have chosen. We can't remove "big number" that is pretty useless without touching internals, we also are not really flexible in name choice. And if we use dates as proposed earlier the name will be to big. Now it is pretty strict: r0-2024-10-14-16-8.pt.trace.json.gz

@krammnic
Copy link
Contributor Author

i dont think so. We should be using trace_handler. Let me get a reference for you

Yeah, I'm using trace_handler, but I'm renaming newest file that came from it with more consistent name that we have chosen. We can't remove "big number" that is pretty useless without touching internals, we also are not really flexible in name choice. And if we use dates as proposed earlier the name will be to big. Now it is pretty strict: r0-2024-10-14-16-8.pt.trace.json.gz

Without this trick it will be something like this in finetuning cases:

r0.12345678901123456789.2024-10-14-16-8.pt.trace.json.gz

@felipemello1
Copy link
Contributor

Not a big fan. I think it works, but it can be dangerous. Do we know for a fact that if we dont rename it, the name will be too big?

@krammnic
Copy link
Contributor Author

Not a big fan. I think it works, but it can be dangerous. Do we know for a fact that if we dont rename it, the name will be too big?

Yeah, I don't like it much either, It is better though then example that I've shown previously. Another way to get such format - do not use tensorboard_trace_handler or modify it's internals(we can't do it)

@krammnic
Copy link
Contributor Author

@felipemello1 Easiest solution probably: re-implement this tensorboard.trace_handler and add normal trace_name: str argument.

@krammnic
Copy link
Contributor Author

Maybe PR to tensorboard?

@felipemello1
Copy link
Contributor

The PR or reimplementation are valid options, but i think for the sake of this PR, just having a better date in the name is already much better than what we have. I would be satisfied with it, and we can follow up in another PR, if thats the case. What do you think?

@krammnic
Copy link
Contributor Author

The PR or reimplementation are valid options, but i think for the sake of this PR, just having a better date in the name is already much better than what we have. I would be satisfied with it, and we can follow up in another PR, if thats the case. What do you think?

So we do not touch useless big number right now and just save the date then?

@felipemello1
Copy link
Contributor

yes, i am fine with it :)

@krammnic
Copy link
Contributor Author

yes, i am fine with it :)

done

@felipemello1
Copy link
Contributor

awesome, thank you! do you mind running it once with profiler.enabled=True and confirming it works?

@krammnic
Copy link
Contributor Author

krammnic commented Oct 14, 2024

awesome, thank you! do you mind running it once with profiler.enabled=True and confirming it works?

Already have done this:
Screenshot from 2024-10-14 19-12-00

@krammnic
Copy link
Contributor Author

Reproducible code:

from torch.profiler import profile, record_function, ProfilerActivity
import time
from torchtune.training._profiler import trace_handler


with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
    time.sleep(2)

trace_handler(prof, output_dir=".")

@felipemello1 felipemello1 merged commit 4bbed4d into pytorch:main Oct 15, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants