-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2 allreduce and a allgather hang in multi-node #899
Comments
Here is what seems to be the scenario Rank A:
Rank B:
So, the allgather is launched first, then Rank A launches allreduce operations in reverse order compared to Rank B. There could be many reasons for that to hang:
|
@sjeaugey Thanks for the quick response!
|
|
I use cuDevicePrimaryCtxRetain to create a gpu context, and create 3 non-blocking stream(with same context) for each operation, does this cause hung? |
The log got from node1: Thread 21 (Thread 0x7f73d9fff700 (LWP 8248)): Thread 20 (Thread 0x7f7482749700 (LWP 8247)): Thread 19 (Thread 0x7f73e4ffd700 (LWP 8246)): Thread 18 (Thread 0x7f73e57fe700 (LWP 8245)): Thread 17 (Thread 0x7f73e5fff700 (LWP 8244)): Thread 16 (Thread 0x7f73f4ffd700 (LWP 8243)): Thread 15 (Thread 0x7f73f5fff700 (LWP 8242)): Thread 14 (Thread 0x7f7482f4a700 (LWP 8241)): Thread 13 (Thread 0x7f73f57fe700 (LWP 8240)): Thread 12 (Thread 0x7f7418be0700 (LWP 8228)): Thread 11 (Thread 0x7f74193e1700 (LWP 8227)): Thread 10 (Thread 0x7f7480f46700 (LWP 8226)): Thread 9 (Thread 0x7f7481747700 (LWP 8225)): Thread 8 (Thread 0x7f74846e2700 (LWP 8221)): Thread 7 (Thread 0x7f7484ee3700 (LWP 8220)): Thread 6 (Thread 0x7f74a1d88700 (LWP 8219)): Thread 5 (Thread 0x7f74a2746700 (LWP 8218)): Thread 4 (Thread 0x7f74a2f9a700 (LWP 8217)): Thread 3 (Thread 0x7f74a379b700 (LWP 8216)): Thread 2 (Thread 0x7f74a6bb2700 (LWP 8215)): Thread 1 (Thread 0x7f7503b63740 (LWP 8212)): |
The log got from node0: Thread 18 (Thread 0x7f0498ffd700 (LWP 135332)): Thread 17 (Thread 0x7f04997fe700 (LWP 135331)): --Type for more, q to quit, c to continue without paging-- Thread 15 (Thread 0x7f053d7fe700 (LWP 135328)): Thread 14 (Thread 0x7f053cffd700 (LWP 135327)): Thread 13 (Thread 0x7f04fc93f700 (LWP 135326)): Thread 12 (Thread 0x7f0551fff700 (LWP 135308)): Thread 11 (Thread 0x7f0558b25700 (LWP 135307)): Thread 10 (Thread 0x7f0559330700 (LWP 135306)): Thread 9 (Thread 0x7f0559b3b700 (LWP 135305)): Thread 8 (Thread 0x7f055b2ee700 (LWP 135304)): Thread 7 (Thread 0x7f055baef700 (LWP 135303)): Thread 6 (Thread 0x7f057894b700 (LWP 135302)): Thread 5 (Thread 0x7f0579307700 (LWP 135301)): Thread 4 (Thread 0x7f0579b5b700 (LWP 135300)): Thread 3 (Thread 0x7f057a35c700 (LWP 135299)): Thread 2 (Thread 0x7f057d777700 (LWP 135298)): Thread 1 (Thread 0x7f05da731740 (LWP 135295)):
|
Hello, I have 2 nodes and 1 gpu in each node.
NCCL versio is2.14.3, NCCL_LAUNCH_MODE=GROUP,NCCL_DEBUG_SUBSYS=COLL,NET,P2P
I have 2 allreduce and 1 allgather in 1 gpu, but the 3 communication ops do out of order execution, so I create uniqueId 3 times, and call ncclCommInitRank() 6 times to make sure each coomunication will get right data.
But I will hang sometimes, the hang log as below:
The text was updated successfully, but these errors were encountered: