You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I am one of the SpECTRE developers and we had a case where one of our executables built with charm would hang after a long while of running without explanation. After a lot of debugging, we believe we were able to trace down the issue to a message definitely being sent from one chare, but not being received by a different chare. Since the test case where this happened was pretty complicated, I've developed a minimal example which doesn't depend on SpECTRE that I believe also reproduces the issue. Attached is a tar that contains the source code necessary to run the minimal example.
In the SpECTRE test case, this happened after a large number of messages were received by an array chare with a single element. Roughly 2^32 / 10 ($\approx$ 4.3e8) messages.
The issue persisted even after checkpoint/restart (meaning this likely isn't caused by running out of memory)
Even though a message seems to have been dropped and no more messages are sent, quiescence detection doesn't do anything.
In the minimal example, I construct two single-element array chares; the Sender and the Receiver. The Sender sends a total of 2^32 messages ($\approx$ 4.3e9) in batches of 1e8 messages at a time (to avoid running out of memory). The Receiver receives the 1e8 messages, increments a counter, then tells the Sender to send another batch of messages. After around 2.2e9 messages sent, the Receiver no longer prints that it is receiving messages, and the executable hangs.
Also, this was built/configured for mpi-linux-x86_64-smp with intel MPI as the backend
Any help you can provide to try and understand what is happening would be very helpful.
The text was updated successfully, but these errors were encountered:
Hello, I am one of the SpECTRE developers and we had a case where one of our executables built with charm would hang after a long while of running without explanation. After a lot of debugging, we believe we were able to trace down the issue to a message definitely being sent from one chare, but not being received by a different chare. Since the test case where this happened was pretty complicated, I've developed a minimal example which doesn't depend on SpECTRE that I believe also reproduces the issue. Attached is a tar that contains the source code necessary to run the minimal example.
MessageDrop.tar.gz
Some things to note about the issue:
Also, this was built/configured for
mpi-linux-x86_64-smp
with intel MPI as the backendAny help you can provide to try and understand what is happening would be very helpful.
The text was updated successfully, but these errors were encountered: