Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible message drop when receiving large number of messages #3860

Open
knelli2 opened this issue Dec 10, 2024 · 0 comments
Open

Possible message drop when receiving large number of messages #3860

knelli2 opened this issue Dec 10, 2024 · 0 comments

Comments

@knelli2
Copy link

knelli2 commented Dec 10, 2024

Hello, I am one of the SpECTRE developers and we had a case where one of our executables built with charm would hang after a long while of running without explanation. After a lot of debugging, we believe we were able to trace down the issue to a message definitely being sent from one chare, but not being received by a different chare. Since the test case where this happened was pretty complicated, I've developed a minimal example which doesn't depend on SpECTRE that I believe also reproduces the issue. Attached is a tar that contains the source code necessary to run the minimal example.

MessageDrop.tar.gz

Some things to note about the issue:

  • In the SpECTRE test case, this happened after a large number of messages were received by an array chare with a single element. Roughly 2^32 / 10 ($\approx$ 4.3e8) messages.
  • The issue persisted even after checkpoint/restart (meaning this likely isn't caused by running out of memory)
  • Even though a message seems to have been dropped and no more messages are sent, quiescence detection doesn't do anything.
  • In the minimal example, I construct two single-element array chares; the Sender and the Receiver. The Sender sends a total of 2^32 messages ($\approx$ 4.3e9) in batches of 1e8 messages at a time (to avoid running out of memory). The Receiver receives the 1e8 messages, increments a counter, then tells the Sender to send another batch of messages. After around 2.2e9 messages sent, the Receiver no longer prints that it is receiving messages, and the executable hangs.

Also, this was built/configured for mpi-linux-x86_64-smp with intel MPI as the backend

Any help you can provide to try and understand what is happening would be very helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant