-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional panic error during state-sync snapshot generation #8047
Comments
Ran into this myself on my mainnet follower. Core dump and logs attached. (4GB compressed!) stack trace:
Additionally the swing-store export sub process subsequently exited with the following error:
At this point I believe this is triggered by a node making state-sync snapshots. I do not however see any of relevant agoric-sdk code on the stack, so I suspect that we're somehow triggering a bug in the underlying cosmos-sdk or tendermint. I'm also suspicious that we always seem to see gc on stack. This smells like a use after free type bug. Edit: I just realized my node left a partial snapshot behind. It seems to have generated 22 chunks, which would be very early when starting adding swing_store payloads (if it even started by then). |
IIUC, state sync snapshot generation may complete during an inter-block boundary (e.g. between |
Where would this zero |
It would only happen if there is IPC from JS to Golang outside the context of evaluating a block. |
Yeah that doesn't happen today. And for cosmos snapshots, it's only ever the golang side making a call to JS, which ultimately resolves. That resolution can come in at any time, but I believe it wouldn't trigger this issue, right ? |
Need to confirm that there are no calls to unsafe. |
@JimLarson I updated the post above with an artifact of the partial state sync snapshot as found on disk. |
Now I'm suspicious of IPC returning from JS to Golang outside the context of evaluating a block. I believe that could be related to the #8001 failures we've been seeing, but I'm still investigating. |
I just hit this issue again on my follower, and gathered some other observations: It crashed after commiting block 8120. According to the new slogging in upgrade-11, the JS side had NOT received the retrieve message yet. Even more interesting, for the snapshot just preceding, JS received the retrieve message right after commiting block 6120, and immediately responded (aka the swing-store snapshot was ready). This would indicate that golang is crashing right after finishing snapshotting the IAVL tree, likely cleaning up any memory allocated for the process. In the previous crash, the trace indicates the crash occurred after sending the retrieve message but before hearing back from it. In the new trace it hadn't sent the retrieve yet, but the new logging indicated it was about the time retrieve is being sent after the IVAL part of the snapshot completes. My node is configured with pruning
|
While I am not sure if the original report had anything to do with state-sync, all occurrences of similar panics on my follower have been during state-sync snapshots. They occur fairly frequently, something like once a week or more often. |
Well apparently pruning wasn't it. I had another panic during state sync, and according to the config, I was well outside the pruning window. |
I experienced a crash with similar symptoms, on my follower "agreeable", while executing block 15807086. The swingset code had begun executing that block, and was about 67 deliveries in when the process exited (the slog cut off just after a syscall-result was sent back to the v99 worker). The delivery in flight was a BOYD, triggered by the In case it helps, my follower is configured with:
The crashdump starts with:
and the main stack appears to be:
The full output is attached below. |
Describe the bug
Certain logs are generated, causing a Node process to panic and terminate.
The logs below are the logs that are generated when the issue occurs.
To Reproduce
I can't reproduce. This is because the issue occurs randomly while running the validator.
Expected behavior
Platform Environment
git describe --tags --always
)Additional context
This has happened a couple of times since the recent upgrade. In case it helps, the block heights that occurred are
10698768
and10715954
respectively.Screenshots
The text was updated successfully, but these errors were encountered: