-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
non-deterministic crash of single node under load #5031
Comments
@Tartuffo I'm still experiencing this on the benchmark machine. I have been unable to gather any kind of daily loadgen number for weeks now, however my concern is that this is a kernel process seemingly non deterministic crash. |
Alright @JimLarson I got a core dump. This one terminated with a |
Here is another one. This time for a |
can you run |
Ok I need to grab the binary before it goes poof |
Here is a new archive from a new run on revision a7fb401, including the For context, this is running inside a docker environment, created using https://github.com/Agoric/testnet-load-generator/blob/main/Dockerfile. The |
|
This is surprising. Do we know the which binary the frames with missing info belong to? I know another native piece is the LMDB addon. Can |
Got a
Edit: I managed to load all binaries from the test, and indeed, the culprit seem to be |
I similarly reran
|
I've found a couple issues of interest on the
I've modified the benchmark machine's Docker image to compile a debug version of node addons, including node-lmdb: Agoric/testnet-load-generator@main...mhofman/core-dump I've also created a Docker image to more easily start # Run in testnet-load-generator: docker build --target dev-env -t loadgen-runner:dev-env .
FROM loadgen-runner:dev-env
RUN apt-get update && export DEBIAN_FRONTEND=noninteractive \
&& apt-get -y install --no-install-recommends lldb \
&& apt-get clean -y && rm -rf /var/lib/apt/lists/*
RUN mkdir /out && ln -s /out/src /src
ENTRYPOINT ["bash", "-c", "lldb -c /out/tmp/core-node.*"] From which we can then do: docker run --rm -it -v /path/to/loadgen/output:/out loadgen-runner:debug-core-dump So far I've only encountered back traces with Node's garbage collector on the stack, and haven't seen the |
After a painstaking bisection, it looks like this started happening when #4527 landed. Looking at the description of that PR, it would seem we started exercising the durable and collections logic more extensively, which would stress LMDB. Having seen node-lmdb on the stack in some of the core dumps ( cc @FUDCo, any idea how we could confirm or disprove this hypothesis? How hard would it be to route all swingStore operations to a sub-process? Maybe an ugly hack with synchronous pipes between the kernel and a dedicated sub-process? |
Describe the bug
The benchmark machine has not been able to run a single full loadgen test, failing at varying times while under load. It appears the kernel process gets killed with either
SIGSEGV
orSIGABRT
, most times printing nothing, other printing a message likefree(): invalid next size (fast)
ordouble free or corruption (out)
. These actually happened on the same revision (6fcb5462e508
) when attempting to run the loadgen again.To Reproduce
Steps to reproduce the behavior:
daily-perf
runs, including revision aboveExpected behavior
No crash
Platform Environment
6fcb5462e508
and othersThe text was updated successfully, but these errors were encountered: