Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-deterministic crash of single node under load #5031

Closed
mhofman opened this issue Apr 6, 2022 · 12 comments · Fixed by #5488
Closed

non-deterministic crash of single node under load #5031

mhofman opened this issue Apr 6, 2022 · 12 comments · Fixed by #5488
Assignees
Labels
bug Something isn't working SwingSet package: SwingSet
Milestone

Comments

@mhofman
Copy link
Member

mhofman commented Apr 6, 2022

Describe the bug

The benchmark machine has not been able to run a single full loadgen test, failing at varying times while under load. It appears the kernel process gets killed with either SIGSEGV or SIGABRT, most times printing nothing, other printing a message like free(): invalid next size (fast) or double free or corruption (out). These actually happened on the same revision (6fcb5462e508) when attempting to run the loadgen again.

To Reproduce

Steps to reproduce the behavior:

  1. Login to benchmark machine
  2. Check logs of recent daily-perf runs, including revision above

Expected behavior

No crash

Platform Environment

  • what OS are you using? what version of Node.js? Node 16.14.0 (same behavior on Node 16.13.1). go 1.17.7.
  • is there anything special/unusual about your platform? Loadgen docker container
  • what version of the Agoric-SDK are you using? 6fcb5462e508 and others
@mhofman mhofman added bug Something isn't working cosmic-swingset package: cosmic-swingset labels Apr 6, 2022
@Tartuffo Tartuffo added this to the Mainnet 1 milestone Apr 8, 2022
@mhofman
Copy link
Member Author

mhofman commented Apr 21, 2022

@Tartuffo I'm still experiencing this on the benchmark machine. I have been unable to gather any kind of daily loadgen number for weeks now, however my concern is that this is a kernel process seemingly non deterministic crash.

@mhofman
Copy link
Member Author

mhofman commented May 6, 2022

Alright @JimLarson I got a core dump. This one terminated with a SIGABRT and printed the message corrupted size vs. prev_size in fastbins before exiting. Revision a1d8251

@mhofman
Copy link
Member Author

mhofman commented May 6, 2022

Here is another one. This time for a SIGSEGV and nothing printed out before kill. Revision a7fb401

@arirubinstein
Copy link
Contributor

can you run gdb agd corefile and post the output? the core is bound to the binary so if it's being rebuilt it needs to be analyzed with the same binary

@mhofman
Copy link
Member Author

mhofman commented May 6, 2022

Ok I need to grab the binary before it goes poof

@mhofman
Copy link
Member Author

mhofman commented May 6, 2022

Here is a new archive from a new run on revision a7fb401, including the agd and node binaries.

For context, this is running inside a docker environment, created using https://github.com/Agoric/testnet-load-generator/blob/main/Dockerfile. The mcr.microsoft.com/vscode/devcontainers/typescript-node base image tag is 0.204.0-16-bullseye

@arirubinstein
Copy link
Contributor

% lldb ./node -c core-node.16038.92a442bab0d5.1651857390                                                                                                                        130 ↵
(lldb) target create "./node" --core "core-node.16038.92a442bab0d5.1651857390"
Core file '/Users/arubinstein/Downloads/core-a7fb4017895f-sigsegv/tmp/core-node.16038.92a442bab0d5.1651857390' (x86_64) was loaded.
(lldb) bt
* thread #1, name = 'node', stop reason = signal SIGSEGV
  * frame #0: 0x0000000001578991 node`Builtins_RecordWriteEmitRememberedSetSaveFP + 113
    frame #1: 0x000000000158b0b2 node`Builtins_KeyedStoreIC_Megamorphic + 14130
    frame #2: 0x00007f6ede967a98
    frame #3: 0x00007f6edeb2b941
    frame #4: 0x00007f6edeb3ab71
    frame #5: 0x00007f6ede9d169f
    frame #6: 0x00007f6ede9e7f5e
    frame #7: 0x00007f6ede979f66
    frame #8: 0x0000000001637734 node`Builtins_PromiseConstructor + 2164
    frame #9: 0x0000000001580d92 node`Builtins_JSBuiltinsConstructStub + 242
    frame #10: 0x00007f6edeb893f8
    frame #11: 0x00007f6ede97be5a
    frame #12: 0x0000000001638f8e node`Builtins_PromiseFulfillReactionJob + 46
    frame #13: 0x00000000015a5146 node`Builtins_RunMicrotasks + 614
    frame #14: 0x0000000001581723 node`Builtins_JSRunMicrotasksEntry + 131
    frame #15: 0x0000000000e2334e node`v8::internal::(anonymous namespace)::Invoke(v8::internal::Isolate*, v8::internal::(anonymous namespace)::InvokeParams const&) + 1518
    frame #16: 0x0000000000e23be0 node`v8::internal::(anonymous namespace)::InvokeWithTryCatch(v8::internal::Isolate*, v8::internal::(anonymous namespace)::InvokeParams const&) (.constprop.58) + 80
    frame #17: 0x0000000000e240fa node`v8::internal::Execution::TryRunMicrotasks(v8::internal::Isolate*, v8::internal::MicrotaskQueue*, v8::internal::MaybeHandle<v8::internal::Object>*) + 90
    frame #18: 0x0000000000e4c6ba node`v8::internal::MicrotaskQueue::RunMicrotasks(v8::internal::Isolate*) (.part.50) + 138
    frame #19: 0x0000000000e4ca82 node`v8::internal::MicrotaskQueue::PerformCheckpoint(v8::Isolate*) + 66
    frame #20: 0x00000000015856cc node`Builtins_CallApiCallback + 172
    frame #21: 0x00007f6ede9be687
    frame #22: 0x00000000015817d8 node`Builtins_JSEntryTrampoline + 88
    frame #23: 0x0000000001581563 node`Builtins_JSEntry + 131
    frame #24: 0x0000000000e22ea8 node`v8::internal::(anonymous namespace)::Invoke(v8::internal::Isolate*, v8::internal::(anonymous namespace)::InvokeParams const&) + 328
    frame #25: 0x0000000000e23cff node`v8::internal::Execution::Call(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, int, v8::internal::Handle<v8::internal::Object>*) + 95
    frame #26: 0x0000000000d13ba2 node`v8::Function::Call(v8::Local<v8::Context>, v8::Local<v8::Value>, int, v8::Local<v8::Value>*) + 258
    frame #27: 0x0000000000a42d6f node`node::InternalCallbackScope::Close() + 1167
    frame #28: 0x0000000000a432a5 node`node::InternalMakeCallback(node::Environment*, v8::Local<v8::Object>, v8::Local<v8::Object>, v8::Local<v8::Function>, int, v8::Local<v8::Value>*, node::async_context) + 677
    frame #29: 0x0000000000a54019 node`node::AsyncWrap::MakeCallback(v8::Local<v8::Function>, int, v8::Local<v8::Value>*) + 121
    frame #30: 0x0000000000be6378 node`node::StreamBase::CallJSOnreadMethod(long, v8::Local<v8::ArrayBuffer>, unsigned long, node::StreamBase::StreamBaseJSChecks) (.constprop.131) + 168
    frame #31: 0x0000000000be8dca node`node::EmitToJSStreamListener::OnStreamRead(long, uv_buf_t const&) + 730
    frame #32: 0x0000000000bf3598 node`node::LibuvStreamWrap::OnUvRead(long, uv_buf_t const*) + 120
    frame #33: 0x000000000156f0e7 node`uv__read(stream=0x0000000006410b50) at stream.c:1247:7
    frame #34: 0x000000000156f910 node`uv__stream_io(loop=<unavailable>, w=0x0000000006410bd8, events=1) at stream.c:1315:5
    frame #35: 0x0000000001575af4 node`uv__io_poll(loop=0x0000000004973c60, timeout=<unavailable>) at epoll.c:374:11
    frame #36: 0x0000000001563d18 node`uv_run(loop=0x0000000004973c60, mode=UV_RUN_DEFAULT) at core.c:389:5
    frame #37: 0x0000000000a43dd5 node`node::SpinEventLoop(node::Environment*) + 309
    frame #38: 0x0000000000b4bab6 node`node::NodeMainInstance::Run(node::EnvSerializeInfo const*) + 406
    frame #39: 0x0000000000acd3f2 node`node::Start(int, char**) + 546
    frame #40: 0x00007f6eed2ccd0a

@mhofman
Copy link
Member Author

mhofman commented May 6, 2022

This is surprising. Do we know the which binary the frames with missing info belong to? I know another native piece is the LMDB addon. Can lldb load multiple binaries?

@mhofman
Copy link
Member Author

mhofman commented May 7, 2022

Got a SIGABRT on revision e80fba2 (coredump). corrupted size vs. prev_size while consolidating was printed out before exiting. Seems to be rooted in Node.js as well

$ lldb ./node -c core-node.9625.ad98cf15592b.1651878983
(lldb) target create "./node" --core "core-node.9625.ad98cf15592b.1651878983"
Core file '/out/tmp/core-node.9625.ad98cf15592b.1651878983' (x86_64) was loaded.

(lldb) bt
* thread #1, name = 'node', stop reason = signal SIGABRT
  * frame #0: 0x00007fa298a5cce1 libc.so.6`raise + 321
    frame #1: 0x00007fa298a46537 libc.so.6`abort + 291
    frame #2: 0x00007fa298a9f768 libc.so.6`___lldb_unnamed_symbol343$$libc.so.6 + 600
    frame #3: 0x00007fa298aa6a5a libc.so.6`___lldb_unnamed_symbol415$$libc.so.6 + 26
    frame #4: 0x00007fa298aa812e libc.so.6`___lldb_unnamed_symbol428$$libc.so.6 + 2062
    frame #5: 0x00007fa2982013bb node-lmdb.node`CursorWrap::~CursorWrap() + 187
    frame #6: 0x0000000000e623df node`v8::internal::GlobalHandles::InvokeFirstPassWeakCallbacks() + 223
    frame #7: 0x0000000000ebe42d node`v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags) (.constprop.1307) + 781
    frame #8: 0x0000000000ebeeb0 node`v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) + 864
    frame #9: 0x0000000000f222f4 node`v8::internal::ScavengeJob::Task::RunInternal() + 244
    frame #10: 0x0000000000db59db node`non-virtual thunk to v8::internal::CancelableTask::Run() + 59
    frame #11: 0x0000000000b77524 node`node::PerIsolatePlatformData::RunForegroundTask(std::unique_ptr<v8::Task, std::default_delete<v8::Task> >) + 196
    frame #12: 0x0000000000b79389 node`node::PerIsolatePlatformData::FlushForegroundTasksInternal() + 585
    frame #13: 0x00000000015633c6 node`uv__async_io.part.1 at async.c:163:5
    frame #14: 0x0000000001575af4 node`uv__io_poll at epoll.c:374:11
    frame #15: 0x0000000001563d18 node`uv_run(loop=0x0000000004973c60, mode=UV_RUN_DEFAULT) at core.c:389:5
    frame #16: 0x0000000000a43dd5 node`node::SpinEventLoop(node::Environment*) + 309
    frame #17: 0x0000000000b4bab6 node`node::NodeMainInstance::Run(node::EnvSerializeInfo const*) + 406
    frame #18: 0x0000000000acd3f2 node`node::Start(int, char**) + 546
    frame #19: 0x00007fa298a47d0a libc.so.6`__libc_start_main + 234
    frame #20: 0x0000000000a4076c node`_start + 41

Edit: I managed to load all binaries from the test, and indeed, the culprit seem to be node-lmbd

@mhofman
Copy link
Member Author

mhofman commented May 7, 2022

I similarly reran lldb on a7fb401 and I'm not getting much more in the stack, with the missing frames seemingly in linux's vdso

$ lldb -c core-node.16038.92a442bab0d5.1651857390
(lldb) target create --core "core-node.16038.92a442bab0d5.1651857390"
Core file '/out/tmp/core-node.16038.92a442bab0d5.1651857390' (x86_64) was loaded.

(lldb) bt
* thread #1, name = 'node', stop reason = signal SIGSEGV
  * frame #0: 0x0000000001578991 node`Builtins_RecordWriteEmitRememberedSetSaveFP + 113
    frame #1: 0x000000000158b0b2 node`Builtins_KeyedStoreIC_Megamorphic + 14130
    frame #2: 0x00007f6ede967a98
    frame #3: 0x00007f6edeb2b941
    frame #4: 0x00007f6edeb3ab71
    frame #5: 0x00007f6ede9d169f
    frame #6: 0x00007f6ede9e7f5e
    frame #7: 0x00007f6ede979f66
    frame #8: 0x0000000001637734 node`Builtins_PromiseConstructor + 2164
    frame #9: 0x0000000001580d92 node`Builtins_JSBuiltinsConstructStub + 242
    frame #10: 0x00007f6edeb893f8
    frame #11: 0x00007f6ede97be5a
    frame #12: 0x0000000001638f8e node`Builtins_PromiseFulfillReactionJob + 46
    frame #13: 0x00000000015a5146 node`Builtins_RunMicrotasks + 614
    frame #14: 0x0000000001581723 node`Builtins_JSRunMicrotasksEntry + 131
    frame #15: 0x0000000000e2334e node`v8::internal::(anonymous namespace)::Invoke(v8::internal::Isolate*, v8::internal::(anonymous namespace)::InvokeParams const&) + 1518
    frame #16: 0x0000000000e23be0 node`v8::internal::(anonymous namespace)::InvokeWithTryCatch(v8::internal::Isolate*, v8::internal::(anonymous namespace)::InvokeParams const&) (.constprop.58) + 80
    frame #17: 0x0000000000e240fa node`v8::internal::Execution::TryRunMicrotasks(v8::internal::Isolate*, v8::internal::MicrotaskQueue*, v8::internal::MaybeHandle<v8::internal::Object>*) + 90
    frame #18: 0x0000000000e4c6ba node`v8::internal::MicrotaskQueue::RunMicrotasks(v8::internal::Isolate*) (.part.50) + 138
    frame #19: 0x0000000000e4ca82 node`v8::internal::MicrotaskQueue::PerformCheckpoint(v8::Isolate*) + 66
    frame #20: 0x00000000015856cc node`Builtins_CallApiCallback + 172
    frame #21: 0x00007f6ede9be687
    frame #22: 0x00000000015817d8 node`Builtins_JSEntryTrampoline + 88
    frame #23: 0x0000000001581563 node`Builtins_JSEntry + 131
    frame #24: 0x0000000000e22ea8 node`v8::internal::(anonymous namespace)::Invoke(v8::internal::Isolate*, v8::internal::(anonymous namespace)::InvokeParams const&) + 328
    frame #25: 0x0000000000e23cff node`v8::internal::Execution::Call(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, int, v8::internal::Handle<v8::internal::Object>*) + 95
    frame #26: 0x0000000000d13ba2 node`v8::Function::Call(v8::Local<v8::Context>, v8::Local<v8::Value>, int, v8::Local<v8::Value>*) + 258
    frame #27: 0x0000000000a42d6f node`node::InternalCallbackScope::Close() + 1167
    frame #28: 0x0000000000a432a5 node`node::InternalMakeCallback(node::Environment*, v8::Local<v8::Object>, v8::Local<v8::Object>, v8::Local<v8::Function>, int, v8::Local<v8::Value>*, node::async_context) + 677
    frame #29: 0x0000000000a54019 node`node::AsyncWrap::MakeCallback(v8::Local<v8::Function>, int, v8::Local<v8::Value>*) + 121
    frame #30: 0x0000000000be6378 node`node::StreamBase::CallJSOnreadMethod(long, v8::Local<v8::ArrayBuffer>, unsigned long, node::StreamBase::StreamBaseJSChecks) (.constprop.131) + 168
    frame #31: 0x0000000000be8dca node`node::EmitToJSStreamListener::OnStreamRead(long, uv_buf_t const&) + 730
    frame #32: 0x0000000000bf3598 node`node::LibuvStreamWrap::OnUvRead(long, uv_buf_t const*) + 120
    frame #33: 0x000000000156f0e7 node`uv__read(stream=0x0000000006410b50) at stream.c:1247:7
    frame #34: 0x000000000156f910 node`uv__stream_io(loop=<unavailable>, w=0x0000000006410bd8, events=1) at stream.c:1315:5
    frame #35: 0x0000000001575af4 node`uv__io_poll at epoll.c:374:11
    frame #36: 0x0000000001563d18 node`uv_run(loop=0x0000000004973c60, mode=UV_RUN_DEFAULT) at core.c:389:5
    frame #37: 0x0000000000a43dd5 node`node::SpinEventLoop(node::Environment*) + 309
    frame #38: 0x0000000000b4bab6 node`node::NodeMainInstance::Run(node::EnvSerializeInfo const*) + 406
    frame #39: 0x0000000000acd3f2 node`node::Start(int, char**) + 546
    frame #40: 0x00007f6eed2ccd0a libc.so.6`__libc_start_main + 234
    frame #41: 0x0000000000a4076c node`_start + 41
(lldb) image list
[  0] 1D7C138E-4867-722C-82DE-22BAB51EEF32-32BEB1B0 0x0000000000400000 /usr/local/bin/node
[  1] D71E62A6-0AA1-6E19-5A07-A2D1FF448CB1-860921F3 0x00007ffddc1ef000 [vdso] (0x00007ffddc1ef000)
[  2] 1D921E2B 0x00007ffddc1ef000 linux-vdso.so.1 (0x00007ffddc1ef000)
[  3] 118B9016-1526-D181-8078-18C459BAEE84-1993795B 0x00007f6eed7b8000 /lib/x86_64-linux-gnu/libdl.so.2
[  4] 5CFE96FC-398B-43AC-08C6-6E1CB91D9535-73D3B6F8 0x00007f6eed5eb000 /usr/lib/x86_64-linux-gnu/libstdc++.so.6
[  5] E9D2C064-79B1-3DD3-CFA7-8D714D11DCCF-6FCBEE51 0x00007f6eed4a7000 /lib/x86_64-linux-gnu/libm.so.6
[  6] 596409BC-4E94-583E-F18F-141C9B941A46-540868EE 0x00007f6eed48d000 /lib/x86_64-linux-gnu/libgcc_s.so.1
[  7] 5018237B-BF01-2B40-9402-7FD0B96FC22A-24496EA4 0x00007f6eed46b000 /lib/x86_64-linux-gnu/libpthread.so.0
[  8] 2E5ABCEE-94F3-BCBE-D7BB-A094F341070A-2585A2BA 0x00007f6eed2a6000 /lib/x86_64-linux-gnu/libc.so.6
[  9] 32438EB3-B034-DA54-CAF5-8C7A65446639-F7CFE274 0x00007f6eed7c7000 /lib64/ld-linux-x86-64.so.2
[ 10] 5F5028BD-E168-2766-4099-9899C92FC4C0-2A7E76A2 0x00007f6eec263000 /src/node_modules/node-lmdb/build/Release/node-lmdb.node
[ 11] F6DA7FAE-4596-3133-5D96-A0398E0BA3DA-804C87ED 0x00007f6eec252000 /src/golang/cosmos/build/Release/agcosmosdaemon.node
[ 12] E5315A12-20EF-0F30-0308-1EA62744E829-CBCAD18F 0x00007f6ecca8f000 /src/golang/cosmos/build/libagcosmosdaemon.so
[ 13] 8057E07B-12FC-6118-D1C7-23397A05AC90-2A1908A1 0x00007f6edc1e4000 /lib/x86_64-linux-gnu/libnss_files.so.2
[ 14] 5274601B-5E40-F395-CE90-667CC001E3CB-0BA24C9B 0x00007f6edc0d3000 /src/node_modules/bufferfromfile/binding/bufferfromfile.node
[ 15] 68A30F85-23E7-5C78-994A-5F219381B860-1930BF28 0x00007f6c1c47f000 /src/node_modules/better-sqlite3/build/Release/better_sqlite3.node

@mhofman
Copy link
Member Author

mhofman commented May 9, 2022

I've found a couple issues of interest on the node-lmdb repo around cursors and double free issues:

I've modified the benchmark machine's Docker image to compile a debug version of node addons, including node-lmdb: Agoric/testnet-load-generator@main...mhofman/core-dump

I've also created a Docker image to more easily start lldb from the core dump on the same base image:

# Run in testnet-load-generator: docker build --target dev-env -t loadgen-runner:dev-env .
FROM loadgen-runner:dev-env

RUN apt-get update && export DEBIAN_FRONTEND=noninteractive \
    && apt-get -y install --no-install-recommends lldb \
    && apt-get clean -y && rm -rf /var/lib/apt/lists/*

RUN mkdir /out && ln -s /out/src /src

ENTRYPOINT ["bash", "-c", "lldb -c /out/tmp/core-node.*"]

From which we can then do:

docker run --rm -it -v /path/to/loadgen/output:/out loadgen-runner:debug-core-dump

So far I've only encountered back traces with Node's garbage collector on the stack, and haven't seen the node-lmdb stacks.

@mhofman mhofman added SwingSet package: SwingSet and removed cosmic-swingset package: cosmic-swingset labels May 9, 2022
@mhofman
Copy link
Member Author

mhofman commented May 30, 2022

After a painstaking bisection, it looks like this started happening when #4527 landed. Looking at the description of that PR, it would seem we started exercising the durable and collections logic more extensively, which would stress LMDB. Having seen node-lmdb on the stack in some of the core dumps (~CursorWrap in particular), I am now back at suspecting node-lmdb is the culprit.

cc @FUDCo, any idea how we could confirm or disprove this hypothesis? How hard would it be to route all swingStore operations to a sub-process? Maybe an ugly hack with synchronous pipes between the kernel and a dedicated sub-process?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants