-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exit race between main and worker threads #25007
Comments
/cc @nodejs/workers @nodejs/process |
on a side note: why would worker module be loaded even in the absence of Can we change that? At least in this case, it will save a lot of resources (thread, memory) for use cases that do not require worker? |
@bengl was looking at this earlier but I don't know if he has anything to add (and if he did, he'd probably put it in the other issue). Pinging him here anyway just in case... |
As far as I can tell, it’s only the native binding that’s loaded unconditionally (to figure out whether we’re in a worker or not during bootstrap).
That’s not the case;
The memory overhead is probably not huge, and just loading the worker module does not spawn any threads on its own. |
$ gdb ./node_g
(gdb) b pthread_create
Breakpoint 1 at 0xd2fe40
(gdb) r
Starting program: ./node_g
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Breakpoint 1, 0x0000000000d2fe40 in pthread_create@plt ()
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 libgcc-4.8.5-28.el7.x86_64 libstdc++-4.8.5-28.el7.x86_64
(gdb) bt
#0 0x0000000000d2fe40 in pthread_create@plt ()
#1 0x0000000000fd8acf in uv_thread_create (tid=0x3396790,
entry=0xe9be78 <node::WorkerThreadsTaskRunner::DelayedTaskScheduler::Start()::{lambda(void*)#1}::_FUN(void*)>, arg=0x3395f60) at ../deps/uv/src/unix/thread.c:213
#2 0x0000000000e9bf1a in node::WorkerThreadsTaskRunner::DelayedTaskScheduler::Start (
this=0x3395f60) at ../src/node_platform.cc:63
#3 0x0000000000e998e8 in node::WorkerThreadsTaskRunner::WorkerThreadsTaskRunner (
this=0x3395bf0, thread_pool_size=4) at ../src/node_platform.cc:178
#4 0x0000000000ea795f in __gnu_cxx::new_allocator<node::WorkerThreadsTaskRunner>::construct<node::WorkerThreadsTaskRunner<int&> > (this=0x7fffffffdcfd, __p=0x3395bf0)
at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/ext/new_allocator.h:120
#5 0x0000000000ea68c4 in std::allocator_traits<std::allocator<node::WorkerThreadsTaskRunner> >::_S_construct<node::WorkerThreadsTaskRunner<int&> >(std::allocator<node::WorkerThreadsTaskRunner>&, std::allocator_traits<std::allocator<node::WorkerThreadsTaskRunner> >::__construct_helper*, (node::WorkerThreadsTaskRunner<int&>&&)...) (__a=..., __p=0x3395bf0)
at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/alloc_traits.h:253
#6 0x0000000000ea54d5 in std::allocator_traits<std::allocator<node::WorkerThreadsTaskRunner> >::construct<node::WorkerThreadsTaskRunner<int&> >(std::allocator<node::WorkerThreadsTaskRunner>&, node::WorkerThreadsTaskRunner<int&>*, (node::WorkerThreadsTaskRunner<int&>&&)...) (__a=...,
__p=0x3395bf0) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/alloc_traits.h:399
#7 0x0000000000ea37aa in std::__shared_ptr<node::WorkerThreadsTaskRunner, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<node::WorkerThreadsTaskRunner>, int&> (
this=0x7fffffffde30, __tag=..., __a=...)
at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr_base.h:1124
#8 0x0000000000ea1692 in std::shared_ptr<node::WorkerThreadsTaskRunner>::shared_ptr<std::alloca---Type <return> to continue, or q <return> to quit---
tor<node::WorkerThreadsTaskRunner>, int&> (this=0x7fffffffde30, __tag=..., __a=...)
at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr.h:316
#9 0x0000000000e9fe4e in std::allocate_shared<node::WorkerThreadsTaskRunner, std::allocator<node::WorkerThreadsTaskRunner>, int&> (__a=...)
at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr.h:588
#10 0x0000000000e9e7fb in std::make_shared<node::WorkerThreadsTaskRunner, int&> ()
at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr.h:604
#11 0x0000000000e9a18f in node::NodePlatform::NodePlatform (this=0x3395b00,
thread_pool_size=4, tracing_controller=0x3395920) at ../src/node_platform.cc:293
#12 0x0000000000dbbdcc in Initialize (this=0x3359680 <node::v8_platform>, thread_pool_size=4)
at ../src/node.cc:239
#13 0x0000000000dc26d4 in node::InitializeV8Platform (thread_pool_size=4)
at ../src/node.cc:1893
#14 0x0000000000dc2cd7 in node::Start (argc=1, argv=0x338b4d0) at ../src/node.cc:2122
#15 0x0000000001e13199 in main (argc=1, argv=0x7fffffffe048) at ../src/node_main.cc:126
(gdb) @addaleax - this is what I see, am I missing something? |
ok, apologies; those are normal worker threads created every time at the bootup. I got confused with |
@gireeshpunathil Okay, that clears it up :) I think |
ok, I edited the description to remove worker module from the limelight. I was misguided by the word So just to clarify (for myself and others): now this test case only solves the purpose to easily recreate / pronounce the issue through many worker threads, but workers are not real culprits. |
I believe that should be unnecessary - opened #25017 |
Part of debugging #24921 I happened to run the entire CI with underscored exit replacing normal exit in |
on another side, if we want to restrict global destructors from being invoked from helper threads (avoiding the word if (! env->is_main_thread()) return; leaving that action for the main thread. However, how do we identify such destructors? Do we have a static list somewhere? /cc @addaleax @joyeecheung |
@gireeshpunathil I don’t think that’s a feasible solution… the problem occurs mostly when the main thread exits, as far as I have seen so far. Also, I don’t think there’s a safe way for us to tell from a destructor whether we are on the main thread or not (and in the context of embedders there might be multiple main threads). As for a list of object destructions that are potentially problematic… here’s the best thing I’ve come up with so far based on looking at the symbol tables: *Potentially* problematic symbols$ nm node| grep '[0-9a-f]* [dDbB]'|egrep -v 'args(_[0-9]*_*)*$'|egrep -v '_ZZ?N2v8'|grep -v '_ZZN6icu_63'|grep N4node|grep -v '_ZTV'|grep -v trace_event_unique_atomic|grep -v available_category| awk '{print $3}'|c++filt|sort
guard variable for node::crypto::NewRootCertStore()::root_certs_vector
guard variable for node::crypto::NewRootCertStore()::root_certs_vector_mutex
node::(anonymous namespace)::Parser::settings
node::(anonymous namespace)::Parser::settings
node::cares_wrap::(anonymous namespace)::ares_library_mutex
node::crypto::extra_root_certs_loaded
node::crypto::Initialize(v8::Local<v8::Object>, v8::Local<v8::Value>, v8::Local<v8::Context>, void*)::init_once
node::crypto::NewRootCertStore()::root_certs_vector
node::crypto::NewRootCertStore()::root_certs_vector_mutex
node::crypto::NodeBIO::GetMethod()::method
node::crypto::root_certs
node::crypto::root_cert_store
node::debug_symbols_generated
node::Environment::kNodeContextTagPtr
node::Environment::Start(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, bool)::init_once
node::Environment::thread_local_env
node::environ_mutex
node::fs::kPathSeparator
node::http2::Headers::Headers(v8::Isolate*, v8::Local<v8::Context>, v8::Local<v8::Array>)::zero
node::http2::Http2Session::callback_struct_saved
node::http2::Origins::Origins(v8::Isolate*, v8::Local<v8::Context>, v8::Local<v8::String>, unsigned long)::zero
node::http_parser_version
node::init_modpending_once
node::inspector::(anonymous namespace)::start_io_thread_async
node::inspector::(anonymous namespace)::start_io_thread_semaphore
node::inspector::protocol::NodeTracing::TraceConfig::RecordModeEnum::RecordAsMuchAsPossible
node::inspector::protocol::NodeTracing::TraceConfig::RecordModeEnum::RecordContinuously
node::inspector::protocol::NodeTracing::TraceConfig::RecordModeEnum::RecordUntilFull
node::inspector::protocol::StringUtil::kNotFound
node::linux_at_secure
node::llhttp_version
node::loader::EXTENSIONS
node::modlist_addon
node::modlist_builtin
node::modlist_internal
node::modlist_linked
node::node_is_initialized
node::node_isolate
node::node_isolate_mutex
node::options_parser::DebugOptionsParser::instance
node::options_parser::EnvironmentOptionsParser::instance
node::options_parser::PerIsolateOptionsParser::instance
node::options_parser::PerProcessOptionsParser::instance
node::performance::performance_node_start
node::performance::performance_v8_start
node::performance::timeOrigin
node::performance::timeOriginTimestamp
node::per_process_loader
node::per_process::metadata
node::per_process_opts
node::per_process_opts_mutex
node::process_mutex
node::prog_start_time
node::PromiseRejectCallback(v8::PromiseRejectMessage)::rejectionsHandledAfter
node::PromiseRejectCallback(v8::PromiseRejectMessage)::unhandledRejections
node::provider_names
node::reverted
node::SigintWatchdogHelper::instance
node::thread_local_modpending
node::tracing::g_agent
node::url::(anonymous namespace)::hex
node::v8_initialized
node::v8_is_profiling
node::v8_platform
node::worker::(anonymous namespace)::next_thread_id
node::worker::(anonymous namespace)::next_thread_id_mutex |
thanks @addaleax. this is a pretty huge list! so what viable options exist for us in your opinion? At the moment it loos like many tests are being affected, and though started with harmless-looking tests in AIX, there is evidence that it has infected some Linux variants. Also just wondering what recent changes would have triggered this. The current AIX CI which I work with seem to have the best recreate frequency, so do you advise a bisect route? It is a laborious thing to do and probably can be inconclusive, but we might get some vital hints. |
@gireeshpunathil Most of the things in these list are not even C++ classes but rather primitive types, so those could be omitted… I don’t have a clue, as for what might have caused this, or how to bisect this best. And in the worst case, it might just be a timing-based race condition that we caused through some performance changes. :/ |
Okay, this one looks like a real bug that could be part of this:
We probably should at least keep |
There’s also at least one other possible race condition in the platform code, which I think it has been caused by e273abc, but which I don’t quite understand:
|
Calling `process.exit()` calls the C `exit()` function, which in turn calls the destructors of static C++ objects. This can lead to race conditions with other concurrently executing threads; disposing of all Worker threads and then the V8 platform instance helps with this (although it might not be a full solution for all problems of this kind). Refs: nodejs#24403 Refs: nodejs#25007
@gireeshpunathil I just opened #25061 to take up the
I’ve stared at this code for an hour now, becoming somewhat convinced that it’s correct… this seems like bad news? Does “not fail in 100+ runs” mean that it’s likely to cause these issues here? :/ /cc @ofrobots |
@addaleax - yes, 500+ runs now, and no failure. So the said commit is very likely to have caused the issues. |
@gireeshpunathil 9f7e3a4 also changed a lot of the code introduced in e273abc … do you think you can tell if the former might be the cause rather than the latter? That would be nice, because it's just cleanup that we could revert without issues (although it would of course be nice to understand this first) |
A number of tests that were `flaked` recently are proved to have failing reason identified in nodejs#25007 and resolution identified in nodejs#25061 Revoke flaky designation of all these tests as the said PR is landed. PR-URL: nodejs#25611 Reviewed-By: Sakthipriyan Vairamani <[email protected]> Reviewed-By: Rich Trott <[email protected]> Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Luigi Pinca <[email protected]> Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Richard Lau <[email protected]> Reviewed-By: Beth Griggs <[email protected]>
A number of tests that were `flaked` recently are proved to have failing reason identified in #25007 and resolution identified in #25061 Revoke flaky designation of all these tests as the said PR is landed. PR-URL: #25611 Reviewed-By: Sakthipriyan Vairamani <[email protected]> Reviewed-By: Rich Trott <[email protected]> Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Luigi Pinca <[email protected]> Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Richard Lau <[email protected]> Reviewed-By: Beth Griggs <[email protected]>
almost all of the tests that had manifested |
Execute many module loads in worker in a loop while exiting from the main thread at arbitrary execution points, and make sure that the workers quiesce without crashing. `worker_threads` are not necessarily the subject of testing, those are used for easy simulation of multi-thread scenarios. Refs: nodejs#25007 PR-URL: nodejs#25083 Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: James M Snell <[email protected]>
Execute many module loads in worker in a loop while exiting from the main thread at arbitrary execution points, and make sure that the workers quiesce without crashing. `worker_threads` are not necessarily the subject of testing, those are used for easy simulation of multi-thread scenarios. Refs: #25007 PR-URL: #25083 Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: James M Snell <[email protected]>
this pattern is not observed anymore in CI, thanks to #25061 . So good to close. |
Execute JS code in worker through same vm context while exiting from the main thread at arbitrary execution points, and make sure that the workers quiesce without crashing. `worker_threads` are not necessarily the subject of testing, those are used for easy simulation of multi-thread scenarios. Refs: nodejs#25007 PR-URL: nodejs#25085 Reviewed-By: Anna Henningsen <[email protected]>
Execute JS code in worker through same vm context while exiting from the main thread at arbitrary execution points, and make sure that the workers quiesce without crashing. `worker_threads` are not necessarily the subject of testing, those are used for easy simulation of multi-thread scenarios. Refs: #25007 PR-URL: #25085 Reviewed-By: Anna Henningsen <[email protected]>
Calling `process.exit()` calls the C `exit()` function, which in turn calls the destructors of static C++ objects. This can lead to race conditions with other concurrently executing threads; disposing of all Worker threads and then the V8 platform instance helps with this (although it might not be a full solution for all problems of this kind). Refs: #24403 Refs: #25007 Backport-PR-URL: #26048 PR-URL: #25061 Reviewed-By: Gireesh Punathil <[email protected]> Reviewed-By: Joyee Cheung <[email protected]> Reviewed-By: Ben Noordhuis <[email protected]>
Calling `process.exit()` calls the C `exit()` function, which in turn calls the destructors of static C++ objects. This can lead to race conditions with other concurrently executing threads; disposing of all Worker threads and then the V8 platform instance helps with this (although it might not be a full solution for all problems of this kind). Refs: nodejs#24403 Refs: nodejs#25007 PR-URL: nodejs#25061 Reviewed-By: Gireesh Punathil <[email protected]> Reviewed-By: Joyee Cheung <[email protected]> Reviewed-By: Ben Noordhuis <[email protected]>
Calling `process.exit()` calls the C `exit()` function, which in turn calls the destructors of static C++ objects. This can lead to race conditions with other concurrently executing threads; disposing of all Worker threads and then the V8 platform instance helps with this (although it might not be a full solution for all problems of this kind). Refs: #24403 Refs: #25007 Backport-PR-URL: #26048 PR-URL: #25061 Reviewed-By: Gireesh Punathil <[email protected]> Reviewed-By: Joyee Cheung <[email protected]> Reviewed-By: Ben Noordhuis <[email protected]>
sequential/test-inspector-debug-end and parallel/test-child-process-execfile Off late these have been failing in AIX. Debugging core dump suggested that this is a side effect of exit-race that is described in #25007 Mart these as flaky in AIX until that is resolved. Refs: #25047 Refs: #25029 PR-URL: #25126 Reviewed-By: James M Snell <[email protected]> Reviewed-By: Joyee Cheung <[email protected]>
sequential/test-inspector-debug-end and parallel/test-child-process-execfile Off late these have been failing in AIX. Debugging core dump suggested that this is a side effect of exit-race that is described in #25007 Mart these as flaky in AIX until that is resolved. Refs: #25047 Refs: #25029 PR-URL: #25126 Reviewed-By: James M Snell <[email protected]> Reviewed-By: Joyee Cheung <[email protected]>
A number of tests that were `flaked` recently are proved to have failing reason identified in #25007 and resolution identified in #25061 Revoke flaky designation of all these tests as the said PR is landed. PR-URL: #25611 Reviewed-By: Sakthipriyan Vairamani <[email protected]> Reviewed-By: Rich Trott <[email protected]> Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Luigi Pinca <[email protected]> Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Richard Lau <[email protected]> Reviewed-By: Beth Griggs <[email protected]>
sequential/test-inspector-debug-end and parallel/test-child-process-execfile Off late these have been failing in AIX. Debugging core dump suggested that this is a side effect of exit-race that is described in #25007 Mart these as flaky in AIX until that is resolved. Refs: #25047 Refs: #25029 PR-URL: #25126 Reviewed-By: James M Snell <[email protected]> Reviewed-By: Joyee Cheung <[email protected]>
A number of tests that were `flaked` recently are proved to have failing reason identified in #25007 and resolution identified in #25061 Revoke flaky designation of all these tests as the said PR is landed. PR-URL: #25611 Reviewed-By: Sakthipriyan Vairamani <[email protected]> Reviewed-By: Rich Trott <[email protected]> Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Luigi Pinca <[email protected]> Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Richard Lau <[email protected]> Reviewed-By: Beth Griggs <[email protected]>
sequential/test-inspector-debug-end and parallel/test-child-process-execfile Off late these have been failing in AIX. Debugging core dump suggested that this is a side effect of exit-race that is described in #25007 Mart these as flaky in AIX until that is resolved. Refs: #25047 Refs: #25029 PR-URL: #25126 Reviewed-By: James M Snell <[email protected]> Reviewed-By: Joyee Cheung <[email protected]>
A number of tests that were `flaked` recently are proved to have failing reason identified in #25007 and resolution identified in #25061 Revoke flaky designation of all these tests as the said PR is landed. PR-URL: #25611 Reviewed-By: Sakthipriyan Vairamani <[email protected]> Reviewed-By: Rich Trott <[email protected]> Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Luigi Pinca <[email protected]> Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Richard Lau <[email protected]> Reviewed-By: Beth Griggs <[email protected]>
worker, process, srcSample test case to reproduce the issue:
The flakiness of the test is largely influenced by the thread scheduling order / number of CPUs / load on the system.
First reported in AIX and Linux through
sequential/test-cli-syntax.js
. The more you run, the more variety of scenarios you get: SIGSEGV, SIGABRT, SIGILL... depends on at what point the main and the worker threads are.The root cause is that there is no specified order / identified ownership of C++ global objects that being destructed between threads.
Refs: #24403
The text was updated successfully, but these errors were encountered: