-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct rcl entity lifecycles and fix spurious test failures #386
Conversation
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
…ID set to 99 Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An extensive and thorough refactor. Thank you for this!
I find it unfortunate that we have to resort to a global mutex, as well as adding in a dependency on another crate, but I'm also not really seeing an alternative to the situation at the present. Perhaps in future we'll determine a more elegant solution, but for now this should allow ROS 2 Rust programs to be memory-safe again.
Signed-off-by: Michael X. Grey <[email protected]>
Fair point that we should try to minimize our dependencies. I decided to look into this more, and I found out that the guidance I originally followed for using lazy_static was actually out of date. Recent changes to the stable Rust have made it obsolete, and we can actually take advantage of new language features ( I've update the PR to remove the use of |
As for the use of a global mutex, I agree that it's icky, but it should be extremely low-impact since it's only used very briefly when initializing or dropping a middleware entity, which in almost all applications is extremely seldom, only happens during wind-up/wind-down, and will almost never happen more than once at a time. There's definitely no realistic application where simultaneous locks of that mutex will happen at a higher frequency than our rclrs test suite, and I've seen no qualitative difference in how fast the tests run after adding the global mutex. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To quote your comment from the matrix chat,
A major selling point of Rust is fearless parallelism, so I don't want us to be in a position where we put implicit restrictions on users about what they can do across threads
I completely agree with this. We do have a higher bar of correctness simply because of the language. With that said, this problem can be mitigated, but ultimately I don't believe it is solvable unless our dependencies protect the mutation of these globals.
Long term, it is not unreasonable to think that consumers of rclrs
may have multiple packages in a workspace. Rust packages will still be built as independent crates with cargo under the hood, which means we'll have multiple downloads and potentially multiple versions of rclrs
installed, but still linking to the single version of rcl
or rmw
found in the (ROS 2) workspace. A classic example of a configuration management problem.
This "global" lock may end up being a package level lock, which is unfortunate for some features, like enabling colcon test
in colcon-ros-cargo.
I think this is about as good as we can do for now. Long term, I think it is in our best interest to raise issues (and potentially contribute) upstream.
(A little late for me, apologies for any incoherence)
rclrs/src/context.rs
Outdated
/// This is locked whenever initializing or dropping any middleware entity | ||
/// because we have found issues in RCL and some RMW implementations that | ||
/// make it unsafe to simultaneously initialize and/or drop various types of | ||
/// entities. It seems these C and C++ based libraries will regularly use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What types of entites do we need to be careful around besides logging (rcl
) and middleware (rmw
)? If it truly is all entities we can initialize/drop from our FFI deps, then perhaps we're can be more direct and say something to the effect of ... and/or drop entites from these libraries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as tests indicate, it's only what is mentioned here: middleware entities from RCL and RWM implementation (not rmw
itself) libraries. There are some RCL data structures that we do not need to worry about, like rcl_init_options_t
because its initialization and cleanup do not touch any global variables (..as far as I can tell).
I will probably participate in a discussion with rcl maintainers to improve the documentation around the use of global variables within rcl functions, and if that effort is fruitful then we'll be able to say with more confidence which exact functions to be concerned about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the documentation to clarify the full situation a bit: 0efa831
What you're describing here wouldn't happen with this implementation. The "global" lock is only global within a single executable because it lives in static program memory, which is unique per executable. What you're describing would only be an issue if the lock were in shared memory, which is not the case. The global lock only needs to protect against unprotected use of global variables in static program memory which are being performed by rcutils and potentially by any RMW implementation that happens to get used. If there were any unprotected use of shared memory then that would be a critical bug in whatever shared memory library is being used, and that would need to be handled upstream, not by a ROS client library. That being said, even if it were shared between all the programs on a system that are based on rclrs, it would only have a very minor impact on performance as evidenced by its negligible effect on the rclrs test suite, which you could think of as 48 rclrs programs running simultaneously and sharing this global lock. Regarding your feedback on the documentation, I'll try to make the changes you've asked for later today. |
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
…mpty Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM Grey. Thank you for all your hard work and purist of excellence in rclrs :)
// * This function does not store the ephemeral init_options and c_args pointers. | ||
// * Passing in a zero-initialized rcl_context is mandatory. | ||
// * The entity lifecycle mutex is locked to protect against the risk of global variables | ||
// in the rmw implementation being unsafely modified during initialization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This would actually be in rcl
, not an rmw
implementation right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No the comment is correct as-is. To be more concrete about it: Most of the segfaults we were seeing are coming from FastDDS
via rmw_fastrtps_cpp
, which is an RMW implementation.
The segfaults happen because that particular RMW implementation makes unsafe use of global variables. That problem would exist whether or not RCL is thread-safe.
Note that "RMW implementation" =/= "the implementation of the rmw
library", although the human language aspect of all this is admittedly very confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in the unsafe block below, we are calling rcl_init()
, not an RMW implementation function. I'm not as familiar with the internals of rcl
as I should be, but perhaps rcl_init()
is then calling an RMW implementation function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but perhaps rcl_init() is then calling an RMW implementation function?
Correct, client libraries like rclrs
do not ever call any rmw
functions directly. Instead rcl
is the abstraction layer that all client libraries are supposed to interface with. Many (but not all) rcl functions will then call rmw functions as needed, and those rmw functions will hook the calls into whatever RMW implementation has been loaded.
As a rule of thumb, any rcl function that involves sending information out over a middleware (including to form connections and perform discovery) will call into the RMW implementation. But this is something that I would recommend rcl to officially document.
// * The rcl_context is kept alive by the ContextHandle because it is a dependency of the node. | ||
// * The entity lifecycle mutex is locked to protect against the risk of | ||
// global variables in the rmw implementation being unsafely modified during cleanup. | ||
let _lifecycle_lock = ENTITY_LIFECYCLE_MUTEX.lock().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to by consistent with when we acquire this lock. Sometimes if we have what would be a one-liner unsafe function, we acquire the lock inside the unsafe scope. Functionally this doesn't really do anything, but it does increase the amount of unsafe
code we have.
But maybe this is just a broader refactor we can do after this PR is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do that here we just need to put a fresh new scope around the entity lock and unsafe block to avoid holding the entity lock longer than what's actually needed.
This implementation is at a good point right now. I think any future work (i.e. moving the locks out of the unsafe blocks in a few cases) should be done in a follow up PR. Thanks again for the great contributions |
For some time now we've noticed spurious segmentation faults and other test failures in CI. This PR aims to eliminate those entirely. I'll break down the problems we observed, followed by how they are being solved by this PR.
Segmentation Faults
Our initial suspicion was that the segmentation faults were being caused by flaws in the drop order of various rcl entities (e.g.
rcl_context_t
,rcl_node_t
,rcl_publisher_t
, etc). This suspicion had a lot of merit since rcl specifies that all nodes and middleware primitives must be cleaned up (fini
ed) before thercl_context_t
that they came from getsfini
ed. However, the way the rcl objects' lifecycles were being managed, if aNode
struct gets dropped before its associated primitives (Subscription
,Publisher
,Service
,Client
) are dropped, then theirrcl_context_t
will get dropped while thercl_node_t
and various primitive rcl objects are still alive. This violates the contract that rcl stipulates and seemed like a very reasonable explanation for spurious segfaults.Initial solution
To address this, I introduced a
[_]Handle
struct specifically to manage the lifecycle of each rcl binding object that gets used by rclrs. For example we now haveContextHandle
,NodeHandle,
PublisherHandle
, etc. We already had aSubscriptionHandle
which served a similar purpose, and now all the other middleware entity types have a similar handle for themselves. The handle has two roles:Drop
trait such that it cleans up the rcl object correctly.[_]Handle
struct via RAII.Applying this pattern consistently to all types in rclrs provides us with a sound and easily maintained way to ensure that we're always respecting the documented lifecycle contracts of rcl. If our initial suspicion about drop order flaws were correct, this would completely resolve the segmentation faults that we were experiencing. So with bated breath I ran the tests through an aggressive loop and ...it still segfaulted....
The actual cause
While the
Initial Solution
was a worthwhile and legitimate thing to fix, it ultimately turned out to not be the cause of the segfaults we were observing (although it does fix other segfaults that could have happened in other scenarios). Ultimately the conclusive cause of the segfaults was this:rcutils and (much more so) FastDDS perform unprotected mutations to global variables inside of a variety of entity initialization and cleanup functions.
rcutils has some ongoing efforts to weed out its use of unprotected mutations to global variables, but (1) the direction of those efforts is controversial, so there's no telling when they'll be merged, if ever, and (2) the vast vast majority of segmentation faults are originating in FastDDS, which suggests they are making far more extensive use of global variables and might not be likely to change that aspect of their implementation any time soon. Taking all of that into account, I am left to conclude that we must accommodate the assumption that all calls to rcl entity initialization or cleanup functions may be unsafe to run concurrently no matter their arguments nor their relationship to each other.
To deal with this, I've introduced a global (but thread-safe) static
ENTITY_LIFECYCLE_MUTEX
which must be locked while any entity initialization or cleanup function is running. This ensures that no matter what unprotected global variables might be getting modified by rcutils or the rmw implementation, there will not be any data races or race conditions between those calls.Since initialization and cleanup are both extremely low-frequency occurrences and also do not block for any substantial amount of time, I think it is reasonable to have a single global lock for them. The alternative, which is to risk segmentation faults depending on the choice of rmw implementation and the mood that rcutils happens to be in, is something I would consider unacceptable.
Empty Graph Tests
Besides segmentation faults, we also observed sporadic failures in the
test_graph_empty
test case. This is easily explained by two facts:cargo test
by default will run as many tests at once as possible, according to the number of threads available on the current systemtest_graph_empty
assumes that the system's node graph will be empty because there are no nodes, topics, etc being constructed within the scope of the testThe assumption made by (2) is trivially violated by (1) if
test_graph_empty
ever happens to be run at the same time as any other test that does happen to create nodes or middleware primitives.Prior to this PR, the only way to change the domain ID of a context programmatically within an
rclrs
application was to set the environment variableROS_DOMAIN_ID
usingstd::env::set_var
. That would not help in this case because changing the environment variable anywhere within the application would modify it across all new contexts within the application, and all the tests being run in parallel are all being run within the same application.To solve this, I've introduced an API for providing an
InitOptions
struct while creating a newContext
. TheInitOptions
struct is the rclrs equivalent torcl_init_options_t
, which allows users to specify a domain ID for the context that may be different than what is specified by theROS_DOMAIN_ID
environment variable. Taking advantage of this, we can set the domain ID of the context used bytest_graph_empty
to be something different from the rest of the tests.I've arbitrarily chosenEdit: If the99
, but that means tests may fail if the system running the tests happens to have itsROS_DOMAIN_ID
set to 99. We could consider putting in a check and remedy for this if anyone is concerned about making sure our tests have absolutely no possible edge cases.ROS_DOMAIN_ID
environment variable is unset or set to anything besides99
, thentest_graph_empty
will use 99 as its domain ID. But ifROS_DOMAIN_ID
happens to be set to99
then we will use 98 for the domain ID oftest_graph_empty
to avoid conflicts with any other tests that may run simultaneously.