-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change default RMW vendor to CycloneDDS. #293
Change default RMW vendor to CycloneDDS. #293
Conversation
Signed-off-by: Chris Lalancette <[email protected]>
There's an interesting number of new failures, which might be expected considering that a lot of tests are only run with the default implementation. It's strange that there aren't similar failures in the cyclonedds only rolling CI job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should investigate the failures before switching, but the code of this PR LGTM!
Yeah, I think we really need to figure out what is going on with the test failures before merging. Otherwise all builds will be yellow going forward. @eboasson and I are planning to take a look at this, though not until early January. Any help in that direction is of course welcome. |
I think it is all related to this: ros2/rmw_cyclonedds#271 (comment) :
|
New CI incorporating ros2/rmw_cyclonedds#274: |
@clalancette, ros2/rmw_cyclonedds#274 certainly did what it was expected to do 😄 I had a look through the errors and after filtering out uncrustify and cppcheck issues on Windows and CentOS a handful of things warranting further investigation remain. I'll try to get my Windows box back up and see if I can easily replicate that problem. Linux Looks to me like it is unrelated (Connext-specific)
Linux-aarch64 One failure that I do find suspicious: macOS Looks to me like it is unrelated (Connext-specific)
CentOS Tons of cppcheck errors, plus:
and a number of them that I suspect may be real (the Mimick warnings are also present in the nightly build, but the test failures are not):
Linux-clang Compiler warnings only, in urdfdom and tlsf (seems like it must be unrelated). Windows/Windows Debug Tons of uncrustify errors, plus a substantial amount of other failed tests. I looked at a couple of them and those at least all have the test print "PASSED" but the framework declare a failure with "return code 3221225477". That is 0xc000000005 which is also the exception code used for an access violation, so that looks like a crash on termination: e.g. https://ci.ros2.org/job/ci_windows/13394/testReport/junit/projectroot.test/rclcpp/test_wait_set/
|
Yep! I've now merged it, so at least that piece of the puzzle is in.
Yeah, the uncrustify errors popped up during break. Our buildfarmer is looking into it, but they have nothing to do with this PR.
While I agree that this looks Connext specific, the one thing that worries me is that this problem doesn't seem to pop up in the Linux release nightlies or repeated nightlies. I'll leave this at the bottom of the pile to worry about.
This one shows up pretty consistently in the nightlies, so it is not the fault of this PR.
Agreed that this one looks suspicious. I don't see any evidence of it in the nightly aarch64 release or aarch64 repeated one. This needs to be looked into.
This one shows up in the nightlies, so we can ignore it.
Yes, we can ignore that one; ros2/sros2#248 is an attempt to fix it.
Yeah, those bear looking into to. CentOS isn't currently officially supported, but we do try and keep it working.
Yeah, these are present in the nightlies, so we can ignore it.
OK, yeah, that looks like the biggest problem. @eboasson I'm going to leave the Windows problems to you for now, and take a look at the aarch64 |
I think I've caught the crash on Windows crash in the debugger (this is a debug build, BTW), saving it here just to make sure it isn't lost, one never knows with Windows ...
Edit, it's at the *: static void clean_waitset_caches()
{
/* Called whenever a subscriber, guard condition, service or client is deleted (as these may
have been cached in a waitset), and drops all cached entities from all waitsets (just to keep
life simple). I'm assuming one is not allowed to delete an entity while it is still being
used ... */
std::lock_guard<std::mutex> lock(gcdds.lock);
for (auto && ws : gcdds.waitsets) {
* std::lock_guard<std::mutex> wslock(ws->lock);
if (!ws->inuse) {
waitset_detach(ws);
}
}
} in the current state
There's some issue with destructor ordering, it seems. |
From my side, I looked a bit into the aarch64 failure. On an unloaded machine, I got it to fail only once in about 2 hours of testing. After that, I added some load to the machine, and now I'm able to make it happen fairly regularly (though not every time). It looks like under load, calling It looks like there can be a delay between I also had some time to look into the CentOS failures. They generally fall into two categories that I'll discuss separately below. Injection failuresFor example https://ci.ros2.org/job/ci_linux-centos/94/testReport/junit/(root)/projectroot/test_lifecycle_node/ The mimick library does not support
This has been a source of irritation in the past, though we have mostly worked around it at this point. It seems that a few more workarounds would be needed to make it fully work with CycloneDDS as the default. It is typically, though not always, possible to rewrite these tests to use Detecting type support failuresEven with the fix in ros2/rmw_cyclonedds#274 in place, we are still hitting areas of the code in CentOS where an error is set that is just not expected. Because of this, I'm going to look into solving ros2/rosidl_typesupport#100 now. Hopefully that will solve it for good, though we'll need a bunch of testing with different rmw vendors to make sure there are no side-effects. |
I think this is actually a bug in the The global state that is being destroyed is really not very interesting, just a few mutexes, an empty map and an empty set. Perhaps it would be an option to simply do nothing in I figured I could avoid the problem by converting the global variable into So, if anyone has a good suggestion ...
|
Arg, yes, global destructor ordering. This has bitten us before.
Yeah, that seems too hacky.
I also couldn't make it work here with gcc, even with
I don't think that we should rely on that.
Two suggestions:
|
Another option:
|
OK, ros2/rosidl_typesupport#102 should solve most of the remaining issues on CentOS. I'm still not certain that is the way we want to go, but I'll gather additional opinions on that PR. Between that and ros2/rmw_cyclonedds#275 , I think we may be pretty close to getting this in. |
All right. With ros2/rmw_cyclonedds#275 merged, we can look at the CI run in ros2/rmw_cyclonedds#275 (comment) again.
So I think the one remaining issue we have to deal with before merging this is the Linux |
A complication for me is that I don't have Connext and probably never will (RTI is quite sensitive about such things), so I can't try anything involving Connext myself. Something that comes to mind that might play a role is the ros 2 daemon. If that one is perchance started prior to these tests and then remains running, it would be running in a slightly different environment with this PR in. I don't know when exactly that thing gets started, but you could try running the CI limiting testing to |
Well, that is interesting. I ran a job just running the So at the very least, it looks like there is some state being kept around and potentially causing problems. I'll add some code to kill off the daemon before running these tests, and see if that is any better when running the full suite. |
It does look like killing the daemon before starting the tests improved things. Here is another CI with this PR and the contents of |
Looking through the test failures in the last CI, most of them are known. However, after looking at it more carefully, there is one test failure in Windows which seems unique to this PR. That test failure is in rclcpp's Even so, I'm going to go ahead and merge this so that we can start doing testing beyond just the CI. Thanks for all of the work so far! |
The Or more precisely, it's caused by Windows and those are the two lines that trigger it: in these two locations, the graph_listener thread defined at https://github.com/ros2/rclcpp/blob/56a037a3da15a7b3a5902a6cce2ef6b7c4aa8599/rclcpp/src/rclcpp/graph_listener.cpp#L109 is started, and it is then still blocked on a waitset by the time Windows decides that thread really shouldn't exist. Indiscriminately killing that thread from a global destructor then corrupts internal state of Cyclone and leaves it blocking on trying to delete the entity. I'd say Windows needs to be eliminated for the good of humankind, or if that's impossible, fixed, or, assuming that too is an impossibility, those See below for a recording of a stripped down version of
|
IIUC |
@ivanpauno Thanks for the feedback. @eboasson and I discussed this, and he is going to give that a shot to see if it fixes the issue. |
I'd like to understand this more. Can someone point me to some documentation or discussion about this buggy (insane?) behavior for Windows that it can just eliminate a thread before static objects which may or may not interact with it can be destroyed. That sounds broken to the point that it's not possible to me... |
The best reference I've found is this and this. See particularly steps |
I'll read those tomorrow, but it seems really odd to me that we don't see problems elsewhere. Also, it's not obvious to me when |
My understanding is that
I think that the main wrapper may call destructors from static objects of the executable before calling To actually check what happens, we would need to check the generated object code or debug step by step process termination, but honestly I prefer not knowing the details and assume that I don't know when destructors of static lifetime objects will be run 😂. |
Signed-off-by: Chris Lalancette [email protected]