Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

insufficient performance in the QoS demo using default parameters #202

Closed
dirk-thomas opened this issue May 24, 2018 · 26 comments
Closed

insufficient performance in the QoS demo using default parameters #202

dirk-thomas opened this issue May 24, 2018 · 26 comments
Labels
bug Something isn't working more-information-needed Further information is required

Comments

@dirk-thomas
Copy link
Member

dirk-thomas commented May 24, 2018

The scope of this ticket is focus on the performance of the QoS demo with default values between publisher and subscriber with the same RMW implementation. The cross vendor results are only mentioned for completeness / context.

To reproduce run the easiest example from the QoS demo:

  • ros2 run image_tools cam2image -b
  • ros2 run image_tools showimage

Which means:

  • reliable (default)
  • queue depth: 10 (default)
  • publish frequency in Hz: 30 (default)
  • history QoS setting: keep all the samples (default)

The following results show significant differences in the performance depending on which RMW implementation is chosen on the publisher and subscriber side (collected with the default branches on Ubuntu 16.04 on a Lenovo P50). Only the diagonal highlighted with "quality" colors is of interest for now:

Sub \ Pub FastRTPS Connext OpenSplice
FastRTPS #ffa000 Little stuttering, not smooth Flawless Severe increasing lag, multi-seconds within seconds
Connext One second burst, short hang, repeat #00ff00 Flawless One second burst, short hang, repeat
OpenSplice Flawless Stuttering #00ff00 Flawless

When increasing the image size to -x 640 -y 480 the problems become even more apparent:

Sub \ Pub FastRTPS Connext OpenSplice
FastRTPS #f03c15 Much more stuttering Hangs for several seconds between bursts Severe increasing lag, multi-seconds within seconds
Connext One second burst, short hang, repeat #00ff00 Flawless One second burst, short hang, repeat
OpenSplice Flawless Smooth but significantly reduced framerate (even on publisher side) #00ff00 Flawless

The acceptance criteria to resolve this ticket are:

  • without changing the demo itself or the QoS settings
  • the FastRTPS publisher and subscriber should demonstrate a "good" performance which is comparable with the other vendors.

PS: please don't suggest variations of the QoS parameters in this thread but keep this ticket focused on this very specific case. Other cases can be considered separately.

@dirk-thomas dirk-thomas added the bug Something isn't working label May 24, 2018
@codebot
Copy link
Member

codebot commented May 24, 2018

Another data point: a new ThinkPad X1 (i7-8550U) running Ubuntu 18.04. Qualitative performance is the same as reported by @dirk-thomas :

FastRTPS-FastRTPS

  • 320x240 images @ 30 Hz: the "stuttering" is noticeable. CPU load ~7% for both sender and receiver processes.
  • 640x480 images @ 30 Hz, the "stuttering" is quite severe. In this case, CPU load is ~6% for the sender, and ~100% for the receiver.

Connext-Connext

  • 320x240 images @ 30 Hz: flawless. CPU load ~7% for both sender and receiver processes
  • 640x480 images @ 30 Hz: flawless. CPU load ~50% for both sender and receiver processes

@mikaelarguedas
Copy link
Member

For people testing this today: if you can afford to rebuild a workspace before testing, please consider using the timestamp_console_logging branch from rcutils (and rclcpp if you build the tests). This will allow use to have timestamps reported in the console output and give more quantitative feedback.

To use it:

  • compile a workspace using the `` branch
  • set the RCUTILS_CONSOLE_OUTPUT_FORMAT variable before running your exectuables:
    export RCUTILS_CONSOLE_OUTPUT_FORMAT="[{severity}] [{name}] [{time}]: {message}"

Example of resulting output:
Publisher

[INFO] [cam2image] [1527186403.826753588]: Publishing image #400
[INFO] [cam2image] [1527186403.860162956]: Publishing image #401
[INFO] [cam2image] [1527186403.893428871]: Publishing image #402
[INFO] [cam2image] [1527186403.926734432]: Publishing image #403
[INFO] [cam2image] [1527186403.960066041]: Publishing image #404
[INFO] [cam2image] [1527186403.993477350]: Publishing image #405

Subscriber:

[INFO] [showimage] [1527186403.829860703]: Received image #400
[INFO] [showimage] [1527186403.862650268]: Received image #401
[INFO] [showimage] [1527186403.895934081]: Received image #402
[INFO] [showimage] [1527186403.929688603]: Received image #403
[INFO] [showimage] [1527186403.963225085]: Received image #404
[INFO] [showimage] [1527186403.996462511]: Received image #405

So I can confidently say that for me Fast-RTPS is smooth and doesnt show latency at 30 Hz (3ms of delay after 400 frames).

@codebot
Copy link
Member

codebot commented May 24, 2018

Do you see non-trivial jitter? I'm not sure how to quantify that other than gathering a lot of samples and showing the distribution in a log-log plot or something. We could compute first-order statistics (median, std.dev, etc) but those tend to not fully capture the things we worry about, like the worst-case or
"fraction of outliers," etc.

@clalancette
Copy link
Contributor

I tested Fast-RTPS on my Fedora machine as well (all ros2 repos on latest master). The Fast-RTPS -> Fast-RTPS was also very stuttering, just like what @dirk-thomas mentioned in the initial comment.

I'm having some trouble rebuilding on the timestamp_console_logging branch that @mikaelarguedas mentioned, but I'll try a bit more and report back here once I figure it out.

@dhood
Copy link
Member

dhood commented May 24, 2018

Seeing the same stuttering with fastrtps on my 16.04 desktop.

These are some screen captures of the performance:

@clalancette
Copy link
Contributor

I finally got the timestamps compiling, and took some data on my machine. First, this is the default of 320x240, and the cam2image process took ~12% of a core and the showimage took ~10% of a core (so I was not CPU bound). Second, once I took the data I post-processed it with a small python script to calculate the delta between successive publish/receives (I can share that script later if needed).

The cam2image publishing data is here: https://gist.github.com/clalancette/def97b55da9c6c2af52f9a6195b4845d , and it shows about what we expect; we publish an image every ~33ms, for a frame rate of around 30Hz.

The showimage subscribe data is here: https://gist.github.com/clalancette/1d6259b52e6679a96c8f3eafb2ffd9de . It shows that the data is being received in a bursty manner. Looking at line 57-82, for instance, we see that it is receiving data at ~33ms, and then at line 83 there is an 'event' that causes the next frame to take ~5 times longer. That causes a backlog of 5 items which are processed quickly (lines 84-88), at which point things get back to normal.

@JaimeMartin
Copy link
Contributor

We were commenting the issue in our daily meeting. We are now working in some performance improvements, including the rmw layer, so it is perfect timing to analyze this case. We will update you guys soon.

@mikaelarguedas
Copy link
Member

Result of the 640x480 test on the 3 rmw implementations with the following setup:
Dell XPS15, Ubuntu 16.04, send/receive 400 images, image rendering enabled

ros2 run image_tools cam2image -b -x 640 -y 480

640x480:

  Min Max Mean Median Stddev fps sender CPU receiver CPU
connext 0.0300738811 0.041087389 0.0318081552 0.0316724777 0.0010073057 30.2290740409 50.00% 50.00%
opensplice 0.010027647 0.051132679 0.022188526 0.019367218 0.0093266182 30.2616755364 36.00% 31.00%
fastrtps 0.0020124912 0.1774580479 0.0352770768 0.0078625679 0.0464236557 30.6302253416 22.00% 33.00%

Serious stuttering with FastRTPS, very smooth for the other ones.
Below a graph of the latency for each rmw_implementation.

image

@wjwwood
Copy link
Member

wjwwood commented May 25, 2018

I experienced the same behavior as @mikaelarguedas on my 2017 macbook pro (macOS 10.13.3).

I noticed a few behaviors of interest to me:

  • @ 320x240 I experience smoothness interrupted (seemingly randomly) with stuttering, basically what @mikaelarguedas's plot shows
  • @ 640x480 I experience smoothness uninterrupted, but a growing latency (sort of what I would expect from reliable + keep all), but notably less or not stuttering
  • @ 640x480 when I ctrl-c the publisher, the subscriber stops about ~100 sequence numbers short, which seems to match the latency mentioned in the previous point, and it supports the idea that it is just the history building up because the pub/sub cannot keep up (it should be able to though in my opinion)
  • @ 160x120 I see none of the previous behaviors, it is smooth, low latency, and if I ctrl-c the publisher the subscription receives all sequence numbers including the one last printed by the publisher (basically all the behavior I expect if the system is keeping up)

I experienced similar behaviors on my Windows desktop, not exactly the same, but the same trend (enough to convince me that it's not OS dependent at this point). I can provide timestamped data if needed on any platform.

@dejanpan
Copy link

@andreaspasternak can you weigh in here please?

@richiware
Copy link
Contributor

In this branch I've decreased default values for some timers and it improves the responsive at RTPS level. I've noticed it using cam2image and showimage. Can anyone test with this branch?

@mikaelarguedas
Copy link
Member

mikaelarguedas commented May 29, 2018

Thanks @richiware for looking into it. The behavior on the hotfix branch is much better indeed!

Could you breifly describe what are the tradeoffs being made using the new timer values?

Result on my XP15 using the new branch:

  Min Max Mean Median Stddev fps
connext 0.0300700665 0.0410900116 0.0318081648 0.0316700935 0.0010073107 30.2290620882
opensplice 0.0100297928 0.0511300564 0.0221883707 0.019359827 0.0093266318 30.261684248
fastrtps 0.002010107 0.1774599552 0.0352770948 0.0078701973 0.0464236037 30.6302342667
fastrtps_hotfix 0.001853466 0.0287704468 0.0074550388 0.0068049431 0.0458932319 30.2791350882

image_demo_latency_hotfix_comparison

With the hotfix and for this test, Fast-RTPS has less latency than the other rmw implementations using default values:
image_demo_latency_hotfix

Note: I haven't tested other frequencies of message sizes

@dirk-thomas
Copy link
Member Author

The branch definitely shows a better performance with the burger demo. Thank you for the quick update 👍

Could you breifly describe what are the tradeoffs being made using the new timer values?

Yeah, any information what you expect as "side effects" for other use cases would be good to hear.

@dirk-thomas dirk-thomas added in progress Actively being worked on (Kanban column) more-information-needed Further information is required labels May 31, 2018
@deeplearningrobotics
Copy link

Hi all,

Regarding this:

Second, once I took the data I post-processed it with a small python script to calculate the delta between successive publish/receives (I can share that script later if needed).Do you see non-trivial jitter? I'm not sure how to quantify that other than gathering a lot of samples and showing the distribution in a log-log plot or something.

Next week we will release as open source a performance measurement tool for various communication systems. Its core features are:

  • Logging of performance metrics such as latency, throughput, lost samples and system utilization to a file.
  • Creating graphical diagrams from the log files using an extensible python script.
  • Easy extendability to any desired publish/subscribe communication framework.
  • Extensive command line interface which allows configuration of the number of publishers and subscribers, quality of service and real-time settings, message type, publishing frequency and more.
  • Supports analysis of inter and intra process and inter-machine communication.

We were working together with eProsima (FastRTPS) to improve performance under real-time conditions for autonomous driving with great success.
This will also improve the performance of the upcoming new RMW implementation for FastRTPS with static type support (https://github.com/eProsima/rmw_fastrtps/tree/static_typesupport)

In the following graphs, note that y-axis is latency in ms and x-axis is the duration of the experiment in seconds.

Using directly FastRTPS on a real-time system with real-time settings shows good results for medium-sized data at 1000 Hz:
log_array32k

Large data also works well at 30 Hz after applying the settings in http://docs.eprosima.com/en/latest/advanced.html#finding-out-system-maximum-values:
log_array2m

We found for example that busy waiting (meaning not using a waitset) leads to a very significant performance degradation due to expensive mutex locking/unlocking.
log_array32k_no_waitset

Using the current dynamic type support of FastRTPS together with ROS 2 does not provide optimal performance, but this will be very likely shortly resolved by the upcoming new RMW implementation for FastRTPS with static type support.
log_array32k_ros2

If you like to test other RMW implementations you can do that using the RMW_IMPLEMENTATION environment variable.

We took care to implement the performance test tool real-time capable by:

  • Testing it extensively under RT-Linux with proper CPU isolation and memory locking
  • Preallocation all memory.
  • Carefully using dedicated RT threads which do not use any blocking calls.

Here the latency at 100 Hz with real-time functionality enabled:
log_array32k_rt

Here with without real-time functionality:
log_array32k_non_rt

@dirk-thomas
Copy link
Member Author

dirk-thomas commented Jun 4, 2018

@richiprosima Can you please respond to the above questions? We would like to get this resolved (merged into master) rather sooner than later. Thanks.

@richiware
Copy link
Contributor

richiware commented Jun 4, 2018

@mikaelarguedas @dirk-thomas I've decreased two QoS values, one for publishers and other for subscribers.

  • For publisher: i've decreased publisher_attr.times.nackResponseDelay. It specifies the time a publisher waits until response to a subscriber's acknack. It's previous value was 45ms, and in the patch it is 5ms. Its purpose is to reduce the number of retries sending the same requested samples to multiple subscribers.
  • For subscribers: i've decreased subscriber_attr.times.heartbeatResponseDelay. It specifies the time a subscriber waits until response to a publisher's heartbeat. It's previous value was 116ms, and in the patch it is 5ms. Its purpose is to reduce the burst of acknacks.

Decreasing these values the response to lost data improves. A lost sample takes less time to be forwarded. This is more appreciated in the video scenario because you can see the stuttering. Why data is lost? A video frame is to large to default linux socket buffer size and some RTPS fragments are lost.

This week I want to play more with these values and it will be merge at the end of the week.

@dirk-thomas
Copy link
Member Author

@richiware

Could you briefly describe what are the tradeoffs being made using the new timer values?

@richiware
Copy link
Contributor

I've updated my comment ↑

@mikaelarguedas
Copy link
Member

Thanks @richiware for the updated comment!

(sorry if it's obvious) Can you give examples of scenarios where a shorter nackResponseDelay (or heartbeatResponseDelay) would result in worse performance?

As a made up example of the kind of "scenarios" I'm thinking of: Is it possible that in network environments with significant latency, fragments will be considered lost while they may still be on the way, and with a shorter nackResponseDelay we would nack them too early resulting in an increasing number of resend and collisions?

@richiware
Copy link
Contributor

nackResponseDelay is also used to reduce the heartbeat messages. If several acknack messages, from different subscribers, are received in a short time, they will be answered by one heartbeat instead of one for each acknack.

@matt-attack
Copy link

It's concerning to me that these parameters are fixed times. It would probably help if they were related to network latency somehow so you don't have to tune them manually for the type of network you are using.

@dirk-thomas
Copy link
Member Author

This week I want to play more with these values and it will be merge at the end of the week.

@richiware As far as I can see the patch hasn't been merged yet. Can you clarify the timeline when that is planned to happen?

@richiware
Copy link
Contributor

@dirk-thomas , @MiguelCompany and I were working in a new release. We are integrating all new features, and also this patch, in our internal develop branch. I want to release it to master tomorrow Wednesday. Sorry for the inconveniences.

@dirk-thomas
Copy link
Member Author

With the hotfix merged upstream I will go ahead and close this.

@mikaelarguedas mikaelarguedas removed the in progress Actively being worked on (Kanban column) label Jun 14, 2018
@dejanpan
Copy link

@dirk-thomas for traceability it would be nice to link to the upstream commit where the fix happened.

@mikaelarguedas
Copy link
Member

Commit that changed the timer values: eProsima/Fast-DDS@201a393

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working more-information-needed Further information is required
Projects
None yet
Development

No branches or pull requests