RTT calculation of connection is pretty unreliable when _just_ using the library as is #1902

StarStarJ · 2024-06-20T11:01:17Z

Hello together,

I am upgrading my app to use QUIC and I wanted to use the network stats of the connection to replace my custom ping-pong calculation which i currently use in my app.
Now with this wonderful library I saw it already has such network stats implemented, which is pretty neat, since it saves me quite some work.

However I saw that the RTT calculation seems pretty unreliable:
To make sure it's not a problem I caused, i changed the example code of this crate and saw similar problems, it seems to be related to the amount of packets send/received:

Here is my small change, i basically just put a loop around the client code and printed the RTT information:
StarStarJ@18c16b6?diff=unified&w=1

The important change is:
StarStarJ@18c16b6?diff=unified&w=1#diff-c1480185b8a920b105ec923f9a63194786ac1688fe25fb07611c9a640ba9a194R153

For example changing the sleep value from 500 to 20 significantly decreases the RTT value.

Now is the question do I have to call some function on the connection to keep this value realistic, or is that a bug.
Since the value only decreases the lower the timeout is it kinda feels like i must call a function on the connection as soon as possible, maybe quinn only sends acks if there is an active polling on that connection?

In any case it would be nice for an example to get the actually best estimate of RTT using this library, since from docs + examples alone I didn't understand where my thinking problem is here.

Thanks in advance

The text was updated successfully, but these errors were encountered:

Ralith · 2024-06-21T01:20:26Z

You are getting bad behavior because you are calling a blocking function, std::thread::sleep, in async code. Never do this. See https://ryhl.io/blog/async-what-is-blocking/.

StarStarJ · 2024-06-21T06:50:03Z

No that is not the reason, and your conclusion is not good:

If I have heavy calculations that could also take 500ms CPU time (on this single thread at least), that is why tokio has the rt-multi-thread feature.
Replacing std::thread::sleep(Duration::from_millis(500)); with tokio::time::sleep(Duration::from_millis(500)).await; still shows the exact same problem.

The question still stands, what is important to have right behavior?
I also tried moving the sending part in a tokio::task (tried delaying the sending process instead, tried to wait for the previous sending task after open a new bidirectional stream etc.) to call open_bi as soon as possible, which also didn't fix it.

Ralith · 2024-06-21T18:48:56Z

If I have heavy calculations that could also take 500ms CPU time (on this single thread at least), that is why tokio has the rt-multi-thread feature.

No, it isn't. Please read the article I linked, which directly addresses this misconception.

Replacing std::thread::sleep(Duration::from_millis(500)); with tokio::time::sleep(Duration::from_millis(500)).await; still shows the exact same problem.

It seems to work okay to me. What about the behavior seems not "right" to you?

StarStarJ · 2024-06-21T19:44:01Z

For example changing the sleep value from 500 to 20 significantly decreases the RTT value.

It's about 1ms off just by changing the sleep value.
On my real app it seems to suffer even more, it get's very inconsistent, jumps between few microseconds up to a few milliseconds. (The client unregulary sends packets, sometimes more sometimes few, this is the reason i simulated the same using the sleep command, which at least showed similar problems)

Ralith · 2024-06-21T21:45:16Z

tokio timer precision is 1ms, so variation on that scale is expected at the bare minimum, to say nothing of any variation in actual network path latency and peer load. OS timer precision is even worse (~15ms?) if you're on Windows and you don't make some special winapi calls.

StarStarJ · 2024-06-21T22:38:00Z

Mh ok, i am on linux

if i do
ping -i 1 localhost or ping -i 0.1 localhost
i get very stable ping times, so i was surprised it differed so much depending on the packet load.

I guess I'll simply not rely on the rtt values as very reliable (at least not to such an extend), even tho i have to say 15ms sounds like a huge bug to me. in my app i see differences/jitter of upto ~8ms (on average 2-3ms) already using the quinn rtt value, that is higher than the refresh rate of my screen.

Anyway, thanks a lot for your time.

Ralith · 2024-06-22T05:05:08Z

i get very stable ping times, so i was surprised it differed so much depending on the packet load.

ping has different design priorities than tokio. For example, using limited timer precision allows for scheduling and resetting large numbers of concurrent timers to be very efficient, which can drastically improve server performance. If you really want you could drop in your own high-precision timer (and pay higher CPU cost to do so), but for most applications this is not worth the effort: network applications should be designed to tolerate quite a bit more latency jitter than you have reported; e.g. wifi alone can routinely delay packets by 100+ms.

i have to say 15ms sounds like a huge bug to me

Yes, Windows is very idiosyncratic in this respect. See tokio-rs/tokio#5021.

in my app i see differences/jitter of upto ~8ms

Make sure that you are using a release build and not doing computationally demanding or otherwise blocking work on networking threads.

StarStarJ · 2024-06-30T10:36:35Z

So I did further tests on top of #1910.

(StarStarJ@2db746a)

Additionally I used
sudo tc qdisc add dev lo root netem delay 300ms 300ms

Which should add a delay of 300 ms and a jitter of +-300ms (600ms total) for outgoing packets, or in other words the client & server together have a jitter range of 0-1200ms.
Using ping -i 0.02 localhost confirms this is true.

In the above commit when i add a delay of 1 second to the requests the rtt estimator also kinda agrees with it (the average is around 600ms and rtts above 1 second can be observerd).

However changing the sleep value to 100ms drastically changes this. The rtt estimation suddenly is much lower. (No matter if i use the latest value or the smoothed value) - A average of 200-300ms is observed and the max ping is rarely above 800ms.
The "response received" output, which also checks how long the whole request took, is much more accurate to the actual jitter range.

Sadly since QUIC acks all packets (even DATAGRAMs) running my own ping-pong logic would massively increase packets per second (up to 4 packets, 2 for ping-pong, up to 2 for acks, and those packets are sent multiple times per second to get accurate insight).
So a stable RTT estimation would be really nice.

Now I don't know the QUIC spec in detail, and maybe the RTT estimation really is just a best effort try, but that is not obvious to me.

Ralith closed this as completed Jun 21, 2024

Ralith closed this as not planned Won't fix, can't repro, duplicate, stale Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RTT calculation of connection is pretty unreliable when _just_ using the library as is #1902

RTT calculation of connection is pretty unreliable when _just_ using the library as is #1902

StarStarJ commented Jun 20, 2024 •

edited

Loading

Ralith commented Jun 21, 2024

StarStarJ commented Jun 21, 2024 •

edited

Loading

Ralith commented Jun 21, 2024

StarStarJ commented Jun 21, 2024 •

edited

Loading

Ralith commented Jun 21, 2024

StarStarJ commented Jun 21, 2024

Ralith commented Jun 22, 2024 •

edited

Loading

StarStarJ commented Jun 30, 2024 •

edited

Loading

RTT calculation of connection is pretty unreliable when _just_ using the library as is #1902

RTT calculation of connection is pretty unreliable when _just_ using the library as is #1902

Comments

StarStarJ commented Jun 20, 2024 • edited Loading

Ralith commented Jun 21, 2024

StarStarJ commented Jun 21, 2024 • edited Loading

Ralith commented Jun 21, 2024

StarStarJ commented Jun 21, 2024 • edited Loading

Ralith commented Jun 21, 2024

StarStarJ commented Jun 21, 2024

Ralith commented Jun 22, 2024 • edited Loading

StarStarJ commented Jun 30, 2024 • edited Loading

StarStarJ commented Jun 20, 2024 •

edited

Loading

StarStarJ commented Jun 21, 2024 •

edited

Loading

StarStarJ commented Jun 21, 2024 •

edited

Loading

Ralith commented Jun 22, 2024 •

edited

Loading

StarStarJ commented Jun 30, 2024 •

edited

Loading