Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RTT calculation of connection is pretty unreliable when _just_ using the library as is #1902

Closed
StarStarJ opened this issue Jun 20, 2024 · 8 comments

Comments

@StarStarJ
Copy link

StarStarJ commented Jun 20, 2024

Hello together,

I am upgrading my app to use QUIC and I wanted to use the network stats of the connection to replace my custom ping-pong calculation which i currently use in my app.
Now with this wonderful library I saw it already has such network stats implemented, which is pretty neat, since it saves me quite some work.

However I saw that the RTT calculation seems pretty unreliable:
To make sure it's not a problem I caused, i changed the example code of this crate and saw similar problems, it seems to be related to the amount of packets send/received:

Here is my small change, i basically just put a loop around the client code and printed the RTT information:
StarStarJ@18c16b6?diff=unified&w=1

The important change is:
StarStarJ@18c16b6?diff=unified&w=1#diff-c1480185b8a920b105ec923f9a63194786ac1688fe25fb07611c9a640ba9a194R153

For example changing the sleep value from 500 to 20 significantly decreases the RTT value.

Now is the question do I have to call some function on the connection to keep this value realistic, or is that a bug.
Since the value only decreases the lower the timeout is it kinda feels like i must call a function on the connection as soon as possible, maybe quinn only sends acks if there is an active polling on that connection?

In any case it would be nice for an example to get the actually best estimate of RTT using this library, since from docs + examples alone I didn't understand where my thinking problem is here.

Thanks in advance

@Ralith
Copy link
Collaborator

Ralith commented Jun 21, 2024

You are getting bad behavior because you are calling a blocking function, std::thread::sleep, in async code. Never do this. See https://ryhl.io/blog/async-what-is-blocking/.

@Ralith Ralith closed this as completed Jun 21, 2024
@Ralith Ralith closed this as not planned Won't fix, can't repro, duplicate, stale Jun 21, 2024
@StarStarJ
Copy link
Author

StarStarJ commented Jun 21, 2024

No that is not the reason, and your conclusion is not good:

  1. If I have heavy calculations that could also take 500ms CPU time (on this single thread at least), that is why tokio has the rt-multi-thread feature.
  2. Replacing std::thread::sleep(Duration::from_millis(500)); with tokio::time::sleep(Duration::from_millis(500)).await; still shows the exact same problem.

The question still stands, what is important to have right behavior?
I also tried moving the sending part in a tokio::task (tried delaying the sending process instead, tried to wait for the previous sending task after open a new bidirectional stream etc.) to call open_bi as soon as possible, which also didn't fix it.

@Ralith
Copy link
Collaborator

Ralith commented Jun 21, 2024

If I have heavy calculations that could also take 500ms CPU time (on this single thread at least), that is why tokio has the rt-multi-thread feature.

No, it isn't. Please read the article I linked, which directly addresses this misconception.

Replacing std::thread::sleep(Duration::from_millis(500)); with tokio::time::sleep(Duration::from_millis(500)).await; still shows the exact same problem.

It seems to work okay to me. What about the behavior seems not "right" to you?

@StarStarJ
Copy link
Author

StarStarJ commented Jun 21, 2024

For example changing the sleep value from 500 to 20 significantly decreases the RTT value.

It's about 1ms off just by changing the sleep value.
On my real app it seems to suffer even more, it get's very inconsistent, jumps between few microseconds up to a few milliseconds. (The client unregulary sends packets, sometimes more sometimes few, this is the reason i simulated the same using the sleep command, which at least showed similar problems)

@Ralith
Copy link
Collaborator

Ralith commented Jun 21, 2024

tokio timer precision is 1ms, so variation on that scale is expected at the bare minimum, to say nothing of any variation in actual network path latency and peer load. OS timer precision is even worse (~15ms?) if you're on Windows and you don't make some special winapi calls.

@StarStarJ
Copy link
Author

Mh ok, i am on linux

if i do
ping -i 1 localhost or ping -i 0.1 localhost
i get very stable ping times, so i was surprised it differed so much depending on the packet load.

I guess I'll simply not rely on the rtt values as very reliable (at least not to such an extend), even tho i have to say 15ms sounds like a huge bug to me. in my app i see differences/jitter of upto ~8ms (on average 2-3ms) already using the quinn rtt value, that is higher than the refresh rate of my screen.

Anyway, thanks a lot for your time.

@Ralith
Copy link
Collaborator

Ralith commented Jun 22, 2024

i get very stable ping times, so i was surprised it differed so much depending on the packet load.

ping has different design priorities than tokio. For example, using limited timer precision allows for scheduling and resetting large numbers of concurrent timers to be very efficient, which can drastically improve server performance. If you really want you could drop in your own high-precision timer (and pay higher CPU cost to do so), but for most applications this is not worth the effort: network applications should be designed to tolerate quite a bit more latency jitter than you have reported; e.g. wifi alone can routinely delay packets by 100+ms.

i have to say 15ms sounds like a huge bug to me

Yes, Windows is very idiosyncratic in this respect. See tokio-rs/tokio#5021.

in my app i see differences/jitter of upto ~8ms

Make sure that you are using a release build and not doing computationally demanding or otherwise blocking work on networking threads.

@StarStarJ
Copy link
Author

StarStarJ commented Jun 30, 2024

So I did further tests on top of #1910.

(StarStarJ@2db746a)

Additionally I used
sudo tc qdisc add dev lo root netem delay 300ms 300ms

Which should add a delay of 300 ms and a jitter of +-300ms (600ms total) for outgoing packets, or in other words the client & server together have a jitter range of 0-1200ms.
Using ping -i 0.02 localhost confirms this is true.

In the above commit when i add a delay of 1 second to the requests the rtt estimator also kinda agrees with it (the average is around 600ms and rtts above 1 second can be observerd).

However changing the sleep value to 100ms drastically changes this. The rtt estimation suddenly is much lower. (No matter if i use the latest value or the smoothed value) - A average of 200-300ms is observed and the max ping is rarely above 800ms.
The "response received" output, which also checks how long the whole request took, is much more accurate to the actual jitter range.

Sadly since QUIC acks all packets (even DATAGRAMs) running my own ping-pong logic would massively increase packets per second (up to 4 packets, 2 for ping-pong, up to 2 for acks, and those packets are sent multiple times per second to get accurate insight).
So a stable RTT estimation would be really nice.

Now I don't know the QUIC spec in detail, and maybe the RTT estimation really is just a best effort try, but that is not obvious to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants