Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark is mostly idle at 10 connections #29

Closed
uNetworkingAB opened this issue May 7, 2023 · 19 comments
Closed

Benchmark is mostly idle at 10 connections #29

uNetworkingAB opened this issue May 7, 2023 · 19 comments

Comments

@uNetworkingAB
Copy link

uNetworkingAB commented May 7, 2023

The current run of only 10 connections is not enough to stress the servers, at least not uWS. Here is a list of differences between uWS and fastwebsockets at different number of connections assuming 1 kB messages. 10 connections has the least difference, so it's a natural pick if one wants to convey a minimal diff:

  • at 10 connections the diff is 7%
  • at 100 connections the diff is 9%
  • at 200 connections the diff is 16%
  • at 500 connections the diff is 34%

So it's pretty easy to tell there's some scaling issues that aren't being conveyed with the low count of only 10 connections. This can be improved by using more connections.

Edit: oh wow for 16 kB messages the diff is 56% at 200 connections

@uNetworkingAB
Copy link
Author

uNetworkingAB commented May 7, 2023

For next rerun, I have a few relevant changes in v20.41.0

on master,

load_test now takes byte length and you can specify any length (it swaps from short, medium to long messages as needed)

@uNetworkingAB
Copy link
Author

For 16kb messages at 500 connections there's more than 100% diff:

Using message size of 16000 bytes
Running benchmark now...
Msg/sec: 60466.250000
Msg/sec: 60521.250000
Msg/sec: 61029.250000

Using message size of 16000 bytes
Running benchmark now...
Msg/sec: 124614.000000
Msg/sec: 122536.500000

So those graphs are quite misleading as of now

@littledivy
Copy link
Member

littledivy commented May 7, 2023

Can reproduce this 👍

Areas to improve:

  1. Payloads are always copied over, it should be a clone-on-write view to a shared recv buffer. I wanted to do this earlier but Rust lifetimes won't let us do this with the current API.

We also cannot use the normal std::borrow::Cow here because masking happens in-place and we need a mutable borrow to the recv buffer. Instead something like this:

pub enum MutCow<'a, B>
where
    B: 'a + ToOwned + ?Sized,
    <B as ToOwned>::Owned: AsRef<B> + AsMut<B>,
{
    Borrowed(&'a mut B),
    Owned(<B as ToOwned>::Owned),
}

  1. Be smart about using vectored writes. I think we should just enable writev when frame size is large enough. Alternatively, we should just improve the write buffer logic with sendto.

  1. Excessive yields back to the Tokio scheduler. Under heavy load (~500 conns), I/O resources are almost always ready and quickly fill up the coop budget in Tokio - this forces Tokio to yield back to the scheduler so that "other tasks" can get a chance to be polled.

    However in this particular echo_server benchmark there are no "other
    tasks" we care about and we essentially end up wasting time.

@littledivy
Copy link
Member

Meh, I just realised MutCow is an overkill and Frame payloads can just be a &'f mut [u8] :)

@bartlomieju
Copy link
Member

Excessive yields back to the Tokio scheduler. Under heavy load (~500 conns), I/O resources are almost always ready and quickly fill up the coop budget in Tokio - this forces Tokio to yield back to the scheduler so that "other tasks" can get a chance to be polled.

Wrap relevant task in https://docs.rs/tokio/latest/tokio/task/fn.unconstrained.html to avoid forced yields.

@uNetworkingAB
Copy link
Author

I've added initial io_uring in v21:

image

@littledivy
Copy link
Member

Cool, I was playing with tokio-uring someday and it seems doable to add feature-gated code to support tokio-uring tcp streams. https://docs.rs/tokio-uring/latest/tokio_uring/net/struct.TcpStream.html#method.read

@littledivy
Copy link
Member

Published fastwebsockets 0.4.2

@uNetworkingAB you might be interested in these charts:

image

image

@littledivy
Copy link
Member

Current analysis:

fastwebsockets uWS conn size % (+/-)
197921 203761 10 20 -3%
211226 214914 200 20 -2%
213680 227030 500 20 -5%
101496 86058 10 16386 18%
122088 97946 200 16386 25%
106938 80347 500 16386 33%

@uNetworkingAB
Copy link
Author

Ah, yes writev with 2 chunks beats write for long messages, not something I've bothered with (yet?). The short message bars make no sense though, they definitely do not match what I see here. I see at least 40% better short message perf. (1 kb and less) with uWS . You never tried v21, right? Even v20 beats fastwebsockets v0.4.2 on small messages by at least 15%, but the diff is extremely apparent in v21.

@littledivy
Copy link
Member

littledivy commented May 11, 2023

Does v21 use epoll/kqueue by default for EchoServer?

@uNetworkingAB
Copy link
Author

Don't get me wrong, this competition is good. I'm already looking at adding no-copy writev sends for anything above a threshold. This is good, and I can confirm those numbers, but current short message numbers are way off.

v21 defaults are epoll, there is a release post how to compile with io_uring but you need Linux 6.0 or later.

@littledivy
Copy link
Member

Small msgs with uWS v21 EchoServer

fastwebsockets uWS conn size % (+/-)
191362 208341 10 20 -8%
211942 216165 200 20 -1.9%
200574 224980 500 20 -10%
Linux divy 5.19.0-1022-gcp 
#24~22.04.1-Ubuntu SMP x86_64 GNU/Linux

32GiB System memory
Intel(R) Xeon(R) CPU @ 3.10GHz

It does degrade to 10% but I cannot reproduce the drastic ~40% here.

@uNetworkingAB
Copy link
Author

uNetworkingAB commented May 11, 2023

It needs Linux 6.0. You are on 5.19. You also need to recompile the load_test so that it uses io_uring. Otherwise you just have epoll trying to stress io_uring. You know it's right if strace only lists io_uring_enter, for both EchoServer and load_test.

@littledivy
Copy link
Member

littledivy commented May 11, 2023

I want to compare epoll based implementations for now to find out why there is a 40% degrade you see.

The uWS EchoServer compiled is epoll and above results are for that. Is the 40% diff you see because of io_uring? (then that explains the diff)

@uNetworkingAB
Copy link
Author

Yes 40% is from io_uring on Linux 6.0. There are features of 6.0 that are very central to that bigger diff and that's why I target this kernel version as minimum. This backend will be default as soon as it is stable, so it would be very strange to exclude it.

Anyways, first thing is probably adding this writev send path so we don't have gigantic diffs on bigger messages. I did remember why I never added it though - it's not applicable for compressed messages or SSL, so it's a very specific bypass for only non-ssl, non-compressed, big messages.

@littledivy
Copy link
Member

Cool, the 40% diff will be relevant once fastwebsockets has a iouring backend. Opened #31 for tracking iouring support.

Self note: Add SSL benchmarks sometime in the future.

Anyways, I believe most of the things have been fixed and I'll continue to improve perf on small msgs (max 10% diff is fine for now). Feel free to open more related issues - this has been constructive 👍

@uNetworkingAB
Copy link
Author

Yes competition creates incentive to improve, which is good. I will have writev fix done any time now.

@uNetworkingAB
Copy link
Author

Oh wow, uWS is 10% faster on 16 kb echoes with writev now :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants