Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

message: support bytes::Bytes as a payload type #462

Closed
wants to merge 6 commits into from

Conversation

daniel-abramov
Copy link
Member

@daniel-abramov daniel-abramov commented Dec 11, 2024

Preface

Recently, there has been another PR trying to solve the long-standing #96.

The issue had already been addressed in another PR that has been there for a while. I've noticed that I even approved the PR a while ago. Yet, it has not been merged for reasons I can't remember (AFAIR, I was waiting for another approval from another maintainer(s)/active users, but somehow it went stalled). Now, the discussions on the PR and the issue are so long that it's hard to get a brief overview of the current state and blockers, if any.

Yesterday, I quickly went through #175 and the discussions on the issue to see if I could come up with something that would give it at least some resolution (#175 has conflicts with a master branch) and create a small and easy-to-review PR to see if it's something that we can merge rather quickly.

TL;DR

Essentially, the only thing this one changes is that all binary messages are backed by Bytes instead of a Vec<u8>. The distinction between Owned and Shared is internal and invisible to the end user. (see #462 (comment)). I also have not touched the text messages for now (as there have been several proposals, and I'm not sure there is a consensus as to which one is the best).

So the question is if we are good with having Bytes instead of a Vec<u8> or if we want to expose that Payload type and make Message rely on it (like in #175). If I'm not mistaken, the cost of Bytes is negligible (see #462 (comment)); the main concern for using Bytes as the type for the Message::Binary was the additional allocation/copy of a buffer that is done before applying a mask (since Bytes is immutable), and #175 did so by creating a copy before applying a mask. I changed it so we don't make this copy if Bytes has unique access.

Does it make any sense, or did I miss something?


@Dushistov, @sdroege, @francois-random: I decided to mention you folks since you participated in the discussion the most in the past year and seem interested in the outcome of this one. I'm going to also mention @alexheretic (you seem to be interested in performance-related improvements 🙂).

--

Fixes #96
Supersedes #175 if we agree that it's a sensible solution.

@Zarathustra2
Copy link

I am not sure if this is just on my end but when I run cargo bench on master I get:

     Running benches/write.rs (target/release/deps/write-9e4d5d42fd6d53b9)
Gnuplot not found, using plotters backend
write 100k small texts then flush
                        time:   [5.3589 ms 5.3623 ms 5.3665 ms]
                        change: [-0.1032% +0.0279% +0.1456%] (p = 0.67 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

and when running for this PR/branch:

     Running benches/write.rs (target/release/deps/write-9e4d5d42fd6d53b9)
Gnuplot not found, using plotters backend
write 100k small texts then flush
                        time:   [7.2634 ms 7.2756 ms 7.2928 ms]
                        change: [+35.427% +35.679% +36.006%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

@daniel-abramov
Copy link
Member Author

daniel-abramov commented Dec 11, 2024

Thanks, benchmarking is a good thing! I'll investigate what's going on.

UPD: Ok, when I run write benchmarks, I get +/- 5% difference pretty often, even without changes, but 35% of a difference is quite stark. Apparently, using Bytes in this context is not as "negligible" of a cost as I thought. I'm not sure what this could be attributed to; my only assumption is that it was due to the vtable that Bytes contains.

Anyway, if the performance cost is so high, then perhaps Bytes is not suitable as the underlying type for Message. So, I updated the PR to use Payload across all of the places (similarly to #175). Now, the benchmarks are identical (no changes, within the noise threshold, but seldom 2-3% difference).

@alexheretic
Copy link
Contributor

I think this is a step in the right direction.

Nice to see the write bench was useful :) I think we would benefit from some more scenarios benched in this way, binary writes & reads etc. If I can find some time I'll raise a PR proposing some more.

@agalakhov
Copy link
Member

@alexheretic IIRC, I looked at the bytes crate while writing the very first version of Tungstenite, and the only reason I did not take it was the API instability at that time. As bytes is a solid part of the ecosystem now, it's a good idea to use it.

If implemented correctly, converting to Bytes should not affect benchmarks at all, being of true zero cost. It is only important to make correct choices between Bytes and Buf/BufMut at all places. I believe that it is even possible to improve performance compared to the old approach, as scatter-gather I/O is now stabilized in std. Using read_vectored()/write_vectored() together with a collection of BufMut/Bytes could make Tungstenite to true zero-copy.

@alexheretic
Copy link
Contributor

I noticed you can use this branch for writing text too with Frame::message(str_bytes, OpCode::Data(Data::Text), true)

I struggled to produce a benchmark win using that though, even though it should really eliminate allocation. Maybe allocation just isn't significant enough to the other write work currently?

I raised a PR with the extra benches: #463

@daniel-abramov daniel-abramov changed the title message: switch to bytes::Bytes message: support bytes::Bytes as a payload type Dec 13, 2024
@daniel-abramov
Copy link
Member Author

daniel-abramov commented Dec 13, 2024

I struggled to produce a benchmark win using that though, even though it should really eliminate allocation. Maybe allocation just isn't significant enough to the other write work currently?

I don't expect any improvements in benchmarks from this PR, at least for the benches in question. The reason is that this change does not really switch to Bytes completely (I've just updated the PR title because it referred to the first attempt to just switch to Bytes). The TL;DR is that it switches from Vec<u8> as a payload and message type to a Payload type that is defined as:

pub enum Payload {
    /// Owned data with unique ownership.
    Owned(Vec<u8>),
    /// Shared data with shared ownership.
    Shared(Bytes),
}

In other words, folks that used tungstenite with owned data (String or Vec<u8>) should not notice any difference at all because their messages are internally converted to Payload::Owned(Vec<u8>). But those who wanted to use Bytes, are now capable of doing it as well. This means that the only performance improvement I expect from this kind of change is that the users who write servers (specifically those who broadcast messages to many recipients or send the same message several times for whatever reasons).

That's why there was no difference in performance when I compared it to a master branch (the benchmarks in master use owned data).

However, there is a difference when testing it against your new benchmarks. This PR version shows, on average, a 4-5% decrease in performance in reading and writing. I'm still unsure what this could be attributed to, given that the underlying type for all benchmarks remains the same even for this PR (no Bytes involved, just Vec<u8>). My only [non-scientific] guess is that the machine representation of a Payload is a couple of bytes larger. This is probably invisible for regular use, but for a benchmark that sends 100k messages, this tiny cost might accumulate and result in something visible as a minor performance decrease (after all, all messages in the benchmark are pretty small).

I then modified your writing benchmark for the use cases explicitly mentioned in #96. Instead of creating small messages in a loop, I made a moderately sized message (1KB) and sent it in that loop in the writing benchmark. I ran it on this branch and then compared it to the version that uses Vec<u8> in master, and the performance increase is quite massive:

(master)
write 100k small messages then flush (server)
                        time:   [8.3416 ms 8.3952 ms 8.4669 ms]
                        change: [+123.71% +126.18% +128.52%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low severe
  7 (7.00%) low mild
  1 (1.00%) high severe

So the master version is 126% slower than the version that uses Bytes (due to the massive amount of allocations caused by constant cloning).

The net result of the changes is that performance for general use cases drops by about 5% (and I'm not sure why, apart from the guess mentioned above). However, for server users, the performance wins may be above 125% if they specifically choose to use Bytes for cases when it's appropriate (broadcasting the same message, etc) since they avoid all of the costs associated with expensive cloning.

Tbh, if that small performance drop for the general use case could somehow be eliminated, then switching to the new Payload type would be a no-brainer.

@agalakhov
Copy link
Member

agalakhov commented Dec 13, 2024

In fact, Vec is not needed here at all, as it could be replaced as well. Bytes is just Arc-based Vec per definition, which makes it probably more efficient for passing around. It is, however, less efficient in single-producer, single-consumer model unless the compiler manages to optimize away the Arc since atomic reference counting isn't cheap.

@alexheretic
Copy link
Contributor

alexheretic commented Dec 13, 2024

fwiw I didn't see much difference between the branches in the new benches. Though I did convert them to this branch slightly differently and I think I stumbled upon slice::to_vec being faster which skewed the comparison in this branch's favour.

Now I run using to_vec in both branches I see the following, which I interpret as

  • Read speed is about the same
  • Write speed is slightly impacted
$ cargo bench --bench \* -- --quick --noplot --baseline a
read+unmask 100k small messages (server)
                        time:   [13.010 ms 13.028 ms 13.101 ms]
                        change: [+2.1102% +2.8253% +3.5454%] (p = 0.10 > 0.05)

read 100k small messages (client)
                        time:   [11.052 ms 11.053 ms 11.053 ms]
                        change: [+2.0613% +2.2854% +2.5105%] (p = 0.08 > 0.05)

write 100k small messages then flush (server)
                        time:   [4.6399 ms 4.6452 ms 4.6661 ms]
                        change: [+11.700% +12.027% +12.354%] (p = 0.09 > 0.05)

write+mask 100k small messages then flush (client)
                        time:   [5.6127 ms 5.6169 ms 5.6338 ms]
                        change: [+3.1398% +3.6457% +4.1546%] (p = 0.10 > 0.05)

@daniel-abramov
Copy link
Member Author

Thanks! Yep, this roughly aligns with the results that I'm getting.

I then modified your writing benchmark for the use cases explicitly mentioned in #96. Instead of creating small messages in a loop, I made a moderately sized message (1KB) and sent it in that loop in the writing benchmark. I ran it on this branch and then compared it to the version that uses Vec in master, and the performance increase is quite massive:

I'm thinking about adding this benchmark. This means that the version in master would then create a single message and send it multiple times (cloning it each before sending), while the version from the branch would wrap it in Bytes and pass it to tungstenite (eliminating the need to do expensive cloning).

@alexheretic
Copy link
Contributor

tbh I don't see the minor benches perf regression as blocking this PR.

The scenario where a server broadcasts a message to N clients is clearly a massive advantage to supporting Bytes. The regression reported by the benches is not large.

@alexheretic
Copy link
Contributor

I got carried away a bit trying to adding text support 😅 #465

@daniel-abramov
Copy link
Member Author

Awesome! Since yours is rebased on top of this branch, I'll merge it instead of integrating these two separately! So far, it seems like no one objected, so I think we're good to go!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow sending of shared data
5 participants