Move ReadBuffer chunk to heap #241

PhotonQuantum · 2021-10-14T23:55:03Z

#214 improved the performance of read buffer a lot. However, this commit moved the chunk buffer to stack.

Adding a 4k stack overhead sounds ok. But when it comes to async implementations, it eats up a great amount of stack space due to nested async stack frames. tokio-tungstenite needs about 670kb stack size to perform a handshake. Similarly, async-tungstenite needs about 720kb using tokio runtime. (test snippet: https://gist.github.com/PhotonQuantum/f250bb8a2fd0fbba071a65fd7563fd44)

This PR tries to move the buffer to heap again. After the change, tokio-tungstenite now only needs about 110kb stack memory. async-tungstenite needs about 130kb.

Benchmarks

Initial benchmark attempts on this implementation show that it's actually faster than the stack version.

Gnuplot not found, using plotters backend
buffers/InputBuffer     time:   [2.6389 ms 2.6585 ms 2.6798 ms]                                 
                        thrpt:  [1.4576 GiB/s 1.4694 GiB/s 1.4803 GiB/s]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
buffers/ReadBuffer (stack)                                                                            
                        time:   [565.86 us 572.47 us 579.57 us]
                        thrpt:  [6.7399 GiB/s 6.8235 GiB/s 6.9033 GiB/s]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
buffers/ReadBuffer (heap)                                                                            
                        time:   [562.43 us 568.48 us 574.66 us]
                        thrpt:  [6.7975 GiB/s 6.8714 GiB/s 6.9453 GiB/s]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

However the benchmark may not be reliable, because when I moved the InputBuffer bench to last, I got the following output:

buffers/ReadBuffer (stack)                                                                             
                        time:   [2.4245 ms 2.4468 ms 2.4703 ms]
                        thrpt:  [1.5813 GiB/s 1.5965 GiB/s 1.6112 GiB/s]
                 change:
                        time:   [+315.88% +323.37% +330.52%] (p = 0.00 < 0.05)
                        thrpt:  [-76.772% -76.380% -75.955%]
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high mild
buffers/ReadBuffer (heap)                                                                             
                        time:   [2.5765 ms 2.5979 ms 2.6207 ms]
                        thrpt:  [1.4905 GiB/s 1.5036 GiB/s 1.5161 GiB/s]
                 change:
                        time:   [+344.39% +352.75% +360.15%] (p = 0.00 < 0.05)
                        thrpt:  [-78.268% -77.913% -77.497%]
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
buffers/InputBuffer     time:   [2.9552 ms 2.9808 ms 3.0083 ms]                                 
                        thrpt:  [1.2985 GiB/s 1.3104 GiB/s 1.3218 GiB/s]
                 change:
                        time:   [+10.790% +12.127% +13.394%] (p = 0.00 < 0.05)
                        thrpt:  [-11.812% -10.816% -9.7389%]
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe

And when I swapped ReadBuffer (stack) and ReadBuffer (heap), I got

buffers/ReadBuffer (heap)                                                                             
                        time:   [2.6488 ms 2.6748 ms 2.7040 ms]
                        thrpt:  [1.4446 GiB/s 1.4604 GiB/s 1.4747 GiB/s]
                 change:
                        time:   [+2.1459% +3.2363% +4.7057%] (p = 0.00 < 0.05)
                        thrpt:  [-4.4942% -3.1349% -2.1008%]
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
buffers/ReadBuffer (stack)                                                                             
                        time:   [2.7011 ms 2.7204 ms 2.7413 ms]
                        thrpt:  [1.4250 GiB/s 1.4359 GiB/s 1.4462 GiB/s]
                 change:
                        time:   [+11.712% +12.790% +13.907%] (p = 0.00 < 0.05)
                        thrpt:  [-12.209% -11.340% -10.484%]
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
buffers/InputBuffer     time:   [3.0394 ms 3.0581 ms 3.0782 ms]                                 
                        thrpt:  [1.2690 GiB/s 1.2773 GiB/s 1.2852 GiB/s]
                 change:
                        time:   [+6.1798% +7.0645% +7.9865%] (p = 0.00 < 0.05)
                        thrpt:  [-7.3958% -6.5984% -5.8201%]
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

daniel-abramov · 2021-10-19T19:12:33Z

Hm, this is quite interesting observation, thanks for the PR!

I'm going to check the result on the machines that I used to benchmark the previous implementation.

But when it comes to async implementations, it eats up a great amount of stack space due to nested async stack frames. tokio-tungstenite needs about 670kb stack size to perform a handshake. Similarly, async-tungstenite needs about 720kb using tokio runtime. (test snippet: https://gist.github.com/PhotonQuantum/f250bb8a2fd0fbba071a65fd7563fd44)

Perhaps a stupid question, but how does it correlate with the async? I could imagine that if we were to pass the same structure around (as in let's say C++), growing the stack, however in case of Rust the ReadBuffer is not even Clone (not to mention that it's not Copy). Sorry if my question does not make any sense :)

PhotonQuantum · 2021-10-19T20:38:35Z

I'm not so sure about my understanding on the async part🙈. This is my best guess.

IMO in Rust we are doing the same thing as in C++. We grow the stack to receive the return value. Despite the fact that we are moving ownership, it's still a memcpy.

When it comes to async, things get worse. As the growed stack is now part of a future's stack frame (context), this overhead is present on each of them. The handshake future calls are quite deep (most of them contain an instance/return value of the buffer), eating up the stack.

On release builds LLVM is smart enough to remove most of the unnecessary memcpys & stack growth, but we are not so lucky on debug builds.

daniel-abramov · 2021-10-26T18:00:19Z

Ah, ok, I think I got what you mean. So from what I got, the idea is that since we use allocate the buffer on stack, it's essentially following the "regular C calling convention" (if I can put it like this), meaning that when final compiled code contains actual calls of the function, it's "obligated" to create a new stack frame by copying the stack arguments, return address etc before calling the function. Which means that whenever we pass a ReadBuffer to the function and the call to the function compiles to the "actual call of the function" (i.e. the function call is not inlined etc), then those 4K bytes of memory that we allocate for the reading buffer are copied as well which creates an issue in debug builds with Tokio because of the nature of tasks and quite deep stack "call history" that may deplete available stack in a given thread?

On release builds LLVM is smart enough to remove most of the unnecessary memcpys & stack growth, but we are not so lucky on debug builds.

Ok, got it, so the problem is primarily with the release builds, right?

Thanks for the updated benchmarks BTW, I've ran them on my M1 mac and it seems like the difference is not that significant indeed.

benches/buffer.rs

PhotonQuantum · 2021-10-27T01:04:15Z

Ok, got it, so the problem is primarily with the release builds, right?

The original issue is affecting debug builds only. On release builds most function calls are inlined.

Do you mean by potential performance regressions brought by this pr on release builds? According to the benchmark, the effect on performance seems to be negligible.

najamelan · 2021-10-31T15:04:38Z

This fixes a stack overflow for me on async-std. Same code with tokio runtime did not suffer from a SO. A release with this fix would be appreciated.

daniel-abramov · 2021-11-03T18:00:58Z

Makes sense, thanks for reminding! 0.16.0 has just been published (changelog).

PhotonQuantum added 2 commits October 15, 2021 07:15

Move ReadBuffer chunk to heap

3f5d098

Add benchmark for stack read buffer

76eaa8a

PhotonQuantum mentioned this pull request Oct 14, 2021

Thread 'main' has overflowed its stack when invoking 'connect_with_retry(...)' in debug PhotonQuantum/bililive-rs#45

Closed

daniel-abramov approved these changes Oct 26, 2021

View reviewed changes

daniel-abramov reviewed Oct 26, 2021

View reviewed changes

benches/buffer.rs Show resolved Hide resolved

daniel-abramov merged commit 54cca09 into snapview:master Oct 27, 2021

najamelan mentioned this pull request Oct 31, 2021

Initial report: stack overflow on windows with 0.15 sdroege/async-tungstenite#98

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move ReadBuffer chunk to heap #241

Move ReadBuffer chunk to heap #241

PhotonQuantum commented Oct 14, 2021 •

edited

Loading

daniel-abramov commented Oct 19, 2021

PhotonQuantum commented Oct 19, 2021 •

edited

Loading

daniel-abramov commented Oct 26, 2021

PhotonQuantum commented Oct 27, 2021

najamelan commented Oct 31, 2021

daniel-abramov commented Nov 3, 2021 •

edited

Loading

Move ReadBuffer chunk to heap #241

Move ReadBuffer chunk to heap #241

Conversation

PhotonQuantum commented Oct 14, 2021 • edited Loading

Benchmarks

daniel-abramov commented Oct 19, 2021

PhotonQuantum commented Oct 19, 2021 • edited Loading

daniel-abramov commented Oct 26, 2021

PhotonQuantum commented Oct 27, 2021

najamelan commented Oct 31, 2021

daniel-abramov commented Nov 3, 2021 • edited Loading

PhotonQuantum commented Oct 14, 2021 •

edited

Loading

PhotonQuantum commented Oct 19, 2021 •

edited

Loading

daniel-abramov commented Nov 3, 2021 •

edited

Loading