Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move ReadBuffer chunk to heap #241

Merged
merged 2 commits into from
Oct 27, 2021
Merged

Conversation

PhotonQuantum
Copy link
Contributor

@PhotonQuantum PhotonQuantum commented Oct 14, 2021

#214 improved the performance of read buffer a lot. However, this commit moved the chunk buffer to stack.

Adding a 4k stack overhead sounds ok. But when it comes to async implementations, it eats up a great amount of stack space due to nested async stack frames. tokio-tungstenite needs about 670kb stack size to perform a handshake. Similarly, async-tungstenite needs about 720kb using tokio runtime. (test snippet: https://gist.github.com/PhotonQuantum/f250bb8a2fd0fbba071a65fd7563fd44)

This PR tries to move the buffer to heap again. After the change, tokio-tungstenite now only needs about 110kb stack memory. async-tungstenite needs about 130kb.

Benchmarks

Initial benchmark attempts on this implementation show that it's actually faster than the stack version.

Gnuplot not found, using plotters backend
buffers/InputBuffer     time:   [2.6389 ms 2.6585 ms 2.6798 ms]                                 
                        thrpt:  [1.4576 GiB/s 1.4694 GiB/s 1.4803 GiB/s]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
buffers/ReadBuffer (stack)                                                                            
                        time:   [565.86 us 572.47 us 579.57 us]
                        thrpt:  [6.7399 GiB/s 6.8235 GiB/s 6.9033 GiB/s]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
buffers/ReadBuffer (heap)                                                                            
                        time:   [562.43 us 568.48 us 574.66 us]
                        thrpt:  [6.7975 GiB/s 6.8714 GiB/s 6.9453 GiB/s]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

However the benchmark may not be reliable, because when I moved the InputBuffer bench to last, I got the following output:

buffers/ReadBuffer (stack)                                                                             
                        time:   [2.4245 ms 2.4468 ms 2.4703 ms]
                        thrpt:  [1.5813 GiB/s 1.5965 GiB/s 1.6112 GiB/s]
                 change:
                        time:   [+315.88% +323.37% +330.52%] (p = 0.00 < 0.05)
                        thrpt:  [-76.772% -76.380% -75.955%]
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high mild
buffers/ReadBuffer (heap)                                                                             
                        time:   [2.5765 ms 2.5979 ms 2.6207 ms]
                        thrpt:  [1.4905 GiB/s 1.5036 GiB/s 1.5161 GiB/s]
                 change:
                        time:   [+344.39% +352.75% +360.15%] (p = 0.00 < 0.05)
                        thrpt:  [-78.268% -77.913% -77.497%]
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
buffers/InputBuffer     time:   [2.9552 ms 2.9808 ms 3.0083 ms]                                 
                        thrpt:  [1.2985 GiB/s 1.3104 GiB/s 1.3218 GiB/s]
                 change:
                        time:   [+10.790% +12.127% +13.394%] (p = 0.00 < 0.05)
                        thrpt:  [-11.812% -10.816% -9.7389%]
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe

And when I swapped ReadBuffer (stack) and ReadBuffer (heap), I got

buffers/ReadBuffer (heap)                                                                             
                        time:   [2.6488 ms 2.6748 ms 2.7040 ms]
                        thrpt:  [1.4446 GiB/s 1.4604 GiB/s 1.4747 GiB/s]
                 change:
                        time:   [+2.1459% +3.2363% +4.7057%] (p = 0.00 < 0.05)
                        thrpt:  [-4.4942% -3.1349% -2.1008%]
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
buffers/ReadBuffer (stack)                                                                             
                        time:   [2.7011 ms 2.7204 ms 2.7413 ms]
                        thrpt:  [1.4250 GiB/s 1.4359 GiB/s 1.4462 GiB/s]
                 change:
                        time:   [+11.712% +12.790% +13.907%] (p = 0.00 < 0.05)
                        thrpt:  [-12.209% -11.340% -10.484%]
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
buffers/InputBuffer     time:   [3.0394 ms 3.0581 ms 3.0782 ms]                                 
                        thrpt:  [1.2690 GiB/s 1.2773 GiB/s 1.2852 GiB/s]
                 change:
                        time:   [+6.1798% +7.0645% +7.9865%] (p = 0.00 < 0.05)
                        thrpt:  [-7.3958% -6.5984% -5.8201%]
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

@daniel-abramov
Copy link
Member

Hm, this is quite interesting observation, thanks for the PR!

I'm going to check the result on the machines that I used to benchmark the previous implementation.

But when it comes to async implementations, it eats up a great amount of stack space due to nested async stack frames. tokio-tungstenite needs about 670kb stack size to perform a handshake. Similarly, async-tungstenite needs about 720kb using tokio runtime. (test snippet: https://gist.github.com/PhotonQuantum/f250bb8a2fd0fbba071a65fd7563fd44)

Perhaps a stupid question, but how does it correlate with the async? I could imagine that if we were to pass the same structure around (as in let's say C++), growing the stack, however in case of Rust the ReadBuffer is not even Clone (not to mention that it's not Copy). Sorry if my question does not make any sense :)

@PhotonQuantum
Copy link
Contributor Author

PhotonQuantum commented Oct 19, 2021

I'm not so sure about my understanding on the async part🙈. This is my best guess.

IMO in Rust we are doing the same thing as in C++. We grow the stack to receive the return value. Despite the fact that we are moving ownership, it's still a memcpy.

When it comes to async, things get worse. As the growed stack is now part of a future's stack frame (context), this overhead is present on each of them. The handshake future calls are quite deep (most of them contain an instance/return value of the buffer), eating up the stack.

On release builds LLVM is smart enough to remove most of the unnecessary memcpys & stack growth, but we are not so lucky on debug builds.

@daniel-abramov
Copy link
Member

Ah, ok, I think I got what you mean. So from what I got, the idea is that since we use allocate the buffer on stack, it's essentially following the "regular C calling convention" (if I can put it like this), meaning that when final compiled code contains actual calls of the function, it's "obligated" to create a new stack frame by copying the stack arguments, return address etc before calling the function. Which means that whenever we pass a ReadBuffer to the function and the call to the function compiles to the "actual call of the function" (i.e. the function call is not inlined etc), then those 4K bytes of memory that we allocate for the reading buffer are copied as well which creates an issue in debug builds with Tokio because of the nature of tasks and quite deep stack "call history" that may deplete available stack in a given thread?

On release builds LLVM is smart enough to remove most of the unnecessary memcpys & stack growth, but we are not so lucky on debug builds.

Ok, got it, so the problem is primarily with the release builds, right?

Thanks for the updated benchmarks BTW, I've ran them on my M1 mac and it seems like the difference is not that significant indeed.

@PhotonQuantum
Copy link
Contributor Author

Ok, got it, so the problem is primarily with the release builds, right?

The original issue is affecting debug builds only. On release builds most function calls are inlined.

Do you mean by potential performance regressions brought by this pr on release builds? According to the benchmark, the effect on performance seems to be negligible.

@daniel-abramov daniel-abramov merged commit 54cca09 into snapview:master Oct 27, 2021
@najamelan
Copy link
Contributor

This fixes a stack overflow for me on async-std. Same code with tokio runtime did not suffer from a SO. A release with this fix would be appreciated.

@daniel-abramov
Copy link
Member

daniel-abramov commented Nov 3, 2021

Makes sense, thanks for reminding! 0.16.0 has just been published (changelog).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants