-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move ReadBuffer chunk to heap #241
Conversation
Hm, this is quite interesting observation, thanks for the PR! I'm going to check the result on the machines that I used to benchmark the previous implementation.
Perhaps a stupid question, but how does it correlate with the async? I could imagine that if we were to pass the same structure around (as in let's say C++), growing the stack, however in case of Rust the |
I'm not so sure about my understanding on the async part🙈. This is my best guess. IMO in Rust we are doing the same thing as in C++. We grow the stack to receive the return value. Despite the fact that we are moving ownership, it's still a memcpy. When it comes to async, things get worse. As the growed stack is now part of a future's stack frame (context), this overhead is present on each of them. The handshake future calls are quite deep (most of them contain an instance/return value of the buffer), eating up the stack. On release builds LLVM is smart enough to remove most of the unnecessary memcpys & stack growth, but we are not so lucky on debug builds. |
Ah, ok, I think I got what you mean. So from what I got, the idea is that since we use allocate the buffer on stack, it's essentially following the "regular C calling convention" (if I can put it like this), meaning that when final compiled code contains actual calls of the function, it's "obligated" to create a new stack frame by copying the stack arguments, return address etc before calling the function. Which means that whenever we pass a
Ok, got it, so the problem is primarily with the release builds, right? Thanks for the updated benchmarks BTW, I've ran them on my M1 mac and it seems like the difference is not that significant indeed. |
The original issue is affecting debug builds only. On release builds most function calls are inlined. Do you mean by potential performance regressions brought by this pr on release builds? According to the benchmark, the effect on performance seems to be negligible. |
This fixes a stack overflow for me on async-std. Same code with tokio runtime did not suffer from a SO. A release with this fix would be appreciated. |
Makes sense, thanks for reminding! 0.16.0 has just been published (changelog). |
#214 improved the performance of read buffer a lot. However, this commit moved the chunk buffer to stack.
Adding a 4k stack overhead sounds ok. But when it comes to async implementations, it eats up a great amount of stack space due to nested async stack frames.
tokio-tungstenite
needs about 670kb stack size to perform a handshake. Similarly,async-tungstenite
needs about 720kb using tokio runtime. (test snippet: https://gist.github.com/PhotonQuantum/f250bb8a2fd0fbba071a65fd7563fd44)This PR tries to move the buffer to heap again. After the change,
tokio-tungstenite
now only needs about 110kb stack memory.async-tungstenite
needs about 130kb.Benchmarks
Initial benchmark attempts on this implementation show that it's actually faster than the stack version.
However the benchmark may not be reliable, because when I moved the
InputBuffer
bench to last, I got the following output:And when I swapped
ReadBuffer (stack)
andReadBuffer (heap)
, I got