-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase in memory usage when compression is on #203
Comments
First step would probably be to get a heap and allocation profile of your application using pprof. |
Please generate the profiles with the latest version of the package. The recent change to pool flate readers and writes should help. |
Thanks guys, I'm waiting for the next version of Centrifugo which includes your latest version and then will profile the application and upload here. You can close this issue until then, but I hope to have the new version today or tomorrow the latest so it's up to you. |
Here is the pprof dump This is the heap profile @kisielk @garyburd I've updated the gist please check now https://gist.github.com/joshdvir/091229e3d3e4ade8d73b8cffe86c602b |
I asked @joshdvir to send be cpu and memory profiles from production node, here is what we have: CPU:
Most of cpu time spent in WriteMessage:
NextWriter:
compressNoContextTakeover:
And now heap profile:
NextWriter:
compressNoContextTakeover:
|
Possibly related: golang/go#18625 |
@y3llowcake thanks for pointing on this issue. I've written test case for Gorilla Websocket: type testConn struct {
conn *Conn
messages chan []byte
}
func newTestConn(c *Conn, bufferSize int) *testConn {
return &testConn{
conn: c,
messages: make(chan []byte, bufferSize),
}
}
func printss() {
m := runtime.MemStats{}
runtime.ReadMemStats(&m)
fmt.Printf("inuse: %d sys: %d\n", m.StackInuse, m.StackSys)
}
func TestWriteWithCompression(t *testing.T) {
w := ioutil.Discard
done := make(chan struct{})
numConns := 1000
numMessages := 1000
conns := make([]*testConn, numConns)
var wg sync.WaitGroup
for i := 0; i < numConns; i++ {
c := newConn(fakeNetConn{Reader: nil, Writer: w}, false, 1024, 1024)
c.enableWriteCompression = true
c.newCompressionWriter = compressNoContextTakeover
conns[i] = newTestConn(c, 256)
wg.Add(1)
go func(c *testConn) {
defer wg.Done()
i := 0
for i < numMessages {
select {
case <-done:
return
case msg := <-c.messages:
c.conn.WriteMessage(TextMessage, msg)
i++
}
}
}(conns[i])
}
messages := textMessages(100)
for i := 0; i < numMessages; i++ {
if i%100 == 0 {
printss()
}
msg := messages[i%len(messages)]
for _, c := range conns {
c.messages <- msg
}
}
wg.Wait()
}
func textMessages(num int) [][]byte {
messages := make([][]byte, num)
for i := 0; i < num; i++ {
msg := fmt.Sprintf("planet: %d, country: %d, city: %d, street: %d", i, i, i, i)
messages[i] = []byte(msg)
}
return messages
} It creates 1000 connections with compression enabled, each with buffered message channel. Then in a loop we write message into each connection. Here is how it behaves with
Using Go with commit golang/go@9c3630f
Though It's hard to say at moment will this fix solve original problem in this issue or not. I also tried the same with
But actually even without that fix https://github.com/klauspost/compress library behaves without memory grows... I can't explain this. Also here is benchmark result using https://github.com/klauspost/compress library:
It's 4x speedup comparing to standard lib
@garyburd I understand that having non-standard lib package in core could be a wrong step, but maybe we can consider mechanism to let it be plugged somehow by user's code? |
AFAICT, this package uses "level 3" compression (which is a good choice). In my package level 1-4 are a specialized, and do not use the "generic" code, which has the issue. In Go 1.7 level 1 (Best speed) has a similar specialized function. I would think that if you use that, you will not experience the issue. That might be a solution you can use, so you do not have to import a specialized package (even if I wouldn't mind to give users the option). Performance for level 1 should be very close to my package. |
@klauspost thanks for explaining, just tried what you said - yes, with compression level 1 performance is comparable to your library and has no memory problems in go1.7 (in test case above) |
@garyburd what do you think about this? I see two solutions that can help us - make compression level exported variable or allow to plug custom flate implementation. Of course we can also wait for go1.8 but a way to improve compression performance still very important. Do you want us to try creating custom build with advices you gave and fixed array copy bug and see how it behaves in production? |
@FZambia How about setting compression level to one for now? It's probably the best option for most applications at this time and avoids exposing more API surface area. |
@garyburd I agree with @FZambia, having the option to set the compression would help greatly, for my specific use case I have a fan out of 80K messages per sec for 200K users which cause a lot of traffic out, if I will be able to set the compression level I could find the sweet spot between servers numbers and traffic out. |
OK, let's add the following:
Modify Upgrader to match Dialer. |
@garyburd I agree that level 1 is better for default value at moment because it fixes memory grows on Go1.7 and compression is so costly. But looks like on fanout level in such a big apps as @joshdvir has saving bandwidth saves a lot of money so having compression level configurable makes sense. We made a custom build with compression level 1 and counters you suggested and put it into production. Counter values are:
It's aggregation over 1 minute. So pool looks pretty effective but... ...Compression is still the leader in allocs and CPU profiles and getting from sync.Pool is the most allocation-expensive operation for some reason. Here is CPU profile now:
list WriteMessage
list Close
list Flush
Memory usage is much better now but it still very high, compression allocates a lot:
list NextWriter
list compressNoContextTakeover
It's hard to say what can we do with such a big compression overhead... @garyburd you suggest adding |
@FZambia Thank you for testing the pool effectiveness and reporting the profiles. I don't want to add an API now that might replaced with another API later. As I think about it more, it's better to add one method:
This is more flexible than other options. The implementation will replace flateWriterPool with flatWriterPools[12]. The flateWriterWrapper will need to store the level so the writer can be returned to the correct pool. |
I think that having one method that changes default compression level still makes sense: var defaultCompressionLevel int = 1
// SetDefaultCompressionLevel sets the flate compression level which will be used by
// default to compress messages when compression negotiated. This function must be
// called once before application starts.
//
// Valid levels range from -2 to 9. Level -2 will use Huffman compression only.
// Level -1 uses the default compression level. Level 0 does not attempt any
// compression. Levels 1 through 9 range from best speed to best compression.
func (c *Conn) SetDefaultCompressionLevel(n int) error {
defaultCompressionLevel = n
} In most situations I suppose users that need custom compression need to set this default value once. Then if someone needs compression level per connection/message we can add // SetCompressionLevel sets the flate compression level for the next message.
// Valid levels range from -2 to 9. Level -2 will use Huffman compression only.
// Level -1 uses the default compression level. Level 0 does not attempt any
// compression. Levels 1 through 9 range from best speed to best compression.
// If not set default compression level will be used.
func (c *Conn) SetCompressionLevel(n int) error {
} If it's not called but compression negotiated |
Just looked at size of each package main
import "unsafe"
import "compress/flate"
func main() {
var w flate.Writer
println(unsafe.Sizeof(w))
} 600Kb! This size is surprising for me - I have not even assumed that it's such a big thing:) |
Yeah - there is quite a number of tables needed to maintain state, produce huffman tables and buffer output. There isn't much that can be done about it, except for level -2(HuffmanOnly), 0 (No Compression) and 1(Best Speed) in the standard library. For my own library I have been looking at reducing the memory requirements for encoding options that does not require as many buffers, see klauspost/compress#70 (it still has issues, as can be seen by the crash I logged in the issue) |
@FZambia I do not want to use package level variables for settings. The setting should be included with the other settings in Dialer and Upgrader or it should be part of the connection so it can be changed from message to message. |
See b0dc455. |
@garyburd many thanks for your time helping us with this. Looking at all these profiles do you see any way to reduce compression memory usage and allocations? |
If the application sends the same message to multiple clients, then #182 is the next thing to try. |
@garyburd heap before changes: The same picture for CPU. So we don't see compression in heap and cpu profiles on first place anymore. Though memory usage reduced it's still pretty high when using compression - but this can be application problem, we will try to investigate |
@garyburd just analized The leaders are There are some other things in profiles that might interest you - I'll try to upload svg graph visualizations from build with fix. |
Here are some graphs using Centrifugo with latest Gorilla Websocket. This is from a Centrifugo node running by @joshdvir with many connections (about 100k) and websocket compression enabled (level 1), go1.7.5 CPU: Here we see that most of cpu spent on syscalls - read, write - I don't think we can do a lot here, because we send a lot of fanout messages to different sockets. So this looks normal. Alloc space: Here we see a big impact of Inuse space: @garyburd do you see a way to improve things further? If not then we can live with this I suppose. P.S. I am going to create many ws connections on local machine with compression enabled/disabled and compare profiles. |
Maybe if you wrote a fairly standalone benchmark, meaning only using the stdlib and deflate which represented real-world scenarios I could use that as a benchmark for testing if a leaner encoder would actually help. The PR itself would have to be rewritten, but having a good benchmark would be a starting point. We could also add a |
Given these performance issues, is compression worth enabling?
How would this API work? Would we allow writing directly to the net.Conn? |
On the server without compression, the WriteMessage API mostly copies the data directly from the application supplied []byte to the underlying network connection. The prepared message API also copies a []byte directly to the underlying network connection. |
I believe this issue is due to compress/flate not not exposing a way to adjust the sliding window. See https://golang.org/issue/3155 Thus, every flat writer allocates a lot of memory and even though they are reused via pool, this ends up causing the large memory usage spike with compression on. |
Doesn't seem like there is anything this library can do further until that issue is closed. |
In fact, according to this benchmark func BenchmarkFlate(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
flate.NewWriter(nil, flate.BestSpeed)
}
}
flate.NewWriter allocates 1.2 MB! That's a lot of memory to allocate for compression when you're writing 600 byte WebSocket messages. |
With BestCompression or DefaultCompress, the allocation drops 800 KB but that's still massive. # DefaultCompression
$ go test -bench=BenchmarkFlate -run=^$
goos: darwin
goarch: amd64
pkg: nhooyr.io/websocket
BenchmarkFlate-8 20000 93332 ns/op 806784 B/op 13 allocs/op
PASS
ok nhooyr.io/websocket 2.848s
# BestCompression
$ go test -bench=BenchmarkFlate -run=^$
goos: darwin
goarch: amd64
pkg: nhooyr.io/websocket
BenchmarkFlate-8 20000 95197 ns/op 806784 B/op 13 allocs/op
PASS
ok nhooyr.io/websocket 2.879s |
I've filed golang/go#32371 |
Have a look at klauspost/compress#176 |
@klauspost that's really cool. What sort of memory overhead would be involved for each Write? |
It will require a few hundred kilobytes, but contrary to before the other alternatives once it is compressed the memory will be freed. For web sockets, that is what you need. Meaning no live allocations for inactive sockets. Just be sure all is written in a single write. |
@klauspost thanks, I'll try to check it out as soon as I have time. I have a gist that allows to use custom flate with Gorilla WebSocket library - need to revisit it and experiment with your stateless compression branch. |
klauspost/compress#185 brings down heap allocation to a few hundred bytes during a Write by having an internal |
My testing of @klauspost's library with stateless compression has shown great results. See klauspost/compress#216 (comment) About 9.5 KB allocated per message written for the simple 512 byte message edit: It's now available for testing in master on my library. |
To clarify, that's with a dictionary in context takeover mode, it's 50 B a message allocated if there is no dictionary. |
Compressor and decompressor instances are very memory intensive, and gorilla uses sync.Pool to reuse them. For best reuse, messages should be processed in a non-blocking manner, with a goroutine opened immediately after decoding. |
If it is acceptable to add a dependency, fix by changing the package to use https://github.com/klauspost/compress instead of the standard library. |
It's acceptable to add a dependency, but its not clear to me how using this would improve the situation? |
@jaitaiwan gzip (deflate) compression mostly relies on referencing previous data. Beside the window (that is typically extended, so it doesn't have to copy too much) it also keeps temporary buffers, so it doesn't have to allocate for every block, as well as index of the window (typically the most memory intensive). For reference here is the approximate "idle" usage of each encoder level. That means the allocs made to create an instance and the memory used after a Write call before the Writer is closed:
So there are 3 variations: A) Use a Stateless Writer. This works as a normal compressor but de-allocates everything between each Write call. This means there is 0 "idle" allocations, but there is a significant compression loss, and each Write will cause temporary allocations. B) Do your own buffering of the last data written, and use StatelessDeflate to provide the history. This will result in less compression loss, but still allocs on each write C) Do your own buffering of the last data written, and use ResetDict before each write and flush+stash the encoders afterwards. |
Hi,
I've been started using with Centrifugo in the past week.
I'm using the raw Websocket endpoint which uses this library under the hood.
I'm experiencing a situation where per-message-deflate is enabled there is a massive grow in memory up to the point the docker container crashes for using too much memory.
I'm running inside a docker container, with average 150K-200K concurrent users, my average message rate is between 30K-50K messages per sec, with average message size of 600 bytes.
Without the per-message-deflate the is no memory grow at all and performance is awesome, but the data transfer is very high.
Can anyone help me investigate it ?
Thank you.
The text was updated successfully, but these errors were encountered: