-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sync: RWMutex scales poorly with CPU count #17973
Comments
Possibly of interest for this: http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf |
cc @dvyukov |
It may be difficult to apply the algorithm described in that paper to our existing It would be feasible to implement the algorithm as part of a new type, in which the read-lock operation returned a pointer to be passed to the read-unlock operation. I think that new type could be implemented entirely in terms of sync/atomic functions, |
Locking per-P slot may be enough and is much simpler: |
What happens when a goroutine moves to a different P between read-lock and read-unlock? |
RLock must return a proxy object to unlock, that object must hold the locked P index. |
The existing (For example, you could envision an algorithm that attempts to unlock the slot for the current P, then falls back to a linear scan if the current P's slot wasn't already locked.) |
At any rate: general application code can work around the problem (in part) by using per-goroutine or per-goroutine-pool caches rather than global caches shared throughout the process. The bigger issue is that |
For these cases in std lib |
I agree in general, but it's not obvious to me how one could use |
What kind of cache do you mean? |
E.g., the one in |
See e.g. |
Hmm... that trades a higher allocation rate (and O(N) insertion cost) in exchange for getting the lock out of the reader path. (It basically pushes the "read lock" out to the garbage collector.) And since most of these maps are insert-only (never deleted from), you can at least suspect that the O(N) insert won't be a huge drag: if there were many inserts, the maps would end up enormously large. It would be interesting to see whether the latency tradeoff favors the |
@dvyukov Thanks. For the |
Indeed. Seems like it might be worth an experiment. If nothing else, might end up being useful at go4.org.
Not always. atomic.Value can end up being a lot more code and complication. See CL 2641 for a worked example. For low level performance critical things like reflect, I'm all for atomic.Value, but much of the rest of the standard library, it'd be nice to fix the scalability of RWMutex (or have a comparably easy to use alternative). |
Note that in most of these cases insertions happen very, very infrequently. Only during server warmup when it receives a first request of a new type or something. While reads happen all the time. Also, no matter how scalable RWMutex is, it still blocks all readers during updates increasing latency and causing large overheads for blocking/unblocking.
Just benchmarked it on realistic benchmarks in my head. It is good :) |
I would not say that it is radically more code and complication. Provided that one does it right the first time, rather than do it non-scalable first and then refactor everything. |
CL https://golang.org/cl/33411 mentions this issue. |
https://go-review.googlesource.com/#/c/33852/ has a draft for a more general API for maps of the sort used in the standard library; should I send that for review? (I'd put it in the |
@jonhoo built a more scalable RWMutex here: https://github.com/jonhoo/drwmutex/ |
One problem of https://github.com/jonhoo/drwmutex/ is that
it doesn't handle the case when the user increases GOMAXPROCS
at runtime (because New() just allocates a static slice of Mutexes.)
|
@minux That would be easy enough to fix by using |
Sure, but it wastes significant amount of memory when the
system has a lot processors. Real per-cpu mutex should
tie to P.
|
@minux I did at some point have a benchmark running on a modified version of Go that used P instead of CPUID. Unfortunately, I can't find that code any more, but from memory it got strictly worse performance than the CPUID-based solution. The situation could of course be different now though. |
Update: the two problems with RWMutex is that it has poor multi-core scalability (this issue) and that its fairly big (takes up 40% of a cache-line on its own). I originally decided to look at multi-core scalability first, but upon reflection, it makes more sense to tackle these problems in the other order. I intend to return to this issue once #37142 is resolved. |
There is one more possible approach to making RWMutex more scalable for readers - BRAVO (Biased Locking for Reader-Writer Locks) algorithm: It may be seen as a variation of D.Vyukov's DistributedRWMutex. Yet, the implementation is different since it wraps a single RWMutex instance and uses an array of reader slots to distribute the RLock attempts. It also returns a proxy object to the readers and internally uses a sync.Pool to piggyback on its thread-local behavior (obviously, that's a user-land workaround, not something mandatory). As you'd expect, reader acquires scale better at the cost of more expensive writer locks. I'm posting this for the sake of listing all possible approaches. |
An experiment using drwmutex [1] to speed up read lock contention on 96 vCPUs, as observed in [2]. The final run of `kv95/enc=false/nodes=3/cpu=96` exhibited average throughput of 173413 ops/sec. That's worse than the implementation without RWMutex. It appears that read lock, as implemented by Go's runtime scales poorly to a high number of vCPUs [3]. On the other hand, the write lock under drwmutex requires acquiring 96 locks in this case, which appears to be the only bottleneck; the sharded read lock is optimal enough that it doesn't show up on the cpu profile. The only slow down appears to be the write lock inside getStatsForStmtWithKeySlow which is unavoidable. Although inconclusive, it appears that drwmutex doesn't scale well above a certain number of vCPUs, when the write mutex is on a critical path. [1] https://github.com/jonhoo/drwmutex [2] cockroachdb#109443 [3] golang/go#17973 Epic: none Release note: None
I made a benchmark, But not sure if the code is correct. Hope this helps. for 10 concurrent 100k iters per each, RWMutex is fine in write, and slightly better in read. https://github.com/ntsd/go-mutex-comparison?tab=readme-ov-file#test-scenarios |
On a machine with many cores, the performance of
sync.RWMutex.R{Lock,Unlock}
degrades dramatically asGOMAXPROCS
increases.This test program:
degrades by a factor of 8x as it saturates threads and cores, presumably due to cache contention on &rw.readerCount:
A "control" test, calling a no-op function instead of
RWMutex
methods, displays no such degradation: the problem does not appear to be due to runtime scheduling overhead.The text was updated successfully, but these errors were encountered: