Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync: RWMutex scales poorly with CPU count #17973

Open
bcmills opened this issue Nov 18, 2016 · 52 comments
Open

sync: RWMutex scales poorly with CPU count #17973

bcmills opened this issue Nov 18, 2016 · 52 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. Performance
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Nov 18, 2016

On a machine with many cores, the performance of sync.RWMutex.R{Lock,Unlock} degrades dramatically as GOMAXPROCS increases.

This test program:

package benchmarks_test

import (
	"fmt"
	"sync"
	"testing"
)

func BenchmarkRWMutex(b *testing.B) {
	for ng := 1; ng <= 256; ng <<= 2 {
		b.Run(fmt.Sprint(ng), func(b *testing.B) {
			var mu sync.RWMutex
			mu.Lock()

			var wg sync.WaitGroup
			wg.Add(ng)

			n := b.N
			quota := n / ng

			for g := ng; g > 0; g-- {
				if g == 1 {
					quota = n
				}

				go func(quota int) {
					for i := 0; i < quota; i++ {
						mu.RLock()
						mu.RUnlock()
					}
					wg.Done()
				}(quota)

				n -= quota
			}

			if n != 0 {
				b.Fatalf("Incorrect quota assignments: %v remaining", n)
			}

			b.StartTimer()
			mu.Unlock()
			wg.Wait()
			b.StopTimer()
		})
	}
}

degrades by a factor of 8x as it saturates threads and cores, presumably due to cache contention on &rw.readerCount:

# ./benchmarks.test -test.bench . -test.cpu 1,4,16,64
testing: warning: no tests to run
BenchmarkRWMutex/1      20000000                72.6 ns/op
BenchmarkRWMutex/1-4    20000000                72.4 ns/op
BenchmarkRWMutex/1-16   20000000                72.8 ns/op
BenchmarkRWMutex/1-64   20000000                72.5 ns/op
BenchmarkRWMutex/4      20000000                72.6 ns/op
BenchmarkRWMutex/4-4    20000000               105 ns/op
BenchmarkRWMutex/4-16   10000000               130 ns/op
BenchmarkRWMutex/4-64   20000000               160 ns/op
BenchmarkRWMutex/16     20000000                72.4 ns/op
BenchmarkRWMutex/16-4   10000000               125 ns/op
BenchmarkRWMutex/16-16  10000000               263 ns/op
BenchmarkRWMutex/16-64   5000000               287 ns/op
BenchmarkRWMutex/64     20000000                72.6 ns/op
BenchmarkRWMutex/64-4   10000000               137 ns/op
BenchmarkRWMutex/64-16   5000000               306 ns/op
BenchmarkRWMutex/64-64   3000000               517 ns/op
BenchmarkRWMutex/256                    20000000                72.4 ns/op
BenchmarkRWMutex/256-4                  20000000               137 ns/op
BenchmarkRWMutex/256-16                  5000000               280 ns/op
BenchmarkRWMutex/256-64                  3000000               602 ns/op
PASS

A "control" test, calling a no-op function instead of RWMutex methods, displays no such degradation: the problem does not appear to be due to runtime scheduling overhead.

@josharian
Copy link
Contributor

Possibly of interest for this: http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf

@josharian
Copy link
Contributor

cc @dvyukov

@ianlancetaylor
Copy link
Member

It may be difficult to apply the algorithm described in that paper to our existing sync.RWMutex type. The algorithm requires an association between the read-lock operation and the read-unlock operation. It can be implemented by having read-lock/read-unlock always occur on the same thread or goroutine, or by having the read-lock operation return a pointer that is passed to the read-unlock operation. Basically the algorithm builds a tree to avoid contention, and requires each read-lock/read-unlock pair to operate on the same node of the tree.

It would be feasible to implement the algorithm as part of a new type, in which the read-lock operation returned a pointer to be passed to the read-unlock operation. I think that new type could be implemented entirely in terms of sync/atomic functions, sync.Mutex, and sync.Cond. That is, it doesn't seem to require any special relationship with the runtime package.

@dvyukov
Copy link
Member

dvyukov commented Nov 18, 2016

Locking per-P slot may be enough and is much simpler:
https://codereview.appspot.com/4850045/diff2/1:3001/src/pkg/co/distributedrwmutex.go

@ianlancetaylor
Copy link
Member

What happens when a goroutine moves to a different P between read-lock and read-unlock?

@dvyukov
Copy link
Member

dvyukov commented Nov 18, 2016

RLock must return a proxy object to unlock, that object must hold the locked P index.

@bcmills
Copy link
Contributor Author

bcmills commented Nov 18, 2016

The existing RWMutex API allows the RLock call to occur on a different goroutine from the RUnlock call. We can certainly assume that most RLock / RUnlock pairs occur on the same goroutine (and optimize for that case), but I think there needs to be a slow-path fallback for the general case.

(For example, you could envision an algorithm that attempts to unlock the slot for the current P, then falls back to a linear scan if the current P's slot wasn't already locked.)

@bcmills
Copy link
Contributor Author

bcmills commented Nov 18, 2016

At any rate: general application code can work around the problem (in part) by using per-goroutine or per-goroutine-pool caches rather than global caches shared throughout the process.

The bigger issue is that sync.RWMutex is used fairly extensively within the standard library for package-level locks (the various caches in reflect, http.statusMu, json.encoderCache, mime.mimeLock, etc.), so it's easy for programs to fall into contention traps and hard to apply workarounds without avoiding large portions of the standard library. For those use-cases, it might actually be feasible to switch to something with a different API (such as having RLock return an unlocker).

@dvyukov
Copy link
Member

dvyukov commented Nov 18, 2016

For these cases in std lib atomic.Value is much better fit. It is already used in json, gob and http. atomic.Value is perfectly scalable and virtually zero overhead for readers.

@bcmills
Copy link
Contributor Author

bcmills commented Nov 18, 2016

I agree in general, but it's not obvious to me how one could use atomic.Value to guard lookups in a map acting as a cache. (It's perfect for maps which do not change, but how would you add new entries to the caches with that approach?)

@dvyukov
Copy link
Member

dvyukov commented Nov 18, 2016

What kind of cache do you mean?

@ianlancetaylor
Copy link
Member

E.g., the one in reflect.ptrTo.

@dvyukov
Copy link
Member

dvyukov commented Nov 18, 2016

See e.g. encoding/json/encode.go:cachedTypeFields

@bcmills
Copy link
Contributor Author

bcmills commented Nov 18, 2016

Hmm... that trades a higher allocation rate (and O(N) insertion cost) in exchange for getting the lock out of the reader path. (It basically pushes the "read lock" out to the garbage collector.)

And since most of these maps are insert-only (never deleted from), you can at least suspect that the O(N) insert won't be a huge drag: if there were many inserts, the maps would end up enormously large.

It would be interesting to see whether the latency tradeoff favors the RWMutex overhead or the O(N) insert overhead for more of the standard library.

@ianlancetaylor
Copy link
Member

@dvyukov Thanks. For the reflect.ptrTo case I wrote it up as https://golang.org/cl/33411. It needs some realistic benchmarks--microbenchmarks won't prove anything one way or another.

@quentinmit quentinmit added the NeedsFix The path to resolution is known, but the work has not been done. label Nov 18, 2016
@quentinmit quentinmit added this to the Go1.8Maybe milestone Nov 18, 2016
@josharian
Copy link
Contributor

It would be feasible to implement the algorithm as part of a new type, in which the read-lock operation returned a pointer to be passed to the read-unlock operation. I think that new type could be implemented entirely in terms of sync/atomic functions, sync.Mutex, and sync.Cond.

Indeed. Seems like it might be worth an experiment. If nothing else, might end up being useful at go4.org.

For these cases in std lib atomic.Value is much better fit.

Not always. atomic.Value can end up being a lot more code and complication. See CL 2641 for a worked example. For low level performance critical things like reflect, I'm all for atomic.Value, but much of the rest of the standard library, it'd be nice to fix the scalability of RWMutex (or have a comparably easy to use alternative).

@dvyukov
Copy link
Member

dvyukov commented Nov 21, 2016

Note that in most of these cases insertions happen very, very infrequently. Only during server warmup when it receives a first request of a new type or something. While reads happen all the time. Also, no matter how scalable RWMutex is, it still blocks all readers during updates increasing latency and causing large overheads for blocking/unblocking.

For the reflect.ptrTo case I wrote it up as https://golang.org/cl/33411. It needs some realistic benchmarks--microbenchmarks won't prove anything one way or another.

Just benchmarked it on realistic benchmarks in my head. It is good :)

@dvyukov
Copy link
Member

dvyukov commented Nov 21, 2016

See CL 2641 for a worked example.

I would not say that it is radically more code and complication. Provided that one does it right the first time, rather than do it non-scalable first and then refactor everything.

@rsc rsc modified the milestones: Go1.9Early, Go1.8Maybe Nov 21, 2016
@gopherbot
Copy link
Contributor

CL https://golang.org/cl/33411 mentions this issue.

@bcmills
Copy link
Contributor Author

bcmills commented Dec 1, 2016

https://go-review.googlesource.com/#/c/33852/ has a draft for a more general API for maps of the sort used in the standard library; should I send that for review? (I'd put it in the x/sync repo for now so we can do some practical experiments.)

@davidlazar
Copy link
Member

@jonhoo built a more scalable RWMutex here: https://github.com/jonhoo/drwmutex/

@minux
Copy link
Member

minux commented Dec 3, 2016 via email

@bcmills
Copy link
Contributor Author

bcmills commented Dec 3, 2016

@minux That would be easy enough to fix by using runtime.NumCPU() instead.

@minux
Copy link
Member

minux commented Dec 3, 2016 via email

@jonhoo
Copy link

jonhoo commented Dec 3, 2016

@minux I did at some point have a benchmark running on a modified version of Go that used P instead of CPUID. Unfortunately, I can't find that code any more, but from memory it got strictly worse performance than the CPUID-based solution. The situation could of course be different now though.

@balasanjay
Copy link
Contributor

Update: the two problems with RWMutex is that it has poor multi-core scalability (this issue) and that its fairly big (takes up 40% of a cache-line on its own). I originally decided to look at multi-core scalability first, but upon reflection, it makes more sense to tackle these problems in the other order. I intend to return to this issue once #37142 is resolved.

@puzpuzpuz
Copy link

There is one more possible approach to making RWMutex more scalable for readers - BRAVO (Biased Locking for Reader-Writer Locks) algorithm:
https://github.com/puzpuzpuz/xsync#rbmutex

It may be seen as a variation of D.Vyukov's DistributedRWMutex. Yet, the implementation is different since it wraps a single RWMutex instance and uses an array of reader slots to distribute the RLock attempts. It also returns a proxy object to the readers and internally uses a sync.Pool to piggyback on its thread-local behavior (obviously, that's a user-land workaround, not something mandatory).

As you'd expect, reader acquires scale better at the cost of more expensive writer locks.

I'm posting this for the sake of listing all possible approaches.

ngergs added a commit to ngergs/websrv that referenced this issue Mar 3, 2022
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022
srosenberg added a commit to srosenberg/cockroach that referenced this issue Aug 27, 2023
An experiment using drwmutex [1] to speed up read lock contention
on 96 vCPUs, as observed in [2]. The final run of
`kv95/enc=false/nodes=3/cpu=96` exhibited average
throughput of 173413 ops/sec. That's worse than the implementation
without RWMutex. It appears that read lock, as implemented by
Go's runtime scales poorly to a high number of vCPUs [3].
On the other hand, the write lock under drwmutex requires
acquiring 96 locks in this case, which appears to be the only
bottleneck; the sharded read lock is optimal enough that it
doesn't show up on the cpu profile. The only slow down
appears to be the write lock inside getStatsForStmtWithKeySlow
which is unavoidable. Although inconclusive, it appears that
drwmutex doesn't scale well above a certain number of vCPUs,
when the write mutex is on a critical path.

[1] https://github.com/jonhoo/drwmutex
[2] cockroachdb#109443
[3] golang/go#17973

Epic: none

Release note: None
@ntsd
Copy link

ntsd commented Dec 22, 2023

I made a benchmark, But not sure if the code is correct. Hope this helps.

for 10 concurrent 100k iters per each, RWMutex is fine in write, and slightly better in read.
for 100 concurrent 10k iters per each, RWMutex is slower in write, and impressive in read.
What surprises me is sync.Map did better in more concurrency.

https://github.com/ntsd/go-mutex-comparison?tab=readme-ov-file#test-scenarios

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. Performance
Projects
None yet
Development

No branches or pull requests