Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iterator: optimize zeroing of iterAlloc in Iterator.Close #4187

Merged

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Dec 2, 2024

This commit optimizes the process of zeroing the iterAlloc struct in the Iterator.Close method.

The function resets the alloc struct, re-assign the fields that are being recycled, and then return it to the pool. We now split the first two steps, instead of doing them in a single step (e.g. *alloc = iterAlloc{...}). This is because the compiler can avoid the use of a stack allocated autotmp iterAlloc variable (~12KB, as of Dec 2024), which must first be zeroed out, then assigned into, then copied over into the heap-allocated alloc. Instead, the two-step process allows the compiler to quickly zero out the heap allocated object and then assign the few fields we want to preserve.

name                                           old time/op    new time/op    delta
Sysbench/KV/1node_local/oltp_read_only-10         322µs ± 7%     310µs ± 3%  -3.63%  (p=0.001 n=10+9)
Sysbench/KV/1node_local/oltp_point_select-10     15.1µs ± 5%    14.6µs ± 6%  -3.04%  (p=0.043 n=10+10)
Sysbench/KV/1node_local/oltp_read_write-10        823µs ± 2%     808µs ± 1%  -1.82%  (p=0.006 n=9+9)
Sysbench/KV/1node_local/oltp_write_only-10        424µs ± 3%     426µs ± 6%    ~     (p=0.796 n=10+10)
Sysbench/SQL/1node_local/oltp_read_only-10       1.69ms ± 3%    1.70ms ± 9%    ~     (p=0.720 n=9+10)
Sysbench/SQL/1node_local/oltp_point_select-10     109µs ± 6%     107µs ± 2%    ~     (p=0.133 n=10+9)
Sysbench/SQL/1node_local/oltp_read_write-10      4.04ms ± 3%    4.05ms ± 5%    ~     (p=0.971 n=10+10)
Sysbench/SQL/1node_local/oltp_write_only-10      1.54ms ± 4%    1.55ms ± 6%    ~     (p=0.853 n=10+10)

I've left a TODO which describes a potential further optimization, whereby we can avoid zeroing the iterAlloc struct entirely in Iterator.Close.


Interestingly, as part of understanding why this was faster, I also found that on arm64, a zeroing loop has one fewer instruction per 128-bit chunk than a memcpy loop. Both loops use a post-indexed addressing mode for their loads and stores, which avoids the need for separate increment instructions.

; zeroing loop
STP.P	(ZR, ZR), 16(R16)  # address 1848
CMP	R14, R16
BLE	1848

; memcpy loop
LDP.P	16(R16), (R25, R27)  # address 1880
STP.P	(R25, R27), 16(R17)
CMP	R3, R16
BLE	1880

This commit optimizes the process of zeroing the iterAlloc struct in the
Iterator.Close method.

The function resets the alloc struct, re-assign the fields that are being
recycled, and then return it to the pool. We now split the first two steps,
instead of doing them in a single step (e.g. *alloc = iterAlloc{...}). This is
because the compiler can avoid the use of a stack allocated autotmp iterAlloc
variable (~12KB, as of Dec 2024), which must first be zeroed out, then assigned
into, then copied over into the heap-allocated alloc. Instead, the two-step
process allows the compiler to quickly zero out the heap allocated object and
then assign the few fields we want to preserve.

```
name                                           old time/op    new time/op    delta
Sysbench/KV/1node_local/oltp_read_only-10         322µs ± 7%     310µs ± 3%  -3.63%  (p=0.001 n=10+9)
Sysbench/KV/1node_local/oltp_point_select-10     15.1µs ± 5%    14.6µs ± 6%  -3.04%  (p=0.043 n=10+10)
Sysbench/KV/1node_local/oltp_read_write-10        823µs ± 2%     808µs ± 1%  -1.82%  (p=0.006 n=9+9)
Sysbench/KV/1node_local/oltp_write_only-10        424µs ± 3%     426µs ± 6%    ~     (p=0.796 n=10+10)
Sysbench/SQL/1node_local/oltp_read_only-10       1.69ms ± 3%    1.70ms ± 9%    ~     (p=0.720 n=9+10)
Sysbench/SQL/1node_local/oltp_point_select-10     109µs ± 6%     107µs ± 2%    ~     (p=0.133 n=10+9)
Sysbench/SQL/1node_local/oltp_read_write-10      4.04ms ± 3%    4.05ms ± 5%    ~     (p=0.971 n=10+10)
Sysbench/SQL/1node_local/oltp_write_only-10      1.54ms ± 4%    1.55ms ± 6%    ~     (p=0.853 n=10+10)
```

I've left a TODO which describes a potential further optimization, whereby we
can avoid zeroing the iterAlloc struct entirely in Iterator.Close.

----

Interestingly, as part of understanding why this was faster, I also found that
on arm64, a zeroing loop has one fewer instruction per 128-byte chunk than a
memcpy loop. Both loops use a post-indexed addressing mode for their loads and
stores, which avoids the need for separate increment instructions.
```
// zeroing loop
STP.P	(ZR, ZR), 16(R16)  # address 1848
CMP	R14, R16
BLE	1848

// memcpy loop
LDP.P	16(R16), (R25, R27)  # address 1880
STP.P	(R25, R27), 16(R17)
CMP	R3, R16
BLE	1880
```
@nvanbenschoten nvanbenschoten added the o-perf-efficiency Related to performance efficiency label Dec 2, 2024
@nvanbenschoten nvanbenschoten requested a review from a team as a code owner December 2, 2024 23:00
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Nice find!

Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @jbowens)

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

Reviewed 1 of 1 files at r1.
Reviewable status: all files reviewed (commit messages unreviewed), all discussions resolved (waiting on @nvanbenschoten)

@nvanbenschoten nvanbenschoten merged commit 535eb0f into cockroachdb:master Dec 3, 2024
23 checks passed
Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. We have 54 other instances of this zeroing pattern in Pebble, and 198 instances in CRDB. Is it worth a quick audit to see if any of the other cases are on hot-paths?

~/go/src/github.com/cockroachdb/pebble master git grep -E '\*[a-zA-Z0-9]+ = [a-zA-Z0-9]+{' | grep -v -E '[a-zA-Z0-9]+{}' | grep -v _test.go
batch.go:       *iter = batchIter{
db.go:  *get = getIter{
db.go:  *i = Iterator{
db.go:  *dbi = Iterator{
...

Reviewable status: all files reviewed (commit messages unreviewed), all discussions resolved

@nvanbenschoten
Copy link
Member Author

Here's a listing of the top 10 largest stack frames (before this PR), to give an indication of whether this is a problem elsewhere?

➜ go build -gcflags -S . 2>&1 | rg 'ABIInternal' | awk '{print $7 " " $5}' | sed 's/-/ /g' | sed 's/\$//g' | sort -nr | head
17008 8 github.com/cockroachdb/pebble.(*Iterator).Close(SB),
16800 8 github.com/cockroachdb/pebble.(*scanInternalIterator).close(SB),
8656 80 github.com/cockroachdb/pebble.(*compactionPickerByScore).pickAuto(SB),
6288 32 github.com/cockroachdb/pebble.(*pickedCompaction).setupInputs(SB),
4336 88 github.com/cockroachdb/pebble.(*compactionPickerByScore).getScores(SB),
4080 8 github.com/cockroachdb/pebble.(*DB).collectTableStats(SB),
3616 24 github.com/cockroachdb/pebble.(*compaction).newInputIters(SB),
3360 24 github.com/cockroachdb/pebble.Open(SB),
3344 152 github.com/cockroachdb/pebble.ingestLoad1(SB),
3296 48 github.com/cockroachdb/pebble.(*scanInternalIterator).constructPointIter(SB),

Notice that (*Iterator).Close (fixed here) was the largest offender. (*scanInternalIterator).close also contains a *alloc = iterAlloc{...}, so we might want to make the same change there, though I don't think that's on a hot path.

There isn't much below that which seems to be on a hot path either. Most of the functions are related to compaction.

I'll run a similar audit in crdb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
o-perf-efficiency Related to performance efficiency
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants