iterator: optimize zeroing of iterAlloc in `Iterator.Close` #4187

nvanbenschoten · 2024-12-02T23:00:16Z

This commit optimizes the process of zeroing the iterAlloc struct in the Iterator.Close method.

The function resets the alloc struct, re-assign the fields that are being recycled, and then return it to the pool. We now split the first two steps, instead of doing them in a single step (e.g. *alloc = iterAlloc{...}). This is because the compiler can avoid the use of a stack allocated autotmp iterAlloc variable (~12KB, as of Dec 2024), which must first be zeroed out, then assigned into, then copied over into the heap-allocated alloc. Instead, the two-step process allows the compiler to quickly zero out the heap allocated object and then assign the few fields we want to preserve.

name                                           old time/op    new time/op    delta
Sysbench/KV/1node_local/oltp_read_only-10         322µs ± 7%     310µs ± 3%  -3.63%  (p=0.001 n=10+9)
Sysbench/KV/1node_local/oltp_point_select-10     15.1µs ± 5%    14.6µs ± 6%  -3.04%  (p=0.043 n=10+10)
Sysbench/KV/1node_local/oltp_read_write-10        823µs ± 2%     808µs ± 1%  -1.82%  (p=0.006 n=9+9)
Sysbench/KV/1node_local/oltp_write_only-10        424µs ± 3%     426µs ± 6%    ~     (p=0.796 n=10+10)
Sysbench/SQL/1node_local/oltp_read_only-10       1.69ms ± 3%    1.70ms ± 9%    ~     (p=0.720 n=9+10)
Sysbench/SQL/1node_local/oltp_point_select-10     109µs ± 6%     107µs ± 2%    ~     (p=0.133 n=10+9)
Sysbench/SQL/1node_local/oltp_read_write-10      4.04ms ± 3%    4.05ms ± 5%    ~     (p=0.971 n=10+10)
Sysbench/SQL/1node_local/oltp_write_only-10      1.54ms ± 4%    1.55ms ± 6%    ~     (p=0.853 n=10+10)

I've left a TODO which describes a potential further optimization, whereby we can avoid zeroing the iterAlloc struct entirely in Iterator.Close.

Interestingly, as part of understanding why this was faster, I also found that on arm64, a zeroing loop has one fewer instruction per 128-bit chunk than a memcpy loop. Both loops use a post-indexed addressing mode for their loads and stores, which avoids the need for separate increment instructions.

; zeroing loop
STP.P	(ZR, ZR), 16(R16)  # address 1848
CMP	R14, R16
BLE	1848

; memcpy loop
LDP.P	16(R16), (R25, R27)  # address 1880
STP.P	(R25, R27), 16(R17)
CMP	R3, R16
BLE	1880

This commit optimizes the process of zeroing the iterAlloc struct in the Iterator.Close method. The function resets the alloc struct, re-assign the fields that are being recycled, and then return it to the pool. We now split the first two steps, instead of doing them in a single step (e.g. *alloc = iterAlloc{...}). This is because the compiler can avoid the use of a stack allocated autotmp iterAlloc variable (~12KB, as of Dec 2024), which must first be zeroed out, then assigned into, then copied over into the heap-allocated alloc. Instead, the two-step process allows the compiler to quickly zero out the heap allocated object and then assign the few fields we want to preserve. ``` name old time/op new time/op delta Sysbench/KV/1node_local/oltp_read_only-10 322µs ± 7% 310µs ± 3% -3.63% (p=0.001 n=10+9) Sysbench/KV/1node_local/oltp_point_select-10 15.1µs ± 5% 14.6µs ± 6% -3.04% (p=0.043 n=10+10) Sysbench/KV/1node_local/oltp_read_write-10 823µs ± 2% 808µs ± 1% -1.82% (p=0.006 n=9+9) Sysbench/KV/1node_local/oltp_write_only-10 424µs ± 3% 426µs ± 6% ~ (p=0.796 n=10+10) Sysbench/SQL/1node_local/oltp_read_only-10 1.69ms ± 3% 1.70ms ± 9% ~ (p=0.720 n=9+10) Sysbench/SQL/1node_local/oltp_point_select-10 109µs ± 6% 107µs ± 2% ~ (p=0.133 n=10+9) Sysbench/SQL/1node_local/oltp_read_write-10 4.04ms ± 3% 4.05ms ± 5% ~ (p=0.971 n=10+10) Sysbench/SQL/1node_local/oltp_write_only-10 1.54ms ± 4% 1.55ms ± 6% ~ (p=0.853 n=10+10) ``` I've left a TODO which describes a potential further optimization, whereby we can avoid zeroing the iterAlloc struct entirely in Iterator.Close. ---- Interestingly, as part of understanding why this was faster, I also found that on arm64, a zeroing loop has one fewer instruction per 128-byte chunk than a memcpy loop. Both loops use a post-indexed addressing mode for their loads and stores, which avoids the need for separate increment instructions. ``` // zeroing loop STP.P (ZR, ZR), 16(R16) # address 1848 CMP R14, R16 BLE 1848 // memcpy loop LDP.P 16(R16), (R25, R27) # address 1880 STP.P (R25, R27), 16(R17) CMP R3, R16 BLE 1880 ```

cockroach-teamcity · 2024-12-02T23:00:24Z

This change is

RaduBerinde

Nice find!

Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @jbowens)

jbowens

Reviewed 1 of 1 files at r1.
Reviewable status: all files reviewed (commit messages unreviewed), all discussions resolved (waiting on @nvanbenschoten)

petermattis

Interesting. We have 54 other instances of this zeroing pattern in Pebble, and 198 instances in CRDB. Is it worth a quick audit to see if any of the other cases are on hot-paths?

~/go/src/github.com/cockroachdb/pebble master git grep -E '\*[a-zA-Z0-9]+ = [a-zA-Z0-9]+{' | grep -v -E '[a-zA-Z0-9]+{}' | grep -v _test.go
batch.go:       *iter = batchIter{
db.go:  *get = getIter{
db.go:  *i = Iterator{
db.go:  *dbi = Iterator{
...

Reviewable status: all files reviewed (commit messages unreviewed), all discussions resolved

nvanbenschoten · 2024-12-03T16:14:12Z

Here's a listing of the top 10 largest stack frames (before this PR), to give an indication of whether this is a problem elsewhere?

➜ go build -gcflags -S . 2>&1 | rg 'ABIInternal' | awk '{print $7 " " $5}' | sed 's/-/ /g' | sed 's/\$//g' | sort -nr | head
17008 8 github.com/cockroachdb/pebble.(*Iterator).Close(SB),
16800 8 github.com/cockroachdb/pebble.(*scanInternalIterator).close(SB),
8656 80 github.com/cockroachdb/pebble.(*compactionPickerByScore).pickAuto(SB),
6288 32 github.com/cockroachdb/pebble.(*pickedCompaction).setupInputs(SB),
4336 88 github.com/cockroachdb/pebble.(*compactionPickerByScore).getScores(SB),
4080 8 github.com/cockroachdb/pebble.(*DB).collectTableStats(SB),
3616 24 github.com/cockroachdb/pebble.(*compaction).newInputIters(SB),
3360 24 github.com/cockroachdb/pebble.Open(SB),
3344 152 github.com/cockroachdb/pebble.ingestLoad1(SB),
3296 48 github.com/cockroachdb/pebble.(*scanInternalIterator).constructPointIter(SB),

Notice that (*Iterator).Close (fixed here) was the largest offender. (*scanInternalIterator).close also contains a *alloc = iterAlloc{...}, so we might want to make the same change there, though I don't think that's on a hot path.

There isn't much below that which seems to be on a hot path either. Most of the functions are related to compaction.

I'll run a similar audit in crdb.

nvanbenschoten added the o-perf-efficiency Related to performance efficiency label Dec 2, 2024

nvanbenschoten requested review from jbowens and RaduBerinde December 2, 2024 23:00

nvanbenschoten requested a review from a team as a code owner December 2, 2024 23:00

RaduBerinde approved these changes Dec 3, 2024

View reviewed changes

jbowens mentioned this pull request Dec 3, 2024

sstable: cpu optimize Iterator.Close #4049

Open

jbowens approved these changes Dec 3, 2024

View reviewed changes

nvanbenschoten merged commit 535eb0f into cockroachdb:master Dec 3, 2024
23 checks passed

petermattis reviewed Dec 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iterator: optimize zeroing of iterAlloc in `Iterator.Close` #4187

iterator: optimize zeroing of iterAlloc in `Iterator.Close` #4187

nvanbenschoten commented Dec 2, 2024 •

edited

Loading

cockroach-teamcity commented Dec 2, 2024

RaduBerinde left a comment

jbowens left a comment

petermattis left a comment

nvanbenschoten commented Dec 3, 2024

iterator: optimize zeroing of iterAlloc in Iterator.Close #4187

iterator: optimize zeroing of iterAlloc in Iterator.Close #4187

Conversation

nvanbenschoten commented Dec 2, 2024 • edited Loading

cockroach-teamcity commented Dec 2, 2024

RaduBerinde left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Dec 3, 2024

iterator: optimize zeroing of iterAlloc in `Iterator.Close` #4187

iterator: optimize zeroing of iterAlloc in `Iterator.Close` #4187

nvanbenschoten commented Dec 2, 2024 •

edited

Loading