-
Notifications
You must be signed in to change notification settings - Fork 464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
iterator: optimize zeroing of iterAlloc in Iterator.Close
#4187
iterator: optimize zeroing of iterAlloc in Iterator.Close
#4187
Conversation
This commit optimizes the process of zeroing the iterAlloc struct in the Iterator.Close method. The function resets the alloc struct, re-assign the fields that are being recycled, and then return it to the pool. We now split the first two steps, instead of doing them in a single step (e.g. *alloc = iterAlloc{...}). This is because the compiler can avoid the use of a stack allocated autotmp iterAlloc variable (~12KB, as of Dec 2024), which must first be zeroed out, then assigned into, then copied over into the heap-allocated alloc. Instead, the two-step process allows the compiler to quickly zero out the heap allocated object and then assign the few fields we want to preserve. ``` name old time/op new time/op delta Sysbench/KV/1node_local/oltp_read_only-10 322µs ± 7% 310µs ± 3% -3.63% (p=0.001 n=10+9) Sysbench/KV/1node_local/oltp_point_select-10 15.1µs ± 5% 14.6µs ± 6% -3.04% (p=0.043 n=10+10) Sysbench/KV/1node_local/oltp_read_write-10 823µs ± 2% 808µs ± 1% -1.82% (p=0.006 n=9+9) Sysbench/KV/1node_local/oltp_write_only-10 424µs ± 3% 426µs ± 6% ~ (p=0.796 n=10+10) Sysbench/SQL/1node_local/oltp_read_only-10 1.69ms ± 3% 1.70ms ± 9% ~ (p=0.720 n=9+10) Sysbench/SQL/1node_local/oltp_point_select-10 109µs ± 6% 107µs ± 2% ~ (p=0.133 n=10+9) Sysbench/SQL/1node_local/oltp_read_write-10 4.04ms ± 3% 4.05ms ± 5% ~ (p=0.971 n=10+10) Sysbench/SQL/1node_local/oltp_write_only-10 1.54ms ± 4% 1.55ms ± 6% ~ (p=0.853 n=10+10) ``` I've left a TODO which describes a potential further optimization, whereby we can avoid zeroing the iterAlloc struct entirely in Iterator.Close. ---- Interestingly, as part of understanding why this was faster, I also found that on arm64, a zeroing loop has one fewer instruction per 128-byte chunk than a memcpy loop. Both loops use a post-indexed addressing mode for their loads and stores, which avoids the need for separate increment instructions. ``` // zeroing loop STP.P (ZR, ZR), 16(R16) # address 1848 CMP R14, R16 BLE 1848 // memcpy loop LDP.P 16(R16), (R25, R27) # address 1880 STP.P (R25, R27), 16(R17) CMP R3, R16 BLE 1880 ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @jbowens)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1.
Reviewable status: all files reviewed (commit messages unreviewed), all discussions resolved (waiting on @nvanbenschoten)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. We have 54 other instances of this zeroing pattern in Pebble, and 198 instances in CRDB. Is it worth a quick audit to see if any of the other cases are on hot-paths?
~/go/src/github.com/cockroachdb/pebble master git grep -E '\*[a-zA-Z0-9]+ = [a-zA-Z0-9]+{' | grep -v -E '[a-zA-Z0-9]+{}' | grep -v _test.go
batch.go: *iter = batchIter{
db.go: *get = getIter{
db.go: *i = Iterator{
db.go: *dbi = Iterator{
...
Reviewable status: all files reviewed (commit messages unreviewed), all discussions resolved
Here's a listing of the top 10 largest stack frames (before this PR), to give an indication of whether this is a problem elsewhere?
Notice that There isn't much below that which seems to be on a hot path either. Most of the functions are related to compaction. I'll run a similar audit in crdb. |
This commit optimizes the process of zeroing the iterAlloc struct in the
Iterator.Close
method.The function resets the alloc struct, re-assign the fields that are being recycled, and then return it to the pool. We now split the first two steps, instead of doing them in a single step (e.g.
*alloc = iterAlloc{...}
). This is because the compiler can avoid the use of a stack allocated autotmp iterAlloc variable (~12KB, as of Dec 2024), which must first be zeroed out, then assigned into, then copied over into the heap-allocated alloc. Instead, the two-step process allows the compiler to quickly zero out the heap allocated object and then assign the few fields we want to preserve.I've left a TODO which describes a potential further optimization, whereby we can avoid zeroing the iterAlloc struct entirely in Iterator.Close.
Interestingly, as part of understanding why this was faster, I also found that on arm64, a zeroing loop has one fewer instruction per 128-bit chunk than a memcpy loop. Both loops use a post-indexed addressing mode for their loads and stores, which avoids the need for separate increment instructions.