-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize write with as_const_str for shorter code #122059
Conversation
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
Optimize write with as_const_str for shorter code Following up on rust-lang#121001 Apparently this code generates significant code block for each call to `write()` with non-simple formatting string - approx 100 lines of assembly code, possibly due to `dyn` (?). See generated assembly code [here](https://github.com/nyurik/rust-optimize-format-str/compare/before-changes..with-my-change#diff-6b404e954c692d8cdc8c452d819a216aa5dcf40522b5944639e9ad947279a477): <details><summary>Details</summary> <p> This is the inlining of `write!(buffer, "Iteration {value} was written")` ```asm core::fmt::Write::write_fmt: // /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 194 fn write_fmt(&mut self, args: Arguments<'_>) -> Result { push r15 push r14 push r13 push r12 push rbx mov rdx, rsi // /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 427 match (self.pieces, self.args) { mov rcx, qword ptr [rsi + 8] mov rax, qword ptr [rsi + 24] // /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 428 ([], []) => Some(""), cmp rcx, 1 je .LBB0_8 test rcx, rcx jne .LBB0_9 test rax, rax jne .LBB0_9 // /home/nyurik/dev/rust/rust/library/alloc/src/vec/mod.rs : 911 self.buf.reserve(self.len, additional); lea r12, [rdi + 16] lea rsi, [rip + .L__unnamed_2] xor ebx, ebx .LBB0_6: mov r14, qword ptr [r12] jmp .LBB0_7 .LBB0_8: // /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 429 ([s], []) => Some(s), test rax, rax je .LBB0_4 .LBB0_9: // /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 1108 if let Some(s) = args.as_str() { output.write_str(s) } else { write_internal(output, args) } lea rsi, [rip + .L__unnamed_1] pop rbx pop r12 pop r13 pop r14 pop r15 jmp qword ptr [rip + core::fmt::write_internal@GOTPCREL] .LBB0_4: mov rax, qword ptr [rdx] // /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 429 ([s], []) => Some(s), mov rsi, qword ptr [rax] mov rbx, qword ptr [rax + 8] // /home/nyurik/dev/rust/rust/library/alloc/src/raw_vec.rs : 248 if T::IS_ZST { usize::MAX } else { self.cap.0 } mov rax, qword ptr [rdi] // /home/nyurik/dev/rust/rust/library/alloc/src/vec/mod.rs : 911 self.buf.reserve(self.len, additional); mov r14, qword ptr [rdi + 16] // /home/nyurik/dev/rust/rust/library/core/src/num/mod.rs : 1281 uint_impl! { sub rax, r14 // /home/nyurik/dev/rust/rust/library/alloc/src/raw_vec.rs : 392 additional > self.capacity().wrapping_sub(len) cmp rax, rbx // /home/nyurik/dev/rust/rust/library/alloc/src/raw_vec.rs : 309 if self.needs_to_grow(len, additional) { jb .LBB0_5 .LBB0_7: mov rax, qword ptr [rdi + 8] // /home/nyurik/dev/rust/rust/library/core/src/ptr/mut_ptr.rs : 1046 unsafe { intrinsics::offset(self, count) } add rax, r14 mov r15, rdi // /home/nyurik/dev/rust/rust/library/core/src/intrinsics.rs : 2922 copy_nonoverlapping(src, dst, count) mov rdi, rax mov rdx, rbx call qword ptr [rip + memcpy@GOTPCREL] // /home/nyurik/dev/rust/rust/library/alloc/src/vec/mod.rs : 2040 self.len += count; add r14, rbx mov qword ptr [r15 + 16], r14 // /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 216 } xor eax, eax pop rbx pop r12 pop r13 pop r14 pop r15 ret .LBB0_5: // /home/nyurik/dev/rust/rust/library/alloc/src/vec/mod.rs : 911 self.buf.reserve(self.len, additional); lea r12, [rdi + 16] mov r15, rdi mov r13, rsi // /home/nyurik/dev/rust/rust/library/alloc/src/raw_vec.rs : 310 do_reserve_and_handle(self, len, additional); mov rsi, r14 mov rdx, rbx call alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle mov rsi, r13 mov rdi, r15 jmp .LBB0_6 ``` </p> </details> ```rust #[inline] pub fn write(output: &mut dyn Write, args: Arguments<'_>) -> Result { if let Some(s) = args.as_str() { output.write_str(s) } else { write_internal(output, args) } } ``` So, this brings back the older experiment - where I used `if core::intrinsics::is_val_statically_known(s.is_some()) { s } else { None }` helper function, and called it in multiple places that used `write`. This is not as optimal because now every user of `write` must do this logic, but at least it results in significantly smaller assembly code for the formatting case, and results in identical code as now for the "simple" (no formatting) case. See [assembly comparison](https://github.com/nyurik/rust-optimize-format-str/compare/with-my-change..with-as-const-str#diff-6b404e954c692d8cdc8c452d819a216aa5dcf40522b5944639e9ad947279a477) of what is now with what this change brings (focus only on `fmt/intel-lib.txt` and `str/intel-lib.txt` files). ```rust if let Some(s) = args.as_const_str() { self.write_str(s) } else { write(self, args) } ```
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (61bc12c): comparison URL. Overall result: ✅ improvements - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 647.585s -> 647.242s (-0.05%) |
@nnethercote are you happier with this than you were in #121001 (comment)? It looks to me like binary size is swinging both good and bad, both here and in the previous PR. |
I don't have strong opinions, but these new results seem good. |
@bors r+ |
☀️ Test successful - checks-actions |
Finished benchmarking commit (9fb91aa): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDNext Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 647.024s -> 646.506s (-0.08%) |
@rustbot label: +perf-regression-triaged |
Relevant upstream changes: rust-lang/rust#120675: An intrinsic `Symbol` is now wrapped in a `IntrinsicDef` struct, so the relevant part of the code needed to be updated. rust-lang/rust#121464: The second argument of the `create_wrapper_file` function changed from a vector to a string. rust-lang/rust#121662: `NullOp::DebugAssertions` was renamed to `NullOp::UbCheck` and it now has data (currently unused by Kani) rust-lang/rust#121728: Introduces `F16` and `F128`, so needed to add stubs for them rust-lang/rust#121969: `parse_sess` was renamed to `psess`, so updated the relevant code. rust-lang/rust#122059: The `is_val_statically_known` intrinsic is now used in some `core::fmt` code, so had to handle it in (codegen'ed to false). rust-lang/rust#122233: This added a new `retag_box_to_raw` intrinsic. This is an operation that is primarily relevant for stacked borrows. For Kani, we just return the pointer. Resolves #3057
/// Same as [`Arguments::as_str`], but will only return `Some(s)` if it can be determined at compile time. | ||
#[must_use] | ||
#[inline] | ||
fn as_const_str(&self) -> Option<&'static str> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"const_str" is a very confusing name since const
has a meaning in Rust and it's not this. We have very deliberately called the intrinsic is_val_statically_known
, not is_val_const
or anything like that. Would be nice if the functions using that intrinsic would follow the same pattern. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So... as_statically_known_str
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be better IMO. (I'm also open to suggestions for a shorter term that captures this, but "constant" is too misleading IMO. In Rust, whether something is a "constant" is a language guarantee, and we have things like const fn
that relate to that. In contrast, "statically known" is a very unstable notion depending on tons of factors that impact how good the compiler is able to analyze some piece of code. Also, const fn
is not the same meaning of "const" as in "constant propagation" -- a common misconception that the standard library should not perpetuate. :)
Following up on #121001
Apparently this code generates significant code block for each call to
write()
with non-simple formatting string - approx 100 lines of assembly code, possibly due todyn
(?). See generated assembly code here:Details
This is the inlining of
write!(buffer, "Iteration {value} was written")
So, this brings back the older experiment - where I used
if core::intrinsics::is_val_statically_known(s.is_some()) { s } else { None }
helper function, and called it in multiple places that usedwrite
. This is not as optimal because now every user ofwrite
must do this logic, but at least it results in significantly smaller assembly code for the formatting case, and results in identical code as now for the "simple" (no formatting) case. See assembly comparison of what is now with what this change brings (focus only onfmt/intel-lib.txt
andstr/intel-lib.txt
files).