Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize write with as_const_str for shorter code #122059

Merged
merged 1 commit into from
Mar 8, 2024

Conversation

nyurik
Copy link
Contributor

@nyurik nyurik commented Mar 6, 2024

Following up on #121001

Apparently this code generates significant code block for each call to write() with non-simple formatting string - approx 100 lines of assembly code, possibly due to dyn (?). See generated assembly code here:

Details

This is the inlining of write!(buffer, "Iteration {value} was written")

core::fmt::Write::write_fmt:
		// /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 194
		fn write_fmt(&mut self, args: Arguments<'_>) -> Result {
	push r15
	push r14
	push r13
	push r12
	push rbx
	mov rdx, rsi
		// /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 427
		match (self.pieces, self.args) {
	mov rcx, qword ptr [rsi + 8]
	mov rax, qword ptr [rsi + 24]
		// /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 428
		([], []) => Some(""),
	cmp rcx, 1
	je .LBB0_8
	test rcx, rcx
	jne .LBB0_9
	test rax, rax
	jne .LBB0_9
		// /home/nyurik/dev/rust/rust/library/alloc/src/vec/mod.rs : 911
		self.buf.reserve(self.len, additional);
	lea r12, [rdi + 16]
	lea rsi, [rip + .L__unnamed_2]
	xor ebx, ebx
.LBB0_6:
	mov r14, qword ptr [r12]
	jmp .LBB0_7
.LBB0_8:
		// /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 429
		([s], []) => Some(s),
	test rax, rax
	je .LBB0_4
.LBB0_9:
		// /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 1108
		if let Some(s) = args.as_str() { output.write_str(s) } else { write_internal(output, args) }
	lea rsi, [rip + .L__unnamed_1]
	pop rbx
	pop r12
	pop r13
	pop r14
	pop r15
	jmp qword ptr [rip + core::fmt::write_internal@GOTPCREL]
.LBB0_4:
	mov rax, qword ptr [rdx]
		// /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 429
		([s], []) => Some(s),
	mov rsi, qword ptr [rax]
	mov rbx, qword ptr [rax + 8]
		// /home/nyurik/dev/rust/rust/library/alloc/src/raw_vec.rs : 248
		if T::IS_ZST { usize::MAX } else { self.cap.0 }
	mov rax, qword ptr [rdi]
		// /home/nyurik/dev/rust/rust/library/alloc/src/vec/mod.rs : 911
		self.buf.reserve(self.len, additional);
	mov r14, qword ptr [rdi + 16]
		// /home/nyurik/dev/rust/rust/library/core/src/num/mod.rs : 1281
		uint_impl! {
	sub rax, r14
		// /home/nyurik/dev/rust/rust/library/alloc/src/raw_vec.rs : 392
		additional > self.capacity().wrapping_sub(len)
	cmp rax, rbx
		// /home/nyurik/dev/rust/rust/library/alloc/src/raw_vec.rs : 309
		if self.needs_to_grow(len, additional) {
	jb .LBB0_5
.LBB0_7:
	mov rax, qword ptr [rdi + 8]
		// /home/nyurik/dev/rust/rust/library/core/src/ptr/mut_ptr.rs : 1046
		unsafe { intrinsics::offset(self, count) }
	add rax, r14
	mov r15, rdi
		// /home/nyurik/dev/rust/rust/library/core/src/intrinsics.rs : 2922
		copy_nonoverlapping(src, dst, count)
	mov rdi, rax
	mov rdx, rbx
	call qword ptr [rip + memcpy@GOTPCREL]
		// /home/nyurik/dev/rust/rust/library/alloc/src/vec/mod.rs : 2040
		self.len += count;
	add r14, rbx
	mov qword ptr [r15 + 16], r14
		// /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 216
		}
	xor eax, eax
	pop rbx
	pop r12
	pop r13
	pop r14
	pop r15
	ret
.LBB0_5:
		// /home/nyurik/dev/rust/rust/library/alloc/src/vec/mod.rs : 911
		self.buf.reserve(self.len, additional);
	lea r12, [rdi + 16]
	mov r15, rdi
	mov r13, rsi
		// /home/nyurik/dev/rust/rust/library/alloc/src/raw_vec.rs : 310
		do_reserve_and_handle(self, len, additional);
	mov rsi, r14
	mov rdx, rbx
	call alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle
	mov rsi, r13
	mov rdi, r15
	jmp .LBB0_6

#[inline]
pub fn write(output: &mut dyn Write, args: Arguments<'_>) -> Result {
    if let Some(s) = args.as_str() { output.write_str(s) } else { write_internal(output, args) }
}

So, this brings back the older experiment - where I used if core::intrinsics::is_val_statically_known(s.is_some()) { s } else { None } helper function, and called it in multiple places that used write. This is not as optimal because now every user of write must do this logic, but at least it results in significantly smaller assembly code for the formatting case, and results in identical code as now for the "simple" (no formatting) case. See assembly comparison of what is now with what this change brings (focus only on fmt/intel-lib.txt and str/intel-lib.txt files).

if let Some(s) = args.as_const_str() {
    self.write_str(s)
} else {
    write(self, args)
}

@rustbot
Copy link
Collaborator

rustbot commented Mar 6, 2024

r? @scottmcm

rustbot has assigned @scottmcm.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Mar 6, 2024
@nyurik
Copy link
Contributor Author

nyurik commented Mar 6, 2024

@bors r? @cuviper

@clubby789
Copy link
Contributor

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 6, 2024
@bors
Copy link
Contributor

bors commented Mar 6, 2024

⌛ Trying commit 3d0d0ce with merge 61bc12c...

bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 6, 2024
Optimize write with as_const_str for shorter code

Following up on rust-lang#121001

Apparently this code generates significant code block for each call to `write()` with non-simple formatting string - approx 100 lines of assembly code, possibly due to `dyn` (?).  See generated assembly code [here](https://github.com/nyurik/rust-optimize-format-str/compare/before-changes..with-my-change#diff-6b404e954c692d8cdc8c452d819a216aa5dcf40522b5944639e9ad947279a477):

<details><summary>Details</summary>
<p>

This is the inlining of `write!(buffer, "Iteration {value} was written")`

```asm
core::fmt::Write::write_fmt:
		// /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 194
		fn write_fmt(&mut self, args: Arguments<'_>) -> Result {
	push r15
	push r14
	push r13
	push r12
	push rbx
	mov rdx, rsi
		// /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 427
		match (self.pieces, self.args) {
	mov rcx, qword ptr [rsi + 8]
	mov rax, qword ptr [rsi + 24]
		// /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 428
		([], []) => Some(""),
	cmp rcx, 1
	je .LBB0_8
	test rcx, rcx
	jne .LBB0_9
	test rax, rax
	jne .LBB0_9
		// /home/nyurik/dev/rust/rust/library/alloc/src/vec/mod.rs : 911
		self.buf.reserve(self.len, additional);
	lea r12, [rdi + 16]
	lea rsi, [rip + .L__unnamed_2]
	xor ebx, ebx
.LBB0_6:
	mov r14, qword ptr [r12]
	jmp .LBB0_7
.LBB0_8:
		// /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 429
		([s], []) => Some(s),
	test rax, rax
	je .LBB0_4
.LBB0_9:
		// /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 1108
		if let Some(s) = args.as_str() { output.write_str(s) } else { write_internal(output, args) }
	lea rsi, [rip + .L__unnamed_1]
	pop rbx
	pop r12
	pop r13
	pop r14
	pop r15
	jmp qword ptr [rip + core::fmt::write_internal@GOTPCREL]
.LBB0_4:
	mov rax, qword ptr [rdx]
		// /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 429
		([s], []) => Some(s),
	mov rsi, qword ptr [rax]
	mov rbx, qword ptr [rax + 8]
		// /home/nyurik/dev/rust/rust/library/alloc/src/raw_vec.rs : 248
		if T::IS_ZST { usize::MAX } else { self.cap.0 }
	mov rax, qword ptr [rdi]
		// /home/nyurik/dev/rust/rust/library/alloc/src/vec/mod.rs : 911
		self.buf.reserve(self.len, additional);
	mov r14, qword ptr [rdi + 16]
		// /home/nyurik/dev/rust/rust/library/core/src/num/mod.rs : 1281
		uint_impl! {
	sub rax, r14
		// /home/nyurik/dev/rust/rust/library/alloc/src/raw_vec.rs : 392
		additional > self.capacity().wrapping_sub(len)
	cmp rax, rbx
		// /home/nyurik/dev/rust/rust/library/alloc/src/raw_vec.rs : 309
		if self.needs_to_grow(len, additional) {
	jb .LBB0_5
.LBB0_7:
	mov rax, qword ptr [rdi + 8]
		// /home/nyurik/dev/rust/rust/library/core/src/ptr/mut_ptr.rs : 1046
		unsafe { intrinsics::offset(self, count) }
	add rax, r14
	mov r15, rdi
		// /home/nyurik/dev/rust/rust/library/core/src/intrinsics.rs : 2922
		copy_nonoverlapping(src, dst, count)
	mov rdi, rax
	mov rdx, rbx
	call qword ptr [rip + memcpy@GOTPCREL]
		// /home/nyurik/dev/rust/rust/library/alloc/src/vec/mod.rs : 2040
		self.len += count;
	add r14, rbx
	mov qword ptr [r15 + 16], r14
		// /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 216
		}
	xor eax, eax
	pop rbx
	pop r12
	pop r13
	pop r14
	pop r15
	ret
.LBB0_5:
		// /home/nyurik/dev/rust/rust/library/alloc/src/vec/mod.rs : 911
		self.buf.reserve(self.len, additional);
	lea r12, [rdi + 16]
	mov r15, rdi
	mov r13, rsi
		// /home/nyurik/dev/rust/rust/library/alloc/src/raw_vec.rs : 310
		do_reserve_and_handle(self, len, additional);
	mov rsi, r14
	mov rdx, rbx
	call alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle
	mov rsi, r13
	mov rdi, r15
	jmp .LBB0_6
```

</p>
</details>

```rust
#[inline]
pub fn write(output: &mut dyn Write, args: Arguments<'_>) -> Result {
    if let Some(s) = args.as_str() { output.write_str(s) } else { write_internal(output, args) }
}
```

So, this brings back the older experiment - where I used `if core::intrinsics::is_val_statically_known(s.is_some()) { s } else { None }` helper function, and called it in multiple places that used `write`.  This is not as optimal because now every user of `write` must do this logic, but at least it results in significantly smaller assembly code for the formatting case, and results in identical code as now for the "simple" (no formatting) case. See [assembly comparison](https://github.com/nyurik/rust-optimize-format-str/compare/with-my-change..with-as-const-str#diff-6b404e954c692d8cdc8c452d819a216aa5dcf40522b5944639e9ad947279a477) of what is now with what this change brings (focus only on `fmt/intel-lib.txt` and `str/intel-lib.txt` files).

```rust
               if let Some(s) = args.as_const_str() {
                    self.write_str(s)
                } else {
                    write(self, args)
                }
```
@bors
Copy link
Contributor

bors commented Mar 6, 2024

☀️ Try build successful - checks-actions
Build commit: 61bc12c (61bc12ca0210786789b3e8f6265106ba9701acde)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (61bc12c): comparison URL.

Overall result: ✅ improvements - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-0.5% [-1.2%, -0.2%] 4
Improvements ✅
(secondary)
-0.4% [-1.9%, -0.2%] 9
All ❌✅ (primary) -0.5% [-1.2%, -0.2%] 4

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.8% [0.2%, 3.4%] 3
Regressions ❌
(secondary)
3.1% [2.9%, 3.2%] 2
Improvements ✅
(primary)
-3.9% [-4.8%, -2.3%] 3
Improvements ✅
(secondary)
-2.3% [-2.3%, -2.3%] 1
All ❌✅ (primary) -1.0% [-4.8%, 3.4%] 6

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-1.0% [-1.0%, -1.0%] 1
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -1.0% [-1.0%, -1.0%] 1

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.1% [0.0%, 1.2%] 33
Regressions ❌
(secondary)
0.6% [0.6%, 0.6%] 3
Improvements ✅
(primary)
-0.1% [-0.4%, -0.0%] 20
Improvements ✅
(secondary)
-0.9% [-1.3%, -0.1%] 36
All ❌✅ (primary) 0.0% [-0.4%, 1.2%] 53

Bootstrap: 647.585s -> 647.242s (-0.05%)
Artifact size: 175.07 MiB -> 175.02 MiB (-0.03%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 6, 2024
@cuviper
Copy link
Member

cuviper commented Mar 6, 2024

@nnethercote are you happier with this than you were in #121001 (comment)?

It looks to me like binary size is swinging both good and bad, both here and in the previous PR.

@nnethercote
Copy link
Contributor

I don't have strong opinions, but these new results seem good.

@cuviper
Copy link
Member

cuviper commented Mar 7, 2024

@bors r+

@bors
Copy link
Contributor

bors commented Mar 7, 2024

📌 Commit 3d0d0ce has been approved by cuviper

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 7, 2024
@bors
Copy link
Contributor

bors commented Mar 8, 2024

⌛ Testing commit 3d0d0ce with merge 9fb91aa...

@bors
Copy link
Contributor

bors commented Mar 8, 2024

☀️ Test successful - checks-actions
Approved by: cuviper
Pushing 9fb91aa to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Mar 8, 2024
@bors bors merged commit 9fb91aa into rust-lang:master Mar 8, 2024
12 checks passed
@rustbot rustbot added this to the 1.78.0 milestone Mar 8, 2024
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (9fb91aa): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please open an issue or create a new PR that fixes the regressions, add a comment linking to the newly created issue or PR, and then add the perf-regression-triaged label to this PR.

@rustbot label: +perf-regression
cc @rust-lang/wg-compiler-performance

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
1.1% [0.3%, 1.9%] 2
Improvements ✅
(primary)
-0.8% [-1.2%, -0.4%] 2
Improvements ✅
(secondary)
-0.3% [-0.3%, -0.3%] 1
All ❌✅ (primary) -0.8% [-1.2%, -0.4%] 2

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
3.5% [0.1%, 5.9%] 4
Regressions ❌
(secondary)
3.2% [3.2%, 3.2%] 1
Improvements ✅
(primary)
-4.3% [-4.3%, -4.3%] 1
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 2.0% [-4.3%, 5.9%] 5

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
2.2% [2.0%, 2.5%] 3
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) - - 0

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.1% [0.0%, 1.0%] 34
Regressions ❌
(secondary)
0.6% [0.6%, 0.6%] 3
Improvements ✅
(primary)
-0.1% [-0.4%, -0.1%] 20
Improvements ✅
(secondary)
-0.8% [-1.3%, -0.1%] 42
All ❌✅ (primary) 0.0% [-0.4%, 1.0%] 54

Bootstrap: 647.024s -> 646.506s (-0.08%)
Artifact size: 172.57 MiB -> 172.55 MiB (-0.01%)

@rustbot rustbot added the perf-regression Performance regression. label Mar 8, 2024
@nyurik nyurik deleted the with-as-const-str branch March 8, 2024 17:05
@pnkfelix
Copy link
Member

  • (secondary) regressions are to deep-vector debug-full (which may be spurious based on the graph) and wg-grammar debug-incr-unchanged
  • since @nnethercote was already involved in related efforts here (e.g. PR #121001) and this resulting refinement, I'm going to mark this as triaged.

@rustbot label: +perf-regression-triaged

@rustbot rustbot added the perf-regression-triaged The performance regression has been triaged. label Mar 12, 2024
celinval pushed a commit to model-checking/kani that referenced this pull request Mar 12, 2024
Relevant upstream changes:

rust-lang/rust#120675: An intrinsic `Symbol` is
now wrapped in a `IntrinsicDef` struct, so the relevant part of the code
needed to be updated.
rust-lang/rust#121464: The second argument of
the `create_wrapper_file` function changed from a vector to a string.
rust-lang/rust#121662: `NullOp::DebugAssertions`
was renamed to `NullOp::UbCheck` and it now has data (currently unused
by Kani)
rust-lang/rust#121728: Introduces `F16` and
`F128`, so needed to add stubs for them
rust-lang/rust#121969: `parse_sess` was renamed
to `psess`, so updated the relevant code.
rust-lang/rust#122059: The
`is_val_statically_known` intrinsic is now used in some `core::fmt`
code, so had to handle it in (codegen'ed to false).
rust-lang/rust#122233: This added a new
`retag_box_to_raw` intrinsic. This is an operation that is primarily
relevant for stacked borrows. For Kani, we just return the pointer.

Resolves #3057
/// Same as [`Arguments::as_str`], but will only return `Some(s)` if it can be determined at compile time.
#[must_use]
#[inline]
fn as_const_str(&self) -> Option<&'static str> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"const_str" is a very confusing name since const has a meaning in Rust and it's not this. We have very deliberately called the intrinsic is_val_statically_known, not is_val_const or anything like that. Would be nice if the functions using that intrinsic would follow the same pattern. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... as_statically_known_str?

Copy link
Member

@RalfJung RalfJung Mar 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be better IMO. (I'm also open to suggestions for a shorter term that captures this, but "constant" is too misleading IMO. In Rust, whether something is a "constant" is a language guarantee, and we have things like const fn that relate to that. In contrast, "statically known" is a very unstable notion depending on tons of factors that impact how good the compiler is able to analyze some piece of code. Also, const fn is not the same meaning of "const" as in "constant propagation" -- a common misconception that the standard library should not perpetuate. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. perf-regression Performance regression. perf-regression-triaged The performance regression has been triaged. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants