Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: improve write_fmt to handle simple strings #121001

Merged
merged 7 commits into from
Mar 5, 2024

Conversation

nyurik
Copy link
Contributor

@nyurik nyurik commented Feb 13, 2024

In case format string has no arguments, simplify its implementation with a direct call to output.write_str(value). This builds on @dtolnay original suggestion. This does not change any expectations because the original fn write() implementation calls write_str for parts of the format string.

write!(f, "text")  ->  f.write_str("text")
 /// [`write!`]: crate::write!
+#[inline]
 #[stable(feature = "rust1", since = "1.0.0")]
 pub fn write(output: &mut dyn Write, args: Arguments<'_>) -> Result {
+    if let Some(s) = args.as_str() { output.write_str(s) } else { write_internal(output, args) }
+}
+
+/// Actual implementation of the [`write`], but without the simple string optimization.
+fn write_internal(output: &mut dyn Write, args: Arguments<'_>) -> Result {
     let mut formatter = Formatter::new(output);
     let mut idx = 0;

CC: @m-ou-se as probably the biggest expert in everything format!

@rustbot
Copy link
Collaborator

rustbot commented Feb 13, 2024

r? @cuviper

rustbot has assigned @cuviper.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Feb 13, 2024
@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@Noratrieb
Copy link
Member

Noratrieb commented Feb 13, 2024

Can you run some benchmarks for our original motivating program to see whether this helps?
To ensure a good comparison target I'm gonna do a try build
@bors try
I think rustdoc still uses write in many places, so a perf run might reveal something as well.
@rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 13, 2024
@Noratrieb
Copy link
Member

@bors try

bors added a commit to rust-lang-ci/rust that referenced this pull request Feb 13, 2024
perf: improve write_fmt to handle simple strings

Per `@dtolnay` suggestion in serde-rs/serde#2697 (comment) - attempt to speed up performance in the cases of a simple string format without arguments:

```rust
write!(f, "text")  ->  f.write_str("text")
```

```diff
+ #[inline]
  pub fn write_fmt(&mut self, f: fmt::Arguments) -> fmt::Result {
+     if let Some(s) = f.as_str() {
+         self.buf.write_str(s)
+     } else {
          write(self.buf, f)
+     }
  }
```

Hopefully it will improve the simple case for the rust-lang#99012

CC: `@m-ou-se` as probably the biggest expert in everything `format!`
@bors
Copy link
Contributor

bors commented Feb 13, 2024

⌛ Trying commit 75808d1 with merge 09bfbad...

@nyurik
Copy link
Contributor Author

nyurik commented Feb 13, 2024

I tried this code, running it with the latest main branch and compiling using RUSTFLAGS="--emit asm" cargo +stage2 build --release . Note that in the second case it seems to have done the tail call elimination!

#[no_mangle]
#[inline(never)]
pub fn fmt_abcd(f: &mut Formatter<'_>) -> fmt::Result {
    write!(f, "ABCXYZ")
}

Before the change:

fmt_abcd:
	.cfi_startproc
	subq	$56, %rsp
	.cfi_def_cfa_offset 64
	leaq	.L__unnamed_2(%rip), %rax
	movq	%rax, 8(%rsp)
	movq	$1, 16(%rsp)
	leaq	.L__unnamed_3(%rip), %rax
	movq	%rax, 24(%rsp)
	xorps	%xmm0, %xmm0
	movups	%xmm0, 32(%rsp)
	leaq	8(%rsp), %rsi
	callq	*_ZN4core3fmt9Formatter9write_fmt17hf6272f62fa2a64dfE@GOTPCREL(%rip)
	addq	$56, %rsp
	.cfi_def_cfa_offset 8
	retq
.Lfunc_end5:

After the change:

fmt_abcd:
	.cfi_startproc
	movq	32(%rdi), %rax
	movq	40(%rdi), %rcx
	movq	24(%rcx), %rcx
	leaq	.L__unnamed_2(%rip), %rsi
	movl	$6, %edx
	movq	%rax, %rdi
	jmpq	*%rcx
.Lfunc_end5:

@Kobzol
Copy link
Contributor

Kobzol commented Feb 13, 2024

In the past, this didn't really help (#100700), but we didn't have runtime benchmarks then.

@bors
Copy link
Contributor

bors commented Feb 13, 2024

☀️ Try build successful - checks-actions
Build commit: 09bfbad (09bfbad0db2dd7aa4e4c6c7e5dbb8a410a72fa4f)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (09bfbad): comparison URL.

Overall result: ✅ improvements - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-2.5% [-3.3%, -1.6%] 2
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -2.5% [-3.3%, -1.6%] 2

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.1% [0.2%, 2.1%] 2
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-7.9% [-13.4%, -2.4%] 2
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -3.4% [-13.4%, 2.1%] 4

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-2.1% [-2.7%, -1.6%] 2
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -2.1% [-2.7%, -1.6%] 2

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.1% [0.0%, 0.2%] 40
Regressions ❌
(secondary)
0.7% [0.4%, 1.0%] 6
Improvements ✅
(primary)
-0.3% [-1.1%, -0.0%] 18
Improvements ✅
(secondary)
-0.0% [-0.0%, -0.0%] 4
All ❌✅ (primary) -0.0% [-1.1%, 0.2%] 58

Bootstrap: 662.665s -> 665.299s (0.40%)
Artifact size: 308.47 MiB -> 308.48 MiB (0.00%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 13, 2024
@Kobzol
Copy link
Contributor

Kobzol commented Feb 13, 2024

image

This looks a bit worrisome, but it could also be noise, we don't yet have history charts for runtime benchmarks.

@nyurik
Copy link
Contributor Author

nyurik commented Feb 13, 2024

@Kobzol thx, I tried to dig through the graphs - but there is so much data, I am not certain what to look at. The worst test I see is branch misses - it went up by 23773.11% (!!) for fmt-write-str. The rest of the results seem relatively consistent and benign - but again, I might be looking at the wrong thing.

What are the next steps for this? Are there any reliable way to figure out what's going on with a certain test? Note that micro benchmarks show nearly 80% perf gain when using write_str, and my assembly experiment above shows a significant drop in the instruction count - which looks identical to what a write_str would produce.

@Kobzol
Copy link
Contributor

Kobzol commented Feb 13, 2024

I would take the runtime benchmarks results with a grain of salt. That being said, a lot of branch misses makes sense. The call is resolved with dynamic dispatch, through a v-table, and this condition might be very badly predicted, because it will be true for some values and false for others, quite randomly.

@nyurik
Copy link
Contributor Author

nyurik commented Feb 13, 2024

my understanding was that this new if/else gets compiled away because the .as_str() is a const fn that gets inlined, plus the value itself is known at the compile time - so the compiler should fully remove one of the branches - ending up with either the original code (for complex formats), or the new shorter branch for the simple str ... or am i reasoning incorrectly?

@Kobzol
Copy link
Contributor

Kobzol commented Feb 13, 2024

Sorry, my point about the v-table was wrong, there's no dynamic dispatch, the function is implemented on Arguments directly, I didn't notice that. The fmt calls on Arguments::args are resolved via dynamic dispatch.

On the other hand, the as_str call might not be fully resolved in compile time, for that the compiler would have to inline everything, and I'm not sure if that happens currently. It would be interesting to check if that inlining happens in --release.

@cuviper
Copy link
Member

cuviper commented Feb 13, 2024

I wonder if we can use is_val_statically_known for this?

jonasbb added a commit to jonasbb/serde_with that referenced this pull request Feb 13, 2024
write! with a single string argument is not properly optimized and using
write_str generates better code:
serde-rs/serde#2697
rust-lang/rust#121001
jonasbb added a commit to jonasbb/serde_with that referenced this pull request Feb 13, 2024
write! with a single string argument is not properly optimized and using
write_str generates better code:
serde-rs/serde#2697
rust-lang/rust#121001
@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (dc99c35): comparison URL.

Overall result: ❌ regressions - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.5% [0.5%, 0.5%] 1
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.5% [0.5%, 0.5%] 1

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
5.4% [1.0%, 8.7%] 5
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-2.2% [-2.2%, -2.2%] 1
All ❌✅ (primary) 5.4% [1.0%, 8.7%] 5

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.1% [0.0%, 0.9%] 27
Regressions ❌
(secondary)
0.1% [0.0%, 0.6%] 17
Improvements ✅
(primary)
-0.1% [-0.2%, -0.0%] 7
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.1% [-0.2%, 0.9%] 34

Bootstrap: 640.589s -> 639.335s (-0.20%)
Artifact size: 308.63 MiB -> 308.68 MiB (0.02%)

@rustbot rustbot removed S-waiting-on-perf Status: Waiting on a perf run to be completed. perf-regression Performance regression. labels Feb 21, 2024
@nyurik
Copy link
Contributor Author

nyurik commented Feb 21, 2024

Are these good results or not? Or are we not even doing relevant perf testing for the simple write! implementation? What are the next steps?

@Kobzol
Copy link
Contributor

Kobzol commented Feb 21, 2024

I don't think that our current benchmarks in rustc-perf are a good fit for a change like this. Perhaps a micro-benchmark in stdlib would be better. That being said, even a microbenchmark won't tell us the effects on non-trivial programs. That's why it's challenging to do changes like this.

@nyurik
Copy link
Contributor Author

nyurik commented Feb 21, 2024

@Kobzol are the assembly output changes a good indicator? See my comment above #121001 (comment)

@Kobzol
Copy link
Contributor

Kobzol commented Feb 23, 2024

It's definitely a way of evaluating the potential perf. effects of this change :) I'm not a libs reviewer, so I'll leave the evaluation of this change to others.

@cuviper
Copy link
Member

cuviper commented Mar 4, 2024

@bors r+

@bors
Copy link
Contributor

bors commented Mar 4, 2024

📌 Commit c85a9df has been approved by cuviper

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 4, 2024
@bors
Copy link
Contributor

bors commented Mar 5, 2024

⌛ Testing commit c85a9df with merge 5a1e544...

@bors
Copy link
Contributor

bors commented Mar 5, 2024

☀️ Test successful - checks-actions
Approved by: cuviper
Pushing 5a1e544 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Mar 5, 2024
@bors bors merged commit 5a1e544 into rust-lang:master Mar 5, 2024
12 checks passed
@rustbot rustbot added this to the 1.78.0 milestone Mar 5, 2024
@nyurik nyurik deleted the optimize-core-fmt branch March 5, 2024 08:19
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (5a1e544): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please open an issue or create a new PR that fixes the regressions, add a comment linking to the newly created issue or PR, and then add the perf-regression-triaged label to this PR.

@rustbot label: +perf-regression
cc @rust-lang/wg-compiler-performance

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.7% [0.5%, 3.6%] 4
Regressions ❌
(secondary)
0.3% [0.3%, 0.3%] 1
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-1.3% [-1.3%, -1.3%] 1
All ❌✅ (primary) 1.7% [0.5%, 3.6%] 4

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.3% [2.3%, 2.3%] 1
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-5.4% [-6.5%, -4.2%] 2
Improvements ✅
(secondary)
-3.2% [-6.2%, -2.1%] 32
All ❌✅ (primary) -2.8% [-6.5%, 2.3%] 3

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.2% [1.1%, 3.7%] 3
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 2.2% [1.1%, 3.7%] 3

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.1% [0.0%, 1.8%] 34
Regressions ❌
(secondary)
0.2% [0.0%, 1.0%] 19
Improvements ✅
(primary)
-0.2% [-0.2%, -0.1%] 5
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.1% [-0.2%, 1.8%] 39

Bootstrap: 645.309s -> 645.372s (0.01%)
Artifact size: 174.97 MiB -> 175.01 MiB (0.02%)

@rustbot rustbot added the perf-regression Performance regression. label Mar 5, 2024
@Mark-Simulacrum
Copy link
Member

Regressions look real, but are perhaps expected -- LLVM can optimize more with this change, so it'll spend more time doing that. I'm going to mark as triaged.

@Mark-Simulacrum Mark-Simulacrum added the perf-regression-triaged The performance regression has been triaged. label Mar 5, 2024
@nnethercote
Copy link
Contributor

Are the binary size regressions expected?

@nyurik
Copy link
Contributor Author

nyurik commented Mar 6, 2024

@nnethercote thx for the ping, I had another go at in-depth assembly digging - it seems the inlined fn with dyn keyword generates considerable extra code for each call when using for formatting (vs non-formatting) case. So I created #122059 that goes back to the as_const_str approach - not as clean code-wise, but generates better results (and faster when used with formatted string)

@nnethercote
Copy link
Contributor

@nyurik: thanks for looking into it!

bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 6, 2024
Optimize write with as_const_str for shorter code

Following up on rust-lang#121001

Apparently this code generates significant code block for each call to `write()` with non-simple formatting string - approx 100 lines of assembly code, possibly due to `dyn` (?).  See generated assembly code [here](https://github.com/nyurik/rust-optimize-format-str/compare/before-changes..with-my-change#diff-6b404e954c692d8cdc8c452d819a216aa5dcf40522b5944639e9ad947279a477):

<details><summary>Details</summary>
<p>

This is the inlining of `write!(buffer, "Iteration {value} was written")`

```asm
core::fmt::Write::write_fmt:
		// /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 194
		fn write_fmt(&mut self, args: Arguments<'_>) -> Result {
	push r15
	push r14
	push r13
	push r12
	push rbx
	mov rdx, rsi
		// /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 427
		match (self.pieces, self.args) {
	mov rcx, qword ptr [rsi + 8]
	mov rax, qword ptr [rsi + 24]
		// /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 428
		([], []) => Some(""),
	cmp rcx, 1
	je .LBB0_8
	test rcx, rcx
	jne .LBB0_9
	test rax, rax
	jne .LBB0_9
		// /home/nyurik/dev/rust/rust/library/alloc/src/vec/mod.rs : 911
		self.buf.reserve(self.len, additional);
	lea r12, [rdi + 16]
	lea rsi, [rip + .L__unnamed_2]
	xor ebx, ebx
.LBB0_6:
	mov r14, qword ptr [r12]
	jmp .LBB0_7
.LBB0_8:
		// /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 429
		([s], []) => Some(s),
	test rax, rax
	je .LBB0_4
.LBB0_9:
		// /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 1108
		if let Some(s) = args.as_str() { output.write_str(s) } else { write_internal(output, args) }
	lea rsi, [rip + .L__unnamed_1]
	pop rbx
	pop r12
	pop r13
	pop r14
	pop r15
	jmp qword ptr [rip + core::fmt::write_internal@GOTPCREL]
.LBB0_4:
	mov rax, qword ptr [rdx]
		// /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 429
		([s], []) => Some(s),
	mov rsi, qword ptr [rax]
	mov rbx, qword ptr [rax + 8]
		// /home/nyurik/dev/rust/rust/library/alloc/src/raw_vec.rs : 248
		if T::IS_ZST { usize::MAX } else { self.cap.0 }
	mov rax, qword ptr [rdi]
		// /home/nyurik/dev/rust/rust/library/alloc/src/vec/mod.rs : 911
		self.buf.reserve(self.len, additional);
	mov r14, qword ptr [rdi + 16]
		// /home/nyurik/dev/rust/rust/library/core/src/num/mod.rs : 1281
		uint_impl! {
	sub rax, r14
		// /home/nyurik/dev/rust/rust/library/alloc/src/raw_vec.rs : 392
		additional > self.capacity().wrapping_sub(len)
	cmp rax, rbx
		// /home/nyurik/dev/rust/rust/library/alloc/src/raw_vec.rs : 309
		if self.needs_to_grow(len, additional) {
	jb .LBB0_5
.LBB0_7:
	mov rax, qword ptr [rdi + 8]
		// /home/nyurik/dev/rust/rust/library/core/src/ptr/mut_ptr.rs : 1046
		unsafe { intrinsics::offset(self, count) }
	add rax, r14
	mov r15, rdi
		// /home/nyurik/dev/rust/rust/library/core/src/intrinsics.rs : 2922
		copy_nonoverlapping(src, dst, count)
	mov rdi, rax
	mov rdx, rbx
	call qword ptr [rip + memcpy@GOTPCREL]
		// /home/nyurik/dev/rust/rust/library/alloc/src/vec/mod.rs : 2040
		self.len += count;
	add r14, rbx
	mov qword ptr [r15 + 16], r14
		// /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 216
		}
	xor eax, eax
	pop rbx
	pop r12
	pop r13
	pop r14
	pop r15
	ret
.LBB0_5:
		// /home/nyurik/dev/rust/rust/library/alloc/src/vec/mod.rs : 911
		self.buf.reserve(self.len, additional);
	lea r12, [rdi + 16]
	mov r15, rdi
	mov r13, rsi
		// /home/nyurik/dev/rust/rust/library/alloc/src/raw_vec.rs : 310
		do_reserve_and_handle(self, len, additional);
	mov rsi, r14
	mov rdx, rbx
	call alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle
	mov rsi, r13
	mov rdi, r15
	jmp .LBB0_6
```

</p>
</details>

```rust
#[inline]
pub fn write(output: &mut dyn Write, args: Arguments<'_>) -> Result {
    if let Some(s) = args.as_str() { output.write_str(s) } else { write_internal(output, args) }
}
```

So, this brings back the older experiment - where I used `if core::intrinsics::is_val_statically_known(s.is_some()) { s } else { None }` helper function, and called it in multiple places that used `write`.  This is not as optimal because now every user of `write` must do this logic, but at least it results in significantly smaller assembly code for the formatting case, and results in identical code as now for the "simple" (no formatting) case. See [assembly comparison](https://github.com/nyurik/rust-optimize-format-str/compare/with-my-change..with-as-const-str#diff-6b404e954c692d8cdc8c452d819a216aa5dcf40522b5944639e9ad947279a477) of what is now with what this change brings (focus only on `fmt/intel-lib.txt` and `str/intel-lib.txt` files).

```rust
               if let Some(s) = args.as_const_str() {
                    self.write_str(s)
                } else {
                    write(self, args)
                }
```
bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 8, 2024
Optimize write with as_const_str for shorter code

Following up on rust-lang#121001

Apparently this code generates significant code block for each call to `write()` with non-simple formatting string - approx 100 lines of assembly code, possibly due to `dyn` (?).  See generated assembly code [here](https://github.com/nyurik/rust-optimize-format-str/compare/before-changes..with-my-change#diff-6b404e954c692d8cdc8c452d819a216aa5dcf40522b5944639e9ad947279a477):

<details><summary>Details</summary>
<p>

This is the inlining of `write!(buffer, "Iteration {value} was written")`

```asm
core::fmt::Write::write_fmt:
		// /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 194
		fn write_fmt(&mut self, args: Arguments<'_>) -> Result {
	push r15
	push r14
	push r13
	push r12
	push rbx
	mov rdx, rsi
		// /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 427
		match (self.pieces, self.args) {
	mov rcx, qword ptr [rsi + 8]
	mov rax, qword ptr [rsi + 24]
		// /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 428
		([], []) => Some(""),
	cmp rcx, 1
	je .LBB0_8
	test rcx, rcx
	jne .LBB0_9
	test rax, rax
	jne .LBB0_9
		// /home/nyurik/dev/rust/rust/library/alloc/src/vec/mod.rs : 911
		self.buf.reserve(self.len, additional);
	lea r12, [rdi + 16]
	lea rsi, [rip + .L__unnamed_2]
	xor ebx, ebx
.LBB0_6:
	mov r14, qword ptr [r12]
	jmp .LBB0_7
.LBB0_8:
		// /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 429
		([s], []) => Some(s),
	test rax, rax
	je .LBB0_4
.LBB0_9:
		// /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 1108
		if let Some(s) = args.as_str() { output.write_str(s) } else { write_internal(output, args) }
	lea rsi, [rip + .L__unnamed_1]
	pop rbx
	pop r12
	pop r13
	pop r14
	pop r15
	jmp qword ptr [rip + core::fmt::write_internal@GOTPCREL]
.LBB0_4:
	mov rax, qword ptr [rdx]
		// /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 429
		([s], []) => Some(s),
	mov rsi, qword ptr [rax]
	mov rbx, qword ptr [rax + 8]
		// /home/nyurik/dev/rust/rust/library/alloc/src/raw_vec.rs : 248
		if T::IS_ZST { usize::MAX } else { self.cap.0 }
	mov rax, qword ptr [rdi]
		// /home/nyurik/dev/rust/rust/library/alloc/src/vec/mod.rs : 911
		self.buf.reserve(self.len, additional);
	mov r14, qword ptr [rdi + 16]
		// /home/nyurik/dev/rust/rust/library/core/src/num/mod.rs : 1281
		uint_impl! {
	sub rax, r14
		// /home/nyurik/dev/rust/rust/library/alloc/src/raw_vec.rs : 392
		additional > self.capacity().wrapping_sub(len)
	cmp rax, rbx
		// /home/nyurik/dev/rust/rust/library/alloc/src/raw_vec.rs : 309
		if self.needs_to_grow(len, additional) {
	jb .LBB0_5
.LBB0_7:
	mov rax, qword ptr [rdi + 8]
		// /home/nyurik/dev/rust/rust/library/core/src/ptr/mut_ptr.rs : 1046
		unsafe { intrinsics::offset(self, count) }
	add rax, r14
	mov r15, rdi
		// /home/nyurik/dev/rust/rust/library/core/src/intrinsics.rs : 2922
		copy_nonoverlapping(src, dst, count)
	mov rdi, rax
	mov rdx, rbx
	call qword ptr [rip + memcpy@GOTPCREL]
		// /home/nyurik/dev/rust/rust/library/alloc/src/vec/mod.rs : 2040
		self.len += count;
	add r14, rbx
	mov qword ptr [r15 + 16], r14
		// /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 216
		}
	xor eax, eax
	pop rbx
	pop r12
	pop r13
	pop r14
	pop r15
	ret
.LBB0_5:
		// /home/nyurik/dev/rust/rust/library/alloc/src/vec/mod.rs : 911
		self.buf.reserve(self.len, additional);
	lea r12, [rdi + 16]
	mov r15, rdi
	mov r13, rsi
		// /home/nyurik/dev/rust/rust/library/alloc/src/raw_vec.rs : 310
		do_reserve_and_handle(self, len, additional);
	mov rsi, r14
	mov rdx, rbx
	call alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle
	mov rsi, r13
	mov rdi, r15
	jmp .LBB0_6
```

</p>
</details>

```rust
#[inline]
pub fn write(output: &mut dyn Write, args: Arguments<'_>) -> Result {
    if let Some(s) = args.as_str() { output.write_str(s) } else { write_internal(output, args) }
}
```

So, this brings back the older experiment - where I used `if core::intrinsics::is_val_statically_known(s.is_some()) { s } else { None }` helper function, and called it in multiple places that used `write`.  This is not as optimal because now every user of `write` must do this logic, but at least it results in significantly smaller assembly code for the formatting case, and results in identical code as now for the "simple" (no formatting) case. See [assembly comparison](https://github.com/nyurik/rust-optimize-format-str/compare/with-my-change..with-as-const-str#diff-6b404e954c692d8cdc8c452d819a216aa5dcf40522b5944639e9ad947279a477) of what is now with what this change brings (focus only on `fmt/intel-lib.txt` and `str/intel-lib.txt` files).

```rust
               if let Some(s) = args.as_const_str() {
                    self.write_str(s)
                } else {
                    write(self, args)
                }
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. perf-regression Performance regression. perf-regression-triaged The performance regression has been triaged. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.