Use `br` instead of `switch` in more cases. #103331

nnethercote · 2022-10-21T01:44:21Z

codegen_switchint_terminator already uses br instead of switch when there is one normal target plus the otherwise target. But there's another common case with two normal targets and an otherwise target that points to an empty unreachable BB. This comes up a lot when switching on the tags of enums that use niches.

The pattern looks like this:

bb1:                                              ; preds = %bb6
  %3 = load i8, ptr %_2, align 1, !range !9, !noundef !4
  %4 = sub i8 %3, 2
  %5 = icmp eq i8 %4, 0
  %_6 = select i1 %5, i64 0, i64 1
  switch i64 %_6, label %bb3 [
    i64 0, label %bb4
    i64 1, label %bb2
  ]

bb3:                                              ; preds = %bb1
  unreachable

This commit adds code to convert the switch to a br:

bb1:                                              ; preds = %bb6
  %3 = load i8, ptr %_2, align 1, !range !9, !noundef !4
  %4 = sub i8 %3, 2
  %5 = icmp eq i8 %4, 0
  %_6 = select i1 %5, i64 0, i64 1
  %6 = icmp eq i64 %_6, 0
  br i1 %6, label %bb4, label %bb2

bb3:                                              ; No predecessors!
  unreachable

This has a surprisingly large effect on compile times, with reductions of 5% on debug builds of some crates. The reduction is all due to LLVM taking less time. Maybe LLVM is just much better at handling br than switch.

The resulting code is still suboptimal.

The icmp, select, icmp sequence is silly, converting an i1 to an i64 and back to an i1. But with the current code structure it's hard to avoid, and LLVM will easily clean it up, in opt builds at least.
bb3 is usually now truly dead code (though not always, so it can't be removed universally).

r? @scottmcm

nnethercote · 2022-10-21T01:44:33Z

@bors try @rust-timer queue

rust-timer · 2022-10-21T01:44:35Z

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

bors · 2022-10-21T01:44:42Z

⌛ Trying commit 0aef066dc5da906131faeda7338cae344e7043b0 with merge 0f1697fa69ded7c5a968467067a9153a1d5e3668...

nnethercote · 2022-10-21T01:46:35Z

Unfortunately this change breaks the src/test/codegen/try_question_mark_nop.rs test, because the functions in that test end up like this:

define i64 @result_nop_match_32(i64 %0) unnamed_addr #0 {
start:
  %_2 = and i64 %0, 4294967295
  %1 = icmp ne i64 %_2, 0
  %. = zext i1 %1 to i64
  %.sroa.4.0.extract.shift = and i64 %0, -4294967296
  %.sroa.04.0.insert.insert = or i64 %.sroa.4.0.extract.shift, %.
  ret i64 %.sroa.04.0.insert.insert
}

instead of just being ret i64 0. Hence the changes from CHECK-NEXT to NOCHECK-NEXT in that file :(

scottmcm · 2022-10-21T01:54:01Z

The downside of this is that the otherwise was like an assume(%desc ULE 1), making it clearer that the discriminant can only be zero or one. So maybe this should only happen for opt-level=0?

cc @nikic, who did some LLVM work to take advantage of that pattern in #85133 (comment) to get those ? cases to be NOPs.

As for the icmp-select pattern, I wish there was trunc nuw i8 %discr to i1, so LLVM would know that any other value is UB even without the otherwise: unreachable.

nnethercote · 2022-10-21T02:51:38Z

@scottmcm: for the example in the PR description, I can't see how the switch-to-br transformation can change what LLVM knows.

However, the pattern in try_question_mark_nop.rs is a little different, involving non-niche tag operations. Old code:

  %3 = load i32, ptr %x, align 4, !range !2, !noundef !3
  %_2 = zext i32 %3 to i64
  switch i64 %_2, label %bb2 [
    i64 0, label %bb3
    i64 1, label %bb1
  ]

bb2:                                              ; preds = %start
  unreachable

New code:

  %3 = load i32, ptr %x, align 4, !range !2, !noundef !3
  %_2 = zext i32 %3 to i64
  %4 = icmp eq i64 %_2, 0
  br i1 %4, label %bb3, label %bb1

I can see for this one that the switch makes it clear that %3 must be i32 0 or i32 1, while the br does not. Except, the !range !2 metadata (where !2 = !{i32 0, i32 2}) should make that fact clear. Maybe LLVM doesn't pay attention to that metadata?

bors · 2022-10-21T04:04:58Z

☀️ Try build successful - checks-actions
Build commit: 0f1697fa69ded7c5a968467067a9153a1d5e3668 (0f1697fa69ded7c5a968467067a9153a1d5e3668)

bors · 2022-10-21T04:04:58Z

☀️ Try build successful - checks-actions
Build commit: 0f1697fa69ded7c5a968467067a9153a1d5e3668 (0f1697fa69ded7c5a968467067a9153a1d5e3668)

rust-timer · 2022-10-21T04:04:59Z

Queued 0f1697fa69ded7c5a968467067a9153a1d5e3668 with parent dcb3761, future comparison URL.

rust-timer · 2022-10-21T05:22:59Z

Finished benchmarking commit (0f1697fa69ded7c5a968467067a9153a1d5e3668): comparison URL.

Overall result: ✅ improvements - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean¹	range	count²
Regressions ❌ (primary)	0.5%	[0.5%, 0.5%]	1
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-1.9%	[-4.4%, -0.3%]	33
Improvements ✅ (secondary)	-2.1%	[-3.1%, -0.5%]	12
All ❌✅ (primary)	-1.9%	[-4.4%, 0.5%]	34

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean¹	range	count²
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-3.2%	[-7.3%, -2.0%]	7
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-3.2%	[-7.3%, -2.0%]	7

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean¹	range	count²
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-3.3%	[-4.9%, -2.2%]	17
Improvements ✅ (secondary)	-4.2%	[-4.2%, -4.2%]	1
All ❌✅ (primary)	-3.3%	[-4.9%, -2.2%]	17

the arithmetic mean of the percent change ↩ ↩² ↩³
number of relevant changes ↩ ↩² ↩³

nnethercote · 2022-10-21T06:02:24Z

The perf benefits are entirely in debug builds, so restricting this to opt-level=0 only will get us the compile time improvements without affecting codegen quality, yay.

nikic · 2022-10-21T09:31:04Z

The reduction is all due to LLVM taking less time. Maybe LLVM is just much better at handling br than switch.

This is because FastISel does not support switches, so those cases would always fall back to SelectionDAG isel. (This is only relevant for -C opt-level=0 builds.)

I can see for this one that the switch makes it clear that %3 must be i32 0 or i32 1, while the br does not. Except, the !range !2 metadata (where !2 = !{i32 0, i32 2}) should make that fact clear. Maybe LLVM doesn't pay attention to that metadata?

This range metadata generally gets lost during SROA. We could enable knowledge retention to preserve it, but I suspect that will preserve more than we bargained for, and would need some optimization for production use first.

Limiting this to -C opt-level=0 for now sounds fine.

nnethercote · 2022-10-21T10:49:46Z

Thanks for the explanation, that's very helpful! I will make this -C opt-level=0 only on Monday.

nnethercote · 2022-10-21T11:39:30Z

Today I was reading LLVM's Performance Tips for Frontend Authors, which is "a collection of tips on how to generate IR that optimizes well". I would be interested in a similar document containing tips on how to generate IR that can be compiled quickly.

This might be an example tip: "Unoptimized builds use FastISel, but FastISel does not support switches, therefore branches are much faster to compile than switches in unoptimized builds." (Assuming I've understood the description above correctly.)

If you can think of any other such tips, for unoptimized or optimized builds, I'd love to hear about them. Things like "avoid this code pattern", or "try to use this code pattern".

scottmcm · 2022-10-21T18:26:02Z

for the example in the PR description […]

Oh, right, I commented based on the codegen test, and hadn't looked at the example in the OP in detail.

Do you happen to know the MIR that produced the

  %_6 = select i1 %5, i64 0, i64 1
  switch i64 %_6, label %bb3 [
    i64 0, label %bb4
    i64 1, label %bb2
  ]

pattern? I wonder if we couldn't optimize that down simpler even before codegen. (Not in this PR, of course.)

I guess if it's a switch-on-discriminant then MIR can't know in general, since it tries not to know how the discriminant is actually encoded.

nnethercote · 2022-10-21T20:24:11Z

Yes, this code is from a switch-on-discriminant.

When the tag is in a niche the discriminant is extracted here, which gives the sub+icmp+select sequence in the example at the top of this PR.
The switch is then handled here.

I experimented with changing the type used for discriminant extraction away from i64 but many tests failed. Alternatively, it's conceivable that the latter function could be modified to special-case switch-on-discriminant, but I decided that would be too big a change. And judging from what @nikic said, the compile-time improvement here is all about avoiding the switch; avoiding the bool-to-i64-to-bool sequence wouldn't gain much.

nikic · 2022-10-21T20:29:14Z

@nnethercote For this particular class of problem, you can use -Cllvm-args=-fast-isel-report-on-fallback -C remark=sdagisel to print FastISel aborts. Looks like this for one of the functions from try_question_mark_nop.rs:

note: <source>:9:5: sdagisel: FastISel missed terminator:   switch i64 %6, label %7 [
          i64 0, label %8
          i64 1, label %12
        ], !dbg !13

note: <source>:9:11: sdagisel: FastISel missed terminator:   unreachable, !dbg !11

Presumably you avoid the first one here. The second one is something we should probably address on the LLVM side. FastISel does support unreachable, but because we use TrapUnreachable by default it tries to emit an ISD::TRAP -- and as far as I can tell, the necessary support function for that (fastEmit_) never gets generated by FastISelGen...

nnethercote · 2022-10-23T23:54:09Z

I have updated so that the switch-to-br change only happens in unoptimized builds. Let's do another perf run just to make sure things are working as expected.

@bors try @rust-timer queue

rust-timer · 2022-10-23T23:54:10Z

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

bors · 2022-10-23T23:54:17Z

⌛ Trying commit ef98c61ecd25db2e31106fec7478faa1f0418584 with merge 97d79c13f38e165929b28d96b2444ac6110d7fc6...

nnethercote · 2022-10-24T04:20:05Z

New perf results look good, this is ready to go.

src/test/codegen/match.rs

scottmcm · 2022-10-24T05:32:35Z

Thanks, those results look amazing! I'm really glad to not lose the ? optimizations, too.

One request: Seems to me that there should be a test for the new 2-switch case?

Maybe add a match-optimized.rs with a test for the current behaviour, something like

// compile-flags: -O -C no-prepopulate-passes

#[repr(u16)]
pub enum E2 {
    A = 13,
    B = 42,
}

// For optimized code we keep an unreachable target so LLVM knows the possible values

// CHECK-LABEL: @exhaustive_match_2
#[no_mangle]
pub fn exhaustive_match_2(e: E2) -> u8 {
    // CHECK: switch i16 %{{.+}}, label %[[UNREACH:.+]] [
    // CHECK-NEXT: i16 13,
    // CHECK-NEXT: i16 42,
    // CHECK-NEXT: ]
    // CHECK: [[UNREACH]]:
    // CHECK-NEXT: unreachable
    match e {
        E2::A => 0,
        E2::B => 1,
    }
}

And a similar test in the existing non-optimized file, like

#[repr(u16)]
pub enum E2 {
    A = 13,
    B = 42,
}

// This is the debug codegen; compare the same code in `match-optimized.rs`

// CHECK-LABEL: @exhaustive_match_2
#[no_mangle]
pub fn exhaustive_match_2(e: E2) -> u8 {
    // CHECK: %[[CMP:.+]] = icmp eq i16 %{{.+}}, 13
    // CHECK-NEXT: br i1 %[[CMP:.+]],
    match e {
        E2::A => 0,
        E2::B => 1,
    }
}

r=me with tests, or if you feel very strongly that there shouldn't be a test for this.

nnethercote · 2022-10-24T21:18:16Z

For the br case I had to use -Copt-level=0. (I hadn't realized that the default rustc invocation applied optimizations, interesting.) So I ended up renaming match.rs as match-optimized.rs, added the new optimized (switch) case to it, and then created a new match-unoptimized.rs for the br case.

nnethercote · 2022-10-24T21:35:23Z

Tests added as requested.

@bors r=scottmcm

bors · 2022-10-24T21:35:25Z

📌 Commit 08d8944fdc4b0599ca2a3581737f78347258ca16 has been approved by scottmcm

It is now in the queue for this repository.

bors · 2022-10-30T20:26:00Z

⌛ Testing commit 08d8944fdc4b0599ca2a3581737f78347258ca16 with merge 67688d4ddd938775f544499a7ca671a19615462d...

bors · 2022-10-30T20:31:41Z

💔 Test failed - checks-actions

`codegen_switchint_terminator` already uses `br` instead of `switch` when there is one normal target plus the `otherwise` target. But there's another common case with two normal targets and an `otherwise` target that points to an empty unreachable BB. This comes up a lot when switching on the tags of enums that use niches. The pattern looks like this: ``` bb1: ; preds = %bb6 %3 = load i8, ptr %_2, align 1, !range !9, !noundef !4 %4 = sub i8 %3, 2 %5 = icmp eq i8 %4, 0 %_6 = select i1 %5, i64 0, i64 1 switch i64 %_6, label %bb3 [ i64 0, label %bb4 i64 1, label %bb2 ] bb3: ; preds = %bb1 unreachable ``` This commit adds code to convert the `switch` to a `br`: ``` bb1: ; preds = %bb6 %3 = load i8, ptr %_2, align 1, !range !9, !noundef !4 %4 = sub i8 %3, 2 %5 = icmp eq i8 %4, 0 %_6 = select i1 %5, i64 0, i64 1 %6 = icmp eq i64 %_6, 0 br i1 %6, label %bb4, label %bb2 bb3: ; No predecessors! unreachable ``` This has a surprisingly large effect on compile times, with reductions of 5% on debug builds of some crates. The reduction is all due to LLVM taking less time. Maybe LLVM is just much better at handling `br` than `switch`. The resulting code is still suboptimal. - The `icmp`, `select`, `icmp` sequence is silly, converting an `i1` to an `i64` and back to an `i1`. But with the current code structure it's hard to avoid, and LLVM will easily clean it up, in opt builds at least. - `bb3` is usually now truly dead code (though not always, so it can't be removed universally).

nnethercote · 2022-10-30T23:17:18Z

I rebased.

@bors r=scottmcm

bors · 2022-10-30T23:17:19Z

📌 Commit 003a3f8 has been approved by scottmcm

It is now in the queue for this repository.

bors · 2022-10-31T00:40:35Z

⌛ Testing commit 003a3f8 with merge d726c84...

bors · 2022-10-31T03:40:16Z

☀️ Test successful - checks-actions
Approved by: scottmcm
Pushing d726c84 to master...

rust-timer · 2022-10-31T04:59:34Z

Finished benchmarking commit (d726c84): comparison URL.

Overall result: ✅ improvements - no action needed

@rustbot label: -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-2.0%	[-4.6%, -0.4%]	34
Improvements ✅ (secondary)	-2.2%	[-3.3%, -1.0%]	12
All ❌✅ (primary)	-2.0%	[-4.6%, -0.4%]	34

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-3.6%	[-5.1%, -2.0%]	5
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-3.6%	[-5.1%, -2.0%]	5

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-3.2%	[-5.4%, -1.4%]	23
Improvements ✅ (secondary)	-3.1%	[-4.0%, -1.8%]	9
All ❌✅ (primary)	-3.2%	[-5.4%, -1.4%]	23

rustbot added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Oct 21, 2022

rust-highfive assigned scottmcm Oct 21, 2022

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Oct 21, 2022

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 21, 2022

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 21, 2022

scottmcm added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 22, 2022

nnethercote force-pushed the convert-switch-to-br branch from 0aef066 to ef98c61 Compare October 23, 2022 23:53

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 23, 2022

nnethercote force-pushed the convert-switch-to-br branch from ef98c61 to 47d9ddb Compare October 24, 2022 04:19

scottmcm reviewed Oct 24, 2022

View reviewed changes

src/test/codegen/match.rs Outdated Show resolved Hide resolved

scottmcm removed the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Oct 24, 2022

nnethercote force-pushed the convert-switch-to-br branch from 47d9ddb to 08d8944 Compare October 24, 2022 21:27

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Oct 24, 2022

bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Oct 30, 2022

This comment has been minimized.

Sign in to view

nnethercote force-pushed the convert-switch-to-br branch from 08d8944 to 003a3f8 Compare October 30, 2022 23:16

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 30, 2022

bors added the merged-by-bors This PR was explicitly merged by bors. label Oct 31, 2022

bors merged commit d726c84 into rust-lang:master Oct 31, 2022

rustbot added this to the 1.67.0 milestone Oct 31, 2022

nnethercote deleted the convert-switch-to-br branch October 31, 2022 03:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `br` instead of `switch` in more cases. #103331

Use `br` instead of `switch` in more cases. #103331

nnethercote commented Oct 21, 2022

nnethercote commented Oct 21, 2022

rust-timer commented Oct 21, 2022

bors commented Oct 21, 2022

nnethercote commented Oct 21, 2022

scottmcm commented Oct 21, 2022 •

edited

Loading

nnethercote commented Oct 21, 2022

bors commented Oct 21, 2022

bors commented Oct 21, 2022

rust-timer commented Oct 21, 2022

rust-timer commented Oct 21, 2022

nnethercote commented Oct 21, 2022

nikic commented Oct 21, 2022

nnethercote commented Oct 21, 2022

nnethercote commented Oct 21, 2022

scottmcm commented Oct 21, 2022

nnethercote commented Oct 21, 2022

nikic commented Oct 21, 2022

nnethercote commented Oct 23, 2022

rust-timer commented Oct 23, 2022

bors commented Oct 23, 2022

nnethercote commented Oct 24, 2022

scottmcm commented Oct 24, 2022

nnethercote commented Oct 24, 2022

nnethercote commented Oct 24, 2022

bors commented Oct 24, 2022

bors commented Oct 30, 2022

bors commented Oct 30, 2022

This comment has been minimized.

nnethercote commented Oct 30, 2022

bors commented Oct 30, 2022

bors commented Oct 31, 2022

bors commented Oct 31, 2022

rust-timer commented Oct 31, 2022

Use br instead of switch in more cases. #103331

Use br instead of switch in more cases. #103331

Conversation

nnethercote commented Oct 21, 2022

nnethercote commented Oct 21, 2022

rust-timer commented Oct 21, 2022

bors commented Oct 21, 2022

nnethercote commented Oct 21, 2022

scottmcm commented Oct 21, 2022 • edited Loading

nnethercote commented Oct 21, 2022

bors commented Oct 21, 2022

bors commented Oct 21, 2022

rust-timer commented Oct 21, 2022

rust-timer commented Oct 21, 2022

Overall result: ✅ improvements - no action needed

Instruction count

Max RSS (memory usage)

Cycles

Footnotes

nnethercote commented Oct 21, 2022

nikic commented Oct 21, 2022

nnethercote commented Oct 21, 2022

nnethercote commented Oct 21, 2022

scottmcm commented Oct 21, 2022

nnethercote commented Oct 21, 2022

nikic commented Oct 21, 2022

nnethercote commented Oct 23, 2022

rust-timer commented Oct 23, 2022

bors commented Oct 23, 2022

nnethercote commented Oct 24, 2022

scottmcm commented Oct 24, 2022

nnethercote commented Oct 24, 2022

nnethercote commented Oct 24, 2022

bors commented Oct 24, 2022

bors commented Oct 30, 2022

bors commented Oct 30, 2022

This comment has been minimized.

nnethercote commented Oct 30, 2022

bors commented Oct 30, 2022

bors commented Oct 31, 2022

bors commented Oct 31, 2022

rust-timer commented Oct 31, 2022

Overall result: ✅ improvements - no action needed

Instruction count

Max RSS (memory usage)

Cycles

Use `br` instead of `switch` in more cases. #103331

Use `br` instead of `switch` in more cases. #103331

scottmcm commented Oct 21, 2022 •

edited

Loading