Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use br instead of switch in more cases. #103331

Merged
merged 1 commit into from
Oct 31, 2022

Conversation

nnethercote
Copy link
Contributor

codegen_switchint_terminator already uses br instead of switch when there is one normal target plus the otherwise target. But there's another common case with two normal targets and an otherwise target that points to an empty unreachable BB. This comes up a lot when switching on the tags of enums that use niches.

The pattern looks like this:

bb1:                                              ; preds = %bb6
  %3 = load i8, ptr %_2, align 1, !range !9, !noundef !4
  %4 = sub i8 %3, 2
  %5 = icmp eq i8 %4, 0
  %_6 = select i1 %5, i64 0, i64 1
  switch i64 %_6, label %bb3 [
    i64 0, label %bb4
    i64 1, label %bb2
  ]

bb3:                                              ; preds = %bb1
  unreachable

This commit adds code to convert the switch to a br:

bb1:                                              ; preds = %bb6
  %3 = load i8, ptr %_2, align 1, !range !9, !noundef !4
  %4 = sub i8 %3, 2
  %5 = icmp eq i8 %4, 0
  %_6 = select i1 %5, i64 0, i64 1
  %6 = icmp eq i64 %_6, 0
  br i1 %6, label %bb4, label %bb2

bb3:                                              ; No predecessors!
  unreachable

This has a surprisingly large effect on compile times, with reductions of 5% on debug builds of some crates. The reduction is all due to LLVM taking less time. Maybe LLVM is just much better at handling br than switch.

The resulting code is still suboptimal.

  • The icmp, select, icmp sequence is silly, converting an i1 to an i64 and back to an i1. But with the current code structure it's hard to avoid, and LLVM will easily clean it up, in opt builds at least.
  • bb3 is usually now truly dead code (though not always, so it can't be removed universally).

r? @scottmcm

@rustbot rustbot added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Oct 21, 2022
@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Oct 21, 2022
@nnethercote
Copy link
Contributor Author

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 21, 2022
@bors
Copy link
Contributor

bors commented Oct 21, 2022

⌛ Trying commit 0aef066dc5da906131faeda7338cae344e7043b0 with merge 0f1697fa69ded7c5a968467067a9153a1d5e3668...

@nnethercote
Copy link
Contributor Author

Unfortunately this change breaks the src/test/codegen/try_question_mark_nop.rs test, because the functions in that test end up like this:

define i64 @result_nop_match_32(i64 %0) unnamed_addr #0 {
start:
  %_2 = and i64 %0, 4294967295
  %1 = icmp ne i64 %_2, 0
  %. = zext i1 %1 to i64
  %.sroa.4.0.extract.shift = and i64 %0, -4294967296
  %.sroa.04.0.insert.insert = or i64 %.sroa.4.0.extract.shift, %.
  ret i64 %.sroa.04.0.insert.insert
} 

instead of just being ret i64 0. Hence the changes from CHECK-NEXT to NOCHECK-NEXT in that file :(

@scottmcm
Copy link
Member

scottmcm commented Oct 21, 2022

The downside of this is that the otherwise was like an assume(%desc ULE 1), making it clearer that the discriminant can only be zero or one. So maybe this should only happen for opt-level=0?

cc @nikic, who did some LLVM work to take advantage of that pattern in #85133 (comment) to get those ? cases to be NOPs.


As for the icmp-select pattern, I wish there was trunc nuw i8 %discr to i1, so LLVM would know that any other value is UB even without the otherwise: unreachable.

@nnethercote
Copy link
Contributor Author

@scottmcm: for the example in the PR description, I can't see how the switch-to-br transformation can change what LLVM knows.

However, the pattern in try_question_mark_nop.rs is a little different, involving non-niche tag operations. Old code:

  %3 = load i32, ptr %x, align 4, !range !2, !noundef !3
  %_2 = zext i32 %3 to i64
  switch i64 %_2, label %bb2 [
    i64 0, label %bb3
    i64 1, label %bb1
  ]

bb2:                                              ; preds = %start
  unreachable

New code:

  %3 = load i32, ptr %x, align 4, !range !2, !noundef !3
  %_2 = zext i32 %3 to i64
  %4 = icmp eq i64 %_2, 0
  br i1 %4, label %bb3, label %bb1

I can see for this one that the switch makes it clear that %3 must be i32 0 or i32 1, while the br does not. Except, the !range !2 metadata (where !2 = !{i32 0, i32 2}) should make that fact clear. Maybe LLVM doesn't pay attention to that metadata?

@bors
Copy link
Contributor

bors commented Oct 21, 2022

☀️ Try build successful - checks-actions
Build commit: 0f1697fa69ded7c5a968467067a9153a1d5e3668 (0f1697fa69ded7c5a968467067a9153a1d5e3668)

1 similar comment
@bors
Copy link
Contributor

bors commented Oct 21, 2022

☀️ Try build successful - checks-actions
Build commit: 0f1697fa69ded7c5a968467067a9153a1d5e3668 (0f1697fa69ded7c5a968467067a9153a1d5e3668)

@rust-timer
Copy link
Collaborator

Queued 0f1697fa69ded7c5a968467067a9153a1d5e3668 with parent dcb3761, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (0f1697fa69ded7c5a968467067a9153a1d5e3668): comparison URL.

Overall result: ✅ improvements - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean1 range count2
Regressions ❌
(primary)
0.5% [0.5%, 0.5%] 1
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-1.9% [-4.4%, -0.3%] 33
Improvements ✅
(secondary)
-2.1% [-3.1%, -0.5%] 12
All ❌✅ (primary) -1.9% [-4.4%, 0.5%] 34

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean1 range count2
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-3.2% [-7.3%, -2.0%] 7
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -3.2% [-7.3%, -2.0%] 7

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean1 range count2
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-3.3% [-4.9%, -2.2%] 17
Improvements ✅
(secondary)
-4.2% [-4.2%, -4.2%] 1
All ❌✅ (primary) -3.3% [-4.9%, -2.2%] 17

Footnotes

  1. the arithmetic mean of the percent change 2 3

  2. number of relevant changes 2 3

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 21, 2022
@nnethercote
Copy link
Contributor Author

The perf benefits are entirely in debug builds, so restricting this to opt-level=0 only will get us the compile time improvements without affecting codegen quality, yay.

@nikic
Copy link
Contributor

nikic commented Oct 21, 2022

The reduction is all due to LLVM taking less time. Maybe LLVM is just much better at handling br than switch.

This is because FastISel does not support switches, so those cases would always fall back to SelectionDAG isel. (This is only relevant for -C opt-level=0 builds.)

I can see for this one that the switch makes it clear that %3 must be i32 0 or i32 1, while the br does not. Except, the !range !2 metadata (where !2 = !{i32 0, i32 2}) should make that fact clear. Maybe LLVM doesn't pay attention to that metadata?

This range metadata generally gets lost during SROA. We could enable knowledge retention to preserve it, but I suspect that will preserve more than we bargained for, and would need some optimization for production use first.

Limiting this to -C opt-level=0 for now sounds fine.

@nnethercote
Copy link
Contributor Author

Thanks for the explanation, that's very helpful! I will make this -C opt-level=0 only on Monday.

@nnethercote
Copy link
Contributor Author

Today I was reading LLVM's Performance Tips for Frontend Authors, which is "a collection of tips on how to generate IR that optimizes well". I would be interested in a similar document containing tips on how to generate IR that can be compiled quickly.

This might be an example tip: "Unoptimized builds use FastISel, but FastISel does not support switches, therefore branches are much faster to compile than switches in unoptimized builds." (Assuming I've understood the description above correctly.)

If you can think of any other such tips, for unoptimized or optimized builds, I'd love to hear about them. Things like "avoid this code pattern", or "try to use this code pattern".

@scottmcm
Copy link
Member

for the example in the PR description […]

Oh, right, I commented based on the codegen test, and hadn't looked at the example in the OP in detail.

Do you happen to know the MIR that produced the

  %_6 = select i1 %5, i64 0, i64 1
  switch i64 %_6, label %bb3 [
    i64 0, label %bb4
    i64 1, label %bb2
  ]

pattern? I wonder if we couldn't optimize that down simpler even before codegen. (Not in this PR, of course.)

I guess if it's a switch-on-discriminant then MIR can't know in general, since it tries not to know how the discriminant is actually encoded.

@nnethercote
Copy link
Contributor Author

Yes, this code is from a switch-on-discriminant.

  • When the tag is in a niche the discriminant is extracted here, which gives the sub+icmp+select sequence in the example at the top of this PR.
  • The switch is then handled here.

I experimented with changing the type used for discriminant extraction away from i64 but many tests failed. Alternatively, it's conceivable that the latter function could be modified to special-case switch-on-discriminant, but I decided that would be too big a change. And judging from what @nikic said, the compile-time improvement here is all about avoiding the switch; avoiding the bool-to-i64-to-bool sequence wouldn't gain much.

@nikic
Copy link
Contributor

nikic commented Oct 21, 2022

@nnethercote For this particular class of problem, you can use -Cllvm-args=-fast-isel-report-on-fallback -C remark=sdagisel to print FastISel aborts. Looks like this for one of the functions from try_question_mark_nop.rs:

note: <source>:9:5: sdagisel: FastISel missed terminator:   switch i64 %6, label %7 [
          i64 0, label %8
          i64 1, label %12
        ], !dbg !13

note: <source>:9:11: sdagisel: FastISel missed terminator:   unreachable, !dbg !11

Presumably you avoid the first one here. The second one is something we should probably address on the LLVM side. FastISel does support unreachable, but because we use TrapUnreachable by default it tries to emit an ISD::TRAP -- and as far as I can tell, the necessary support function for that (fastEmit_) never gets generated by FastISelGen...

@scottmcm scottmcm added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 22, 2022
@nnethercote
Copy link
Contributor Author

I have updated so that the switch-to-br change only happens in unoptimized builds. Let's do another perf run just to make sure things are working as expected.

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 23, 2022
@bors
Copy link
Contributor

bors commented Oct 23, 2022

⌛ Trying commit ef98c61ecd25db2e31106fec7478faa1f0418584 with merge 97d79c13f38e165929b28d96b2444ac6110d7fc6...

@nnethercote
Copy link
Contributor Author

New perf results look good, this is ready to go.

src/test/codegen/match.rs Outdated Show resolved Hide resolved
@scottmcm
Copy link
Member

Thanks, those results look amazing! I'm really glad to not lose the ? optimizations, too.

One request: Seems to me that there should be a test for the new 2-switch case?

Maybe add a match-optimized.rs with a test for the current behaviour, something like

// compile-flags: -O -C no-prepopulate-passes

#[repr(u16)]
pub enum E2 {
    A = 13,
    B = 42,
}

// For optimized code we keep an unreachable target so LLVM knows the possible values

// CHECK-LABEL: @exhaustive_match_2
#[no_mangle]
pub fn exhaustive_match_2(e: E2) -> u8 {
    // CHECK: switch i16 %{{.+}}, label %[[UNREACH:.+]] [
    // CHECK-NEXT: i16 13,
    // CHECK-NEXT: i16 42,
    // CHECK-NEXT: ]
    // CHECK: [[UNREACH]]:
    // CHECK-NEXT: unreachable
    match e {
        E2::A => 0,
        E2::B => 1,
    }
}

And a similar test in the existing non-optimized file, like

#[repr(u16)]
pub enum E2 {
    A = 13,
    B = 42,
}

// This is the debug codegen; compare the same code in `match-optimized.rs`

// CHECK-LABEL: @exhaustive_match_2
#[no_mangle]
pub fn exhaustive_match_2(e: E2) -> u8 {
    // CHECK: %[[CMP:.+]] = icmp eq i16 %{{.+}}, 13
    // CHECK-NEXT: br i1 %[[CMP:.+]],
    match e {
        E2::A => 0,
        E2::B => 1,
    }
}

r=me with tests, or if you feel very strongly that there shouldn't be a test for this.

@scottmcm scottmcm removed the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Oct 24, 2022
@nnethercote
Copy link
Contributor Author

For the br case I had to use -Copt-level=0. (I hadn't realized that the default rustc invocation applied optimizations, interesting.) So I ended up renaming match.rs as match-optimized.rs, added the new optimized (switch) case to it, and then created a new match-unoptimized.rs for the br case.

@nnethercote
Copy link
Contributor Author

Tests added as requested.

@bors r=scottmcm

@bors
Copy link
Contributor

bors commented Oct 24, 2022

📌 Commit 08d8944fdc4b0599ca2a3581737f78347258ca16 has been approved by scottmcm

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Oct 24, 2022
@bors
Copy link
Contributor

bors commented Oct 30, 2022

⌛ Testing commit 08d8944fdc4b0599ca2a3581737f78347258ca16 with merge 67688d4ddd938775f544499a7ca671a19615462d...

@bors
Copy link
Contributor

bors commented Oct 30, 2022

💔 Test failed - checks-actions

@bors bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Oct 30, 2022
@rust-log-analyzer

This comment has been minimized.

`codegen_switchint_terminator` already uses `br` instead of `switch`
when there is one normal target plus the `otherwise` target. But there's
another common case with two normal targets and an `otherwise` target
that points to an empty unreachable BB. This comes up a lot when
switching on the tags of enums that use niches.

The pattern looks like this:
```
bb1:                                              ; preds = %bb6
  %3 = load i8, ptr %_2, align 1, !range !9, !noundef !4
  %4 = sub i8 %3, 2
  %5 = icmp eq i8 %4, 0
  %_6 = select i1 %5, i64 0, i64 1
  switch i64 %_6, label %bb3 [
    i64 0, label %bb4
    i64 1, label %bb2
  ]

bb3:                                              ; preds = %bb1
  unreachable
```
This commit adds code to convert the `switch` to a `br`:
```
bb1:                                              ; preds = %bb6
  %3 = load i8, ptr %_2, align 1, !range !9, !noundef !4
  %4 = sub i8 %3, 2
  %5 = icmp eq i8 %4, 0
  %_6 = select i1 %5, i64 0, i64 1
  %6 = icmp eq i64 %_6, 0
  br i1 %6, label %bb4, label %bb2

bb3:                                              ; No predecessors!
  unreachable
```
This has a surprisingly large effect on compile times, with reductions
of 5% on debug builds of some crates. The reduction is all due to LLVM
taking less time. Maybe LLVM is just much better at handling `br` than
`switch`.

The resulting code is still suboptimal.
- The `icmp`, `select`, `icmp` sequence is silly, converting an `i1` to an `i64`
  and back to an `i1`. But with the current code structure it's hard to avoid,
  and LLVM will easily clean it up, in opt builds at least.
- `bb3` is usually now truly dead code (though not always, so it can't
  be removed universally).
@nnethercote
Copy link
Contributor Author

I rebased.

@bors r=scottmcm

@bors
Copy link
Contributor

bors commented Oct 30, 2022

📌 Commit 003a3f8 has been approved by scottmcm

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 30, 2022
@bors
Copy link
Contributor

bors commented Oct 31, 2022

⌛ Testing commit 003a3f8 with merge d726c84...

@bors
Copy link
Contributor

bors commented Oct 31, 2022

☀️ Test successful - checks-actions
Approved by: scottmcm
Pushing d726c84 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Oct 31, 2022
@bors bors merged commit d726c84 into rust-lang:master Oct 31, 2022
@rustbot rustbot added this to the 1.67.0 milestone Oct 31, 2022
@nnethercote nnethercote deleted the convert-switch-to-br branch October 31, 2022 03:56
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (d726c84): comparison URL.

Overall result: ✅ improvements - no action needed

@rustbot label: -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-2.0% [-4.6%, -0.4%] 34
Improvements ✅
(secondary)
-2.2% [-3.3%, -1.0%] 12
All ❌✅ (primary) -2.0% [-4.6%, -0.4%] 34

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-3.6% [-5.1%, -2.0%] 5
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -3.6% [-5.1%, -2.0%] 5

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-3.2% [-5.4%, -1.4%] 23
Improvements ✅
(secondary)
-3.1% [-4.0%, -1.8%] 9
All ❌✅ (primary) -3.2% [-5.4%, -1.4%] 23

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants