-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use br
instead of switch
in more cases.
#103331
Conversation
@bors try @rust-timer queue |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit 0aef066dc5da906131faeda7338cae344e7043b0 with merge 0f1697fa69ded7c5a968467067a9153a1d5e3668... |
Unfortunately this change breaks the
instead of just being |
The downside of this is that the cc @nikic, who did some LLVM work to take advantage of that pattern in #85133 (comment) to get those As for the |
@scottmcm: for the example in the PR description, I can't see how the However, the pattern in
New code:
I can see for this one that the |
☀️ Try build successful - checks-actions |
1 similar comment
☀️ Try build successful - checks-actions |
Queued 0f1697fa69ded7c5a968467067a9153a1d5e3668 with parent dcb3761, future comparison URL. |
Finished benchmarking commit (0f1697fa69ded7c5a968467067a9153a1d5e3668): comparison URL. Overall result: ✅ improvements - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Footnotes |
The perf benefits are entirely in debug builds, so restricting this to |
This is because FastISel does not support switches, so those cases would always fall back to SelectionDAG isel. (This is only relevant for
This range metadata generally gets lost during SROA. We could enable knowledge retention to preserve it, but I suspect that will preserve more than we bargained for, and would need some optimization for production use first. Limiting this to |
Thanks for the explanation, that's very helpful! I will make this |
Today I was reading LLVM's Performance Tips for Frontend Authors, which is "a collection of tips on how to generate IR that optimizes well". I would be interested in a similar document containing tips on how to generate IR that can be compiled quickly. This might be an example tip: "Unoptimized builds use FastISel, but FastISel does not support switches, therefore branches are much faster to compile than switches in unoptimized builds." (Assuming I've understood the description above correctly.) If you can think of any other such tips, for unoptimized or optimized builds, I'd love to hear about them. Things like "avoid this code pattern", or "try to use this code pattern". |
Oh, right, I commented based on the codegen test, and hadn't looked at the example in the OP in detail. Do you happen to know the MIR that produced the %_6 = select i1 %5, i64 0, i64 1
switch i64 %_6, label %bb3 [
i64 0, label %bb4
i64 1, label %bb2
] pattern? I wonder if we couldn't optimize that down simpler even before codegen. (Not in this PR, of course.) I guess if it's a switch-on-discriminant then MIR can't know in general, since it tries not to know how the discriminant is actually encoded. |
Yes, this code is from a switch-on-discriminant.
I experimented with changing the type used for discriminant extraction away from |
@nnethercote For this particular class of problem, you can use
Presumably you avoid the first one here. The second one is something we should probably address on the LLVM side. FastISel does support |
0aef066
to
ef98c61
Compare
I have updated so that the switch-to-br change only happens in unoptimized builds. Let's do another perf run just to make sure things are working as expected. @bors try @rust-timer queue |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit ef98c61ecd25db2e31106fec7478faa1f0418584 with merge 97d79c13f38e165929b28d96b2444ac6110d7fc6... |
ef98c61
to
47d9ddb
Compare
New perf results look good, this is ready to go. |
Thanks, those results look amazing! I'm really glad to not lose the One request: Seems to me that there should be a test for the new 2-switch case? Maybe add a
And a similar test in the existing non-optimized file, like
r=me with tests, or if you feel very strongly that there shouldn't be a test for this. |
For the |
47d9ddb
to
08d8944
Compare
Tests added as requested. @bors r=scottmcm |
📌 Commit 08d8944fdc4b0599ca2a3581737f78347258ca16 has been approved by It is now in the queue for this repository. |
⌛ Testing commit 08d8944fdc4b0599ca2a3581737f78347258ca16 with merge 67688d4ddd938775f544499a7ca671a19615462d... |
💔 Test failed - checks-actions |
This comment has been minimized.
This comment has been minimized.
`codegen_switchint_terminator` already uses `br` instead of `switch` when there is one normal target plus the `otherwise` target. But there's another common case with two normal targets and an `otherwise` target that points to an empty unreachable BB. This comes up a lot when switching on the tags of enums that use niches. The pattern looks like this: ``` bb1: ; preds = %bb6 %3 = load i8, ptr %_2, align 1, !range !9, !noundef !4 %4 = sub i8 %3, 2 %5 = icmp eq i8 %4, 0 %_6 = select i1 %5, i64 0, i64 1 switch i64 %_6, label %bb3 [ i64 0, label %bb4 i64 1, label %bb2 ] bb3: ; preds = %bb1 unreachable ``` This commit adds code to convert the `switch` to a `br`: ``` bb1: ; preds = %bb6 %3 = load i8, ptr %_2, align 1, !range !9, !noundef !4 %4 = sub i8 %3, 2 %5 = icmp eq i8 %4, 0 %_6 = select i1 %5, i64 0, i64 1 %6 = icmp eq i64 %_6, 0 br i1 %6, label %bb4, label %bb2 bb3: ; No predecessors! unreachable ``` This has a surprisingly large effect on compile times, with reductions of 5% on debug builds of some crates. The reduction is all due to LLVM taking less time. Maybe LLVM is just much better at handling `br` than `switch`. The resulting code is still suboptimal. - The `icmp`, `select`, `icmp` sequence is silly, converting an `i1` to an `i64` and back to an `i1`. But with the current code structure it's hard to avoid, and LLVM will easily clean it up, in opt builds at least. - `bb3` is usually now truly dead code (though not always, so it can't be removed universally).
08d8944
to
003a3f8
Compare
I rebased. @bors r=scottmcm |
☀️ Test successful - checks-actions |
Finished benchmarking commit (d726c84): comparison URL. Overall result: ✅ improvements - no action needed@rustbot label: -perf-regression Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
|
codegen_switchint_terminator
already usesbr
instead ofswitch
when there is one normal target plus theotherwise
target. But there's another common case with two normal targets and anotherwise
target that points to an empty unreachable BB. This comes up a lot when switching on the tags of enums that use niches.The pattern looks like this:
This commit adds code to convert the
switch
to abr
:This has a surprisingly large effect on compile times, with reductions of 5% on debug builds of some crates. The reduction is all due to LLVM taking less time. Maybe LLVM is just much better at handling
br
thanswitch
.The resulting code is still suboptimal.
icmp
,select
,icmp
sequence is silly, converting ani1
to ani64
and back to ani1
. But with the current code structure it's hard to avoid, and LLVM will easily clean it up, in opt builds at least.bb3
is usually now truly dead code (though not always, so it can't be removed universally).r? @scottmcm