-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hoisting the invariant out of multi-level nested loops #61420
Comments
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsIf there are multi-level nested loops and there is an invariant calculation inside the inner most loop, we only hoist it to one level outer loop instead of hoisting it to multiple level outer loop (as appropriate). Below is an example of real world code: Here is a simple repro where all the calculation of I tried to hand-code this in SixLabors/ImageSharp#1818 and see some good improvements. Ideally this should be done by JIT.
|
@dotnet/jit-contrib , @BruceForstall |
This is exactly why I see a small regression in #61408, e.g.: void foo(int i)
{
for (int j = 0; j < 100; j++)
{
for (int k = 0; k < 100; k++)
{
Console.WriteLine(Vector128.Create(42, 42));
}
}
} Current codegen in Main: ; Method Program:foo(int):this
G_M10044_IG01:
push rdi
push rsi
sub rsp, 56
vzeroupper
vmovaps qword ptr [rsp+20H], xmm6
;; bbWeight=1 PerfScore 5.25
G_M10044_IG02:
xor esi, esi
;; bbWeight=1 PerfScore 0.25
G_M10044_IG03:
xor edi, edi
vmovupd xmm6, xmmword ptr [reloc @RWD00]
;; bbWeight=4 PerfScore 13.00
G_M10044_IG04:
mov rcx, 0xD1FFAB1E ; System.Runtime.Intrinsics.Vector128`1[Int64]
call CORINFO_HELP_NEWSFAST
vmovupd xmmword ptr [rax+8], xmm6
mov rcx, rax
call System.Console:WriteLine(System.Object)
inc edi
cmp edi, 100
jl SHORT G_M10044_IG04
;; bbWeight=16 PerfScore 96.00
G_M10044_IG05:
inc esi
cmp esi, 100
jl SHORT G_M10044_IG03
;; bbWeight=4 PerfScore 6.00
G_M10044_IG06:
vmovaps xmm6, qword ptr [rsp+20H]
add rsp, 56
pop rsi
pop rdi
ret
;; bbWeight=1 PerfScore 6.25
RWD00 dq 000000000000002Ah, 000000000000002Ah
; Total bytes of code: 82 As you can see, |
We've got multiple issues open on this already -- linking them up for context |
@kunalspathak Wrote a simplified example from the imagesharp case: using System;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
public class C {
private const int IndexBits = 5;
private const int IndexAlphaBits = 5;
private const int IndexCount = (1 << IndexBits) + 1;
private const int IndexAlphaCount = (1 << IndexAlphaBits) + 1;
public int M() {
int sum = 0;
for (int r = 0; r < IndexCount; r++)
{
for (int g = 0; g < IndexCount; g++)
{
for (int b = 0; b < IndexCount; b++)
{
for (int a = 0; a < IndexAlphaCount; a++)
{
int ind1 = (r << ((IndexBits * 2) + IndexAlphaBits))
+ (r << (IndexBits + IndexAlphaBits + 1))
+ (g << (IndexBits + IndexAlphaBits))
+ (r << (IndexBits * 2))
+ (r << (IndexBits + 1))
+ (g << IndexBits)
+ ((r + g + b) << IndexAlphaBits)
+ r + g + b + a;
sum += ind1;
}
}
}
}
return sum;
}
} |
Another impact of this is when we refer a static variable inside a loop. This is bad in for Arm64 because we need 3 instructions to construct the address of static variable. private static int COUNT = 5;
[MethodImpl(MethodImplOptions.NoInlining)]
public static double issue3(int x, int y, int z)
{
double result = 0;
for (int i = 0; i < x; i++)
{
for (int j = 0; j < y; i++)
{
result += COUNT;
}
}
return result;
} G_M22420_IG03: ;; offset=0020H
7100003F cmp w1, #0
5400018D ble G_M22420_IG05
; Below construction of 0x7ffb568dda90 can be hoisted outside the outer most loop IG03.
D29B5214 movz x20, #0xda90
F2AAD1B4 movk x20, #0x568d LSL #16
F2CFFF74 movk x20, #0x7ffb LSL #32
AA1403E0 mov x0, x20
528001A1 mov w1, #13
94000000 bl CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
B9405E80 ldr w0, [x20,#92]
1E620010 scvtf d16, w0
;; bbWeight=4 PerfScore 48.00
G_M22420_IG04: ;; offset=0048H
1E702908 fadd d8, d8, d16
11000673 add w19, w19, #1
17FFFFFE b G_M22420_IG04
;; bbWeight=16 PerfScore 72.00
G_M22420_IG05: ;; offset=0054H
11000673 add w19, w19, #1
6B00027F cmp w19, w0
54FFFE2B blt G_M22420_IG03 |
@BruceForstall - do you think you will have time to look into this in .NET 7? |
It feels unlikely. |
I was trying to understand why hoisting of static variable address doesn't get hoisted at 2-levels with the following example: private static int SOME_TEMP= 4;
int Test() {
int result = 0;
for (int i = 0; i < n; i++) {
for (int j = 0; j < m; j++) {
result += SOME_TEMP;
}
}
} The problem is because we hoist the code from outer to inner loop and not vice-versa as pointed out here. runtime/src/coreclr/jit/optimizer.cpp Lines 6202 to 6208 in 1a296c0
I tried switching the ordering with the following change: @@ -6049,7 +6063,6 @@ void Compiler::optHoistLoopNest(unsigned lnum, LoopHoistContext* hoistCtxt)
m_loopsConsidered++;
#endif // LOOP_HOIST_STATS
- optHoistThisLoop(lnum, hoistCtxt);
VNSet* hoistedInCurLoop = hoistCtxt->ExtractHoistedInCurLoop();
@@ -6086,6 +6099,8 @@ void Compiler::optHoistLoopNest(unsigned lnum, LoopHoistContext* hoistCtxt)
}
}
}
+
+ optHoistThisLoop(lnum, hoistCtxt); However, when we visit the blocks to hoist, we only visit blocks that dominate exit blocks of the loop. Here is the original block ordering:
There are two loops -
But when we try to hoist code for loop |
I believe we should have a way to look at the edge weights and consider all the hot blocks that participate in the loop for hoisting. @AndyAyersMS ? |
@kunalspathak Since you're working on this, I'm assigning it to you. |
Here is the diff from #68061 . Full diffs: https://www.diffchecker.com/DKGAZKq0 |
The example mentioned in #61420 (comment) is addressed by #68061 |
If there are multi-level nested loops and there is an invariant calculation inside the inner most loop, we only hoist it to one level outer loop instead of hoisting it to multiple level outer loop (as appropriate). Below is an example of real world code:
https://github.com/SixLabors/ImageSharp/blob/255226b03c8350f88e641bdb58c9450b68729ef7/src/ImageSharp/Processing/Processors/Quantization/WuQuantizer%7BTPixel%7D.cs#L418-L444
Here is a simple repro where all the calculation of
ind1
is done inside thea-loop
instead of spreading it tob-loop
,g-loop
andr-loop
.I tried to hand-code this in SixLabors/ImageSharp#1818 and see some good improvements. Ideally this should be done by JIT.
The text was updated successfully, but these errors were encountered: