-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Big performance problem with closed intervals looping #45222
Comments
Performing the low-hanging fruit for minimization: #![feature(inclusive_range_syntax)]
#![allow(private_no_mangle_fns)]
#[inline(never)]
#[no_mangle]
fn triangle_exc(n: u64) -> u64 {
let mut count = 0;
for j in (0 .. n + 1) {
count += j;
}
count
}
#[inline(never)]
#[no_mangle]
fn triangle_inc(n: u64) -> u64 {
let mut count = 0;
for j in 0 ..= n {
count += j;
}
count
}
fn main() {
let n: u64 = std::env::args().nth(1).unwrap().parse().unwrap();
println!("{}", triangle_exc(n));
println!("{}", triangle_inc(n));
} Good:
Bad:
Lost some kind of fast lane? |
Apparently llvm is way happier optimizing the open interval with simd . |
Referring to the example in #45222 (comment), the first code is recognized by LLVM loop-vectorize, while the second doesn't.
After vectorization the If we black-box the
We may see if upgrading to LLVM 6 can help this case. Also note that the two pieces of code are not equivalent: the |
I've checked again using the LLVM 6 build. The performance is improved, but there is still a large gap (2.6× → 2.0×). Vectorization is still not recognized for Timings$ rustc +nightly -C codegen-units=1 -C opt-level=3 -C target-cpu=native 1.rs
$ time ./1 30000 2
13500450000000
real 0m5.389s
user 0m5.136s
sys 0m0.012s
$ time ./1 30000 1
13500450000000
real 0m2.195s
user 0m2.192s
sys 0m0.000s
$ rustc +38bd38147d2fa21f8a684b019fc0763adf8fd436 -C codegen-units=1 -C opt-level=3 -C target-cpu=native 1.rs
$ time ./1 30000 2
13500450000000
real 0m4.445s
user 0m4.332s
sys 0m0.000s
$ time ./1 30000 1
13500450000000
real 0m2.330s
user 0m2.184s
sys 0m0.016s |
This might just be the classic external-iteration-is-slower-sometimes problem. Note that even Fix for internal iteration is up at #48012 |
I believe this can be fixed by adding an extra field to struct FixedRangeInclusive {
start: u64,
end: u64,
done: bool,
}
fn fixed_range_inclusive(start: u64, end: u64) -> FixedRangeInclusive {
FixedRangeInclusive {
start,
end,
done: false,
}
}
impl Iterator for FixedRangeInclusive {
type Item = u64;
fn next(&mut self) -> Option<Self::Item> {
if !self.done {
if self.start == self.end {
self.done = true;
}
let new = self.start.wrapping_add(1);
Some(std::mem::replace(&mut self.start, new))
} else {
None
}
}
} Check out the assembly on the playground. |
The current two-field RangeInclusive follows from rust-lang/rfcs#1980. The |
I'm aware that |
This would be jarring, in consideration of the fact that It is unfortunate. Were this not the case I could almost picture something like this: pub struct RangeInclusive<T> {
// NOTE: not pub
start: T, // actually, these should probably be ManuallyDrop<T>
end: T, // or union MaybeUninit<T> { value: T, empty: () }
done: bool,
}
impl RangeInclusive<T> {
// Expose an API that matches the functionality of the enum type
#[inline] pub fn new(start: T, end: T) -> Self { ... }
#[inline] pub fn new_done() -> Self { ... }
#[inline] pub fn endpoints(&self) -> Option<(&T, &T)> { ... }
#[inline] pub fn endpoints_mut(&mut self) -> Option<(&mut T, &mut T)> { ... }
#[inline] pub fn into_endpoints(self) -> Option<(T, T)> { ... }
} and ISTM (note: haven't tested) that this should optimize just as well as the three-field struct, since it IS the three-field struct (just with statically enforced usage patterns). But I suppose that, even then, it would seem questionable to have a standard library type that simulates a enum (rather than being one) solely for performance concerns. Edit: I misread somewhat and thought that the enum was the current proposal. |
@leonardo-m Can you share some non-simple examples where the current form is a problem? #48012 will make it so that the example in here is fine if written the easier way For simple things like sums of iota, it's easy to get suboptimal codegen from all kinds of different iterators. Like using Edit: #48057 has also improved things in recent nightlies, though it's still not perfect. |
…, r=alexcrichton Override try_[r]fold for RangeInclusive Because the last item needs special handling, it seems that LLVM has trouble canonicalizing the loops in external iteration. With the override, it becomes obvious that the start==end case exits the loop (as opposed to the one *after* that exiting the loop in external iteration). Demo adapted from rust-lang#45222 ```rust #[no_mangle] pub fn foo3r(n: u64) -> u64 { let mut count = 0; (0..n).for_each(|_| { (0 ..= n).rev().for_each(|j| { count += j; }) }); count } ``` <details> <summary>Current nightly ASM, 100 lines (https://play.rust-lang.org/?gist=f5674c702c6e2045c3aab5d03763e5f6&version=nightly&mode=release)</summary> ```asm foo3r: pushq %rbx .Lcfi0: .Lcfi1: testq %rdi, %rdi je .LBB0_1 testb $1, %dil jne .LBB0_4 xorl %eax, %eax xorl %r8d, %r8d cmpq $1, %rdi jne .LBB0_11 jmp .LBB0_23 .LBB0_1: xorl %eax, %eax popq %rbx retq .LBB0_4: xorl %r8d, %r8d movq $-1, %r9 xorl %eax, %eax movq %rdi, %r11 xorl %r10d, %r10d jmp .LBB0_5 .LBB0_8: addq %r11, %rax movq %rsi, %r11 movq %rdx, %r10 .LBB0_5: cmpq %r11, %r10 movl $1, %ecx cmovbq %r9, %rcx cmoveq %r8, %rcx testq %rcx, %rcx movl $0, %esi movl $1, %edx je .LBB0_8 cmpq $-1, %rcx jne .LBB0_9 leaq -1(%r11), %rsi movq %r10, %rdx jmp .LBB0_8 .LBB0_9: movl $1, %r8d cmpq $1, %rdi je .LBB0_23 .LBB0_11: xorl %r9d, %r9d movq $-1, %r10 .LBB0_12: movq %rdi, %rsi xorl %r11d, %r11d jmp .LBB0_13 .LBB0_16: addq %rsi, %rax movq %rcx, %rsi movq %rbx, %r11 .LBB0_13: cmpq %rsi, %r11 movl $1, %edx cmovbq %r10, %rdx cmoveq %r9, %rdx testq %rdx, %rdx movl $0, %ecx movl $1, %ebx je .LBB0_16 cmpq $-1, %rdx jne .LBB0_17 leaq -1(%rsi), %rcx movq %r11, %rbx jmp .LBB0_16 .LBB0_17: movq %rdi, %rcx xorl %r11d, %r11d jmp .LBB0_18 .LBB0_21: addq %rcx, %rax movq %rsi, %rcx movq %rbx, %r11 .LBB0_18: cmpq %rcx, %r11 movl $1, %edx cmovbq %r10, %rdx cmoveq %r9, %rdx testq %rdx, %rdx movl $0, %esi movl $1, %ebx je .LBB0_21 cmpq $-1, %rdx jne .LBB0_22 leaq -1(%rcx), %rsi movq %r11, %rbx jmp .LBB0_21 .LBB0_22: addq $2, %r8 cmpq %rdi, %r8 jne .LBB0_12 .LBB0_23: popq %rbx retq .Lfunc_end0: ``` </details><br> With this PR: ```asm foo3r: test rcx, rcx je .LBB3_1 lea r8, [rcx - 1] lea rdx, [rcx - 2] mov rax, r8 mul rdx shld rdx, rax, 63 imul r8, r8 add r8, rcx sub r8, rdx imul r8, rcx mov rax, r8 ret .LBB3_1: xor r8d, r8d mov rax, r8 ret ```
Simplify RangeInclusive::next[_back] `match`ing on an `Option<Ordering>` seems cause some confusion for LLVM; switching to just using comparison operators removes a few jumps from the simple `for` loops I was trying. cc #45222 #28237 (comment) Example: ```rust #[no_mangle] pub fn coresum(x: std::ops::RangeInclusive<u64>) -> u64 { let mut sum = 0; for i in x { sum += i ^ (i-1); } sum } ``` Today: ```asm coresum: xor r8d, r8d mov r9, -1 xor eax, eax jmp .LBB0_1 .LBB0_4: lea rcx, [rdi - 1] xor rcx, rdi add rax, rcx mov rsi, rdx mov rdi, r10 .LBB0_1: cmp rdi, rsi mov ecx, 1 cmovb rcx, r9 cmove rcx, r8 test rcx, rcx mov edx, 0 mov r10d, 1 je .LBB0_4 // 1 cmp rcx, -1 jne .LBB0_5 // 2 lea r10, [rdi + 1] mov rdx, rsi jmp .LBB0_4 // 3 .LBB0_5: ret ``` With this PR: ```asm coresum: cmp rcx, rdx jbe .LBB0_2 xor eax, eax ret .LBB0_2: xor r8d, r8d mov r9d, 1 xor eax, eax .p2align 4, 0x90 .LBB0_3: lea r10, [rcx + 1] cmp rcx, rdx cmovae rdx, r8 cmovae r10, r9 lea r11, [rcx - 1] xor r11, rcx add rax, r11 mov rcx, r10 cmp r10, rdx jbe .LBB0_3 // Just this ret ``` <details><summary>Though using internal iteration (`.map(|i| i ^ (i-1)).sum()`) is still shorter to type, and lets the compiler unroll it</summary> ```asm coresum_inner: .Lcfi0: .seh_proc coresum_inner sub rsp, 168 .Lcfi1: .seh_stackalloc 168 vmovdqa xmmword ptr [rsp + 144], xmm15 .Lcfi2: .seh_savexmm 15, 144 vmovdqa xmmword ptr [rsp + 128], xmm14 .Lcfi3: .seh_savexmm 14, 128 vmovdqa xmmword ptr [rsp + 112], xmm13 .Lcfi4: .seh_savexmm 13, 112 vmovdqa xmmword ptr [rsp + 96], xmm12 .Lcfi5: .seh_savexmm 12, 96 vmovdqa xmmword ptr [rsp + 80], xmm11 .Lcfi6: .seh_savexmm 11, 80 vmovdqa xmmword ptr [rsp + 64], xmm10 .Lcfi7: .seh_savexmm 10, 64 vmovdqa xmmword ptr [rsp + 48], xmm9 .Lcfi8: .seh_savexmm 9, 48 vmovdqa xmmword ptr [rsp + 32], xmm8 .Lcfi9: .seh_savexmm 8, 32 vmovdqa xmmword ptr [rsp + 16], xmm7 .Lcfi10: .seh_savexmm 7, 16 vmovdqa xmmword ptr [rsp], xmm6 .Lcfi11: .seh_savexmm 6, 0 .Lcfi12: .seh_endprologue cmp rdx, rcx jae .LBB1_2 xor eax, eax jmp .LBB1_13 .LBB1_2: mov r8, rdx sub r8, rcx jbe .LBB1_3 cmp r8, 7 jbe .LBB1_5 mov rax, r8 and rax, -8 mov r9, r8 and r9, -8 je .LBB1_5 add rax, rcx vmovq xmm0, rcx vpshufd xmm0, xmm0, 68 mov ecx, 1 vmovq xmm1, rcx vpslldq xmm1, xmm1, 8 vpaddq xmm1, xmm0, xmm1 vpxor xmm0, xmm0, xmm0 vpcmpeqd xmm11, xmm11, xmm11 vmovdqa xmm12, xmmword ptr [rip + __xmm@00000000000000010000000000000001] vmovdqa xmm13, xmmword ptr [rip + __xmm@00000000000000030000000000000003] vmovdqa xmm14, xmmword ptr [rip + __xmm@00000000000000050000000000000005] vmovdqa xmm15, xmmword ptr [rip + __xmm@00000000000000080000000000000008] mov rcx, r9 vpxor xmm4, xmm4, xmm4 vpxor xmm5, xmm5, xmm5 vpxor xmm6, xmm6, xmm6 .p2align 4, 0x90 .LBB1_9: vpaddq xmm7, xmm1, xmmword ptr [rip + __xmm@00000000000000020000000000000002] vpaddq xmm9, xmm1, xmmword ptr [rip + __xmm@00000000000000040000000000000004] vpaddq xmm10, xmm1, xmmword ptr [rip + __xmm@00000000000000060000000000000006] vpaddq xmm8, xmm1, xmm12 vpxor xmm7, xmm8, xmm7 vpaddq xmm2, xmm1, xmm13 vpxor xmm8, xmm2, xmm9 vpaddq xmm3, xmm1, xmm14 vpxor xmm3, xmm3, xmm10 vpaddq xmm2, xmm1, xmm11 vpxor xmm2, xmm2, xmm1 vpaddq xmm0, xmm2, xmm0 vpaddq xmm4, xmm7, xmm4 vpaddq xmm5, xmm8, xmm5 vpaddq xmm6, xmm3, xmm6 vpaddq xmm1, xmm1, xmm15 add rcx, -8 jne .LBB1_9 vpaddq xmm0, xmm4, xmm0 vpaddq xmm0, xmm5, xmm0 vpaddq xmm0, xmm6, xmm0 vpshufd xmm1, xmm0, 78 vpaddq xmm0, xmm0, xmm1 vmovq r10, xmm0 cmp r8, r9 jne .LBB1_6 jmp .LBB1_11 .LBB1_3: xor r10d, r10d jmp .LBB1_12 .LBB1_5: xor r10d, r10d mov rax, rcx .p2align 4, 0x90 .LBB1_6: lea rcx, [rax - 1] xor rcx, rax inc rax add r10, rcx cmp rdx, rax jne .LBB1_6 .LBB1_11: mov rcx, rdx .LBB1_12: lea rax, [rcx - 1] xor rax, rcx add rax, r10 .LBB1_13: vmovaps xmm6, xmmword ptr [rsp] vmovaps xmm7, xmmword ptr [rsp + 16] vmovaps xmm8, xmmword ptr [rsp + 32] vmovaps xmm9, xmmword ptr [rsp + 48] vmovaps xmm10, xmmword ptr [rsp + 64] vmovaps xmm11, xmmword ptr [rsp + 80] vmovaps xmm12, xmmword ptr [rsp + 96] vmovaps xmm13, xmmword ptr [rsp + 112] vmovaps xmm14, xmmword ptr [rsp + 128] vmovaps xmm15, xmmword ptr [rsp + 144] add rsp, 168 ret .seh_handlerdata .section .text,"xr",one_only,coresum_inner .Lcfi13: .seh_endproc ``` </details>
For record: Enabling Polly (#50044) doesn't fix the issue. |
@ollie27 Just a note, if you are going to use a impl Iterator for FixedRangeInclusive {
type Item = u64;
fn next(&mut self) -> Option<Self::Item> {
if self.start < self.end {
let new = self.start.wrapping_add(1);
Some(std::mem::replace(&mut self.start, new))
} else if !self.done {
self.done = true;
Some(self.start)
} else {
None
}
}
} |
To answer Issue #56516, this is a first example of the performance problem, better examples could follow. Code example from:
If I replace the open intervals with closed ones:
For the second version I am seeing a run-time of about 6 seconds. |
Lot of discussion here: https://old.reddit.com/r/rust/comments/ab7hsi/comparing_pythagorean_triples_in_c_d_and_rust/ |
#58122 has significantly improved the situation! My benchmark based on the @leonardo-m snippet
#![feature(test)]
extern crate test;
use test::Bencher;
fn triples_exclusive() -> impl Iterator<Item = (u32, u32, u32)> {
(1u32..).flat_map(|z| {
(1..z + 1).flat_map(move |x| {
(x..z + 1)
.filter(move |&y| x.pow(2) + y.pow(2) == z.pow(2))
.map(move |y| (x, y, z))
})
})
}
fn triples_inclusive() -> impl Iterator<Item = (u32, u32, u32)> {
(1u32..).flat_map(|z| {
(1..=z).flat_map(move |x| {
(x..=z)
.filter(move |&y| x.pow(2) + y.pow(2) == z.pow(2))
.map(move |y| (x, y, z))
})
})
}
#[bench]
fn range_exclusive(b: &mut Bencher) {
b.iter(|| triples_exclusive().take(1_000).map(|(x, y, z)| x + y + z).sum::<u32>());
}
#[bench]
fn range_inclusive(b: &mut Bencher) {
b.iter(|| triples_inclusive().take(1_000).map(|(x, y, z)| x + y + z).sum::<u32>());
} Here are the results on my laptop: Run 1:
Run 2:
Run 3:
Inclusive Range is still consistently slower than the exclusive variant, which is not that significant some might say, but it may end up to be a "clever" optimization trick like so many of them in C++ world. Also, in my opinion, the existance of this performance difference goes against "zero-cost" claims, so I think this issue should be kept open. /cc @matthieu-m |
The percentage performance difference I see in my code is quite larger than the 1-2% difference shown above. So better benchmarks are necessary. |
I agree that the performance drop between exclusive and inclusive is annoying, and should ideally be eliminated if possible, however I don't see that as a failure of zero-overhead abstraction. If you were coding the inclusive range with a for loop, you would also need to handle either the starting/ending bound specially, leading to a different codegen than for exclusive ranges. For me there are two issues:
The API is water under the bridge for stability reasons, so we need to concentrate on the codegen issues. I think the next step should be involving folks knowledgeable about LLVM, who could understand why the optimizer fails so hard and either point to a specific pattern to avoid or improve LLVM so it doesn't. I won't be following this up, as I have other projects, so I invite anybody interested to pick up the flag and carry on. For these benchmarks specifically? In my benchmarks I was seeing less than 1% between exclusive and inclusive, which is consistent with what is reported here. Note that using external iteration (a |
I disagree, |
👋 It's been more than a year since the last comment on this issue. The original example has been well-optimized since inclusive ranges were stabilized. They now look like this: https://godbolt.org/z/oh9WbbacE example::triangle_exc:
cmp rdi, -1
je .LBB0_1
lea rcx, [rdi - 1]
mov rax, rdi
mul rcx
shld rdx, rax, 63
add rdx, rdi
mov rax, rdx
ret
.LBB0_1:
xor edx, edx
mov rax, rdx
ret
example::triangle_inc:
xor eax, eax
xor ecx, ecx
.LBB1_1:
mov rdx, rcx
cmp rcx, rdi
adc rcx, 0
add rax, rdx
cmp rdx, rdi
jae .LBB1_3
cmp rcx, rdi
jbe .LBB1_1
.LBB1_3:
ret Eric Niebler's pythagorean triples benchmark also has different behavior than reported here. Run with the current nightly on a recent x86_64 CPU, default release profile, there is a consistent (small) preference for inclusive ranges:
But the situation is muddied by adding any kind of optimization flag. Here's
And here's
Timings are similarly unstable with the latest stable, 1.56. The basic result holds up across architectures too; here's the default release profile on an aarch64 laptop:
Overall I think different benchmarks are needed here. If anyone has complaints about inclusive range performance, I would like to see a benchmark the demonstrates a consistent regression on a recent compiler. All these were run on the current nightly (which has NewPM enabled by default):
|
The following question on stackoverflow show a big performance hit using From: use num_integer::Roots;
fn main() {
let v = 984067;
for i in 1..=v {
ramanujan(i)
}
}
fn ramanujan(m: i32) {
let maxcube = m.cbrt();
let mut res1 = 0;
let mut res2 = 0;
let mut _res3 = 0;
let mut _res4 = 0;
for i in 1..=maxcube {
for j in 1..=maxcube {
if i * i * i + j * j * j == m {
res1 = i;
res2 = j;
break;
}
}
}
for k in 1..=maxcube {
for l in 1..=maxcube {
if k == res1 || k == res2 || l == res1 || l == res2 {
continue;
}
if k * k * k + l * l * l == m {
_res3 = k;
_res4 = l;
break;
}
}
}
} To: use num_integer::Roots;
fn main() {
let v = 984067;
for i in 1..=v {
ramanujan(i)
}
}
fn ramanujan(m: i32) {
let maxcube = m.cbrt() + 1;
let mut res1 = 0;
let mut res2 = 0;
let mut res3 = 0;
let mut res4 = 0;
for i in 1..maxcube {
for j in 1..maxcube {
if i * i * i + j * j * j == m {
res1 = i;
res2 = j;
break;
}
}
}
for k in 1..maxcube {
for l in 1..maxcube {
if k == res1 || k == res2 || l == res1 || l == res2 {
continue;
}
if k * k * k + l * l * l == m {
res3 = k;
res4 = l;
break;
}
}
}
} Result:
Thus I don't know exactly if this is related to this issue. But since this issue is more or less "problem with RangeInclusive perf" tell me if I need to open and put it in another issue. |
I minimized the above example into: pub fn example(m: i32, max: i32) -> i32 {
for i in 1..(max + 1) {
if i * i == m {
return i;
}
}
0
}
pub fn example_inc(m: i32, max: i32) -> i32 {
for i in 1..=max {
if i * i == m {
return i;
}
}
0
} Godbolt link: https://godbolt.org/z/nqTveYvzx I see a 2x runtime difference between the two, using this program to execute them, commenting out the one I don't want: #![feature(test)]
extern crate test;
use test::bench::black_box;
use num_integer::Roots;
fn main() {
let v = 984067;
for i in 1..=v {
black_box(example(i, i.cbrt()));
}
for i in 1..=v {
black_box(example_inc(i, i.cbrt()));
}
} |
use std::ops::ControlFlow;
pub fn example_try_fold(m: i32, max: i32) -> i32 {
match (1..=max).try_fold(0, |acc, i: i32| {
if i * i == m {
ControlFlow::Break(i)
}
else {
ControlFlow::Continue(acc)
}
}) {
ControlFlow::Break(i) | ControlFlow::Continue(i) => i,
}
} produce more or less the same asm than your |
LLVM is sometimes able to optimize enough if you have a single inclusive loop. The real problem appears when you have two or three nested inclusive loops. In this case LLVM gives loops with bad efficiency. So I've mostly banned inclusive loops in Rust code. |
I have experienced this first hand while writing a benchmark (see my Reddit post). How may I suggest having an unsafe inclusive range for this kind of things? I could use exclusive ranges, of course, but I'd have to have the choice for when uint_MAX is reached to be able to manually disable these checks (as with most of Rust's safety abstractions). |
I think #123741 can fix this once and for all (by making |
In my code I've essentially stopped using loops with ... (intervals closed on the right, recently written with the syntax ..=) because they give performance problems. This is a simple example that shows the problem:
Compiled with the last Nightly:
Compiled with:
rustc -O test.rs
Running it calling foo1 takes 0.02 seconds:
Running it calling foo2 takes about 13.65 seconds:
The asm I am seeing using "--emit asm"
The text was updated successfully, but these errors were encountered: