Performance regression of `array::IntoIter` vs `slice::Iter` #115339

DaniPopes · 2023-08-29T11:40:26Z

Code

I tried this code:

use core::iter;

pub fn array(mut a: [u8; 32], b: [u8; 32]) -> [u8; 32] {
    iter::zip(&mut a, b).for_each(|(a, b)| *a |= b);
    a
}

pub fn slice(mut a: [u8; 32], b: [u8; 32]) -> [u8; 32] {
    iter::zip(&mut a, &b).for_each(|(a, b)| *a |= *b);
    a
}

I expected to see this happen: same codegen or similar in performance

Instead, this happened: array has way worse codegen

godbolt

Assembly

example::array:
        vmovdqu xmm1, xmmword ptr [rdx]
        vmovdqu xmm0, xmmword ptr [rdx + 16]
        mov     rax, rdi
        vmovd   ecx, xmm1
        or      byte ptr [rsi], cl
        vpextrb ecx, xmm1, 1
        or      byte ptr [rsi + 1], cl
        vpextrb ecx, xmm1, 2
        or      byte ptr [rsi + 2], cl
        vpextrb ecx, xmm1, 3
        or      byte ptr [rsi + 3], cl
        vpextrb ecx, xmm1, 4
        or      byte ptr [rsi + 4], cl
        vpextrb ecx, xmm1, 5
        or      byte ptr [rsi + 5], cl
        vpextrb ecx, xmm1, 6
        or      byte ptr [rsi + 6], cl
        vpextrb ecx, xmm1, 7
        or      byte ptr [rsi + 7], cl
        vpextrb ecx, xmm1, 8
        or      byte ptr [rsi + 8], cl
        vpextrb ecx, xmm1, 9
        or      byte ptr [rsi + 9], cl
        vpextrb ecx, xmm1, 10
        or      byte ptr [rsi + 10], cl
        vpextrb ecx, xmm1, 11
        or      byte ptr [rsi + 11], cl
        vpextrb ecx, xmm1, 12
        or      byte ptr [rsi + 12], cl
        vpextrb ecx, xmm1, 13
        or      byte ptr [rsi + 13], cl
        vpextrb ecx, xmm1, 14
        or      byte ptr [rsi + 14], cl
        vpextrb ecx, xmm1, 15
        or      byte ptr [rsi + 15], cl
        vmovd   ecx, xmm0
        or      byte ptr [rsi + 16], cl
        vpextrb ecx, xmm0, 1
        or      byte ptr [rsi + 17], cl
        vpextrb ecx, xmm0, 2
        or      byte ptr [rsi + 18], cl
        vpextrb ecx, xmm0, 3
        or      byte ptr [rsi + 19], cl
        vpextrb ecx, xmm0, 4
        or      byte ptr [rsi + 20], cl
        vpextrb ecx, xmm0, 5
        or      byte ptr [rsi + 21], cl
        vpextrb ecx, xmm0, 6
        or      byte ptr [rsi + 22], cl
        vpextrb ecx, xmm0, 7
        or      byte ptr [rsi + 23], cl
        vpextrb ecx, xmm0, 8
        or      byte ptr [rsi + 24], cl
        vpextrb ecx, xmm0, 9
        or      byte ptr [rsi + 25], cl
        vpextrb ecx, xmm0, 10
        or      byte ptr [rsi + 26], cl
        vpextrb ecx, xmm0, 11
        or      byte ptr [rsi + 27], cl
        vpextrb ecx, xmm0, 12
        or      byte ptr [rsi + 28], cl
        vpextrb ecx, xmm0, 13
        or      byte ptr [rsi + 29], cl
        vpextrb ecx, xmm0, 14
        or      byte ptr [rsi + 30], cl
        vpextrb ecx, xmm0, 15
        or      byte ptr [rsi + 31], cl
        vmovups ymm0, ymmword ptr [rsi]
        vmovups ymmword ptr [rdi], ymm0
        vzeroupper
        ret

example::slice:
        vmovups ymm0, ymmword ptr [rsi]
        mov     rax, rdi
        vorps   ymm0, ymm0, ymmword ptr [rdx]
        vmovups ymmword ptr [rsi], ymm0
        vmovups ymmword ptr [rdi], ymm0
        vzeroupper
        ret

Version it worked on

It most recently worked on: 1.64

Version with regression

1.65 till now.

Regressed in nightly-2022-08-13, maybe LLVM 15 #99464?:

fetching (via local git) commits from 20ffea6938b5839c390252e07940b99e3b6a889a to f22819bcce4abaff7d1246a56eec493418f9f4ee

found 8 bors merge commits in the specified range
  commit[0] 2022-08-11: Auto merge of #100416 - Dylan-DPC:rollup-m344lh1, r=Dylan-DPC
  commit[1] 2022-08-11: Auto merge of #100426 - matthiaskrgr:rollup-0ks4dou, r=matthiaskrgr
  commit[2] 2022-08-12: Auto merge of #100419 - flip1995:clippyup, r=Manishearth
  commit[3] 2022-08-12: Auto merge of #99464 - nikic:llvm-15, r=cuviper
  commit[4] 2022-08-12: Auto merge of #100435 - ehuss:update-cargo, r=ehuss
  commit[5] 2022-08-12: Auto merge of #99624 - vincenzopalazzo:macros/unix_error, r=Amanieu
  commit[6] 2022-08-12: Auto merge of #100328 - davidtwco:perf-implications, r=nnethercote
  commit[7] 2022-08-12: Auto merge of #100456 - Dylan-DPC:rollup-fn17z9f, r=Dylan-DPC

rustc --version --verbose:

rustc 1.74.0-nightly (8550f15e1 2023-08-27)
binary: rustc
commit-hash: 8550f15e148407159af401e02b1d9259762b3496
commit-date: 2023-08-27
host: x86_64-unknown-linux-gnu
release: 1.74.0-nightly
LLVM version: 17.0.0

Backtrace

N/A

@rustbot modify labels: +regression-from-stable-to-stable -regression-untriaged

The text was updated successfully, but these errors were encountered:

apiraino · 2023-08-29T12:05:59Z

WG-prioritization assigning priority (Zulip discussion).

cc @nikic since this regressed in the LLVM 15 update

@rustbot label -I-prioritize +P-medium +T-compiler

the8472 · 2023-08-29T12:09:20Z

I've seen llvm doing these strange 1-byte-at-a-time "vectorizations" in other places too. This might be a more general problem.

In this case we can probably paper over it by implementing TrustedRandomAccess for the array iter.

scottmcm · 2023-08-30T07:38:24Z

Unfortunately, the array iterator is fundamentally worse right now, because indexing into it -- a fundamental part of being a by-value array iterator -- keeps it from SRoAing and thus doesn't optimize well.

This is why when I was making array::map better I ended up making https://github.com/rust-lang/rust/pull/107634/files#diff-d391813c2568a5afd0555e6224d5168993b6ada448c4f1397799cb98b5b8a18b to not have to use array::IntoIter.

Sadly, for the foreseeable future you're better off using the "weaker" slice iterator when you can.

DianQK · 2023-09-05T05:22:09Z

My guess is that the SROA is missing something in the opaque pointer mode.

@rustbot claim

DianQK · 2023-09-08T14:27:28Z

It's not about vectorization. This optimization can be restored by using -Cllvm-args=-opaque-pointers=0. But highly unrecommended. https://godbolt.org/z/GqdPGarsn

Upstream issue: llvm/llvm-project#65763.

DianQK · 2023-09-20T03:38:33Z

The upstream issue is closed.
@rustbot claim

optimize zipping over array iterators Fixes rust-lang#115339 (somewhat) the new assembly: ```asm zip_arrays: .cfi_startproc vmovups (%rdx), %ymm0 leaq 32(%rsi), %rcx vxorps %xmm1, %xmm1, %xmm1 vmovups %xmm1, -24(%rsp) movq $0, -8(%rsp) movq %rsi, -88(%rsp) movq %rdi, %rax movq %rcx, -80(%rsp) vmovups %ymm0, -72(%rsp) movq $0, -40(%rsp) movq $32, -32(%rsp) movq -24(%rsp), %rcx vmovups (%rsi,%rcx), %ymm0 vorps -72(%rsp,%rcx), %ymm0, %ymm0 vmovups %ymm0, (%rsi,%rcx) vmovups (%rsi), %ymm0 vmovups %ymm0, (%rdi) vzeroupper retq ``` This is still longer than the slice version given in the issue but at least it eliminates the terrible `vpextrb`/`orb` chain. I guess this is due to excessive memcpys again (haven't looked at the llvmir)? The `TrustedLen` specialization is a drive-by change since I had to do something for the default impl anyway to be able to specialize the `TrustedRandomAccessNoCoerce` impl.

DaniPopes added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Aug 29, 2023

rustbot added P-medium Medium priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Aug 29, 2023

nikic added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. labels Aug 29, 2023

saethlin removed the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Aug 29, 2023

the8472 mentioned this issue Sep 3, 2023

optimize zipping over array iterators #115515

Merged

rustbot assigned DianQK Sep 5, 2023

DianQK mentioned this issue Sep 8, 2023

LoopFullUnrollPass failed due to constant inference in opaque pointer mode. llvm/llvm-project#65763

Closed

DianQK removed their assignment Sep 11, 2023

rustbot assigned DianQK Sep 20, 2023

Noratrieb added the llvm-fixed-upstream Issue expected to be fixed by the next major LLVM upgrade, or backported fixes label Sep 20, 2023

workingjubilee added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label Oct 8, 2023

bors closed this as completed in 0d410be Oct 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression of `array::IntoIter` vs `slice::Iter` #115339

Performance regression of `array::IntoIter` vs `slice::Iter` #115339

DaniPopes commented Aug 29, 2023 •

edited by rustbot

Loading

apiraino commented Aug 29, 2023

the8472 commented Aug 29, 2023

scottmcm commented Aug 30, 2023

DianQK commented Sep 5, 2023 •

edited

Loading

DianQK commented Sep 8, 2023

DianQK commented Sep 20, 2023

Performance regression of array::IntoIter vs slice::Iter #115339

Performance regression of array::IntoIter vs slice::Iter #115339

Comments

DaniPopes commented Aug 29, 2023 • edited by rustbot Loading

Code

Version it worked on

Version with regression

Backtrace

apiraino commented Aug 29, 2023

the8472 commented Aug 29, 2023

scottmcm commented Aug 30, 2023

DianQK commented Sep 5, 2023 • edited Loading

DianQK commented Sep 8, 2023

DianQK commented Sep 20, 2023

Performance regression of `array::IntoIter` vs `slice::Iter` #115339

Performance regression of `array::IntoIter` vs `slice::Iter` #115339

DaniPopes commented Aug 29, 2023 •

edited by rustbot

Loading

DianQK commented Sep 5, 2023 •

edited

Loading