-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make unary
and binary
faster
#6365
Conversation
Thank you @AdamGS -- I plan to review this carefully over the next few days |
The numbers look a bit too good to be correct. If I try to calculate the throughput in terms of gigabytes/s for addition or multiplication, based on 400ns for 2 x 64x1024x4 bytes, that comes out at around 600Gib/s. The m3 max is fast, but is it that fast using a single core? |
for (output, (a_chunk, b_chunk)) in output_chunks.zip(a_chunks.zip(b_chunks)) { | ||
let a_values: [A::Native; CHUNK_SIZE] = a_chunk.try_into().unwrap(); | ||
let b_values: [B::Native; CHUNK_SIZE] = b_chunk.try_into().unwrap(); | ||
let mut output: [MaybeUninit<O::Native>; CHUNK_SIZE] = output.try_into().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, shouldn't this be of type &mut [MaybeUnit<O::Native;> CHUNK_SIZE]
? Otherwise this would be writing to a temporary and the compiler would probably optimize it away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a reddit thread about the two different try_into
implementations just a few days ago: https://old.reddit.com/r/rust/comments/1f9iyfz/question_if_i_have_a_vect_how_do_i_pass_a_part_of/lllxbo0/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok that seems to be a key point here, changing that (and let output_chunks = output.as_mut_slice().chunks_exact_mut(CHUNK_SIZE);
) seems to make performance worse than the current implementation on master
. The thing I don't understand here is that if we write into a temporary value, how do the tests still pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried running the tests in release mode, it may be the optimiser doesn't "exploit" any UB when in debug mode
Edit: we may also not have any tests of arrays with more than 1024 elements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's the issue, it just never hits that code path, I guess it was just too good to be true.
@jhorstmann I tend to agree but I couldn't find the issue and all tests seem to pass. I didn't have the time this weekend but I plan to spin up a x86 machine sometime this week and see if the effect reproduces |
Which issue does this PR close?
Closes #6364.
Seems like both
PrimitiveArray::unary
andbinary
can be much faster, which seems valuable IMO even if the code is somewhat more complicated.Bencmark results on my machine (M3 Max macbook) are below, can be reproduced with
cargo bench --bench arithmetic_kernels --features test_utils
.What changes are included in this PR?
Making both
unary
andbinary
operate on fixed sizes PR. I didn't touch themut
variants but if the code here is acceptable I'm sure I can add that quickly.Are there any user-facing changes?
No