Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make unary and binary faster #6365

Closed
wants to merge 5 commits into from
Closed

Conversation

AdamGS
Copy link
Contributor

@AdamGS AdamGS commented Sep 6, 2024

Which issue does this PR close?

Closes #6364.

Seems like both PrimitiveArray::unary and binary can be much faster, which seems valuable IMO even if the code is somewhat more complicated.

Bencmark results on my machine (M3 Max macbook) are below, can be reproduced with cargo bench --bench arithmetic_kernels --features test_utils.

group                    after                                  before
-----                    -----                                  ------
add(0)                   1.00     84.6±2.77ns        ? ?/sec    82.86     7.0±0.47µs        ? ?/sec
add(0.1)                 1.00    405.5±4.43ns        ? ?/sec    16.19     6.6±0.98µs        ? ?/sec
add(0.5)                 1.00    410.6±7.88ns        ? ?/sec    14.54     6.0±0.67µs        ? ?/sec
add(0.9)                 1.00    411.8±7.31ns        ? ?/sec    18.08     7.4±0.34µs        ? ?/sec
add(1)                   1.00   416.3±16.35ns        ? ?/sec    17.42     7.3±0.54µs        ? ?/sec
add_checked(0)           1.00     83.9±3.12ns        ? ?/sec    85.34     7.2±0.12µs        ? ?/sec
add_checked(0.1)         1.00    408.6±5.53ns        ? ?/sec    17.31     7.1±0.75µs        ? ?/sec
add_checked(0.5)         1.00    412.3±6.15ns        ? ?/sec    16.70     6.9±0.77µs        ? ?/sec
add_checked(0.9)         1.00    416.9±9.60ns        ? ?/sec    17.03     7.1±0.83µs        ? ?/sec
add_checked(1)           1.00    420.8±6.78ns        ? ?/sec    16.79     7.1±0.82µs        ? ?/sec
add_scalar(0)            1.00     81.9±2.69ns        ? ?/sec    58.61     4.8±0.10µs        ? ?/sec
add_scalar(0.1)          1.00     81.6±4.16ns        ? ?/sec    58.94     4.8±0.10µs        ? ?/sec
add_scalar(0.5)          1.00     79.2±2.33ns        ? ?/sec    60.24     4.8±0.14µs        ? ?/sec
add_scalar(0.9)          1.00     85.2±3.82ns        ? ?/sec    55.92     4.8±0.17µs        ? ?/sec
add_scalar(1)            1.00     85.7±3.65ns        ? ?/sec    55.84     4.8±0.11µs        ? ?/sec
divide(0)                1.00     85.0±3.24ns        ? ?/sec    84.35     7.2±0.08µs        ? ?/sec
divide(0.1)              1.00    408.8±5.45ns        ? ?/sec    17.94     7.3±0.45µs        ? ?/sec
divide(0.5)              1.00    411.6±7.04ns        ? ?/sec    17.83     7.3±0.55µs        ? ?/sec
divide(0.9)              1.00    417.6±7.79ns        ? ?/sec    18.00     7.5±0.21µs        ? ?/sec
divide(1)                1.00    409.2±4.90ns        ? ?/sec    18.04     7.4±0.46µs        ? ?/sec
divide_scalar(0)         1.00     81.4±4.33ns        ? ?/sec    58.70     4.8±0.13µs        ? ?/sec
divide_scalar(0.1)       1.00     85.3±3.54ns        ? ?/sec    56.00     4.8±0.07µs        ? ?/sec
divide_scalar(0.5)       1.00     80.0±2.08ns        ? ?/sec    60.18     4.8±0.13µs        ? ?/sec
divide_scalar(0.9)       1.00     84.4±3.03ns        ? ?/sec    56.97     4.8±0.13µs        ? ?/sec
divide_scalar(1)         1.00     80.8±2.91ns        ? ?/sec    59.09     4.8±0.05µs        ? ?/sec
modulo(0)                1.00     84.0±3.20ns        ? ?/sec    1298.71   109.0±0.72µs        ? ?/sec
modulo(0.1)              1.00    407.7±5.99ns        ? ?/sec    375.97   153.3±1.83µs        ? ?/sec
modulo(0.5)              1.00    409.5±7.24ns        ? ?/sec    674.76   276.3±8.53µs        ? ?/sec
modulo(0.9)              1.00   422.8±39.26ns        ? ?/sec    322.92   136.5±1.31µs        ? ?/sec
modulo(1)                1.00    404.8±6.94ns        ? ?/sec    227.30    92.0±0.57µs        ? ?/sec
modulo_scalar(0)         1.00     78.5±1.65ns        ? ?/sec    3579.75   281.0±4.67µs        ? ?/sec
modulo_scalar(0.1)       1.00     85.7±3.68ns        ? ?/sec    2852.57   244.5±5.46µs        ? ?/sec
modulo_scalar(0.5)       1.00     81.4±3.19ns        ? ?/sec    3623.04   294.8±8.65µs        ? ?/sec
modulo_scalar(0.9)       1.00     84.3±4.73ns        ? ?/sec    1797.17   151.5±1.68µs        ? ?/sec
modulo_scalar(1)         1.00     79.8±3.12ns        ? ?/sec    1338.79   106.8±0.66µs        ? ?/sec
multiply(0)              1.00     90.0±2.24ns        ? ?/sec    78.80     7.1±0.39µs        ? ?/sec
multiply(0.1)            1.00    411.4±4.21ns        ? ?/sec    17.70     7.3±0.65µs        ? ?/sec
multiply(0.5)            1.00    410.6±4.85ns        ? ?/sec    16.72     6.9±0.70µs        ? ?/sec
multiply(0.9)            1.00    414.7±5.33ns        ? ?/sec    18.16     7.5±0.21µs        ? ?/sec
multiply(1)              1.00    408.9±6.23ns        ? ?/sec    17.83     7.3±0.70µs        ? ?/sec
multiply_checked(0)      1.00     89.0±3.29ns        ? ?/sec    79.38     7.1±0.40µs        ? ?/sec
multiply_checked(0.1)    1.00    405.2±3.89ns        ? ?/sec    17.57     7.1±0.70µs        ? ?/sec
multiply_checked(0.5)    1.00    412.5±4.71ns        ? ?/sec    17.44     7.2±0.52µs        ? ?/sec
multiply_checked(0.9)    1.00    416.3±9.08ns        ? ?/sec    18.07     7.5±0.34µs        ? ?/sec
multiply_checked(1)      1.00    407.0±5.49ns        ? ?/sec    18.45     7.5±0.14µs        ? ?/sec
multiply_scalar(0)       1.00     82.0±4.11ns        ? ?/sec    58.59     4.8±0.05µs        ? ?/sec
multiply_scalar(0.1)     1.00     80.0±2.29ns        ? ?/sec    59.87     4.8±0.06µs        ? ?/sec
multiply_scalar(0.5)     1.00     80.3±3.58ns        ? ?/sec    59.82     4.8±0.09µs        ? ?/sec
multiply_scalar(0.9)     1.00     81.1±3.47ns        ? ?/sec    58.84     4.8±0.22µs        ? ?/sec
multiply_scalar(1)       1.00     81.3±2.91ns        ? ?/sec    58.26     4.7±0.27µs        ? ?/sec
subtract(0)              1.00     85.1±2.00ns        ? ?/sec    83.17     7.1±0.40µs        ? ?/sec
subtract(0.1)            1.00    411.3±5.10ns        ? ?/sec    18.03     7.4±0.48µs        ? ?/sec
subtract(0.5)            1.00    410.3±4.17ns        ? ?/sec    17.42     7.1±0.76µs        ? ?/sec
subtract(0.9)            1.00    413.6±6.85ns        ? ?/sec    17.70     7.3±0.37µs        ? ?/sec
subtract(1)              1.00    419.8±6.35ns        ? ?/sec    17.19     7.2±0.60µs        ? ?/sec
subtract_checked(0)      1.00     85.3±2.44ns        ? ?/sec    84.14     7.2±0.07µs        ? ?/sec
subtract_checked(0.1)    1.00    410.0±8.48ns        ? ?/sec    15.48     6.3±0.80µs        ? ?/sec
subtract_checked(0.5)    1.00    406.8±4.91ns        ? ?/sec    17.89     7.3±0.64µs        ? ?/sec
subtract_checked(0.9)    1.00    413.8±5.42ns        ? ?/sec    17.82     7.4±0.41µs        ? ?/sec
subtract_checked(1)      1.00    421.1±6.14ns        ? ?/sec    16.67     7.0±0.86µs        ? ?/sec
subtract_scalar(0)       1.00     81.6±3.26ns        ? ?/sec    58.96     4.8±0.10µs        ? ?/sec
subtract_scalar(0.1)     1.00     85.6±2.74ns        ? ?/sec    56.00     4.8±0.11µs        ? ?/sec
subtract_scalar(0.5)     1.00     79.4±1.07ns        ? ?/sec    60.51     4.8±0.06µs        ? ?/sec
subtract_scalar(0.9)     1.00     81.4±3.41ns        ? ?/sec    59.13     4.8±0.10µs        ? ?/sec
subtract_scalar(1)       1.00     83.3±2.77ns        ? ?/sec    57.72     4.8±0.12µs        ? ?/sec

What changes are included in this PR?

Making both unary and binary operate on fixed sizes PR. I didn't touch the mut variants but if the code here is acceptable I'm sure I can add that quickly.

Are there any user-facing changes?

No

@AdamGS AdamGS changed the title Adamg/chunked unary Make unary and binary faster Sep 6, 2024
@github-actions github-actions bot added the arrow Changes to the arrow crate label Sep 6, 2024
@alamb
Copy link
Contributor

alamb commented Sep 9, 2024

Thank you @AdamGS -- I plan to review this carefully over the next few days

cc @tustvold @jhorstmann @viirya

@jhorstmann
Copy link
Contributor

The numbers look a bit too good to be correct. If I try to calculate the throughput in terms of gigabytes/s for addition or multiplication, based on 400ns for 2 x 64x1024x4 bytes, that comes out at around 600Gib/s. The m3 max is fast, but is it that fast using a single core?

for (output, (a_chunk, b_chunk)) in output_chunks.zip(a_chunks.zip(b_chunks)) {
let a_values: [A::Native; CHUNK_SIZE] = a_chunk.try_into().unwrap();
let b_values: [B::Native; CHUNK_SIZE] = b_chunk.try_into().unwrap();
let mut output: [MaybeUninit<O::Native>; CHUNK_SIZE] = output.try_into().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, shouldn't this be of type &mut [MaybeUnit<O::Native;> CHUNK_SIZE]? Otherwise this would be writing to a temporary and the compiler would probably optimize it away.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a reddit thread about the two different try_into implementations just a few days ago: https://old.reddit.com/r/rust/comments/1f9iyfz/question_if_i_have_a_vect_how_do_i_pass_a_part_of/lllxbo0/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that seems to be a key point here, changing that (and let output_chunks = output.as_mut_slice().chunks_exact_mut(CHUNK_SIZE);) seems to make performance worse than the current implementation on master. The thing I don't understand here is that if we write into a temporary value, how do the tests still pass?

Copy link
Contributor

@tustvold tustvold Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried running the tests in release mode, it may be the optimiser doesn't "exploit" any UB when in debug mode

Edit: we may also not have any tests of arrays with more than 1024 elements

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's the issue, it just never hits that code path, I guess it was just too good to be true.

@AdamGS
Copy link
Contributor Author

AdamGS commented Sep 9, 2024

@jhorstmann I tend to agree but I couldn't find the issue and all tests seem to pass. I didn't have the time this weekend but I plan to spin up a x86 machine sometime this week and see if the effect reproduces

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Primitive binary/unary are not as fast as they could be
4 participants