Make `unary` and `binary` faster #6365

AdamGS · 2024-09-06T12:09:38Z

Which issue does this PR close?

Closes #6364.

Seems like both PrimitiveArray::unary and binary can be much faster, which seems valuable IMO even if the code is somewhat more complicated.

Bencmark results on my machine (M3 Max macbook) are below, can be reproduced with cargo bench --bench arithmetic_kernels --features test_utils.

group                    after                                  before
-----                    -----                                  ------
add(0)                   1.00     84.6±2.77ns        ? ?/sec    82.86     7.0±0.47µs        ? ?/sec
add(0.1)                 1.00    405.5±4.43ns        ? ?/sec    16.19     6.6±0.98µs        ? ?/sec
add(0.5)                 1.00    410.6±7.88ns        ? ?/sec    14.54     6.0±0.67µs        ? ?/sec
add(0.9)                 1.00    411.8±7.31ns        ? ?/sec    18.08     7.4±0.34µs        ? ?/sec
add(1)                   1.00   416.3±16.35ns        ? ?/sec    17.42     7.3±0.54µs        ? ?/sec
add_checked(0)           1.00     83.9±3.12ns        ? ?/sec    85.34     7.2±0.12µs        ? ?/sec
add_checked(0.1)         1.00    408.6±5.53ns        ? ?/sec    17.31     7.1±0.75µs        ? ?/sec
add_checked(0.5)         1.00    412.3±6.15ns        ? ?/sec    16.70     6.9±0.77µs        ? ?/sec
add_checked(0.9)         1.00    416.9±9.60ns        ? ?/sec    17.03     7.1±0.83µs        ? ?/sec
add_checked(1)           1.00    420.8±6.78ns        ? ?/sec    16.79     7.1±0.82µs        ? ?/sec
add_scalar(0)            1.00     81.9±2.69ns        ? ?/sec    58.61     4.8±0.10µs        ? ?/sec
add_scalar(0.1)          1.00     81.6±4.16ns        ? ?/sec    58.94     4.8±0.10µs        ? ?/sec
add_scalar(0.5)          1.00     79.2±2.33ns        ? ?/sec    60.24     4.8±0.14µs        ? ?/sec
add_scalar(0.9)          1.00     85.2±3.82ns        ? ?/sec    55.92     4.8±0.17µs        ? ?/sec
add_scalar(1)            1.00     85.7±3.65ns        ? ?/sec    55.84     4.8±0.11µs        ? ?/sec
divide(0)                1.00     85.0±3.24ns        ? ?/sec    84.35     7.2±0.08µs        ? ?/sec
divide(0.1)              1.00    408.8±5.45ns        ? ?/sec    17.94     7.3±0.45µs        ? ?/sec
divide(0.5)              1.00    411.6±7.04ns        ? ?/sec    17.83     7.3±0.55µs        ? ?/sec
divide(0.9)              1.00    417.6±7.79ns        ? ?/sec    18.00     7.5±0.21µs        ? ?/sec
divide(1)                1.00    409.2±4.90ns        ? ?/sec    18.04     7.4±0.46µs        ? ?/sec
divide_scalar(0)         1.00     81.4±4.33ns        ? ?/sec    58.70     4.8±0.13µs        ? ?/sec
divide_scalar(0.1)       1.00     85.3±3.54ns        ? ?/sec    56.00     4.8±0.07µs        ? ?/sec
divide_scalar(0.5)       1.00     80.0±2.08ns        ? ?/sec    60.18     4.8±0.13µs        ? ?/sec
divide_scalar(0.9)       1.00     84.4±3.03ns        ? ?/sec    56.97     4.8±0.13µs        ? ?/sec
divide_scalar(1)         1.00     80.8±2.91ns        ? ?/sec    59.09     4.8±0.05µs        ? ?/sec
modulo(0)                1.00     84.0±3.20ns        ? ?/sec    1298.71   109.0±0.72µs        ? ?/sec
modulo(0.1)              1.00    407.7±5.99ns        ? ?/sec    375.97   153.3±1.83µs        ? ?/sec
modulo(0.5)              1.00    409.5±7.24ns        ? ?/sec    674.76   276.3±8.53µs        ? ?/sec
modulo(0.9)              1.00   422.8±39.26ns        ? ?/sec    322.92   136.5±1.31µs        ? ?/sec
modulo(1)                1.00    404.8±6.94ns        ? ?/sec    227.30    92.0±0.57µs        ? ?/sec
modulo_scalar(0)         1.00     78.5±1.65ns        ? ?/sec    3579.75   281.0±4.67µs        ? ?/sec
modulo_scalar(0.1)       1.00     85.7±3.68ns        ? ?/sec    2852.57   244.5±5.46µs        ? ?/sec
modulo_scalar(0.5)       1.00     81.4±3.19ns        ? ?/sec    3623.04   294.8±8.65µs        ? ?/sec
modulo_scalar(0.9)       1.00     84.3±4.73ns        ? ?/sec    1797.17   151.5±1.68µs        ? ?/sec
modulo_scalar(1)         1.00     79.8±3.12ns        ? ?/sec    1338.79   106.8±0.66µs        ? ?/sec
multiply(0)              1.00     90.0±2.24ns        ? ?/sec    78.80     7.1±0.39µs        ? ?/sec
multiply(0.1)            1.00    411.4±4.21ns        ? ?/sec    17.70     7.3±0.65µs        ? ?/sec
multiply(0.5)            1.00    410.6±4.85ns        ? ?/sec    16.72     6.9±0.70µs        ? ?/sec
multiply(0.9)            1.00    414.7±5.33ns        ? ?/sec    18.16     7.5±0.21µs        ? ?/sec
multiply(1)              1.00    408.9±6.23ns        ? ?/sec    17.83     7.3±0.70µs        ? ?/sec
multiply_checked(0)      1.00     89.0±3.29ns        ? ?/sec    79.38     7.1±0.40µs        ? ?/sec
multiply_checked(0.1)    1.00    405.2±3.89ns        ? ?/sec    17.57     7.1±0.70µs        ? ?/sec
multiply_checked(0.5)    1.00    412.5±4.71ns        ? ?/sec    17.44     7.2±0.52µs        ? ?/sec
multiply_checked(0.9)    1.00    416.3±9.08ns        ? ?/sec    18.07     7.5±0.34µs        ? ?/sec
multiply_checked(1)      1.00    407.0±5.49ns        ? ?/sec    18.45     7.5±0.14µs        ? ?/sec
multiply_scalar(0)       1.00     82.0±4.11ns        ? ?/sec    58.59     4.8±0.05µs        ? ?/sec
multiply_scalar(0.1)     1.00     80.0±2.29ns        ? ?/sec    59.87     4.8±0.06µs        ? ?/sec
multiply_scalar(0.5)     1.00     80.3±3.58ns        ? ?/sec    59.82     4.8±0.09µs        ? ?/sec
multiply_scalar(0.9)     1.00     81.1±3.47ns        ? ?/sec    58.84     4.8±0.22µs        ? ?/sec
multiply_scalar(1)       1.00     81.3±2.91ns        ? ?/sec    58.26     4.7±0.27µs        ? ?/sec
subtract(0)              1.00     85.1±2.00ns        ? ?/sec    83.17     7.1±0.40µs        ? ?/sec
subtract(0.1)            1.00    411.3±5.10ns        ? ?/sec    18.03     7.4±0.48µs        ? ?/sec
subtract(0.5)            1.00    410.3±4.17ns        ? ?/sec    17.42     7.1±0.76µs        ? ?/sec
subtract(0.9)            1.00    413.6±6.85ns        ? ?/sec    17.70     7.3±0.37µs        ? ?/sec
subtract(1)              1.00    419.8±6.35ns        ? ?/sec    17.19     7.2±0.60µs        ? ?/sec
subtract_checked(0)      1.00     85.3±2.44ns        ? ?/sec    84.14     7.2±0.07µs        ? ?/sec
subtract_checked(0.1)    1.00    410.0±8.48ns        ? ?/sec    15.48     6.3±0.80µs        ? ?/sec
subtract_checked(0.5)    1.00    406.8±4.91ns        ? ?/sec    17.89     7.3±0.64µs        ? ?/sec
subtract_checked(0.9)    1.00    413.8±5.42ns        ? ?/sec    17.82     7.4±0.41µs        ? ?/sec
subtract_checked(1)      1.00    421.1±6.14ns        ? ?/sec    16.67     7.0±0.86µs        ? ?/sec
subtract_scalar(0)       1.00     81.6±3.26ns        ? ?/sec    58.96     4.8±0.10µs        ? ?/sec
subtract_scalar(0.1)     1.00     85.6±2.74ns        ? ?/sec    56.00     4.8±0.11µs        ? ?/sec
subtract_scalar(0.5)     1.00     79.4±1.07ns        ? ?/sec    60.51     4.8±0.06µs        ? ?/sec
subtract_scalar(0.9)     1.00     81.4±3.41ns        ? ?/sec    59.13     4.8±0.10µs        ? ?/sec
subtract_scalar(1)       1.00     83.3±2.77ns        ? ?/sec    57.72     4.8±0.12µs        ? ?/sec

What changes are included in this PR?

Making both unary and binary operate on fixed sizes PR. I didn't touch the mut variants but if the code here is acceptable I'm sure I can add that quickly.

Are there any user-facing changes?

No

alamb · 2024-09-09T12:45:06Z

Thank you @AdamGS -- I plan to review this carefully over the next few days

cc @tustvold @jhorstmann @viirya

jhorstmann · 2024-09-09T14:04:07Z

The numbers look a bit too good to be correct. If I try to calculate the throughput in terms of gigabytes/s for addition or multiplication, based on 400ns for 2 x 64x1024x4 bytes, that comes out at around 600Gib/s. The m3 max is fast, but is it that fast using a single core?

jhorstmann · 2024-09-09T14:06:28Z

arrow-arith/src/arity.rs

+    for (output, (a_chunk, b_chunk)) in output_chunks.zip(a_chunks.zip(b_chunks)) {
+        let a_values: [A::Native; CHUNK_SIZE] = a_chunk.try_into().unwrap();
+        let b_values: [B::Native; CHUNK_SIZE] = b_chunk.try_into().unwrap();
+        let mut output: [MaybeUninit<O::Native>; CHUNK_SIZE] = output.try_into().unwrap();


Hm, shouldn't this be of type &mut [MaybeUnit<O::Native;> CHUNK_SIZE]? Otherwise this would be writing to a temporary and the compiler would probably optimize it away.

There was a reddit thread about the two different try_into implementations just a few days ago: https://old.reddit.com/r/rust/comments/1f9iyfz/question_if_i_have_a_vect_how_do_i_pass_a_part_of/lllxbo0/

Ok that seems to be a key point here, changing that (and let output_chunks = output.as_mut_slice().chunks_exact_mut(CHUNK_SIZE);) seems to make performance worse than the current implementation on master. The thing I don't understand here is that if we write into a temporary value, how do the tests still pass?

Have you tried running the tests in release mode, it may be the optimiser doesn't "exploit" any UB when in debug mode

Edit: we may also not have any tests of arrays with more than 1024 elements

Yeah that's the issue, it just never hits that code path, I guess it was just too good to be true.

AdamGS · 2024-09-09T14:22:28Z

@jhorstmann I tend to agree but I couldn't find the issue and all tests seem to pass. I didn't have the time this weekend but I plan to spin up a x86 machine sometime this week and see if the effect reproduces

AdamGS added 5 commits September 5, 2024 18:41

initial work

03b58c5

.

96d1d19

less unsafe

eefe2c5

.

b018c0f

Add implementation

8a04c18

AdamGS changed the title ~~Adamg/chunked unary~~ Make unary and binary faster Sep 6, 2024

github-actions bot added the arrow Changes to the arrow crate label Sep 6, 2024

jhorstmann reviewed Sep 9, 2024

View reviewed changes

AdamGS closed this Sep 9, 2024

This was referenced Sep 11, 2024

DataFusion weekly project plan (Andrew Lamb) - Sep 9, 2024 apache/datafusion#12391

Closed

DataFusion weekly project plan (Andrew Lamb) - Sep 16, 2024 apache/datafusion#12494

Closed

alamb mentioned this pull request Oct 2, 2024

Primitive binary/unary are not as fast as they could be #6364

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `unary` and `binary` faster #6365

Make `unary` and `binary` faster #6365

AdamGS commented Sep 6, 2024

alamb commented Sep 9, 2024

jhorstmann commented Sep 9, 2024

jhorstmann Sep 9, 2024

jhorstmann Sep 9, 2024

AdamGS Sep 9, 2024

tustvold Sep 9, 2024 •

edited

Loading

AdamGS Sep 9, 2024

AdamGS commented Sep 9, 2024

Make unary and binary faster #6365

Make unary and binary faster #6365

Conversation

AdamGS commented Sep 6, 2024

Which issue does this PR close?

What changes are included in this PR?

Are there any user-facing changes?

alamb commented Sep 9, 2024

jhorstmann commented Sep 9, 2024

jhorstmann Sep 9, 2024

Choose a reason for hiding this comment

jhorstmann Sep 9, 2024

Choose a reason for hiding this comment

AdamGS Sep 9, 2024

Choose a reason for hiding this comment

tustvold Sep 9, 2024 • edited Loading

Choose a reason for hiding this comment

AdamGS Sep 9, 2024

Choose a reason for hiding this comment

AdamGS commented Sep 9, 2024

Make `unary` and `binary` faster #6365

Make `unary` and `binary` faster #6365

tustvold Sep 9, 2024 •

edited

Loading