Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of set_bits by avoiding to set individual bits #6288

Merged
merged 48 commits into from
Sep 15, 2024

Conversation

kazuyukitanimura
Copy link
Contributor

@kazuyukitanimura kazuyukitanimura commented Aug 22, 2024

Which issue does this PR close?

Rationale for this change

This PR improves the performance of set_bits that is often the bottleneck for filter execs in DataFusion. In particular, len < 64 is the majority for TPCDS and this PR helps for that scenario.

For the existing bench

boolean_append_packed   time:   [5.3420 µs 5.3538 µs 5.3664 µs]
                        change: [-20.648% -19.343% -18.564%] (p = 0.00 < 0.05)
                        Performance has improved.

What changes are included in this PR?

Consolidating BitChunks::new and individual bit manipulations into one method.

Are there any user-facing changes?

No

@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 22, 2024
@kazuyukitanimura
Copy link
Contributor Author

kazuyukitanimura commented Aug 22, 2024

bit_mask/set_bits/offset_write_0_offset_read_0_len_1_datum_0
                        time:   [3.8759 ns 3.9730 ns 4.0840 ns]
                        change: [-21.945% -20.320% -18.263%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 21 outliers among 100 measurements (21.00%)
  1 (1.00%) high mild
  20 (20.00%) high severe
bit_mask/set_bits/offset_write_0_offset_read_0_len_1_datum_173
                        time:   [3.9758 ns 4.1393 ns 4.3336 ns]
                        change: [-25.893% -23.843% -21.956%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  8 (8.00%) high mild
  5 (5.00%) high severe
bit_mask/set_bits/offset_write_0_offset_read_0_len_17_datum_0
                        time:   [6.8333 ns 6.9420 ns 7.1782 ns]
                        change: [-58.235% -57.940% -57.435%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) high mild
  9 (9.00%) high severe
bit_mask/set_bits/offset_write_0_offset_read_0_len_17_datum_173
                        time:   [6.8351 ns 6.8465 ns 6.8667 ns]
                        change: [-76.137% -75.801% -75.416%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe
bit_mask/set_bits/offset_write_0_offset_read_0_len_65_datum_0
                        time:   [6.8426 ns 7.3042 ns 7.7755 ns]
                        change: [-30.350% -26.771% -23.218%] (p = 0.00 < 0.05)
                        Performance has improved.
bit_mask/set_bits/offset_write_0_offset_read_0_len_65_datum_173
                        time:   [7.1185 ns 7.4987 ns 7.8494 ns]
                        change: [-10.140% -5.4273% -0.7529%] (p = 0.02 < 0.05)
                        Change within noise threshold.
bit_mask/set_bits/offset_write_0_offset_read_5_len_1_datum_0
                        time:   [3.9477 ns 4.0703 ns 4.2105 ns]
                        change: [-22.813% -20.845% -18.410%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 24 outliers among 100 measurements (24.00%)
  3 (3.00%) low mild
  3 (3.00%) high mild
  18 (18.00%) high severe
bit_mask/set_bits/offset_write_0_offset_read_5_len_1_datum_173
                        time:   [3.8890 ns 3.9610 ns 4.0449 ns]
                        change: [-26.637% -25.411% -23.953%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 20 outliers among 100 measurements (20.00%)
  3 (3.00%) high mild
  17 (17.00%) high severe
bit_mask/set_bits/offset_write_0_offset_read_5_len_17_datum_0
                        time:   [6.8372 ns 6.8422 ns 6.8484 ns]
                        change: [-58.155% -58.028% -57.863%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  5 (5.00%) high mild
  11 (11.00%) high severe
bit_mask/set_bits/offset_write_0_offset_read_5_len_17_datum_173
                        time:   [6.8366 ns 6.8466 ns 6.8626 ns]
                        change: [-76.140% -75.799% -75.416%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) high mild
  11 (11.00%) high severe
bit_mask/set_bits/offset_write_0_offset_read_5_len_65_datum_0
                        time:   [7.0263 ns 7.0297 ns 7.0341 ns]
                        change: [-40.713% -39.692% -38.570%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  1 (1.00%) low mild
  8 (8.00%) high mild
  6 (6.00%) high severe
bit_mask/set_bits/offset_write_0_offset_read_5_len_65_datum_173
                        time:   [7.0466 ns 7.0613 ns 7.0769 ns]
                        change: [-26.789% -24.940% -22.963%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  10 (10.00%) high mild
  4 (4.00%) high severe
bit_mask/set_bits/offset_write_5_offset_read_0_len_1_datum_0
                        time:   [3.9634 ns 4.0719 ns 4.1875 ns]
                        change: [-22.678% -21.434% -19.766%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 23 outliers among 100 measurements (23.00%)
  2 (2.00%) high mild
  21 (21.00%) high severe
bit_mask/set_bits/offset_write_5_offset_read_0_len_1_datum_173
                        time:   [4.0279 ns 4.1669 ns 4.3204 ns]
                        change: [-24.572% -22.810% -20.793%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  13 (13.00%) high mild
  1 (1.00%) high severe
bit_mask/set_bits/offset_write_5_offset_read_0_len_17_datum_0
                        time:   [6.8528 ns 6.8644 ns 6.8769 ns]
                        change: [-58.249% -58.097% -57.938%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  9 (9.00%) high mild
  5 (5.00%) high severe
bit_mask/set_bits/offset_write_5_offset_read_0_len_17_datum_173
                        time:   [6.8432 ns 6.8517 ns 6.8622 ns]
                        change: [-74.387% -74.241% -74.063%] (p = 0.00 < 0.05)
                        Performance has improved.
bit_mask/set_bits/offset_write_5_offset_read_0_len_65_datum_0
                        time:   [7.2842 ns 7.3491 ns 7.4800 ns]
                        change: [-84.073% -83.866% -83.528%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  10 (10.00%) high mild
  7 (7.00%) high severe
bit_mask/set_bits/offset_write_5_offset_read_0_len_65_datum_173
                        time:   [7.2896 ns 7.3805 ns 7.5650 ns]
                        change: [-90.865% -90.774% -90.618%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
bit_mask/set_bits/offset_write_5_offset_read_5_len_1_datum_0
                        time:   [3.9567 ns 4.0505 ns 4.1594 ns]
                        change: [-19.319% -17.125% -14.856%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
bit_mask/set_bits/offset_write_5_offset_read_5_len_1_datum_173
                        time:   [3.8761 ns 3.9783 ns 4.0938 ns]
                        change: [-27.058% -25.845% -24.387%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 20 outliers among 100 measurements (20.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  15 (15.00%) high severe
bit_mask/set_bits/offset_write_5_offset_read_5_len_17_datum_0
                        time:   [6.8488 ns 6.8609 ns 6.8742 ns]
                        change: [-58.721% -58.440% -58.162%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 19 outliers among 100 measurements (19.00%)
  4 (4.00%) high mild
  15 (15.00%) high severe
bit_mask/set_bits/offset_write_5_offset_read_5_len_17_datum_173
                        time:   [6.8905 ns 6.9986 ns 7.1764 ns]
                        change: [-73.753% -73.509% -73.176%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe
bit_mask/set_bits/offset_write_5_offset_read_5_len_65_datum_0
                        time:   [6.8726 ns 6.8881 ns 6.9052 ns]
                        change: [-84.871% -84.829% -84.784%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
bit_mask/set_bits/offset_write_5_offset_read_5_len_65_datum_173
                        time:   [6.8930 ns 6.9067 ns 6.9202 ns]
                        change: [-91.308% -91.286% -91.267%] (p = 0.00 < 0.05)
                        Performance has improved.

@kazuyukitanimura kazuyukitanimura marked this pull request as ready for review August 23, 2024 21:40
@kazuyukitanimura
Copy link
Contributor Author

kazuyukitanimura commented Aug 23, 2024

@alamb @andygrove @viirya
I don't know how to make Miri happy, but this change will improve DataFusion filter exec performance (e.g. TPCDS).
I am traveling next week, so I may be unable respond in time, but I will address reviews a week after next.
Thank you in advance.

@alamb
Copy link
Contributor

alamb commented Sep 9, 2024

Here are the benchmark results I got from this branch:

++ critcmp master fix-set-bits
group                                                              fix-set-bits                           master
-----                                                              ------------                           ------
bit_mask/set_bits/offset_write_0_offset_read_0_len_17_datum_0      1.00     19.2±0.01ns        ? ?/sec    1.71     32.7±0.03ns        ? ?/sec
bit_mask/set_bits/offset_write_0_offset_read_0_len_17_datum_173    1.00     19.2±0.02ns        ? ?/sec    2.25     43.2±0.05ns        ? ?/sec
bit_mask/set_bits/offset_write_0_offset_read_0_len_1_datum_0       1.00      6.8±0.01ns        ? ?/sec    1.77     12.1±0.02ns        ? ?/sec
bit_mask/set_bits/offset_write_0_offset_read_0_len_1_datum_173     1.00      6.8±0.01ns        ? ?/sec    1.76     12.0±0.01ns        ? ?/sec
bit_mask/set_bits/offset_write_0_offset_read_0_len_65_datum_0      1.00     10.3±0.01ns        ? ?/sec    1.53     15.8±0.18ns        ? ?/sec
bit_mask/set_bits/offset_write_0_offset_read_0_len_65_datum_173    1.00     10.3±0.01ns        ? ?/sec    1.69     17.4±0.04ns        ? ?/sec
bit_mask/set_bits/offset_write_0_offset_read_5_len_17_datum_0      1.00     19.2±0.02ns        ? ?/sec    1.71     32.7±0.03ns        ? ?/sec
bit_mask/set_bits/offset_write_0_offset_read_5_len_17_datum_173    1.00     19.2±0.02ns        ? ?/sec    2.26     43.4±0.13ns        ? ?/sec
bit_mask/set_bits/offset_write_0_offset_read_5_len_1_datum_0       1.00      6.8±0.01ns        ? ?/sec    1.77     12.1±0.07ns        ? ?/sec
bit_mask/set_bits/offset_write_0_offset_read_5_len_1_datum_173     1.00      6.8±0.01ns        ? ?/sec    1.76     12.0±0.01ns        ? ?/sec
bit_mask/set_bits/offset_write_0_offset_read_5_len_65_datum_0      1.27     21.2±0.01ns        ? ?/sec    1.00     16.7±0.07ns        ? ?/sec
bit_mask/set_bits/offset_write_0_offset_read_5_len_65_datum_173    1.22     21.2±0.03ns        ? ?/sec    1.00     17.4±0.04ns        ? ?/sec
bit_mask/set_bits/offset_write_5_offset_read_0_len_17_datum_0      1.00     19.2±0.02ns        ? ?/sec    1.70     32.5±0.04ns        ? ?/sec
bit_mask/set_bits/offset_write_5_offset_read_0_len_17_datum_173    1.00     19.2±0.02ns        ? ?/sec    2.19     42.0±0.10ns        ? ?/sec
bit_mask/set_bits/offset_write_5_offset_read_0_len_1_datum_0       1.00      6.8±0.01ns        ? ?/sec    1.78     12.1±0.01ns        ? ?/sec
bit_mask/set_bits/offset_write_5_offset_read_0_len_1_datum_173     1.00      6.8±0.00ns        ? ?/sec    1.76     12.0±0.01ns        ? ?/sec
bit_mask/set_bits/offset_write_5_offset_read_0_len_65_datum_0      1.00     21.0±0.31ns        ? ?/sec    4.45     93.2±0.10ns        ? ?/sec
bit_mask/set_bits/offset_write_5_offset_read_0_len_65_datum_173    1.00     20.9±0.04ns        ? ?/sec    6.62    138.6±0.22ns        ? ?/sec
bit_mask/set_bits/offset_write_5_offset_read_5_len_17_datum_0      1.00     19.2±0.02ns        ? ?/sec    1.70     32.5±0.03ns        ? ?/sec
bit_mask/set_bits/offset_write_5_offset_read_5_len_17_datum_173    1.00     19.2±0.02ns        ? ?/sec    2.21     42.5±0.32ns        ? ?/sec
bit_mask/set_bits/offset_write_5_offset_read_5_len_1_datum_0       1.00      6.8±0.01ns        ? ?/sec    1.78     12.1±0.01ns        ? ?/sec
bit_mask/set_bits/offset_write_5_offset_read_5_len_1_datum_173     1.00      6.8±0.01ns        ? ?/sec    1.76     12.0±0.01ns        ? ?/sec
bit_mask/set_bits/offset_write_5_offset_read_5_len_65_datum_0      1.00     21.3±0.03ns        ? ?/sec    4.37     93.2±0.10ns        ? ?/sec
bit_mask/set_bits/offset_write_5_offset_read_5_len_65_datum_173    1.00     21.3±0.02ns        ? ?/sec    6.40    136.6±0.57ns        ? ?/sec

🚀

Copy link
Contributor

@crepererum crepererum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's like the 4th round of comments, but I feel we're getting super close to merging. I think this might be the final rounds. The code now also feels more compact and easier to follow. Thanks for being so patient.

arrow-buffer/src/util/bit_mask.rs Outdated Show resolved Hide resolved
arrow-buffer/src/util/bit_mask.rs Outdated Show resolved Hide resolved
arrow-buffer/src/util/bit_mask.rs Outdated Show resolved Hide resolved
} else {
let len = std::cmp::min(len, 64 - std::cmp::max(read_shift, write_shift));
let bytes = ceil(len + read_shift, 8);
let chunk = unsafe { read_bytes_to_u64(data, read_byte, bytes) };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some // SAFETY: explanations to unsafe usages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Copy link
Contributor

@crepererum crepererum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now, thanks. Given the weight of unsafe code, I would like to see a 2nd approval though.

@kazuyukitanimura
Copy link
Contributor Author

Thank you @crepererum

@alamb @andygrove @viirya @Dandandan Any other comments?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kazuyukitanimura -- I will review this PR later today

arrow-buffer/src/util/bit_mask.rs Show resolved Hide resolved
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kazuyukitanimura -- I believe this code is correct, but I think it needs some more testing for me to be super confident it in, especially given its profuse use of unsafe as mentioned by @crepererum

I have an idea for a fuzz testing -- I am going to try and code something up and report back here

fn test_set_upto_64bits() {
// len >= 64
let write_data: &mut [u8] = &mut [0; 9];
let data: &[u8] = &[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please also add a test that is greater than 64 bits (not just = 64 bits)?

Copy link
Contributor Author

@kazuyukitanimura kazuyukitanimura Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am working on some more tests too. Stay tuned...

@alamb alamb mentioned this pull request Sep 13, 2024
@kazuyukitanimura
Copy link
Contributor Author

Thank you @kazuyukitanimura -- I believe this code is correct, but I think it needs some more testing for me to be super confident it in, especially given its profuse use of unsafe as mentioned by @crepererum

I have an idea for a fuzz testing -- I am going to try and code something up and report back here

Thanks @alamb FYI it looks there is an existing fuzz test although this is limited to len < 32
https://github.com/apache/arrow-rs/blob/master/arrow-buffer/src/builder/boolean.rs#L387

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote a fuzz tester for set_bits here: #6394

I ran the fuzz tester against this branch and it passed

I also ran it under MIRI (which took quite a while) and it found no errors

andrewlamb@Andrews-MacBook-Pro-2:~/Software/arrow-rs$ cargo +nightly miri test -p arrow-buffer --profile test --lib util::bit_mask::tests
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.04s
     Running unittests src/lib.rs (target/miri/aarch64-apple-darwin/debug/deps/arrow_buffer-82bdebfeced20ddb)

running 6 tests
test util::bit_mask::tests::set_bits_fuz ... ok
test util::bit_mask::tests::test_set_bits_aligned ... ok
test util::bit_mask::tests::test_set_bits_unaligned ... ok
test util::bit_mask::tests::test_set_bits_unaligned_destination_end ... ok
test util::bit_mask::tests::test_set_bits_unaligned_destination_start ... ok
test util::bit_mask::tests::test_set_upto_64bits ... ok

test result: ok. 6 passed; 0 failed; 0 ignored; 0 measured; 112 filtered out; finished in 38.62s

Therefore, I think this is very well done @kazuyukitanimura 👏 🏆 🚀

@alamb
Copy link
Contributor

alamb commented Sep 15, 2024

Let's get it merged

@alamb alamb merged commit b4de692 into apache:master Sep 15, 2024
26 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 15, 2024

Thanks again everyone -- this is quite cool

@kazuyukitanimura
Copy link
Contributor Author

Thank you @alamb @crepererum @andygrove @viirya @Dandandan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants