Use `Arc<[Buffer]>` instead of raw `Vec<Buffer>` in `GenericByteViewArray` for faster `slice` #6427

ShiKaiWi · 2024-09-20T15:56:38Z

Which issue does this PR close?

Rationale for this change

In the GenericByteViewArray, the buffers field is a raw vector, leading to heap allocation when some methods are called, e.g. clone, slice. Using Arc<[Buffer]> instead of the raw Vec<Buffer> can avoid such heap allocation.

And the newly-add benchmark cases about slice shows the improvement:

gc view types all       time:   [687.92 µs 694.54 µs 706.74 µs]
                        change: [+2.9773% +4.3614% +6.3877%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe

gc view types slice half
                        time:   [322.61 µs 325.05 µs 327.29 µs]
                        change: [-0.7292% +0.5853% +1.7277%] (p = 0.37 > 0.05)
                        No change in performance detected.
Found 16 outliers among 100 measurements (16.00%)
  14 (14.00%) high mild
  2 (2.00%) high severe

view types slice        time:   [146.78 ns 147.03 ns 147.24 ns]
                        change: [-15.424% -15.106% -14.772%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 23 outliers among 100 measurements (23.00%)
  12 (12.00%) low severe
  4 (4.00%) low mild
  2 (2.00%) high mild
  5 (5.00%) high severe

What changes are included in this PR?

Use Arc<[Buffer]> instead of the raw Vec<Buffer> as the type of buffers field of GenericByteViewArray.

Are there any user-facing changes?

The signature of the method GenericByteViewArray::new_unchecked is changed from:

pub unsafe fn new_unchecked(
        views: ScalarBuffer<u128>,
        buffers: Vec<[Buffer]>,
        nulls: Option<NullBuffer>,
    ) -> Self;

to

pub unsafe fn new_unchecked(
        views: ScalarBuffer<u128>,
        buffers: impl Into<Arc<[Buffer]>>,
        nulls: Option<NullBuffer>,
    ) -> Self;

However, any usage of this method before this PR should still work without any modification.

…rray` for faster `slice`

tustvold · 2024-09-20T21:25:22Z

Unfortunately the use of impl is still a breaking change as it could impact type inference, e.g if collecting an interator into the argument

tustvold · 2024-09-20T21:26:34Z

arrow-array/src/array/byte_view_array.rs

@@ -234,7 +234,7 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
    }

    /// Deconstruct this array into its constituent parts
-    pub fn into_parts(self) -> (ScalarBuffer<u128>, Vec<Buffer>, Option<NullBuffer>) {
+    pub fn into_parts(self) -> (ScalarBuffer<u128>, Arc<[Buffer]>, Option<NullBuffer>) {


This is also a breaking change

alamb · 2024-09-20T22:33:47Z

FWIW the reason "breaking change" is important is that it restricts when we can merge this PR:

https://github.com/apache/arrow-rs/blob/master/CONTRIBUTING.md#breaking-changes

findepi · 2024-10-12T19:59:37Z

arrow-array/src/array/byte_view_array.rs

@@ -114,7 +114,7 @@ use super::ByteArrayType;
 pub struct GenericByteViewArray<T: ByteViewType + ?Sized> {
    data_type: DataType,
    views: ScalarBuffer<u128>,
-    buffers: Vec<Buffer>,
+    buffers: Arc<[Buffer]>,


What's the rationale for Arc<[Buffer]> vs Vec<Arc>?

Cloning an Arc is relatively cheap (no allocation), cloning a Vec isn't.

i get it. However, if i understand correctly, Arc<[Buffer]> means the buffers can be passed around and shared only when they are within single slice, which can be limiting. For example, Can i merge two arrays, combining their Arc<Buffer> s without moving or cloning the buffers?

Can i merge two arrays, combining their Arc s without moving or cloning the buffers?

No -- you would have to create a new Vec<Buffer> (or some other way to get Arc<[Buffer]>)

So while there are some cases where new allocations are required, slicing / cloning is faster

you would have to create a new Vec<Buffer>

but that would prevent buffer sharing between two arrays, right?

slicing / cloning is faster

cloning yes

slicing -- i didn't see it

I think the observation is that during StringViewArray::slice, the slice actually happens on the views -- the buffers (that the views can point at) must be copied

Here is the clone of buffers: https://docs.rs/arrow-array/53.1.0/src/arrow_array/array/byte_view_array.rs.html#385

tustvold · 2024-11-22T14:38:45Z

Coming back to this now, I think we could merge this although I am curious if there is something funny going on here, which is why this is showing up in benchmarks. Someone popped up on discord the other day reporting a StringViewArray with ~10k buffers, this would suggest something is likely off somewhere.

I wonder if kernels are blindly concatenating identical buffers together, instead of using something like Buffer::ptr_eq to avoid a new entry for the exact same buffer allocation?

alamb · 2024-11-22T15:58:11Z

I wonder if kernels are blindly concatenating identical buffers together, instead of using something like Buffer::ptr_eq to avoid a new entry for the exact same buffer allocation?

What was happening in DataFusion was we had a Filter --> Coalesce chain and thus basically calling concat a few thousand different input RecordBatch each with a few rows.

However, to your point, it may well be the case that the input RecordBatches shared the same underlying buffer so maybe the same buffer was being appended multiple times

@XiangpengHao do you remember if you looked for this ? (related to "Section 3.5: Buffer size tuning
" in https://www.influxdata.com/blog/faster-queries-with-stringview-part-two-influxdb/)

XiangpengHao · 2024-11-22T16:32:02Z

I wonder if kernels are blindly concatenating identical buffers together, instead of using something like Buffer::ptr_eq to avoid a new entry for the exact same buffer allocation?

I think so:

arrow-rs/arrow-data/src/transform/mod.rs

Lines 630 to 637 in def94a8

    
           let variadic_data_buffers = match &data_type { 
        
               DataType::BinaryView | DataType::Utf8View => arrays 
        
                   .iter() 
        
                   .flat_map(|x| x.buffers().iter().skip(1)) 
        
                   .map(Buffer::clone) 
        
                   .collect(), 
        
               _ => vec![], 
        
           };

Someone popped up on discord the other day reporting a StringViewArray with ~10k buffers, this would suggest something is likely off somewhere.

I have experienced this when loading string view from Parquet. If the parquet data has 10k buffers of string data, the string view will just hold them. Typically we should run a filter and then gc it. This is handled in DF but if used out side DF users might need to do something similar like this: https://github.com/apache/datafusion/blob/c0ca4b4e449e07c3bcd6f3593fa31dd31ed5e0c5/datafusion/physical-plan/src/coalesce/mod.rs#L201-L221

In other words, if StringViewArray is constructed by us, it's very unlikely to have 10k buffers as we will exponentially grow the buffers ultil 2MB; so 10k buffers means ~20GB data

alamb · 2024-11-23T13:08:25Z

In other words, if StringViewArray is constructed by us, it's very unlikely to have 10k buffers as we will exponentially grow the buffers ultil 2MB; so 10k buffers means ~20GB data

Good call -- @onursatici has tracked down a related (perhaps the same) issue:

StringView: Using the Interleave kernel (and potentially others) results in many repeated buffers in variadic_buffers #6780

alamb · 2024-11-23T13:09:12Z

All the discussion above not withstanding, I think the fact that this PR improves slice speed (by avoiding allocations) for GenericByteViewArray means it is still worth merging

tustvold · 2024-11-26T10:31:32Z

arrow-array/src/array/byte_view_array.rs

@@ -178,7 +178,7 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
        Ok(Self {
            data_type: T::DATA_TYPE,
            views,
-            buffers,
+            buffers: buffers.into(),


I think we should take impl Into<Arc<[Buffer]>> in this method as well

tustvold · 2024-11-26T10:34:02Z

means it is still worth merging

Agreed

What do people think about extracting a newtype ViewBuffers or similar, much like we have Fields. This would allow us to change the internal representation down the line, whilst also potentially supporting member functions, etc...

alamb · 2024-11-26T11:18:21Z

means it is still worth merging

Agreed

What do people think about extracting a newtype ViewBuffers or similar, much like we have Fields. This would allow us to change the internal representation down the line, whilst also potentially supporting member functions, etc...

I think it would be a great idea. @ShiKaiWi is that something you would be willing to try?

ShiKaiWi · 2024-12-10T08:19:53Z

means it is still worth merging

Agreed
What do people think about extracting a newtype ViewBuffers or similar, much like we have Fields. This would allow us to change the internal representation down the line, whilst also potentially supporting member functions, etc...

I think it would be a great idea. @ShiKaiWi is that something you would be willing to try?

Sorry for the late reply (I was a bit busy with work in the last few months). I'd love to try it on this weekend.

alamb · 2024-12-10T19:04:11Z

Sorry for the late reply (I was a bit busy with work in the last few months). I'd love to try it on this weekend.

Thank @ShiKaiWi -- I don't think there is any huge rush to get this feature in, so it would be great if you could do so ❤️

alamb · 2024-12-16T18:33:43Z

Converting to draft as I think there is some new planned work here.

Use Arc<[Buffer]> instead of raw Vec<Buffer> in `GenericByteViewA…

18e3113

…rray` for faster `slice`

github-actions bot added the arrow Changes to the arrow crate label Sep 20, 2024

add benchmark case about view array slice

33eae4b

ShiKaiWi marked this pull request as ready for review September 20, 2024 16:38

ShiKaiWi changed the title ~~Use Arc<[Buffer]> instead of raw Vec<Buffer> in `GenericByteViewA…~~ Use Arc<[Buffer]> instead of raw Vec<Buffer> in GenericByteViewArray for faster slice Sep 20, 2024

ShiKaiWi changed the title ~~Use Arc<[Buffer]> instead of raw Vec<Buffer> in GenericByteViewArray for faster slice~~ Use Arc<[Buffer]> instead of raw Vec<Buffer> in GenericByteViewArray for faster slice Sep 20, 2024

tustvold added the api-change Changes to the arrow API label Sep 20, 2024

tustvold added the next-major-release the PR has API changes and it waiting on the next major version label Sep 20, 2024

tustvold reviewed Sep 20, 2024

View reviewed changes

findepi reviewed Oct 12, 2024

View reviewed changes

alamb mentioned this pull request Nov 23, 2024

StringView: Using the Interleave kernel (and potentially others) results in many repeated buffers in variadic_buffers #6780

Open

tustvold reviewed Nov 26, 2024

View reviewed changes

alamb marked this pull request as draft December 16, 2024 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `Arc<[Buffer]>` instead of raw `Vec<Buffer>` in `GenericByteViewArray` for faster `slice` #6427

Use `Arc<[Buffer]>` instead of raw `Vec<Buffer>` in `GenericByteViewArray` for faster `slice` #6427

ShiKaiWi commented Sep 20, 2024 •

edited

Loading

tustvold commented Sep 20, 2024

tustvold Sep 20, 2024

alamb commented Sep 20, 2024

findepi Oct 12, 2024

Dandandan Oct 13, 2024

findepi Oct 13, 2024

alamb Oct 15, 2024 •

edited

Loading

findepi Oct 15, 2024

alamb Oct 15, 2024

tustvold commented Nov 22, 2024

alamb commented Nov 22, 2024

XiangpengHao commented Nov 22, 2024

alamb commented Nov 23, 2024

alamb commented Nov 23, 2024

tustvold Nov 26, 2024

tustvold commented Nov 26, 2024

alamb commented Nov 26, 2024

ShiKaiWi commented Dec 10, 2024 •

edited

Loading

alamb commented Dec 10, 2024

alamb commented Dec 16, 2024

Use Arc<[Buffer]> instead of raw Vec<Buffer> in GenericByteViewArray for faster slice #6427

Are you sure you want to change the base?

Use Arc<[Buffer]> instead of raw Vec<Buffer> in GenericByteViewArray for faster slice #6427

Conversation

ShiKaiWi commented Sep 20, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold commented Sep 20, 2024

tustvold Sep 20, 2024

Choose a reason for hiding this comment

alamb commented Sep 20, 2024

findepi Oct 12, 2024

Choose a reason for hiding this comment

Dandandan Oct 13, 2024

Choose a reason for hiding this comment

findepi Oct 13, 2024

Choose a reason for hiding this comment

alamb Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

findepi Oct 15, 2024

Choose a reason for hiding this comment

alamb Oct 15, 2024

Choose a reason for hiding this comment

tustvold commented Nov 22, 2024

alamb commented Nov 22, 2024

XiangpengHao commented Nov 22, 2024

alamb commented Nov 23, 2024

alamb commented Nov 23, 2024

tustvold Nov 26, 2024

Choose a reason for hiding this comment

tustvold commented Nov 26, 2024

alamb commented Nov 26, 2024

ShiKaiWi commented Dec 10, 2024 • edited Loading

alamb commented Dec 10, 2024

alamb commented Dec 16, 2024

Use `Arc<[Buffer]>` instead of raw `Vec<Buffer>` in `GenericByteViewArray` for faster `slice` #6427

Use `Arc<[Buffer]>` instead of raw `Vec<Buffer>` in `GenericByteViewArray` for faster `slice` #6427

ShiKaiWi commented Sep 20, 2024 •

edited

Loading

alamb Oct 15, 2024 •

edited

Loading

ShiKaiWi commented Dec 10, 2024 •

edited

Loading