Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make StringViewArray::slice() and BinaryViewArray::slice() faster / non allocating #6408

Open
Tracked by #11752
alamb opened this issue Sep 17, 2024 · 12 comments · May be fixed by #6427
Open
Tracked by #11752

Make StringViewArray::slice() and BinaryViewArray::slice() faster / non allocating #6408

alamb opened this issue Sep 17, 2024 · 12 comments · May be fixed by #6427
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog help wanted

Comments

@alamb
Copy link
Contributor

alamb commented Sep 17, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
While working on an upstream project apache/datafusion#12092 which switched DataFusion to use StringViewArray rather than StringArray

When I did, one of the queries got much slower.

Profiling revealed that the time difference was almost entirely explained by the time spent in StringViewArray::slice()

Screenshot 2024-09-16 at 4 44 50 PM

Here is the flamegraph with StringArray:
flamegraph-main

Here is the same query with StringViewArray:
flamegraph-string-view

I am pretty sure the additional time is due to the time spent allocating / copying / deallocating the Vecs of buffers here:

Where calling slice on a StringArray can be done with a few Arc increments.

data_type: DataType,
value_offsets: OffsetBuffer<T::Offset>,
value_data: Buffer,
nulls: Option<NullBuffer>,

Describe the solution you'd like
I would like StringViewArray::slice to be faster (aka don't allocate)

Describe alternatives you've considered

We can (and probably should) change DataFusion not to use slice in this case (see apache/datafusion#6906, apache/datafusion#6906 (comment) specifically) but I think making slice faster / non allocating for StringViewArray will be useful in general

Additional context

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Sep 17, 2024
@alamb alamb changed the title Make StringViewArray::slice() and BinaryViewArray::slice() faster Make StringViewArray::slice() and BinaryViewArray::slice() faster / non allocating Sep 17, 2024
@alamb
Copy link
Contributor Author

alamb commented Sep 17, 2024

fyi @a10y and @XiangpengHao

@alamb
Copy link
Contributor Author

alamb commented Sep 17, 2024

On solution here would be to replace Vec<Buffer> with Arc<[Buffer]> after construction -- that would make managing the buffers on slice extremely cheap

@XiangpengHao
Copy link
Contributor

On solution here would be to replace Vec with Arc<[Buffer]> after construction

I agree, this should fix the problems. In fact, many other operations (e.g., take) also need to clone the Vec, changing to Arc<[Buffer]> will benefit them as well.

However, I would be a bit more careful about why cloning the buffer taking so long -- often indicating the Vec is large, which often means gc is not being called timely. So in addition to changing to Arc<[Buffer]>, I would also examine the plan to make sure gc (in CoalesceBatchesExec) is being invoked properly. cc @WetABQ who might be interested in this discussion.

I plan to work on this later this week but anyone else feel free to run faster than me!

@XiangpengHao
Copy link
Contributor

replace Vec with Arc<[Buffer]> after construction -- that would make managing the buffers on slice extremely cheap

But that also means an extra indirection to read buffers, not sure if it matters

@Rachelint
Copy link
Contributor

Rachelint commented Sep 17, 2024

The slice maybe can't be eliminated until the epic apache/datafusion#7065 is finished...
Because it will still be called here https://github.com/apache/datafusion/blob/a08f923c2acb1a46614970231d9a672c36ce3ad2/datafusion/physical-plan/src/aggregates/row_hash.rs#L713

@alamb
Copy link
Contributor Author

alamb commented Sep 17, 2024

But that also means an extra indirection to read buffers, not sure if it matters

I think an Arc<Vec<Buffer>> would add another indirection. I was thinking/hoping that Arc<[Buffer]> would only still have one indirection

@alamb
Copy link
Contributor Author

alamb commented Sep 17, 2024

The slice maybe can't be eliminated until the epic apache/datafusion#7065 is finished...

FWIW the slice that I looked at in apache/datafusion#12092 (comment) is a different one:

https://github.com/apache/datafusion/blob/a08f923c2acb1a46614970231d9a672c36ce3ad2/datafusion/functions-aggregate-common/src/aggregate/groups_accumulator.rs#L435-L438

(This is called once for each distinct group in each batch being aggregates, which is quite bad -- the better way to solve this is to implement a Min/Max accumulator for strings that avoids slicing at all, which we are tracking in apache/datafusion#6906)

I think the fact that slice is used many different places makes it all the more important to optimize in arrow-rs

@alamb
Copy link
Contributor Author

alamb commented Sep 17, 2024

However, I would be a bit more careful about why cloning the buffer taking so long -- often indicating the Vec is large, which often means gc is not being called timely

This is a good point

@Rachelint
Copy link
Contributor

The slice maybe can't be eliminated until the epic apache/datafusion#7065 is finished...

FWIW the slice that I looked at in apache/datafusion#12092 (comment) is a different one:

https://github.com/apache/datafusion/blob/a08f923c2acb1a46614970231d9a672c36ce3ad2/datafusion/functions-aggregate-common/src/aggregate/groups_accumulator.rs#L435-L438

(This is called once for each distinct group in each batch being aggregates, which is quite bad -- the better way to solve this is to implement a Min/Max accumulator for strings that avoids slicing at all, which we are tracking in apache/datafusion#6906)

I think the fact that slice is used many different places makes it all the more important to optimize in arrow-rs

Yes, maybe it should be ensured to be a cheap operations, it is used in many many places...

@ShiKaiWi
Copy link
Member

ShiKaiWi commented Sep 20, 2024

@alamb @XiangpengHao Could assign this ticket to me? I guess I can make it after understanding the proposal in the discussion.

BTW, I find there is no benchmark case for the slice method, and I guess the benchmark case is necessary to prove the proposal implementation.

@XiangpengHao
Copy link
Contributor

You can self assign by commenting "take"

I think adding benchmark is the right first step! Looking forward to it!

@alamb
Copy link
Contributor Author

alamb commented Sep 20, 2024

Tahank you @ShiKaiWi -- can't wait to check it out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog help wanted
Projects
None yet
4 participants