Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of SUBSTR for StringViewArray #12031

Closed
Tracked by #11752
alamb opened this issue Aug 16, 2024 · 3 comments · Fixed by #12044
Closed
Tracked by #11752

Improve performance of SUBSTR for StringViewArray #12031

alamb opened this issue Aug 16, 2024 · 3 comments · Fixed by #12044
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Aug 16, 2024

Is your feature request related to a problem or challenge?

In https://github.com/apache/datafusion/pull/12019/files @dmitrybugakov added support for StringViewArray in the substr function ❤️

However, the initial implementation returns an output StringArray when the input is a StringViewArray, which means all the strings are copied

In some functions, such as substr, this extra copy is unnecessary and only the views (aka the i128s that make up the pointers). See GenericByteViewArray for more details

Describe the solution you'd like

I think we can avoid the copy when the input uses StringViewArray and thus make substr faster

Describe alternatives you've considered

The idea would be to

  1. Create a benchmark for the substring function for StringArray, LargeStringArray and StringViewArray
  2. Optimize the implementation of substr

The optimization would likely look like:

  1. Change the signature of substr so it produces a StringViewArray when its first argument is a StringViewArray (at the moment it produces StringArray when its argument is a StringViewArray)
  2. Make a function that took StringViewArray as input and produced another StringViewArray as output

Additional context

Here is an example benchmark: #12015

Here is the code to work to create StringViews: StringViewBuilder https://docs.rs/arrow/latest/arrow/array/type.StringViewBuilder.html

@alamb alamb added the enhancement New feature or request label Aug 16, 2024
@alamb
Copy link
Contributor Author

alamb commented Aug 16, 2024

@XiangpengHao can you remember any example functions / kernels in arrow-rs that only manipulate the views in this way that we can point to as an example for this function?

@XiangpengHao
Copy link
Contributor

XiangpengHao commented Aug 16, 2024

can you remember any example functions / kernels in arrow-rs that only manipulate the views in this way

I don't have an exact match, but the take kernel can be a good candidate: https://github.com/apache/arrow-rs/blob/042d725888358c73cd2a0d58868ea5c4bad778f7/arrow-select/src/take.rs#L481-L491
It basically breaks the string view array into smaller pieces and then assemble them together.

Alternatively, we can also create a builder, then use append_block to add all blocks from old array (array.data_buffers()) to the builder, and iterate the views, if the len is smaller than 12 bytes, we can directly call append_value, if it is larger than 12 bytes, then we call append_view_unchecked. (It is simpler than it looks)

@Kev1n8
Copy link
Contributor

Kev1n8 commented Aug 16, 2024

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants