You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, the initial implementation returns an output StringArray when the input is a StringViewArray, which means all the strings are copied
In some functions, such as substr, this extra copy is unnecessary and only the views (aka the i128s that make up the pointers). See GenericByteViewArray for more details
Describe the solution you'd like
I think we can avoid the copy when the input uses StringViewArray and thus make substr faster
Describe alternatives you've considered
The idea would be to
Create a benchmark for the substring function for StringArray, LargeStringArray and StringViewArray
Optimize the implementation of substr
The optimization would likely look like:
Change the signature of substr so it produces a StringViewArray when its first argument is a StringViewArray (at the moment it produces StringArray when its argument is a StringViewArray)
Make a function that took StringViewArray as input and produced another StringViewArray as output
@XiangpengHao can you remember any example functions / kernels in arrow-rs that only manipulate the views in this way that we can point to as an example for this function?
Alternatively, we can also create a builder, then use append_block to add all blocks from old array (array.data_buffers()) to the builder, and iterate the views, if the len is smaller than 12 bytes, we can directly call append_value, if it is larger than 12 bytes, then we call append_view_unchecked. (It is simpler than it looks)
Is your feature request related to a problem or challenge?
In https://github.com/apache/datafusion/pull/12019/files @dmitrybugakov added support for StringViewArray in the
substr
function ❤️However, the initial implementation returns an output
StringArray
when the input is a StringViewArray, which means all the strings are copiedIn some functions, such as
substr
, this extra copy is unnecessary and only the views (aka the i128s that make up the pointers). See GenericByteViewArray for more detailsDescribe the solution you'd like
I think we can avoid the copy when the input uses StringViewArray and thus make substr faster
Describe alternatives you've considered
The idea would be to
The optimization would likely look like:
substr
so it produces aStringViewArray
when its first argument is aStringViewArray
(at the moment it producesStringArray
when its argument is aStringViewArray
)Additional context
Here is an example benchmark: #12015
Here is the code to work to create StringViews: StringViewBuilder https://docs.rs/arrow/latest/arrow/array/type.StringViewBuilder.html
The text was updated successfully, but these errors were encountered: