ASCII fast path for some String scalar functions #12306

2010YOUY01 · 2024-09-03T14:38:14Z

Is your feature request related to a problem or challenge?

String operations on UTF8 encoding are relatively more expensive, due to UTF8 being variable length encoding, and each character can be encoded with 1~4 bytes

For example, a UTF8 string "Hello🌏世界" in-memory representation is (x for 1 byte)

[x][x][x][x][x][xxxx][xxx][xxx]

Some seemingly cheap operation liks substr(utf8_col, i, j), character_length(utf8_col) will actually decode the whole string, instead of doing some O(1) operation. If we can assume one string column batch is ASCII only, then those operations are indeed cheap.

However:

Many data are ASCII encoded (1 Byte encoding subset of UTF8), which includes the most common English characters, numbers, etc.
Validating if a string array is ASCII-encoded is fast
- Validation implementation is compiler/CPU friendly, can run ~memory bandwidth
- It's possible to check in batch, for each string array

So it's possible to first do the check within those functions. If the string array is ASCII-only, then run the specialized path. The ASCII validation overhead should be worth the performance gain in the general cases.

This should be a common trick which has been implemented in Velox and Photon, as their paper has mentioned.
Below is the numbers from Velox

I did a quick experiment on character_length()/ substr() scalar functions, and got some speedup for ASCII cases, the validation overhead is very little.
substr() can get another 80% faster upon #12044, for some microbenches with string length 128B

Update

It has been done on char_length() string function and got some performance improvement #12356

The remaining tasks:

Make is_ascii() faster as suggested by ASCII fast path for some String scalar functions #12306 (comment)
Optimize lower()/upper() string function with ASCII fast path #12365
Optimize strpos() string function with ASCII fast path #12366
Optimize substr() string function with ASCII fast path #12367
Optimize reverse() string function with ASCII fast path #12445
Investigate if there is other string operations that can also be optimized by ASCII fast path

Describe the solution you'd like

For scalar functions applicable to ASCII specialization, within function implementation, first validate whether String array is ASCII only, if so enable the fast path.
Functions possible to speed up: character_length(), substr(), lower(), upper()
(And maybe some more like regex functions, need some further investigation)

Describe alternatives you've considered

Add an option to let users to specify whether a column is fully ASCII
Since the always-validate approach is easier to use, and not so expensive, we can leave this to the future

Additional context

No response

The text was updated successfully, but these errors were encountered:

2010YOUY01 · 2024-09-03T14:39:17Z

I plan to first do a PR with character_length() function

alamb · 2024-09-06T20:35:20Z

#12356 is quite cool 🚀

Add an option to let users to specify whether a column is fully ASCII
Since the always-validate approach is easier to use, and not so expensive, we can leave this to the future

I strongly suspect that this is what velox / photon likely do (aka mark which batches are ascii only on creation and the propagate that flag through execution).

The current implementation of arrow is_ascii examines all the bytes in the array:
https://docs.rs/arrow-array/53.0.0/src/arrow_array/array/byte_array.rs.html#262

One thing we might consider doing to improve performance is to add a is_ascii flag to StringArray and StringViewArray if the data is entirely ascii and then propagate that flag through common kernels (e.g. take / filter).

For example, once is_ascii is called once on a SrtingArray and returns true, there is no reason ti check the same data again (even if it is filtered or substr is calculated, etc)

2010YOUY01 added the enhancement New feature or request label Sep 3, 2024

2010YOUY01 mentioned this issue Sep 6, 2024

Faster character_length() string function for ASCII-only case #12356

Merged

goldmedal mentioned this issue Dec 17, 2024

Support unicode character for initcap function #13752

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ASCII fast path for some String scalar functions #12306

ASCII fast path for some String scalar functions #12306

2010YOUY01 commented Sep 3, 2024 •

edited

Loading

2010YOUY01 commented Sep 3, 2024

alamb commented Sep 6, 2024

ASCII fast path for some String scalar functions #12306

ASCII fast path for some String scalar functions #12306

Comments

2010YOUY01 commented Sep 3, 2024 • edited Loading

Is your feature request related to a problem or challenge?

Update

Describe the solution you'd like

Describe alternatives you've considered

Additional context

2010YOUY01 commented Sep 3, 2024

alamb commented Sep 6, 2024

2010YOUY01 commented Sep 3, 2024 •

edited

Loading