You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge?
String operations on UTF8 encoding are relatively more expensive, due to UTF8 being variable length encoding, and each character can be encoded with 1~4 bytes
For example, a UTF8 string "Hello🌏世界" in-memory representation is (x for 1 byte)
[x][x][x][x][x][xxxx][xxx][xxx]
Some seemingly cheap operation liks substr(utf8_col, i, j), character_length(utf8_col) will actually decode the whole string, instead of doing some O(1) operation. If we can assume one string column batch is ASCII only, then those operations are indeed cheap.
However:
Many data are ASCII encoded (1 Byte encoding subset of UTF8), which includes the most common English characters, numbers, etc.
Validating if a string array is ASCII-encoded is fast
Validation implementation is compiler/CPU friendly, can run ~memory bandwidth
It's possible to check in batch, for each string array
So it's possible to first do the check within those functions. If the string array is ASCII-only, then run the specialized path. The ASCII validation overhead should be worth the performance gain in the general cases.
This should be a common trick which has been implemented in Velox and Photon, as their paper has mentioned.
Below is the numbers from Velox
I did a quick experiment on character_length()/ substr() scalar functions, and got some speedup for ASCII cases, the validation overhead is very little. substr() can get another 80% faster upon #12044, for some microbenches with string length 128B
Update
It has been done on char_length() string function and got some performance improvement #12356
Investigate if there is other string operations that can also be optimized by ASCII fast path
Describe the solution you'd like
For scalar functions applicable to ASCII specialization, within function implementation, first validate whether String array is ASCII only, if so enable the fast path.
Functions possible to speed up: character_length(), substr(), lower(), upper()
(And maybe some more like regex functions, need some further investigation)
Describe alternatives you've considered
Add an option to let users to specify whether a column is fully ASCII
Since the always-validate approach is easier to use, and not so expensive, we can leave this to the future
Additional context
No response
The text was updated successfully, but these errors were encountered:
Add an option to let users to specify whether a column is fully ASCII
Since the always-validate approach is easier to use, and not so expensive, we can leave this to the future
I strongly suspect that this is what velox / photon likely do (aka mark which batches are ascii only on creation and the propagate that flag through execution).
One thing we might consider doing to improve performance is to add a is_ascii flag to StringArray and StringViewArray if the data is entirely ascii and then propagate that flag through common kernels (e.g. take / filter).
For example, once is_ascii is called once on a SrtingArray and returns true, there is no reason ti check the same data again (even if it is filtered or substr is calculated, etc)
Is your feature request related to a problem or challenge?
String operations on UTF8 encoding are relatively more expensive, due to UTF8 being variable length encoding, and each character can be encoded with 1~4 bytes
For example, a UTF8 string "Hello🌏世界" in-memory representation is (x for 1 byte)
Some seemingly cheap operation liks
substr(utf8_col, i, j)
,character_length(utf8_col)
will actually decode the whole string, instead of doing some O(1) operation. If we can assume one string column batch is ASCII only, then those operations are indeed cheap.However:
So it's possible to first do the check within those functions. If the string array is ASCII-only, then run the specialized path. The ASCII validation overhead should be worth the performance gain in the general cases.
This should be a common trick which has been implemented in Velox and Photon, as their paper has mentioned.
Below is the numbers from Velox
I did a quick experiment on
character_length()
/substr()
scalar functions, and got some speedup for ASCII cases, the validation overhead is very little.substr()
can get another 80% faster upon #12044, for some microbenches with string length 128BUpdate
It has been done on
char_length()
string function and got some performance improvement #12356The remaining tasks:
is_ascii()
faster as suggested by ASCII fast path for some String scalar functions #12306 (comment)lower()/upper()
string function with ASCII fast path #12365strpos()
string function with ASCII fast path #12366substr()
string function with ASCII fast path #12367reverse()
string function with ASCII fast path #12445Describe the solution you'd like
For scalar functions applicable to ASCII specialization, within function implementation, first validate whether String array is ASCII only, if so enable the fast path.
Functions possible to speed up:
character_length()
,substr()
,lower()
,upper()
(And maybe some more like regex functions, need some further investigation)
Describe alternatives you've considered
Add an option to let users to specify whether a column is fully ASCII
Since the always-validate approach is easier to use, and not so expensive, we can leave this to the future
Additional context
No response
The text was updated successfully, but these errors were encountered: