You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sorry for the driveby question, I tried searching on the ML and in the existing issues but could not find any previous discussion about this. If this has already been answered, any pointers to the relevant resource(s) would be greatly appreciated.
Was it ever considered/discussed to add, likely in a new dedicated subset of this extension, instructions for encoding/decoding UTF-8 (and ideally also UTF-16 and UTF-32)? Most text processed today is in one of those encodings1, and there is little on the horizon that would suggest upcoming changes to this status quo; decoding/encoding UTF is not especially complicated without dedicated instructions (and the existing bitmanip instructions can already help), but given the ubiquity of these encodings and the relative underlying logical simplicity of the coding process (at their heart, UTF-8 and UTF-16 are simple-to-decode VLEs) there may be efficiency benefits2 to be obtained with dedicated support.
Just for the sake of clarity, in its simplest form (covering only UTF-8 → codepoint decoding) this would require a single instruction that takes a 4 bytes input (the maximum length of a UTF-8 encoded codepoint, likely obtained via an unaligned read from memory), and returns the decoded Unicode codepoint (3 bytes), how many bytes of the input were consumed (between 1 and 4, included), and whether the decoding encountered an error (the necessity to return multiple values is probably the biggest roadblock to inclusion in the ISA, albeit I suspect there may be workarounds).
Extensions to the simplest form could include, as hinted to above:
Going further, it is potentially even possible to imagine an expansion (outside of this extension) to a packed SIMD version4 of the same operations, able to {de|en}code multiple codepoints at the same time.
Footnotes
and this includes resources with text representation even if not exclusively meant for direct human consumption, like JSON, CSV, HTML, and other source code ↩
while the English-speaking world may have historically been fine assuming that most text would be quickly parseable in the ASCII-subset of UTF-8, so the need for efficient non-ASCII codepoints handling was lesser, this has never been true in the rest of the world ↩
i.e. the ability to decode a codepoint knowing where the last byte of the encoded representation is (instead of knowing where the first byte of the encoded representation is); this is useful when iterating backwards over text ↩
or even a vector version, albeit this would possibly require a prohibitively high gate count for any reasonable VLEN ↩
The text was updated successfully, but these errors were encountered:
Sorry for the driveby question, I tried searching on the ML and in the existing issues but could not find any previous discussion about this. If this has already been answered, any pointers to the relevant resource(s) would be greatly appreciated.
Was it ever considered/discussed to add, likely in a new dedicated subset of this extension, instructions for encoding/decoding UTF-8 (and ideally also UTF-16 and UTF-32)? Most text processed today is in one of those encodings1, and there is little on the horizon that would suggest upcoming changes to this status quo; decoding/encoding UTF is not especially complicated without dedicated instructions (and the existing bitmanip instructions can already help), but given the ubiquity of these encodings and the relative underlying logical simplicity of the coding process (at their heart, UTF-8 and UTF-16 are simple-to-decode VLEs) there may be efficiency benefits2 to be obtained with dedicated support.
Just for the sake of clarity, in its simplest form (covering only UTF-8 → codepoint decoding) this would require a single instruction that takes a 4 bytes input (the maximum length of a UTF-8 encoded codepoint, likely obtained via an unaligned read from memory), and returns the decoded Unicode codepoint (3 bytes), how many bytes of the input were consumed (between 1 and 4, included), and whether the decoding encountered an error (the necessity to return multiple values is probably the biggest roadblock to inclusion in the ISA, albeit I suspect there may be workarounds).
Extensions to the simplest form could include, as hinted to above:
Going further, it is potentially even possible to imagine an expansion (outside of this extension) to a packed SIMD version4 of the same operations, able to {de|en}code multiple codepoints at the same time.
Footnotes
and this includes resources with text representation even if not exclusively meant for direct human consumption, like JSON, CSV, HTML, and other source code ↩
while the English-speaking world may have historically been fine assuming that most text would be quickly parseable in the ASCII-subset of UTF-8, so the need for efficient non-ASCII codepoints handling was lesser, this has never been true in the rest of the world ↩
i.e. the ability to decode a codepoint knowing where the last byte of the encoded representation is (instead of knowing where the first byte of the encoded representation is); this is useful when iterating backwards over text ↩
or even a vector version, albeit this would possibly require a prohibitively high gate count for any reasonable VLEN ↩
The text was updated successfully, but these errors were encountered: