Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve UTF-8 decoding and encoding functions #410

Merged
merged 1 commit into from
May 21, 2024

Conversation

chqrlie
Copy link
Collaborator

@chqrlie chqrlie commented May 19, 2024

Ensure proper UTF-8 encoding (1 to 4 bytes).
Handle invalid encodings (return 0xFFFD and consume a single byte) Individually encoded surrogate code points are accepted.

  • add utf8_scan() to analyze a byte array for UTF-8 contents detects invalid encoding, computes number of codepoints and content kind: plain ASCII, 8-bit, 16-bit or larger codepoints.
  • add utf8_encode_len(c) to compute the number of bytes to encode c
  • rename unicode_to_utf8 as utf8_encode
  • rename unicode_from_utf8 as utf8_decode
  • add utf8_decode_buf8(dest, size, src, len) to decode a UTF-8 encoded byte array known to contain only ASCII and 8-bit codepoints.
  • add utf8_decode_buf16(dest, size, src, len) to decode a UTF-8 encoded byte array into an array of 16-bit codepoints using UTF-16 surrogate pairs for non-BMP1 codepoints.
  • add utf8_encode_buf8(dest, size, src, len) to encode an array of 8-bit codepoints as a UTF-8 encoded null terminated string
  • add utf16_encode_buf8(dest, size, src, len) to decode an array of 16-bit codepoints (including surrogate pairs) as a UTF-8 encoded null terminated string
  • detect invalid UTF-8 encoding in RegExp parser
  • simplify JS_AtomGetStrRT, JS_NewStringLen using the above functions
  • simplify UTF-8 decoding and error testing

This commit is preliminary for another PR fixing some JSAtom creation inconsistencies and inefficiencies.

Ensure proper UTF-8 encoding (1 to 4 bytes).
Handle invalid encodings (return 0xFFFD and consume a single byte)
Individually encoded surrogate code points are accepted.

- add `utf8_scan()` to analyze a byte array for UTF-8 contents
  detects invalid encoding, computes number of codepoints and content kind:
  plain ASCII, 8-bit, 16-bit or larger codepoints.
- add `utf8_encode_len(c)` to compute the number of bytes to encode `c`
- rename `unicode_to_utf8` as `utf8_encode`
- rename `unicode_from_utf8` as `utf8_decode`
- add `utf8_decode_buf8(dest, size, src, len)` to decode a UTF-8 encoded
  byte array known to contain only ASCII and 8-bit codepoints.
- add `utf8_decode_buf16(dest, size, src, len)` to decode a UTF-8 encoded
  byte array into an array of 16-bit codepoints using UTF-16 surrogate pairs
  for non-BMP1 codepoints.
- add `utf8_encode_buf8(dest, size, src, len)` to encode an array of 8-bit
  codepoints as a UTF-8 encoded null terminated string
- add `utf16_encode_buf8(dest, size, src, len)` to decode an array of 16-bit
  codepoints (including surrogate pairs) as a UTF-8 encoded null terminated string
- detect invalid UTF-8 encoding in RegExp parser
- simplify `JS_AtomGetStrRT`, `JS_NewStringLen` using the above functions
- simplify UTF-8 decoding and error testing
@chqrlie chqrlie force-pushed the improve-utf8-functions branch from 50da583 to 1c6a98a Compare May 19, 2024 12:50
Copy link
Contributor

@saghul saghul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only did a shallow review, but I trust you and the tests are happy :-)

@chqrlie chqrlie merged commit 1baa676 into quickjs-ng:master May 21, 2024
47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants