Stricter utf8 library validation (#1)

luau-lang · Nov 8, 2023 · c12c6cd · c12c6cd
1 parent e14a058
commit c12c6cd
Showing 1 changed file with 46 additions and 0 deletions.
diff --git a/docs/behavior-stricter-utf8-library.md b/docs/behavior-stricter-utf8-library.md
@@ -0,0 +1,46 @@
+# Stricter utf8 library validation
+
+## Summary
+
+`utf8.len` and other functions in UTF-8 library will do more rigorous validation of UTF-8 input, rejecting UTF-16 surrogates encoded into UTF-8.
+
+## Motivation
+
+We use the UTF-8 library `utf8` from Lua 5.3. The implementation of this library mostly correctly validates UTF-8 input and either throws errors (e.g. from `utf8.codes`) or returns `nil` (for `utf8.len`) for invalid input.
+
+However, certain invalid UTF-8 sequences are treated as valid. This specifically includes surrogate characters encoded in UTF-8; for example, the string `"\237\160\128"` represents an attempt to encode the surrogate `0xD800` into UTF-8.
+This is impossible to do in UTF-8; Lua 5.3 accepts this (`utf8.len` returns 1), and so does Luau.
+
+This creates issues for any other API that correctly validates UTF-8. For example, Roblox extends Luau with a few extra functions, like `utf8.nfcnormalize`, that perform UTF-8 validation correctly and reject the aforementioned string.
+As a result, due to Roblox extensions, the `utf8` library exposed in Roblox is inconsistent (this is also a problem in other Roblox-specific APIs like DataStores that reject invalid UTF-8 inputs that `utf8.len` accepts).
+We would also expect that in other environments, extra functions that handle UTF-8 inputs and validate them do the validation properly.
+
+Lua 5.4 fixes this by changing the default behavior of `utf8.len` and other UTF-8 functions to error when surrogates are present. It also fixes an issue in `utf8.codes` that failed to properly reject overlong UTF-8 sequences in certain cases; Luau already contains a fix to this issue.
+
+## Design
+
+All functions from Luau `utf8` library will correctly validate UTF-8 and reject inputs that have surrogates. The functions will use their existing error handling strategy: `utf8.len` will return `nil` followed by the byte offset, `utf8.codepoint` and `utf8.codes` will error.
+
+Any code using these functions should already be prepared for handling errors: `utf8` functions will already do a validation of UTF-8 input, the validation was just not fully comprehensive. It is expected that this will not cause any significant compatibility risk, because many other functions already reject surrogates and the use of `utf8.` function family in general is rare.
+
+Notably, this is a *partial* implementation of Lua 5.4 behavior. In addition to making the default behavior stricter, Lua 5.4 *relaxes* `utf8.char` (allowing it to produce "codepoints" that exceed 0x10FFFF and are invalid per Unicode standard) unconditionally, and also provides an optional argument `lax` to a few functions like `utf8.len` that allows to accept both surrogates and out-of-range (larger than `0x10FFFF`) codepoints.
+
+We do not propose to adopt the full Lua 5.4 behavior or introduce the new `lax` argument for the following reasons:
+
+- Based on the analysis of all packages published to `luarocks`, we conclude that `lax` argument is never used outside of Lua's own tests. As such, it's questionable whether we *need* to give a way to relax the checks.
+- The behavior changes regardless due to the parameter defaults (`lax` defaults to `false`). We agree with this choice but conclude that the compatibility risk, however small, exists regardless of the argument's presence.
+- Because full 5.4 semantics include both making behavior stricter by default, and also significantly relaxing behavior by allowing out-of-range characters (unconditionally for `utf8.char`, and conditionally on `lax` for other `utf8` functions), they allow easier production of invalid UTF-8 sequences.
+
+It is likely that 5.4 semantics change is motivated by only having two types of behavior: "strict" (proposed here) and "lax" (supporting a variant of UTF-8 that allows encoding any codepoint value). Because it is always possible to add the "lax" behavior later, and we do not want to establish the precedent and require this addition to all other functions (for example, what should `nfcnormalize` do for surrogates?), we only specify the "strict" behavior here. The "lax" behavior could be added by a separate RFC if there's significant demand for it; that would be fully backwards compatible.
+
+> After this change, in Roblox strings that are correctly validated by `utf8.len` should also work properly with `utf8.nfcnormalize`, `utf8.graphemes`, DataStore APIs and other functions that validate UTF-8 input.
+
+## Drawbacks
+
+Because this technically changes behavior, due to Hyrum's law we might break some code that *really* wants to encode surrogates in UTF-8 despite the fact that only a couple functions support this.
+
+Because this doesn't fully align behavior with Lua 5.4, it is possible that we will get programs in the future that want the full Lua 5.4 treatment (allowing `lax`). It is likely that this will happen because of larger codepoints for some odd custom encoding reasons though, not for surrogates specifically, and if it does we can add it later.
+
+## Alternatives
+
+Other than doing nothing, we could implement full Lua 5.4 behavior (with `lax` argument and relaxed range checks in some existing functions). This leaves the question as to what the behavior of Roblox-specific utf8 extensions like `nfcnormalize` or `graphemes` should be, as they would become inconsistent not just in implementation but also in interface without further changes.