-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: utf8 -> utf16 decoding bug on surrogate pairs #1486
Conversation
This fixes protobufjs#1473 The custom utf8 -> utf16 decoder appears to be subtly flawed. From my reading it appears the chunking mechanism doesn't account for surrogate pairs at the end of a chunk causing variable size chunks. A larger chunk followed by a smaller chunk leaves behind garbage that'll be included in the latter chunk. It looks like the chunking mechanism was added to prevent stack overflows when calling `formCharCode` with too many args. From some benchmarking it appears putting utf16 code units in an array and spreading that into `fromCharCode` wasn't helping performance much anyway. I simplified it significantly. Here's a repro of the existing encoding bug in a fuzzing suite https://repl.it/@turbio/oh-no-our-strings#decoder.js
chunk[i++] = t; | ||
else if (t > 191 && t < 224) | ||
chunk[i++] = (t & 31) << 6 | buffer[start++] & 63; | ||
else if (t > 239 && t < 365) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of this function is that the passed in buffer is always an array of uint8s, each byte being a utf8 code unit. The condition t < 365
is a bit odd under this assumption, from profiling it actually forces v8 to deopt when t
is compared to a non uint8 value. Alternatively I may have grossly misunderstood how this is expected to work.
Hi @turbio, Would it be possible to have a test that shows the problem included in the PR? Thank you! |
@alexander-fenster yup! added a test which fails on the old |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @turbio!
protobufjs/protobuf.js#1486 released in protobufjs 6.10.2 and updated in @replit/protocol in replit/protocol#9 and released in 0.2.15
hi guys, the merged code seems not included in the latest release. |
This fixes #1473
The custom utf8 -> utf16 decoder appears to be subtly flawed. From my reading it appears the chunking mechanism doesn't account for surrogate pairs at the end of a chunk causing variable size chunks. A larger chunk followed by a smaller chunk leaves behind garbage that'll be included in the latter chunk.
The chunking mechanism was added to prevent stack overflows when calling
formCharCode
with too many args. From some simple benchmarking it doesn't seem like putting utf16 code units in an array and spreading that intofromCharCode
is helping performance much anyway. I've removed the chunking code which simplifies it significantly and fixes the particular issue we've encountered.Here's a repro of the existing encoding bug in a fuzzing suite
https://repl.it/@turbio/oh-no-our-strings#decoder.js