Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v20.5.0] Buffer.allocUnsafe and TextDecoder.decode doesn't play well. #48995

Closed
JoakimCh opened this issue Aug 2, 2023 · 10 comments
Closed
Labels
buffer Issues and PRs related to the buffer subsystem. confirmed-bug Issues with confirmed bugs. encoding Issues and PRs related to the TextEncoder and TextDecoder APIs.

Comments

@JoakimCh
Copy link

JoakimCh commented Aug 2, 2023

Version

v20.5.0

Platform

Linux dell 6.4.0-1-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.4.4-1 (2023-07-23) x86_64 GNU/Linux

Subsystem

No response

What steps will reproduce the bug?

Here is some code to reproduce and explore the bug.
Both Buffer.from([0x80]) and Buffer.allocUnsafe(1)[0] = 0x80 will cause unstable behaviour in TextDecoder.decode

let error1Count = 0
let error2Count = 0
for (let index = 0; index < 20000; index++) {
  try {
    const buffer = Buffer.from([0x80])   // this is unstable
    // const buffer = Buffer.allocUnsafe(1) // this is unstable
    // const buffer = Buffer.alloc(1)       // this is stable
    buffer[0] = 0x80
    const data = new TextDecoder('utf-8', {fatal: true}).decode(buffer)
    error1Count ++ // then data is the '\uFFFD' replacement character
  } catch {}
  try { // this is stable
    const data = new TextDecoder('utf-8', {fatal: true}).decode(new Uint8Array([0x80]))
    error2Count ++
  } catch {} 
}
console.log(error1Count, error2Count)

How often does it reproduce? Is there a required condition?

Sometimes I have to run the code snippet a few times before the bug is seen.

What is the expected behavior? Why is that the expected behavior?

Since I'm running the TextDecoder with {fatal: true} I expect the invalid input to ALWAYS throw an error.

What do you see instead?

A few times (pretty much random) instead of throwing an error it outputs the \uFFFD replacement character. This should not be possible when {fatal: true} is used.

Additional information

I have only tested on Linux Debian with Node v20.5.0.

@atlowChemi atlowChemi added buffer Issues and PRs related to the buffer subsystem. encoding Issues and PRs related to the TextEncoder and TextDecoder APIs. labels Aug 3, 2023
@lpinca lpinca added the confirmed-bug Issues with confirmed bugs. label Aug 3, 2023
@lpinca
Copy link
Member

lpinca commented Aug 3, 2023

I can reproduce on macOS. Buffer.from() and Buffer.allocUnsafe() use the internal Buffer pool so it might be related to that.

@lpinca
Copy link
Member

lpinca commented Aug 3, 2023

The culprit seems to be this commit 2ef13b8.

cc: @anonrig

@lpinca
Copy link
Member

lpinca commented Aug 3, 2023

FWIW this seems to fix the issue:

diff --git a/src/encoding_binding.cc b/src/encoding_binding.cc
index b65a4f868e..747df2c40c 100644
--- a/src/encoding_binding.cc
+++ b/src/encoding_binding.cc
@@ -164,9 +164,9 @@ void BindingData::DecodeUTF8(const FunctionCallbackInfo<Value>& args) {
   size_t length = buffer.length();
 
   if (has_fatal) {
-    auto result = simdutf::validate_utf8_with_errors(data, length);
+    bool is_valid = simdutf::validate_utf8(data, length);
 
-    if (result.error) {
+    if (!is_valid) {
       return node::THROW_ERR_ENCODING_INVALID_ENCODED_DATA(
           env->isolate(), "The encoded data was not valid for encoding utf-8");
     }

cc: @lemire

@lemire
Copy link
Member

lemire commented Aug 3, 2023

No string that begins with the byte value 0x80 can be valid (does not matter what comes after).

I have added tests for this upstream, specifically

  const char bad[1] = {(char)0x80};
  size_t length = 1;
  simdutf::result res = implementation.validate_utf8_with_errors(bad, length);
  ASSERT_TRUE(res.error);

We run tests with sanitizers, so an out-of-bound access or an undefined behavior would be detected.

simdutf/simdutf#265

What is the expectation in this routine...?

if (has_fatal) {
     auto result = simdutf::validate_utf8_with_errors(data, length);

     if (result.error) {
       return node::THROW_ERR_ENCODING_INVALID_ENCODED_DATA(
           env->isolate(), "The encoded data was not valid for encoding utf-8");
     }
   }

Is the expectation that the data is incorrect? If so, I would encourage validate_utf8_with_errors as it is likely more performant.

@lpinca
Copy link
Member

lpinca commented Aug 4, 2023

I think the expectation is that data is correct. Anyway the reproduction script calls validate_utf8_with_errors 20000 times and it fails only a few times on my machine. Sometimes it does not fail. validate_utf8 never fails.

luigi@imac:Desktop$ cat repro.js 
let error1Count = 0;
let error2Count = 0;
for (let index = 0; index < 20000; index++) {
  try {
    const buffer = Buffer.from([0x80]); // this is unstable
    // const buffer = Buffer.allocUnsafe(1) // this is unstable
    // const buffer = Buffer.alloc(1)       // this is stable
    buffer[0] = 0x80;
    const data = new TextDecoder('utf-8', { fatal: true }).decode(buffer);
    error1Count++; // then data is the '\uFFFD' replacement character
  } catch {}
  try {
    // this is stable
    const data = new TextDecoder('utf-8', { fatal: true }).decode(
      new Uint8Array([0x80])
    );
    error2Count++;
  } catch {}
}
console.log(error1Count, error2Count);
luigi@imac:Desktop$ node repro.js 
5 0
luigi@imac:Desktop$ node repro.js 
0 0
luigi@imac:Desktop$ node repro.js 
4 0
luigi@imac:Desktop$ node repro.js 
108 0
luigi@imac:Desktop$ node repro.js 
73 0
luigi@imac:Desktop$ node repro.js 
2 0

@lemire
Copy link
Member

lemire commented Aug 4, 2023

I am going to be adding tests upstream today. Give me 24 hours to research this, to see whether there is an issue.

@lpinca
Copy link
Member

lpinca commented Aug 4, 2023

Sure, no hurries.

@lemire
Copy link
Member

lemire commented Aug 4, 2023

I am going to be issue a patch release.

lemire added a commit to simdutf/simdutf that referenced this issue Aug 4, 2023
@lemire
Copy link
Member

lemire commented Aug 4, 2023

The issue is upstream in simdutf. I have a patch release for it https://github.com/simdutf/simdutf/releases/tag/v3.2.15

I recommend upgrading to v3.2.15. It is nearly identical to v3.2.14 except for this one issue.

cc @anonrig

@lemire
Copy link
Member

lemire commented Aug 16, 2023

Thanks @lpinca and @anonrig for updating.

Thanks @JoakimCh for reporting the issue.

UlisesGascon pushed a commit that referenced this issue Aug 17, 2023
PR-URL: #49019
Fixes: #48995
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Tobias Nießen <[email protected]>
Reviewed-By: Mohammed Keyvanzadeh <[email protected]>
targos pushed a commit that referenced this issue Oct 28, 2023
PR-URL: #49019
Fixes: #48995
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Tobias Nießen <[email protected]>
Reviewed-By: Mohammed Keyvanzadeh <[email protected]>
sercher added a commit to sercher/graaljs that referenced this issue Apr 25, 2024
PR-URL: nodejs/node#49019
Fixes: nodejs/node#48995
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Tobias Nießen <[email protected]>
Reviewed-By: Mohammed Keyvanzadeh <[email protected]>
sercher added a commit to sercher/graaljs that referenced this issue Apr 25, 2024
PR-URL: nodejs/node#49019
Fixes: nodejs/node#48995
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Tobias Nießen <[email protected]>
Reviewed-By: Mohammed Keyvanzadeh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
buffer Issues and PRs related to the buffer subsystem. confirmed-bug Issues with confirmed bugs. encoding Issues and PRs related to the TextEncoder and TextDecoder APIs.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants