Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lib: ensure TextDecoder only removes utf8 BOM on utf8 encoding #42779

Closed
wants to merge 5 commits into from

Conversation

phated
Copy link
Contributor

@phated phated commented Apr 18, 2022

While working on gulpjs/remove-bom-stream#8, I noticed that our test case to ensure a UTF-8 BOM at the beginning of a UTF-16 file wouldn't be removed; however, it was.

I dug into the PR at #30132 and noticed that it lost the utf-8 and utf-16-le checks in the refactor.

This should limit the BOM removal to just the UTF-8 encoding. There should also be a follow-up PR that adds the utf-16-le BOM removal back to the code.

@nodejs-github-bot nodejs-github-bot added encoding Issues and PRs related to the TextEncoder and TextDecoder APIs. needs-ci PRs that need a full CI run. labels Apr 18, 2022
@Trott Trott requested a review from addaleax April 18, 2022 21:30
@phated phated force-pushed the phated/textdecoder-bom branch from 3cc2e04 to d84a4eb Compare April 18, 2022 21:32
@Trott
Copy link
Member

Trott commented Apr 18, 2022

I'm guessing lib: would be the generic choice for the start of the commit message.

@phated phated force-pushed the phated/textdecoder-bom branch from d84a4eb to 96f2c21 Compare April 18, 2022 21:45
@phated
Copy link
Contributor Author

phated commented Apr 18, 2022

Updated! I also attempted to add a test, let me know if it is wrong.

@phated phated changed the title utils: ensure TextDecoder only removes utf-8 BOM on utf-8 encoding lib: ensure TextDecoder only removes utf8 BOM on utf8 encoding Apr 18, 2022
@Trott

This comment was marked as resolved.

@phated phated force-pushed the phated/textdecoder-bom branch from 96f2c21 to 4b0cb91 Compare April 18, 2022 21:47
@Trott
Copy link
Member

Trott commented Apr 18, 2022

I also attempted to add a test, let me know if it is wrong.

Looks right to me. Anyone more encoding-knowledgable want to review?

@Trott Trott added the request-ci Add this label to start a Jenkins CI on a PR. label Apr 18, 2022
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Apr 18, 2022
@nodejs-github-bot
Copy link
Collaborator

@phated
Copy link
Contributor Author

phated commented Apr 18, 2022

@Trott can you help me with these failures so I don't need to approve the Jenkins CI to my entire GH account?

@nodejs-github-bot
Copy link
Collaborator

@Trott
Copy link
Member

Trott commented Apr 18, 2022

@Trott can you help me with these failures so I don't need to approve the Jenkins CI to my entire GH account?

AssertionError [ERR_ASSERTION]: Expected values to be strictly equal:
+ actual - expected

+ 'utf-16le'
- 'utf-16-le'
         ^
    at /home/iojs/build/workspace/node-test-commit-linuxone/test/parallel/test-whatwg-encoding-custom-textdecoder.js:46:12
    at Array.forEach (<anonymous>)
    at Object.<anonymous> (/home/iojs/build/workspace/node-test-commit-linuxone/test/parallel/test-whatwg-encoding-custom-textdecoder.js:39:26)
    at Module._compile (node:internal/modules/cjs/loader:1105:14)
    at Module._extensions..js (node:internal/modules/cjs/loader:1159:10)
    at Module.load (node:internal/modules/cjs/loader:981:32)
    at Module._load (node:internal/modules/cjs/loader:827:12)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:77:12)
    at node:internal/main/run_main_module:17:47 {
  generatedMessage: true,
  code: 'ERR_ASSERTION',
  actual: 'utf-16le',
  expected: 'utf-16-le',
  operator: 'strictEqual'
}

@phated
Copy link
Contributor Author

phated commented Apr 18, 2022

🤦 I totally looked at those conversions too. Thanks!

@Trott Trott added the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Apr 18, 2022
@nodejs-github-bot
Copy link
Collaborator

@Trott
Copy link
Member

Trott commented Apr 19, 2022

@phated It appears that the change here causes a failure in test/parallel/test-whatwg-encoding-custom-textdecoder.js. Is the bug in the test or in the change here?

AssertionError [ERR_ASSERTION]: Expected values to be strictly equal:
+ actual - expected

+ '믯璿攀猀琀가�'
- 'test€'
    at /home/iojs/build/workspace/node-test-commit-linux/test/parallel/test-whatwg-encoding-custom-textdecoder.js:48:12
    at Array.forEach (<anonymous>)
    at Object.<anonymous> (/home/iojs/build/workspace/node-test-commit-linux/test/parallel/test-whatwg-encoding-custom-textdecoder.js:39:26)
    at Module._compile (node:internal/modules/cjs/loader:1105:14)
    at Module._extensions..js (node:internal/modules/cjs/loader:1159:10)
    at Module.load (node:internal/modules/cjs/loader:981:32)
    at Module._load (node:internal/modules/cjs/loader:827:12)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:77:12)
    at node:internal/main/run_main_module:17:47 {
  generatedMessage: true,
  code: 'ERR_ASSERTION',
  actual: '믯璿攀猀琀가�',
  expected: 'test€',
  operator: 'strictEqual'
}

@phated
Copy link
Contributor Author

phated commented Apr 19, 2022

@Trott thanks. It looks like a bug in my test specifically.

const dec = new TextDecoder(i);
assert.strictEqual(dec.encoding, 'utf-16le');
const res = dec.decode(buf);
assert.strictEqual(res, '믯璿攀猀琀가');
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I tried decoding this on current node 16, the bug shows itself in an invalid character at the end: '믯璿攀猀琀가�'

@Trott Trott requested a review from lpinca April 19, 2022 21:28
@nodejs-github-bot
Copy link
Collaborator

@phated
Copy link
Contributor Author

phated commented Apr 19, 2022

If my last commit doesn't work. I might need to copy the gulp test case into here (even though it is a larger buffer)

@Trott
Copy link
Member

Trott commented Apr 19, 2022

node:assert:123
  throw new AssertionError(obj);
  ^

AssertionError [ERR_ASSERTION]: Expected values to be strictly equal:
+ actual - expected

+ '믯璿攀猀琀가�'
- '믯璿攀猀琀가'
         ^
    at /Users/iojs/build/workspace/node-test-commit-osx/nodes/osx1015/test/parallel/test-whatwg-encoding-custom-textdecoder.js:48:12
    at Array.forEach (<anonymous>)
    at Object.<anonymous> (/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx1015/test/parallel/test-whatwg-encoding-custom-textdecoder.js:39:26)
    at Module._compile (node:internal/modules/cjs/loader:1105:14)
    at Module._extensions..js (node:internal/modules/cjs/loader:1159:10)
    at Module.load (node:internal/modules/cjs/loader:981:32)
    at Module._load (node:internal/modules/cjs/loader:827:12)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:77:12)
    at node:internal/main/run_main_module:17:47 {
  generatedMessage: true,
  code: 'ERR_ASSERTION',
  actual: '믯璿攀猀琀가�',
  expected: '믯璿攀猀琀가',
  operator: 'strictEqual'
}

Node.js v19.0.0-pre

@phated
Copy link
Contributor Author

phated commented Apr 19, 2022

I don't know how to test this. In the gulp tests, we save files with different encodings and I have no idea how to craft those buffers inside a Buffer.

@phated phated closed this Apr 19, 2022
@Trott
Copy link
Member

Trott commented Apr 19, 2022

Did you mean to close this? I'd like to keep chipping away at it unless you're opening a new PR or something?

@Trott
Copy link
Member

Trott commented Apr 19, 2022

You can create a file and save it in the test temp directory. (I can write that code or point you to a sample, so you don't have to learn that API unless you think you're going to be writing more Node.js tests.)

We could also put a file permanently in the test/fixtures directory.

@phated
Copy link
Contributor Author

phated commented Apr 19, 2022

Did you mean to close this?

Yeah, I meant to close it. At this point, I can't determine if I'm crazy or not because reading the file into the repl seems to give me the expected results, but my tests fail. The furthest I got before giving up was possibly a re-encoding issue (back into a buffer).

One thing that I noticed is that \ufeff is actually the <Buffer ff fe> buffer, which is actually the utf16 BOM (not the utf8 BOM), so I'm not sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
author ready PRs that have at least one approval, no pending requests for changes, and a CI started. encoding Issues and PRs related to the TextEncoder and TextDecoder APIs. needs-ci PRs that need a full CI run.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants