Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use TextEncoder and TextDecoder for utf8 strings #4513

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

seia-soto
Copy link
Member

fixes #4424

This PR replaces punycode encoder and decoder with TextEncoder and TextDecoder for utf8 strings.

  • BOM character \ufeff should be skipped when decoding to ensure the original form
  • Uint8Array.subarray doesn't copy the array but provides a direct interface to subarray
  • TextEncoder.encodeInto doesn't produce EOL character ( NULL, U+0000 ) but we don't care because TextDecoder can stop nicely when provided buffer ends

To be safe:

  • The performance should be evaluated carefully
  • The output binary size needs to be evaluated carefully

@seia-soto seia-soto added the PR: Internal 🏠 Changes only affect internals label Dec 10, 2024
@seia-soto seia-soto self-assigned this Dec 10, 2024
@seia-soto seia-soto requested a review from remusao as a code owner December 10, 2024 08:29
@seia-soto
Copy link
Member Author

seia-soto commented Dec 11, 2024

> (6635933 / 6634657) * 100 // raw > ads + trackers + annoyances (104550 network + 69076 hide)
100.01923234313395

master, af1a9c9

aa@MacBookPro adblocker % yarn tsx ./tools/engine-size.ts 
> ads (49534 network + 38171 hide)
 + raw 3358745 bytes
 + gzip 1783830 bytes
 + brotli 1467652 bytes
> ads (0 network + 38171 hide)
 + raw 1825421 bytes
 + gzip 880805 bytes
 + brotli 708974 bytes
> ads (49534 network + 0 hide)
 + raw 1533629 bytes
 + gzip 905016 bytes
 + brotli 765458 bytes
> ads + trackers (102405 network + 38324 hide)
 + raw 4958193 bytes
 + gzip 2635039 bytes
 + brotli 2172546 bytes
> ads + trackers (0 network + 38324 hide)
 + raw 1847893 bytes
 + gzip 890432 bytes
 + brotli 715577 bytes
> ads + trackers (102405 network + 0 hide)
 + raw 3110601 bytes
 + gzip 1745248 bytes
 + brotli 1459959 bytes
> ads + trackers + annoyances (104585 network + 69043 hide)
 + raw 6635933 bytes
 + gzip 3468626 bytes
 + brotli 2862312 bytes
> ads + trackers + annoyances (0 network + 69043 hide)
 + raw 3455897 bytes
 + gzip 1674036 bytes
 + brotli 1363518 bytes
> ads + trackers + annoyances (104585 network + 0 hide)
 + raw 3180345 bytes
 + gzip 1793551 bytes
 + brotli 1501761 bytes

seia-soto:textencoder, 1de005e

aa@MacBookPro adblocker % yarn tsx ./tools/engine-size.ts 
> ads (49501 network + 38172 hide)
 + raw 3357509 bytes
 + gzip 1782885 bytes
 + brotli 1468373 bytes
> ads (0 network + 38172 hide)
 + raw 1824997 bytes
 + gzip 880743 bytes
 + brotli 706714 bytes
> ads (49501 network + 0 hide)
 + raw 1532817 bytes
 + gzip 904485 bytes
 + brotli 764578 bytes
> ads + trackers (102370 network + 38325 hide)
 + raw 4957369 bytes
 + gzip 2634088 bytes
 + brotli 2171825 bytes
> ads + trackers (0 network + 38325 hide)
 + raw 1847925 bytes
 + gzip 890324 bytes
 + brotli 719134 bytes
> ads + trackers (102370 network + 0 hide)
 + raw 3109745 bytes
 + gzip 1744659 bytes
 + brotli 1459817 bytes
> ads + trackers + annoyances (104550 network + 69076 hide)
 + raw 6634657 bytes
 + gzip 3468231 bytes
 + brotli 2859315 bytes
> ads + trackers + annoyances (0 network + 69076 hide)
 + raw 3455473 bytes
 + gzip 1674098 bytes
 + brotli 1362984 bytes
> ads + trackers + annoyances (104550 network + 0 hide)
 + raw 3179489 bytes
 + gzip 1793145 bytes
 + brotli 1501288 bytes

@seia-soto seia-soto requested a review from chrmod December 11, 2024 07:45
@seia-soto
Copy link
Member Author

seia-soto commented Dec 12, 2024

> 147.50420889870574 / 156.1296767089117 // benchEngineDeserialization
0.944754463135876
> 147.50420889870574 / 148.7394726802865 // benchEngineSerialization
0.9916951179177841

seia-soto:textencoder, 65764e9

  • benchEngineDeserialization: 147.50420889870574 op/s
  • benchEngineSerialization: 147.50420889870574 op/s

master, de7bfb5

  • benchEngineDeserialization: 156.1296767089117 op/s
  • benchEngineSerialization: 148.7394726802865 op/s

Comment on lines 394 to 395
const { written } = TEXT_ENCODER.encodeInto(raw, this.buffer.subarray(this.pos + 4));
this.pushLength(written);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you do +4 here (32 bits), it's not useful to use pushLength since you might as well push the length always as uint32. The only reason to use pushLength is to use less bits to encode the length of small strings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I'm on the fix to support dynamic positioning to utilize pushLength.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@remusao Please, review 0339bdc (#4513)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR: Internal 🏠 Changes only affect internals
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Overflow error when loading uBO quick fix list
4 participants