feat: use `TextEncoder` and `TextDecoder` for utf8 strings #4513

seia-soto · 2024-12-10T08:29:19Z

This PR replaces punycode encoder and decoder with TextEncoder and TextDecoder for utf8 strings.

BOM character \ufeff should be skipped when decoding to ensure the original form
Uint8Array.subarray doesn't copy the array but provides a direct interface to subarray
TextEncoder.encodeInto doesn't produce EOL character ( NULL, U+0000 ) but we don't care because TextDecoder can stop nicely when provided buffer ends

To be safe:

The performance should be evaluated carefully
The output binary size needs to be evaluated carefully

seia-soto · 2024-12-11T07:22:57Z

> (6635933 / 6634657) * 100 // raw > ads + trackers + annoyances (104550 network + 69076 hide)
100.01923234313395

master, af1a9c9

aa@MacBookPro adblocker % yarn tsx ./tools/engine-size.ts 
> ads (49534 network + 38171 hide)
 + raw 3358745 bytes
 + gzip 1783830 bytes
 + brotli 1467652 bytes
> ads (0 network + 38171 hide)
 + raw 1825421 bytes
 + gzip 880805 bytes
 + brotli 708974 bytes
> ads (49534 network + 0 hide)
 + raw 1533629 bytes
 + gzip 905016 bytes
 + brotli 765458 bytes
> ads + trackers (102405 network + 38324 hide)
 + raw 4958193 bytes
 + gzip 2635039 bytes
 + brotli 2172546 bytes
> ads + trackers (0 network + 38324 hide)
 + raw 1847893 bytes
 + gzip 890432 bytes
 + brotli 715577 bytes
> ads + trackers (102405 network + 0 hide)
 + raw 3110601 bytes
 + gzip 1745248 bytes
 + brotli 1459959 bytes
> ads + trackers + annoyances (104585 network + 69043 hide)
 + raw 6635933 bytes
 + gzip 3468626 bytes
 + brotli 2862312 bytes
> ads + trackers + annoyances (0 network + 69043 hide)
 + raw 3455897 bytes
 + gzip 1674036 bytes
 + brotli 1363518 bytes
> ads + trackers + annoyances (104585 network + 0 hide)
 + raw 3180345 bytes
 + gzip 1793551 bytes
 + brotli 1501761 bytes

seia-soto:textencoder, 1de005e

aa@MacBookPro adblocker % yarn tsx ./tools/engine-size.ts 
> ads (49501 network + 38172 hide)
 + raw 3357509 bytes
 + gzip 1782885 bytes
 + brotli 1468373 bytes
> ads (0 network + 38172 hide)
 + raw 1824997 bytes
 + gzip 880743 bytes
 + brotli 706714 bytes
> ads (49501 network + 0 hide)
 + raw 1532817 bytes
 + gzip 904485 bytes
 + brotli 764578 bytes
> ads + trackers (102370 network + 38325 hide)
 + raw 4957369 bytes
 + gzip 2634088 bytes
 + brotli 2171825 bytes
> ads + trackers (0 network + 38325 hide)
 + raw 1847925 bytes
 + gzip 890324 bytes
 + brotli 719134 bytes
> ads + trackers (102370 network + 0 hide)
 + raw 3109745 bytes
 + gzip 1744659 bytes
 + brotli 1459817 bytes
> ads + trackers + annoyances (104550 network + 69076 hide)
 + raw 6634657 bytes
 + gzip 3468231 bytes
 + brotli 2859315 bytes
> ads + trackers + annoyances (0 network + 69076 hide)
 + raw 3455473 bytes
 + gzip 1674098 bytes
 + brotli 1362984 bytes
> ads + trackers + annoyances (104550 network + 0 hide)
 + raw 3179489 bytes
 + gzip 1793145 bytes
 + brotli 1501288 bytes

packages/adblocker/src/data-view.ts

- ~65535 ASCII only characters

seia-soto · 2024-12-12T07:37:08Z

> 147.50420889870574 / 156.1296767089117 // benchEngineDeserialization
0.944754463135876
> 147.50420889870574 / 148.7394726802865 // benchEngineSerialization
0.9916951179177841

seia-soto:textencoder, 65764e9

benchEngineDeserialization: 147.50420889870574 op/s
benchEngineSerialization: 147.50420889870574 op/s

master, de7bfb5

benchEngineDeserialization: 156.1296767089117 op/s
benchEngineSerialization: 148.7394726802865 op/s

remusao · 2024-12-12T08:27:42Z

packages/adblocker/src/data-view.ts

+    const { written } = TEXT_ENCODER.encodeInto(raw, this.buffer.subarray(this.pos + 4));
+    this.pushLength(written);


Since you do +4 here (32 bits), it's not useful to use pushLength since you might as well push the length always as uint32. The only reason to use pushLength is to use less bits to encode the length of small strings.

Yup, I'm on the fix to support dynamic positioning to utilize pushLength.

@remusao Please, review 0339bdc (#4513)

Dropped useless fast path https://github.com/ghostery/adblocker/pull/4513/files#diff-02c5b76ebc841a4005b9541e8e09eb74aecf407dfd69e179c028c6c83a771552R93-R96

feat: use TextEncoder and TextDecoder for utf8 strings

05063b5

seia-soto added the PR: Internal 🏠 Changes only affect internals label Dec 10, 2024

seia-soto self-assigned this Dec 10, 2024

seia-soto requested a review from remusao as a code owner December 10, 2024 08:29

refactor: pos calculation in pushUTF8

1de005e

seia-soto requested a review from chrmod December 11, 2024 07:45

chrmod reviewed Dec 11, 2024

View reviewed changes

packages/adblocker/src/data-view.ts Show resolved Hide resolved

chrmod approved these changes Dec 11, 2024

View reviewed changes

philipp-classen approved these changes Dec 11, 2024

View reviewed changes

seia-soto commented Dec 11, 2024

View reviewed changes

packages/adblocker/src/data-view.ts Outdated Show resolved Hide resolved

remusao reviewed Dec 11, 2024

View reviewed changes

packages/adblocker/src/data-view.ts Outdated Show resolved Hide resolved

seia-soto added 2 commits December 11, 2024 22:11

chore: save length of string in 16 bits unsigned integer

d6032eb

- ~65535 ASCII only characters

fix: use getLength and pushLength for utf8

65764e9

remusao reviewed Dec 12, 2024

View reviewed changes

seia-soto added 3 commits December 12, 2024 17:53

fix: calculate length of utf8 encoded string

0339bdc

chore: drop useless fast exit

3270e6a

refactor: reuse sizeOfLength

47dc63a

seia-soto requested review from chrmod, remusao and philipp-classen December 12, 2024 09:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: use `TextEncoder` and `TextDecoder` for utf8 strings #4513

feat: use `TextEncoder` and `TextDecoder` for utf8 strings #4513

seia-soto commented Dec 10, 2024

seia-soto commented Dec 11, 2024 •

edited

Loading

seia-soto commented Dec 12, 2024 •

edited

Loading

remusao Dec 12, 2024

seia-soto Dec 12, 2024

seia-soto Dec 12, 2024

seia-soto Dec 12, 2024

		const { written } = TEXT_ENCODER.encodeInto(raw, this.buffer.subarray(this.pos + 4));
		this.pushLength(written);

feat: use TextEncoder and TextDecoder for utf8 strings #4513

Are you sure you want to change the base?

feat: use TextEncoder and TextDecoder for utf8 strings #4513

Conversation

seia-soto commented Dec 10, 2024

seia-soto commented Dec 11, 2024 • edited Loading

seia-soto commented Dec 12, 2024 • edited Loading

remusao Dec 12, 2024

Choose a reason for hiding this comment

seia-soto Dec 12, 2024

Choose a reason for hiding this comment

seia-soto Dec 12, 2024

Choose a reason for hiding this comment

seia-soto Dec 12, 2024

Choose a reason for hiding this comment

feat: use `TextEncoder` and `TextDecoder` for utf8 strings #4513

feat: use `TextEncoder` and `TextDecoder` for utf8 strings #4513

seia-soto commented Dec 11, 2024 •

edited

Loading

seia-soto commented Dec 12, 2024 •

edited

Loading