Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use TextDecoder API for decoding UTF-8 from binary data #184

Closed
jcready opened this issue Oct 30, 2021 · 27 comments
Closed

Use TextDecoder API for decoding UTF-8 from binary data #184

jcready opened this issue Oct 30, 2021 · 27 comments
Labels
enhancement New feature or request

Comments

@jcready
Copy link
Contributor

jcready commented Oct 30, 2021

Identical bug to protobufjs/protobuf.js#1473 as protobuf-ts is using the same (old) implementation. Look like it may have been fixed in protobufjs/protobuf.js#1486, but this comment claims otherwise: protobufjs/protobuf.js#1473 (comment)

I'm not sure what the perf hit would be for just going with TextDecoder, but it may be worth investigating.

I did test out the current implementation against the test fixture: https://github.com/protobufjs/protobuf.js/blob/master/lib/utf8/tests/data/surrogate_pair_bug.txt and it did seem to suffer from the same issue.

@timostamm
Copy link
Owner

"native" is TextDecoder, @protobufjs/utf8 is the old version (it looks like the new version was never published):

### 8 characters
decode @protobufjs/utf8     : 4399740.72 ops/s
decode native               : 300604.37 ops/s
### 16 characters
decode @protobufjs/utf8     : 3671259.14 ops/s
decode native               : 310615.41 ops/s
### 32 characters
decode @protobufjs/utf8     : 2433673.74 ops/s
decode native               : 311712.23 ops/s
### 64 characters
decode @protobufjs/utf8     : 1504567.02 ops/s
decode native               : 304889.64 ops/s
### 128 characters
decode @protobufjs/utf8     : 860757.37 ops/s
decode native               : 293968.72 ops/s
### 256 characters
decode @protobufjs/utf8     : 413414.67 ops/s
decode native               : 274735.92 ops/s
### 512 characters
decode @protobufjs/utf8     : 200305.31 ops/s
decode native               : 238818.75 ops/s
### 1024 characters
decode @protobufjs/utf8     : 106844.69 ops/s
decode native               : 152599.08 ops/s
### 2048 characters
decode @protobufjs/utf8     : 52236.76 ops/s
decode native               : 93704.91 ops/s
### 4096 characters
decode @protobufjs/utf8     : 27495.77 ops/s
decode native               : 49688.62 ops/s
### 8192 characters
decode @protobufjs/utf8     : 12782.08 ops/s
decode native               : 29888.75 ops/s
### 16384 characters
decode @protobufjs/utf8     : 7513.44 ops/s
decode native               : 15755.95 ops/s

@timostamm
Copy link
Owner

For short strings, @protobufjs/utf8 is 10 times faster than TextDecoder. But that doesn't help if it's incorrect. I think switching to TextDecoder is the right call here.

@jcready
Copy link
Contributor Author

jcready commented Oct 30, 2021

Wow, an order of magnitude worse for arguably the most common use (short strings). I bet 95% of all strings in all protobufs ever created are less than 256 characters.

@jcready
Copy link
Contributor Author

jcready commented Oct 31, 2021

To be clear, I haven't actually run into this bug myself. This came up as I was trying to improve the test coverage of the protobuf-ts/runtime package. There were no specific tests for the protobufjs-utf8.ts file, so I was simply trying to add the same tests that the protobuf.js file used.

On a side note, I wasn't exactly sure how to properly test this in a way that would work for both test environments (node + browser). The protobuf.js tests rely on reading the two files from disk into a nodejs buffer and comparing the decoded results. But we can't really do the same thing as it would fail in the browser. The way I got it working at all was to follow these steps for each text file fixture: utf8.txt and surrogate_pair_bug.txt:

  1. Download .txt fixture to machine
  2. Run node
  3. require("fs").readFileSync(/* file location */).toString('base64')
  4. Manually copy the base 64 string into a fixture.ts file as an exported constant (~20k chars for each fixture, R.I.P. many text editors)
  5. Manually copy the original text from each .txt file into another exported constant using template strings (`)

Then the actual test file looks like this:

import { base64decode, utf8read } from '../src';
import { surrogatePairBug_b64, surrogatePairBug_text } from './support/surrogate_pair_bug';
import { utf8_b64, utf8_text } from './support/utf8';

const utf8data = base64decode(utf8_b64);
const surrogatePairData = base64decode(surrogatePairBug_b64);

describe('utf8read()', () => {

    it('should decode an empty buffer to an empty string', () => {
        expect(utf8read(new Uint8Array())).toBe('');
    });

    it('should decode utf8 properly', () => {
        expect(utf8read(utf8data)).toBe(utf8_text);
    });

    it('should avoid the surrogate pair bug', () => {
        expect(utf8read(surrogatePairData)).toBe(surrogatePairBug_text);
    });

});

The first two tests pass :) so good news there. If you have any ideas on how this stuff could be tested in both environments w/o needing to duplicate all the data as base64 strings I'd love to hear it.

@jcready
Copy link
Contributor Author

jcready commented Nov 1, 2021

According to nodejs/node#39879 the TextDecoder appears to be faster in browsers than in node which seems strange to me, but it also looks like node's string_decoder (used by Buffer#toString()) is faster than TextDecoder#decode(). I wonder if the utf8read could be something like:

const textDecoder = new TextDecoder();

// Use Buffer#toString() if running inside nodejs because it's slightly faster
export const utf8read = typeof global === 'object' && typeof Buffer === 'function'
    ? (bytes: Uint8Array): string => Buffer.from(bytes).toString()
    : (bytes: Uint8Array): string => textDecoder.decode(bytes)

This code passes all the tests.

I copied the benchmarking code from the linked issue and just added

console.time('small Buffer toString')
for (let i = 0; i < 100000; i++) Buffer.from(smallUint8).toString();
console.timeEnd('small Buffer toString')

console.time('big   Buffer toString')
for (let i = 0; i < 100000; i++) Buffer.from(bigUint8).toString()
console.timeEnd('big   Buffer toString')

These results on node 14 show it to be about 3x faster than using TextDecoder#decode:

small TextDe decode  : 253.713ms
big   TextDe decode  : 2.332s
small Buffer toString: 83.129ms
big   Buffer toString: 718.043ms

@timostamm
Copy link
Owner

The naive TextDecoder implementation on Node.js has a pretty hefty performance penalty (-48%):

### read binary
google-protobuf             :      11.1   ops/s
ts-proto                    :      25.516 ops/s
protobuf-ts (speed)         :      12.256 ops/s
protobuf-ts (speed, bigint) :      10.825 ops/s
protobuf-ts (size)          :      10.686 ops/s
protobuf-ts (size, bigint)  :      10.066 ops/s
protobufjs                  :      31.17  ops/s

From the Node.js docs:

The Buffer class is a subclass of JavaScript's Uint8Array class

Which means it should be possible to use smallUint8.toString() without allocating a new Buffer. I wonder what the Node.js optimization with Buffer does for the above benchmark...

For the surrogate pair bug, storing the data in a UInt8Array literal seems best to me, as it doesn't depend on a base64 implementation. I haven't found the time to look into it yet, but I'm wondering whether surrogate_pair_bug.txt really needs to be that large. If it can be stripped down, a UInt8Array literal with a comment would work out nicely:

// UTF-8 encoded text: hello 🌍
string_hello_world_emoji: new Uint8Array([10, 104, 101, 108, 108, 111, 32, 240, 159, 140, 141]),

@jcready
Copy link
Contributor Author

jcready commented Nov 1, 2021

for (let fix of fixtures) {
    console.log(`### ${fix.characters} characters`);
    const encoded = nativeEncode(fix.text);
    bench('@protobufjs/utf8     ', () => protobufJsDecode(encoded));
    bench('Buffer#toString      ', () => Buffer.from(encoded).toString());
    bench('UInt8Array#toString  ', () => encoded.toString());
    bench('TextDecoder          ', () => nativeDecode(encoded));
}

Edit: So it's not longer an order of magnitude difference. Slightly better than half speed at 8 char and becomes faster at 64 chars. It would be interesting to see how TextDecoder performs in various browsers in this benchmark.

Edit 2: Wow, the @protobufjs/utf8 package is actually using the old implementation, benching it again against the new one shows that the new one is even faster than the old one. These number are pretty unbelievable.

### 8 characters
NEW protobufjs utf8  : 17516949.98 ops/s
@protobufjs/utf8     :  4073841.15 ops/s
Buffer#toString      :  2512768.24 ops/s
TextDecoder#decode   :   448867.64 ops/s
### 16 characters
NEW protobufjs utf8  : 17069853.61 ops/s
@protobufjs/utf8     :  3535718.31 ops/s
Buffer#toString      :  2554735.72 ops/s
TextDecoder#decode   :   456621.28 ops/s
### 32 characters
NEW protobufjs utf8  : 16053649.46 ops/s
@protobufjs/utf8     :  2160193.4 ops/s
Buffer#toString      :  2108855.74 ops/s
TextDecoder#decode   :   441114.24 ops/s
### 64 characters
NEW protobufjs utf8  : 15981419.76 ops/s
@protobufjs/utf8     :  1432291.98 ops/s
Buffer#toString      :  1609929.12 ops/s
TextDecoder#decode   :   398494.89 ops/s
### 128 characters
NEW protobufjs utf8  : 15955608.38 ops/s
@protobufjs/utf8     :   773642.5 ops/s
Buffer#toString      :  1053945.33 ops/s
TextDecoder#decode   :   427989.21 ops/s
### 256 characters
NEW protobufjs utf8  : 17127740.7 ops/s
@protobufjs/utf8     :   365647.06 ops/s
Buffer#toString      :   638874.22 ops/s
TextDecoder#decode   :   325897.75 ops/s
### 512 characters
NEW protobufjs utf8  : 12930551.61 ops/s
@protobufjs/utf8     :   199813.77 ops/s
Buffer#toString      :   377303.82 ops/s
TextDecoder#decode   :   332516.9 ops/s
### 1024 characters
NEW protobufjs utf8  : 16933107.42 ops/s
@protobufjs/utf8     :    98654.57 ops/s
Buffer#toString      :   211425.37 ops/s
TextDecoder#decode   :   168263.15 ops/s
### 2048 characters
NEW protobufjs utf8  : 16412002.02 ops/s
@protobufjs/utf8     :    45968.61 ops/s
Buffer#toString      :   109674.33 ops/s
TextDecoder#decode   :   107678.5 ops/s
### 4096 characters
NEW protobufjs utf8  : 16765271.11 ops/s
@protobufjs/utf8     :    26562.01 ops/s
Buffer#toString      :    56064.24 ops/s
TextDecoder#decode   :    69391.86 ops/s
### 8192 characters
NEW protobufjs utf8  : 16997275.78 ops/s
@protobufjs/utf8     :    14398.42 ops/s
Buffer#toString      :    25560.07 ops/s
TextDecoder#decode   :    34114.69 ops/s
### 16384 characters
NEW protobufjs utf8  : 16878802.99 ops/s
@protobufjs/utf8     :     8658.53 ops/s
Buffer#toString      :    13107.21 ops/s
TextDecoder#decode   :    19574.82 ops/s

Full bench code:

const protobufJsUtf8 = require("@protobufjs/utf8");

function utf8_read(buffer, start, end) {
    if (end - start < 1) {
        return "";
    }

    var str = "";
    for (var i = start; i < end;) {
        var t = buffer[i++];
        if (t <= 0x7F) {
            str += String.fromCharCode(t);
        } else if (t >= 0xC0 && t < 0xE0) {
            str += String.fromCharCode((t & 0x1F) << 6 | buffer[i++] & 0x3F);
        } else if (t >= 0xE0 && t < 0xF0) {
            str += String.fromCharCode((t & 0xF) << 12 | (buffer[i++] & 0x3F) << 6 | buffer[i++] & 0x3F);
        } else if (t >= 0xF0) {
            var t2 = ((t & 7) << 18 | (buffer[i++] & 0x3F) << 12 | (buffer[i++] & 0x3F) << 6 | buffer[i++] & 0x3F) - 0x10000;
            str += String.fromCharCode(0xD800 + (t2 >> 10));
            str += String.fromCharCode(0xDC00 + (t2 & 0x3FF));
        }
    }

    return str;
}

function bench(name, fn, durationSeconds = 2) {
    let startTs = performance.now();
    let endTs = startTs + durationSeconds * 1000;
    let samples = 0;
    while (performance.now() < endTs) {
        fn();
        samples++;
    }
    let durationMs = performance.now() - startTs;
    let opsPerSecond = 1000 / (durationMs / samples);
    console.log(`${name}: ${Math.round(opsPerSecond * 100) / 100} ops/s`);
}

let textDecoder = new TextDecoder();
let textEncoder = new TextEncoder();

function nativeEncode(text) {
    return textEncoder.encode(text);
}

function nativeDecode(bytes) {
    return textDecoder.decode(bytes, {stream: false})
}

function protobufJsDecode(bytes) {
    return protobufJsUtf8.read(bytes, 0, bytes.length);
}
 
const fixtures = [];
for (let i = 3; i <= 14; i++) {
    let j = Math.pow(2, i);
    fixtures.push({
        characters: j,
        text: "hello 🌍".repeat(j / 8),
    });
}

for (let fix of fixtures) {
    console.log(`### ${fix.characters} characters`);
    const encoded = nativeEncode(fix.text);
    bench('NEW protobufjs utf8  ', () => utf8_read(encoded));
    bench('@protobufjs/utf8     ', () => protobufJsDecode(encoded));
    bench('Buffer#toString      ', () => Buffer.from(encoded).toString());
    bench('TextDecoder#decode   ', () => nativeDecode(encoded));
}

@jcready
Copy link
Contributor Author

jcready commented Nov 2, 2021

While investigating this I also discovered that the original implementation actually belongs to Google's closure library licensed under Apache-2.0 and was just copied over into protobuf.js.

The tests available for the Google's closure library pass using protobuf.js's new implementation. What I'm really looking for is some comprehensive test of the utf8 decoding that would fail when using the non-native code, but succeed when using TextDecoder. So far I haven't had much luck finding such a test case.

If we cannot find such a test to prove that the new protobuf.js implementation is wrong I propose that we simply adopt this newer (and somehow much faster) implementation.

Edit: Perhaps the reason that shortcuts are allowed to be made in the non-native code is that it only has to be responsible for decoding complete utf8 strings and thus it doesn't need to concern itself with handling partial ones that might end in the middle of a surrogate pair which would then require replacement characters.

@jcready
Copy link
Contributor Author

jcready commented Nov 6, 2021

I've found an example that decodes differently for the native decoder vs. both non-native versions:

// "Overlong" sequence should never be generated by a valid UTF-8 encoder
const bytes = new Uint8Array([0xc1, 0xbf]); // [ 193, 191 ]

const native  = new TextDecoder();
const encoder = new TextEncoder();

const native_decoded  = native.decode(bytes); // "��" - "replacement" characters U+FFFD
const old_pbjs_decode = old_pbjs_read(bytes); // "\u007f"
const new_pbjs_decode = new_pbjs_read(bytes); // "\u007f"

const native_round_trip   = encoder.encode(native_decoded);  // [ 239, 191, 189, 239, 191, 189 ]
const old_pbjs_round_trip = encoder.encode(old_pbjs_decode); // [ 127 ]
const new_pbjs_round_trip = encoder.encode(new_pbjs_decode); // [ 127 ]

Based on my reading of https://www.unicode.org/faq/private_use.html#nonchar9 the answer to the question of "Which behavior is correct?" seem to be "It depends." But neither native nor non-native versions preserve the input bytes during a round trip.

Edit: I also verified that the official google protobuf javascript implementation produces the same results as both pbjs versions.

@timostamm
Copy link
Owner

Does Node.js respect the fatal option? An error is definitely better than silent corruption with replacement characters.

In Chrome:

new TextDecoder("utf-8", {fatal: true}).decode(new Uint8Array([0xc1, 0xbf]));

Uncaught TypeError: Failed to execute 'decode' on 'TextDecoder': The encoded data was not valid.

@jcready
Copy link
Contributor Author

jcready commented Nov 7, 2021

I wonder if this could be a compiler option like in nanopb:

PB_VALIDATE_UTF8: Check whether incoming strings are valid UTF-8 sequences. Adds a small performance and code size penalty.

But we'd need to replace "small [...] penalty" with "large [...] penalty" :)

Perhaps a compile option like:

  • VALIDATE_UTF8 possible values
    • THROW - Adds a large performance penalty, but will correctly validate UTF8 strings during decoding. (uses new TextDecoder("utf-8", {fatal: true}))
    • REPLACE - Adds a large performance penalty, but will correctly replace invalid UTF8 code points with the replacement character "�". (uses new TextDecoder("utf-8", {fatal: false}))
    • IGNORE - Very fast, but could result in potentially invalid decoded UTF8 strings. Old behavior, use with caution. (Uses new pbjs utf8 decoder)

@jcready
Copy link
Contributor Author

jcready commented Dec 23, 2021

So it turns out that the benchmark code I posted above had a giant bug in it which basically made the benchmark for the new protobufjs utf8 decoder a no-op (no wonder it was so fast). This line

bench('NEW protobufjs utf8  ', () => utf8_read(encoded));

The problem is that utf8_read expects three arguments, not one. 🤦 And it returns an empty string immediately if not provided all three.

Turns out that the new function is basically the same as the old version when properly benchmarked.

I've also run the same benchmark code in Firefox 95.0.2, Safari 15.0.0, and Chrome 96.0.4664.110. In Firefox and Safari the native TextDecoder is always faster. In Chrome the native TextDecoder is only slower with strings that are 8 and 16 bytes and becomes faster at 32. Running the benchmarks against Node (v16 and v17) still produce similar results to before where the native TextDecoder was an order of magnitude slower with 8 byte strings, and leveled off around 128, and became faster at 256. The Buffer.from(bytes).toString() in Node is always faster than the TextDecoder and becomes faster than the protobufjs code at 16 byte strings.

I think it may be worth just using TextDecoder and living with the perf penalty on Node. It's definitely safer, reduces the amount of code (and tests needed), and will presumably continually improve in performance as time goes on. If the perf penalty is too big for Node then perhaps the utf8.ts file can attempt to detect if it's running inside node and use the Buffer.from(bytes).toString() implementation instead. Either way it hardly seems worth keeping the protobufjs (old or new) implementation around.

Chrome

### 8 characters
NEW protobufjs utf8  : 2214271.3 ops/s  <==  73.4% faster
TextDecoder#decode   : 1276910.0 ops/s
### 16 characters
NEW protobufjs utf8  : 1290394.0 ops/s  <==   3.9% faster
TextDecoder#decode   : 1241463.5 ops/s
### 32 characters                       <== CHANGES AT 32 CHARACTERS
NEW protobufjs utf8  : 1138826.5 ops/s 
TextDecoder#decode   : 1182473.5 ops/s  <==   3.8% faster
### 64 characters
NEW protobufjs utf8  :  856905.0 ops/s
TextDecoder#decode   :  925059.0 ops/s
### 128 characters
NEW protobufjs utf8  :  580796.0 ops/s
TextDecoder#decode   :  964745.0 ops/s
### 256 characters
NEW protobufjs utf8  :  374586.5 ops/s
TextDecoder#decode   :  780087.0 ops/s  <== 108.3% faster
### 512 characters
NEW protobufjs utf8  :  215834.0 ops/s
TextDecoder#decode   :  544806.0 ops/s
### 1024 characters
NEW protobufjs utf8  :  111727.0 ops/s
TextDecoder#decode   :  357377.5 ops/s
### 2048 characters
NEW protobufjs utf8  :   58382.5 ops/s
TextDecoder#decode   :  205289.0 ops/s
### 4096 characters
NEW protobufjs utf8  :   28693.0 ops/s
TextDecoder#decode   :  113674.0 ops/s
### 8192 characters
NEW protobufjs utf8  :   14933.5 ops/s
TextDecoder#decode   :   58760.0 ops/s
### 16384 characters
NEW protobufjs utf8  :    6836.7 ops/s
TextDecoder#decode   :   27918.0 ops/s  <== 308.4% faster

Firefox

### 8 characters
NEW protobufjs utf8  :  794338.5 ops/s
TextDecoder#decode   : 1685541.5 ops/s  <== 112.2% faster
### 16 characters
NEW protobufjs utf8  :  504443.0 ops/s
TextDecoder#decode   : 1255579.0 ops/s
### 32 characters
NEW protobufjs utf8  :  296564.5 ops/s
TextDecoder#decode   : 1037250.0 ops/s
### 64 characters
NEW protobufjs utf8  :  157725.0 ops/s
TextDecoder#decode   : 1412872.0 ops/s
### 128 characters
NEW protobufjs utf8  :   85595.0 ops/s
TextDecoder#decode   :  939345.2 ops/s
### 256 characters
NEW protobufjs utf8  :   43133.5 ops/s
TextDecoder#decode   :  632268.5 ops/s
### 512 characters
NEW protobufjs utf8  :   22369.0 ops/s
TextDecoder#decode   :  388318.9 ops/s
### 1024 characters
NEW protobufjs utf8  :   11373.5 ops/s
TextDecoder#decode   :  233765.4 ops/s
### 2048 characters
NEW protobufjs utf8  :    5738.5 ops/s
TextDecoder#decode   :  116733.0 ops/s
### 4096 characters
NEW protobufjs utf8  :    2793.5 ops/s
TextDecoder#decode   :   63643.5 ops/s
### 8192 characters
NEW protobufjs utf8  :    1428.5 ops/s
TextDecoder#decode   :   31850.0 ops/s
### 16384 characters
NEW protobufjs utf8  :     713.0 ops/s
TextDecoder#decode   :   17661.5 ops/s  <== 2,377% faster

Safari

### 8 characters
NEW protobufjs utf8  : 4287753.0 ops/s
TextDecoder#decode   : 4413562.5 ops/s  <==   2.9% faster
### 16 characters
NEW protobufjs utf8  : 2776814.5 ops/s
TextDecoder#decode   : 4274627.0 ops/s
### 32 characters
NEW protobufjs utf8  : 1533073.0 ops/s
TextDecoder#decode   : 3865311.5 ops/s  <== 152.1% faster
### 64 characters
NEW protobufjs utf8  :  809855.5 ops/s
TextDecoder#decode   : 3220247.5 ops/s
### 128 characters
NEW protobufjs utf8  :  422076.5 ops/s
TextDecoder#decode   : 2293873.0 ops/s
### 256 characters
NEW protobufjs utf8  :  213128.5 ops/s
TextDecoder#decode   : 1488038.5 ops/s
### 512 characters
NEW protobufjs utf8  :  107337.0 ops/s
TextDecoder#decode   :  965831.5 ops/s
### 1024 characters
NEW protobufjs utf8  :   53825.5 ops/s
TextDecoder#decode   :  563387.5 ops/s
### 2048 characters
NEW protobufjs utf8  :   26785.0 ops/s
TextDecoder#decode   :  317623.0 ops/s
### 4096 characters
NEW protobufjs utf8  :   13167.0 ops/s
TextDecoder#decode   :  168157.5 ops/s
### 8192 characters
NEW protobufjs utf8  :    6443.0 ops/s
TextDecoder#decode   :   87202.0 ops/s
### 16384 characters
NEW protobufjs utf8  :    3260.0 ops/s
TextDecoder#decode   :   44552.0 ops/s  <== 1,266% faster

Here is the new benchmark code you can run yourself:

function utf8_read(buffer, start, end) {
    if (end - start < 1) {
        return "";
    }

    var str = "";
    for (var i = start; i < end;) {
        var t = buffer[i++];
        if (t <= 0x7F) {
            str += String.fromCharCode(t);
        } else if (t >= 0xC0 && t < 0xE0) {
            str += String.fromCharCode((t & 0x1F) << 6 | buffer[i++] & 0x3F);
        } else if (t >= 0xE0 && t < 0xF0) {
            str += String.fromCharCode((t & 0xF) << 12 | (buffer[i++] & 0x3F) << 6 | buffer[i++] & 0x3F);
        } else if (t >= 0xF0) {
            var t2 = ((t & 7) << 18 | (buffer[i++] & 0x3F) << 12 | (buffer[i++] & 0x3F) << 6 | buffer[i++] & 0x3F) - 0x10000;
            str += String.fromCharCode(0xD800 + (t2 >> 10));
            str += String.fromCharCode(0xDC00 + (t2 & 0x3FF));
        }
    }

    return str;
}

function bench(name, fn, durationSeconds = 2) {
    let startTs = performance.now();
    let endTs = startTs + durationSeconds * 1000;
    let samples = 0;
    while (performance.now() < endTs) {
        fn();
        samples++;
    }
    let durationMs = performance.now() - startTs;
    let opsPerSecond = 1000 / (durationMs / samples);
    console.log(`${name}: ${Math.round(opsPerSecond * 100) / 100} ops/s`);
}

let textDecoder = new TextDecoder();
let textEncoder = new TextEncoder();

function nativeEncode(text) {
    return textEncoder.encode(text);
}

function nativeDecode(bytes) {
    return textDecoder.decode(bytes);
}

function newJsDecode(bytes) {
    return utf8_read(bytes, 0, bytes.length);
}
 
const fixtures = [];
for (let i = 3; i <= 14; i++) {
    let j = Math.pow(2, i);
    fixtures.push({
        characters: j,
        text: "hello 🌍".repeat(j / 8),
    });
}

for (let fix of fixtures) {
    console.log(`### ${fix.characters} characters`);
    const encoded = nativeEncode(fix.text);
    bench('NEW protobufjs utf8  ', () => newJsDecode(encoded));
    bench('TextDecoder#decode   ', () => nativeDecode(encoded));
}

@timostamm timostamm added the enhancement New feature or request label Dec 30, 2021
@timostamm
Copy link
Owner

timostamm commented Dec 30, 2021

Thank you for the benchmarks, @jcready!

Taking a step back:

1. invalid UTF-8

Going with TextDecoder means invalid UTF-8 will be converted into replacement characters. (There is an option to throw, but Firefox does not support it). In my mind, it would obviously be best to not touch invalid UTF-8. But it might be acceptable. The language guide has a note about the string type:

A string must always contain UTF-8 encoded or 7-bit ASCII text [...]

2. Performance

We let users pass a TextEncoder to our BinaryWriter (via options). If we do the same for TextDecoder, Node.js users who want to be mindful about performance can actually pass in an object that uses Buffer.toString.

I think we probably want to be a bit more lax with the typings to make it easier to use, and make defaults accessible at some point in the future.

@timostamm timostamm changed the title Bug: utf8read produces invalid string when decoding surragate pairs Use TextDecoder API for decoding UTF-8 from binary data Dec 30, 2021
@jcready
Copy link
Contributor Author

jcready commented Dec 31, 2021

Going with TextDecoder means invalid UTF-8 will be converted into replacement characters. (There is an option to throw, but Firefox does not support it).

Your link points to the TextDecoderStream#fatal option, but every major browser supports TextDecoder#fatal.

@timostamm
Copy link
Owner

e1cd360 switches to the TextDecoder API. utf8read() is still exported, but marked deprecated. TextDecoder and TextEncoder can easily be swapped out by alternative implementations, for example to tweak performance on Node.js or for more lenient handling of invalid UTF-8.

I'll close this issue when it is released.

Updated the manual:


JavaScript uses UTF-16 for strings, but protobuf uses UTF-8. In order
to serialize to and from binary data, protobuf-ts converts between the
encodings with the TextEncoder / TextDecoder API.

Note that the protobuf language guide states:

A string must always contain UTF-8 encoded or 7-bit ASCII text [...]

If an invalid UTF-8 string is encoded in the binary format, protobuf-ts
will raise an error on decoding through the TextDecoder option fatal.
If you do not want that behaviour, use the readerFactory option to
pass your own TextDecoder instance.

As of January 2022, performance of TextDecoder on Node.js falls behind
Node.js' Buffer. In order to use Buffer to decode UTF-8, use the
readerFactory option:

const nodeBinaryReadOptions = {
    readerFactory: (bytes: Uint8Array) => new BinaryReader(bytes, {
        decode(input?: Uint8Array): string {
            return input ? (input as Buffer).toString("utf8") : "";
        }
    })
};
MyMessage.fromBinary(bytes, nodeBinaryReadOptions);

@timostamm
Copy link
Owner

Released as v2.2.0-alpha.0 in the next channel.

@jimmywarting
Copy link

Just notice this issue was linked to my issue on the nodejs repo.
seems like they fixed some performence stuff with regards to using TextDecoder, so i would bet on using TextDecoder over buf.toString()

I always prefer native feature that isn't tied to using internal NodeJS modules as it will make it more cross comp friendly and work in more environments

@timostamm
Copy link
Owner

Thanks for the shout, @jimmywarting. Agree with your stance. Node.js is picking up the Web Streams API and the fetch API too.

@jcready
Copy link
Contributor Author

jcready commented Jul 8, 2023

Just an update here. It looks like as of node v18.13 the performance of TextDecoder.decode() now seems to be superior to Buffer.toString() in all cases and superior to both the old and new protobufjs utf8 decoders once the string is at least 16 characters long.

### 8 characters
TextDecoder#decode   : 5,078,129.606 ops/s
OLD protobufjs utf8  : 4,906,884.129 ops/s
NEW protobufjs utf8  : 5,321,692.092 ops/s
Buffer#toString      : 2,475,237.512 ops/s

### 16 characters
TextDecoder#decode   : 4,478,302.979 ops/s
OLD protobufjs utf8  : 4,210,680.676 ops/s
NEW protobufjs utf8  : 2,394,999.767 ops/s
Buffer#toString      : 1,403,588.265 ops/s

### 32 characters
TextDecoder#decode   : 3,715,393.385 ops/s
OLD protobufjs utf8  : 2,821,617.056 ops/s
NEW protobufjs utf8  : 1,808,990.853 ops/s
Buffer#toString      :   788,605.779 ops/s

### 64 characters
TextDecoder#decode   : 2,673,300.363 ops/s
OLD protobufjs utf8  : 1,692,815.943 ops/s
NEW protobufjs utf8  : 1,301,040.713 ops/s
Buffer#toString      :   425,390.725 ops/s

@eduardhasanaj
Copy link

TextEncoder/Decoder are not supported in react native. Is there any workaround or does protobuf-ts have any flag to use Buffer.toString?

@jcready
Copy link
Contributor Author

jcready commented Jul 31, 2023

You can use a polyfill like https://github.com/samthor/fast-text-encoding or you can specify your own readerFactory/writerFactory when calling .fromBinary()/.toBinary():

import {BinaryReader, BinaryWriter} from "@protobuf-ts/runtime";

export const binaryReadOptions = {
    readerFactory: (bytes: Uint8Array) => new BinaryReader(bytes, {
        decode(input?: Uint8Array): string {
            return input ? (input as Buffer).toString("utf8") : "";
        }
    })
};

export const binaryWriteOptions = {
    writerFactory: () => new BinaryWriter({
        encode(input?: string): Uint8Array {
            return Buffer.from(input || "", "utf8");
        }
    })
};

// elsewhere
object = MyMessage.fromBinary(bytes, binaryReadOptions);
bytes  = MyMessage.toBinary(object, binaryWriteOptions);

@eduardhasanaj
Copy link

eduardhasanaj commented Aug 19, 2023

@jcready Sorry for the late response. I thought I wrote to you before my summer vacations :(
First thanks for your suggestion on using a polyfill.
I tried fast-text-encoding and I got the following error:

Failed to construct TextDecoder. The fatal option is unsupported.

Will try anothe polyfill.

@eduardhasanaj
Copy link

eduardhasanaj commented Aug 20, 2023

I added fatal mode to the polyfill.
See this comment samthor/fast-text-encoding#24 (comment) in case you need it immediately.

@jimmywarting
Copy link

MDN comp table shows that Text encoder/decoder are pretty wildly available. is a polyfill really necessary?
what env are you supporting?
and is it something that the consumer of this library could install by themself?

@eduardhasanaj
Copy link

@jimmywarting React Native.

@jimmywarting
Copy link

React Native might be considered more nitched compared to web and backend technologies, and its user base might not be as extensive. when i install/use protobuf then i don't need a polyfill. React Native developers could including a TextEncoder/TextDecoder polyfill while also installing protobuf. I personally prefer not to add unnecessary dependencies that I don't actually require.

@eduardhasanaj
Copy link

@jimmywarting please see the whole last 4-5 messages why a polyfill in RN is needed.
Of course you do not need it in node and browsers. When I made the comment I had in mind RN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants