Incorrect parsing of escaped characters from higher unicode planes in a JSON string #9712

vit-zikmund · 2024-12-11T20:50:24Z

Bug Report

Describe the bug
All characters in a JSON string are by its specification Unicode and all can be escaped using the \u#### notation. This works only for codepoints in the Basic Multilingual Plane (U+0000 - U+FFFF), higher unicode planes, like the emojis are specified to be encoded as a utf-16 surrogate pair, e.g. \ud83e\udd17, the utf-16 surrogate pair for the "hugging face" emoji 🤗 U+1F917.

This escaped surrogate pair needs to be parsed as a single character, while fluent-bit parses them as two standalone unicode codepoints (i.e. U+D83E and U+DD17), which are in fact forbidden to appear in a correct Unicode string.

To Reproduce
Setup a simple stdin to stdout pipeline, pass {"text": "\ud83e\udd17"} to stdin.
Out comes a mangled

[{"date":1733946314.083699,"text":"������"}]

Expected behavior
The output message should be:

[{"date":1733946229.895005,"text":"🤗"}]

Your Environment

Version used: 3.2.2
Configuration: -i stdin -o stdout
Environment name and version (e.g. Kubernetes? What version?): docker image, 3.2.2
Server type and version: n/a
Operating System and version: Fedora Linux 40
Filters and plugins: none

Additional context
This is mangling python json module dumped data (its default is to use the escapes, so the string is actually ASCII) in our destination log database.

There's a workaround to make the dumper use Unicode strings, which don't trigger the problem in fluent-bit.

The text was updated successfully, but these errors were encountered:

vit-zikmund · 2024-12-12T08:33:08Z

FYI, the reason there are 6 replacement characters (�) in place of the 🤗 emoji is because its standalone high surrogate U+d83e is utf-8 encoded as ed a0 be (3 bytes, that don't map to a valid character) and the low surrogate U+dd17 is encoded as ed b4 97 (the other 3 invalid bytes).

vit-zikmund added the status: waiting-for-triage label Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect parsing of escaped characters from higher unicode planes in a JSON string #9712

Incorrect parsing of escaped characters from higher unicode planes in a JSON string #9712

vit-zikmund commented Dec 11, 2024

vit-zikmund commented Dec 12, 2024

Incorrect parsing of escaped characters from higher unicode planes in a JSON string #9712

Incorrect parsing of escaped characters from higher unicode planes in a JSON string #9712

Comments

vit-zikmund commented Dec 11, 2024

Bug Report

vit-zikmund commented Dec 12, 2024