You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
All characters in a JSON string are by its specification Unicode and all can be escaped using the \u#### notation. This works only for codepoints in the Basic Multilingual Plane (U+0000 - U+FFFF), higher unicode planes, like the emojis are specified to be encoded as a utf-16 surrogate pair, e.g. \ud83e\udd17, the utf-16 surrogate pair for the "hugging face" emoji 🤗 U+1F917.
This escaped surrogate pair needs to be parsed as a single character, while fluent-bit parses them as two standalone unicode codepoints (i.e. U+D83E and U+DD17), which are in fact forbidden to appear in a correct Unicode string.
To Reproduce
Setup a simple stdin to stdout pipeline, pass {"text": "\ud83e\udd17"} to stdin.
Out comes a mangled
[{"date":1733946314.083699,"text":"������"}]
Expected behavior
The output message should be:
[{"date":1733946229.895005,"text":"🤗"}]
Your Environment
Version used: 3.2.2
Configuration: -i stdin -o stdout
Environment name and version (e.g. Kubernetes? What version?): docker image, 3.2.2
Server type and version: n/a
Operating System and version: Fedora Linux 40
Filters and plugins: none
Additional context
This is mangling python json module dumped data (its default is to use the escapes, so the string is actually ASCII) in our destination log database.
There's a workaround to make the dumper use Unicode strings, which don't trigger the problem in fluent-bit.
The text was updated successfully, but these errors were encountered:
FYI, the reason there are 6 replacement characters (�) in place of the 🤗 emoji is because its standalone high surrogate U+d83e is utf-8 encoded as ed a0 be (3 bytes, that don't map to a valid character) and the low surrogate U+dd17 is encoded as ed b4 97 (the other 3 invalid bytes).
Bug Report
Describe the bug
All characters in a JSON string are by its specification Unicode and all can be escaped using the
\u####
notation. This works only for codepoints in the Basic Multilingual Plane (U+0000 - U+FFFF), higher unicode planes, like the emojis are specified to be encoded as a utf-16 surrogate pair, e.g.\ud83e\udd17
, the utf-16 surrogate pair for the "hugging face" emoji 🤗 U+1F917.This escaped surrogate pair needs to be parsed as a single character, while fluent-bit parses them as two standalone unicode codepoints (i.e. U+D83E and U+DD17), which are in fact forbidden to appear in a correct Unicode string.
To Reproduce
Setup a simple stdin to stdout pipeline, pass
{"text": "\ud83e\udd17"}
to stdin.Out comes a mangled
Expected behavior
The output message should be:
Your Environment
-i stdin -o stdout
Additional context
This is mangling python
json
module dumped data (its default is to use the escapes, so the string is actually ASCII) in our destination log database.There's a workaround to make the dumper use Unicode strings, which don't trigger the problem in fluent-bit.
The text was updated successfully, but these errors were encountered: