Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect parsing of escaped characters from higher unicode planes in a JSON string #9712

Open
vit-zikmund opened this issue Dec 11, 2024 · 1 comment

Comments

@vit-zikmund
Copy link

Bug Report

Describe the bug
All characters in a JSON string are by its specification Unicode and all can be escaped using the \u#### notation. This works only for codepoints in the Basic Multilingual Plane (U+0000 - U+FFFF), higher unicode planes, like the emojis are specified to be encoded as a utf-16 surrogate pair, e.g. \ud83e\udd17, the utf-16 surrogate pair for the "hugging face" emoji 🤗 U+1F917.

This escaped surrogate pair needs to be parsed as a single character, while fluent-bit parses them as two standalone unicode codepoints (i.e. U+D83E and U+DD17), which are in fact forbidden to appear in a correct Unicode string.

To Reproduce
Setup a simple stdin to stdout pipeline, pass {"text": "\ud83e\udd17"} to stdin.
Out comes a mangled

[{"date":1733946314.083699,"text":"������"}]

Expected behavior
The output message should be:

[{"date":1733946229.895005,"text":"🤗"}]

Your Environment

  • Version used: 3.2.2
  • Configuration: -i stdin -o stdout
  • Environment name and version (e.g. Kubernetes? What version?): docker image, 3.2.2
  • Server type and version: n/a
  • Operating System and version: Fedora Linux 40
  • Filters and plugins: none

Additional context
This is mangling python json module dumped data (its default is to use the escapes, so the string is actually ASCII) in our destination log database.

There's a workaround to make the dumper use Unicode strings, which don't trigger the problem in fluent-bit.

@vit-zikmund
Copy link
Author

FYI, the reason there are 6 replacement characters (�) in place of the 🤗 emoji is because its standalone high surrogate U+d83e is utf-8 encoded as ed a0 be (3 bytes, that don't map to a valid character) and the low surrogate U+dd17 is encoded as ed b4 97 (the other 3 invalid bytes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant