Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix symbolize_keys: true to properly create UTF-8 symbols #246

Merged
merged 1 commit into from
Feb 8, 2022

Conversation

casperisfine
Copy link

When symbolize_keys: true was set, we'd already read the string as binary, even if the native type is UTF-8. This was causing non-ASCII keys to be improperly deserialized.

rb_str_intern will still notice strings that are ASCII only, and return ASCII symbol for them. So there's no backward compatibility concerns.

This fixes: ruby-i18n/i18n#606

@casperisfine
Copy link
Author

@tagomoris apologies for the ping, but do you think you'll have time to review this PR soon?

It's causing issue for people using i18n 1.9.1 and bootsnap, so I'd like to know if I should find some workaround to prevent more people from hitting this bug.

@tagomoris
Copy link
Member

@casperisfine Probably I'll be able to find time to review changes early next week

@@ -494,9 +494,9 @@ static inline VALUE msgpack_buffer_read_top_as_string(msgpack_buffer_t* b, size_
#endif // HAVE_RB_ENC_INTERNED_STR
}

static inline VALUE msgpack_buffer_read_top_as_symbol(msgpack_buffer_t* b, size_t length)
static inline VALUE msgpack_buffer_read_top_as_symbol(msgpack_buffer_t* b, size_t length, int utf8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

msgpack_buffer_read_top_as_string defines the last argument as bool.
Is there any reason to use int here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, just an oversight, I'll fix it right now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@casperisfine casperisfine force-pushed the symbolize-keys-fix-encoding branch from 69230d6 to 0179a1d Compare February 8, 2022 12:51
Comment on lines 303 to 316
it 'MessagePack.unpack symbolize_keys preserve encoding' do
hash = { :ascii => 1, :utf8_é => 2}
loaded_hash = MessagePack.load(MessagePack.pack(hash), :symbolize_keys => true)

loaded_hash.keys[0].should == hash.keys[0]
loaded_hash.keys[0].encoding.should == hash.keys[0].encoding

loaded_hash.keys[1].should == hash.keys[1]
loaded_hash.keys[1].encoding.should == hash.keys[1].encoding

MessagePack.unpack(MessagePack.pack(hash), :symbolize_keys => true).should == hash
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case is not enough clear to me because the encodings of :ascii and :utf8_é depend on the behavior of Ruby runtime and be are not described in this test case.
IMO it will be more clear to me if the original encoding of serialized symbols and expected encoding of deserialized symbols are written explicitly.
@casperisfine What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I'll update right away.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it look now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks!

When `symbolize_keys: true` was set, we'd already read the string
as binary, even if the native type is UTF-8. This was causing non-ASCII
keys to be improperly deserialized.

`rb_str_intern` will still notice strings that are ASCII only, and return
ASCII symbol for them. So there's no backward compatibility concerns.
@casperisfine casperisfine force-pushed the symbolize-keys-fix-encoding branch from 0179a1d to 01b57bd Compare February 8, 2022 12:58
@tagomoris
Copy link
Member

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants