-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add specs for symbols encoding preservation #211
Conversation
It is true that encodings of symbols are not preserved now. Breaking compatibility is not a possible option, so, how about adding a new method pair to serialize/deserialize encoding with the content? It can provide the option to store encodings (of course, it causes the larger binary). Anyway, the original problem is still unclear to me. The difference of Symbol's encoding is causing problems in your case? |
Surprisingly no, it didn't. Likely because non-ASCII symbols are very rare, so it somehow works out. But I figured this was a bug worth reporting.
Agreed.
I'm not sure I understand what you are suggesting. |
👍 Indeed. It's an unexpected thing to me, and good to know.
My idea is to have Anyway, if no one feels that the encoding of symbols is a serious problem, a much larger symbol representation in msgpack w/ encoding should not be welcomed by everyone. |
The last one wins. We could indeed have a flag or something to change that behavior, but then what would people do if they somehow need to support both format? I agree this is tricky. I'm not even sure what a good way to store the encoding would be. We could probably assume that no associated encoding means UTF-8, and that would work for 99% of users. However I know that it's not uncommon in the Japanese community to have source files in other encodings (CIS-JIS? not sure). So I'm tempted to suggest I'm not even sure how we could properly store an associated encoding, like is the ordering of |
Interestingly this is the cause of ruby-i18n/i18n#606 In short some I18n data files has UTF-8 symbol keys, so after a roundtrip in MessagePack they come back as ASCII-8BIT aka BINARY. I can work around it with the hack suggested above, but I figured it was worth reporting. |
So actually the I18n issue wasn't with |
Ref: msgpack/msgpack-ruby#211 The default msgpack Symbol packer/unpacker is not encoding aware which cause all non-ASCII symbols to be unpacked with ASCII-8BIT encoding aka BINARY. So we define a custom packer that prefix the symbol name with the encoding index. An alternative could be to simply assume non-ASCII symbols are UTF-8, but it wouldn't work for people with non UTF-8 source files.
Ref: msgpack/msgpack-ruby#211 The default msgpack Symbol packer/unpacker is not encoding aware which cause all non-ASCII symbols to be unpacked with ASCII-8BIT encoding aka BINARY. So we define a custom packer that prefix the symbol name with the encoding index. Note that the encoding index isn't fixed across ruby platforms and version, but the cache versioning should protect us from that. An alternative could be to simply assume non-ASCII symbols are UTF-8, but it wouldn't work for people with non UTF-8 source files.
Ref: msgpack/msgpack-ruby#211 The default msgpack Symbol packer/unpacker is not encoding aware which cause all non-ASCII symbols to be unpacked with ASCII-8BIT encoding aka BINARY. So we define a custom packer that prefix the symbol name with the encoding index. Note that the encoding index isn't fixed across ruby platforms and version, but the cache versioning should protect us from that. An alternative could be to simply assume non-ASCII symbols are UTF-8, but it wouldn't work for people with non UTF-8 source files.
Ref: msgpack/msgpack-ruby#211 The default msgpack Symbol packer/unpacker is not encoding aware which cause all non-ASCII symbols to be unpacked with ASCII-8BIT encoding aka BINARY. So we define a custom packer that prefix the symbol name with `1` for UTF-8 symbols, and `0` for the others (ASCII or binary) If `Encoding.default_internal` is set to something MessagePack doesn't support, we entirely disable the YAML cache.
Ref: msgpack/msgpack-ruby#211 The default msgpack Symbol packer/unpacker is not encoding aware which cause all non-ASCII symbols to be unpacked with ASCII-8BIT encoding aka BINARY. So we define a custom packer that prefix the symbol name with `1` for UTF-8 symbols, and `0` for the others (ASCII or binary) If `Encoding.default_internal` is set to something MessagePack doesn't support, we entirely disable the YAML cache.
Ref: msgpack/msgpack-ruby#211 The default msgpack Symbol packer/unpacker is not encoding aware which cause all non-ASCII symbols to be unpacked with ASCII-8BIT encoding aka BINARY. So we define a custom packer that prefix the symbol name with `1` for UTF-8 symbols, and `0` for the others (ASCII or binary) If `Encoding.default_internal` is set to something MessagePack doesn't support, we entirely disable the YAML cache.
So after working on Shopify/bootsnap#398, I think we could actually fix this without much compatibility concerns. Message Pack strings are UTF-8 only, and |
Closing in favor of #248 |
This PR only include failing specs because I'm unsure how it could be fixed without being a breaking change.
Ruby Symbols like Ruby Strings have an associated
encoding
property. The only difference is that when possible it will be "casted" toUS_ASCII
:The problem here is that extension types are stored as binary strings, so
Symbol.from_msgpack_ext
receive a String withEncoding::BINARY
(AKAASCII-8BIT
), so after a roundtrip, any non-ascii symbol will be improperly restored:I don't know enough about the msgpack format to figure out what could be done here, and I presume breaking backward compatibility would be an issue, so storing the encoding alongside the symbol string is likely not possible.
One hack I'm thinking about is that we could assume that non-ASCII symbols are UTF-8, e.g.:
I think that would fix the issue for the vast majority of users, but still wouldn't be quite correct for people using something else than UTF-8 as source encoding.