-
Notifications
You must be signed in to change notification settings - Fork 857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relax bare key restrictions to allow additional unicode letters and numbers #687
Comments
I wholeheartedly support this. Identifiers in non-English languages should not be discriminated against in TOML (and having to put them in quotes is a form of discrimination). [Note: the following refers to the original form of the proposal, which since then has been considerably extended.] Admittedly, with this proposal this would still only be the case for languages using the Latin script, but not for Russian, Arabic, Chinese etc. But it would still be a step in the right direction – and I see that there might be issues with allowing, say, Cyrillic letters, because they might be used to spoof a key that looks like ASCII but actually isn't. With Latin diacritics, this risk is much lower. For completeness, I'd suggest to also support Latin Extended-B (Pan-Nigerian alphabet, Pinyin, Romanian), Latin Extended Additional (Vietnamese), and Latin Extended-C (Shona, a Bantu language). |
So there would be no simple rule for human writer to decide if quotes are necessary. |
Do "modern C++ compilers" support most of Unicode, or only chosen subset of Latin Extended? |
@ChristianSi Certainly there's lots of additional characters we could add. My list of suggestions was in no way meant to be exhaustive and I'm hoping that a more useful set of ranges is borne out of discussion. Being an Australian who only speaks English does limit my perspective a bit here!
Technically-speaking the rule could be: quotes if you need whitespace, an escape code, or a TOML-reserved character, otherwise anything goes. My feeling is that requiring users to think about this at all is getting away from the design goals of TOML and drifting too far into Think-Like-A-Programmer territory, which risks defeating the purpose of a simple config file format that intends to "just work" the way people would expect in the layman case.
Absolutely no idea. Here's a live demo of some Unicode on Clang, GCC and MSVC; I encourage you to experiment. I'm certain if we needed a more definitive answer we could read the compiler source code (Clang or GCC) or ask the relevant developers (MSVC). |
I'm on board. I like the way Python handles alphabetical characters:
That should be do-able for us, but I wonder how could complicate TOML implementations which are currently just doing a simple ASCII-value-range check. This also means we'd have to add a lot to TOML's ABNF. |
@pradyunsg: I too would be fine with saying "Bare keys may contain arbitrary Unicode letters as well as ASCII digits, underscores, and dashes". But in effect this would likely mean that implementations would have to depend on some kind of Unicode library – easy in Python, where the An advantage of @marzer 's original proposal, or my somewhat modified one, is that it would be easy to enumerate the affected ranges manually, in the ABNF or in code. With arbitrary Unicode letters this likely becomes effectively impossible – especially since the ranges would have to be extended with each new version of the Unicode standard. But, of course, we might also decide to allow letters in arbitrary languages and scripts, not just Latin-based ones, and accept the Unicode dependency. |
Here's a listing of all the Unicode Character Categories and all the characters that belong to each one. Unsurprisingly, it's quite long! |
@ChristianSi Given that TOML is supposed to be UTF-8 I'm inclined to think that requiring implementations use unicode machinery, hand-rolled or otherwise, isn't really a big deal, regardless of the direction this proposal takes. @pradyunsg I too like the python approach, and I don't think it would be too hard to implement. I'm currently writing a TOML library of my own and I'd be happy to build it into my utf-8 decoder as a proof-of-concept, if that's useful. |
@ChristianSi I wrote a script to scrape letter characters from the website you linked, sort them and list them as ranges. If you omit the letter categories removed, see below @pradyunsg I don't know much about ABNF's but if characters can be expressed as ranges this wouldn't be much work. |
@marzer Interesting. But the problem is that nearly all Unicode letters are in the "Letter, other" (Lo) category – 97 percent according to Wikipedia. Ignoring them, you only get the letters in alphabets that distinguish between upper case and lower case forms – Latin, Cyrillic, Greek, and a few others. But most writing systems don't – e.g. those used to write Chinese, Arabic, Hebrew, Korean, and certain Indian languages such as Tamil and Telugu know no such distinction. Hence their letters go into the "other" category. What happens when you consider all letters? I suppose ranges become a bit unwieldy? |
@ChristianSi Surprisingly it's not that unwieldy: removed, see below |
@ChristianSi Ah, good find. I'll update the script later tonight and see how it looks. |
Thanks for exploring this @ChristianSi and @marzer! ^>^ |
Marking this as a post-1.0 change, since I imagine this relaxation would not make any valid documents invalid -- thus, we can augment this in a non-major version bump. |
@ChristianSi Ok, I updated the script to scrape directly from the unicode consortium's character database and amended it to include all of the letter characters, and it looks like this:
Not really any worse than before, even considering it's 125634 characters. |
@marzer With what you provided, a PR could be prepared fairly quickly. Could you write that list similarly to how For reference, the part of RFC 3987 I'm referring to looks like this: ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
/ %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
/ %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
/ %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
/ %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
/ %xD0000-DFFFD / %xE1000-EFFFD |
Thanks a lot for your efforts, @marzer ! That looks good so far and indeed manageable, but there are a few complications we have missed so far. I looked at what Python 3 and JavaScript allow in identifiers. In addition to the five Letter categories we have already, they allow "Letter Number" (Nl) anywhere in an identifier and "Decimal Number" (Nd) anywhere, except at the start. TOML already allows all-numeric keys (they never occur in a position where they can be confused with actual numbers) so I'd consider it reasonable to allow both these categories anywhere in a key – people using Bengali letters in keys might, for example, reasonably expect to be able to use Bengali digits as well. Together they comprise less than 900 characters, so adding them should be quite manageable. The final Number category (Other Number – No) is not allowed in identifiers in either language. Moreover, both languages allow anywhere, except at the start, "Nonspacing Mark" (Mn) and "Spacing Mark" (Mc). Now it's important to understand that in Unicode, Marks (Mx categories) are always combining characters – they become logically attached to the preceding character and modify it. For example, Mn contains the "combining grave accent" which goes over the preceding letter and modifies it; Mc contains various Bengali vowel signs which likewise modify the preceding (supposedly Bengali) letter. Hence it seems indeed important that we support these two categories too, since they are necessary to write certain words in certain languages – without them, support for multilingual bare keys would be incomplete and people might get odd error messages. It's also important that we must NOT allow them at the start of a bare key, since otherwise they would try to modify the preceding non-key character (likely a newline, space, or Finally, both JS and Python allow, except at the start, Connector Punctuation (Pc). That's a very short category with just 10 entries, including the underscore, which we allow already. I don't have strong feelings regarding this category, but would rather tend NOT to allow it in bare keys – we already have underscores and dashes as connectors, and, for example, the Centreline Low Line (﹎) with a tiny dot in the middle could theoretically be confused with the dots that actually separate key elements in hierarchical table names. So, to summarize, I'd propose to additionally allow Nl and Nd anywhere in a bare key, and Mn and Mc anywhere except as first character (or code point, to be more exact). In the README, we could then say:
|
@ChristianSi LGTM. This proposal has significantly broadened in scope from my original thought bubble, but definitely for the better. Can the Mx codepoints appear consecutively? If not, we'd also need to clarify that codepoints from Mx categories cannot appear at the beginning of a key and immediately following another Mx codepoint. |
Indeed! Yes, consecutive Mx codepoints are allowed – they all modify the preceding letter, e.g. by placing an acute accent above and an ogonek below it. That's necessary for some languages, such as Navajo. |
Alright I've updated the issue text to better reflect the current state of the discussion, as well as including links to my proof-of-concept implementation. I've also updated the script to generate the ABNF notation for the three relevant 'super-categories' of codepoints, which generates this: ; unicode codepoints from categories Ll, Lm, Lo, Lt, Lu
letters = %x41-5A / %x61-7A / %xAA / %xB5 /
%xBA / %xC0-D6 / %xD8-F6 / %xF8-2C1 /
%x2C6-2D1 / %x2E0-2E4 / %x2EC / %x2EE /
%x370-374 / %x376-377 / %x37A-37D / %x37F /
%x386 / %x388-38A / %x38C / %x38E-3A1 /
%x3A3-3F5 / %x3F7-481 / %x48A-52F / %x531-556 /
%x559 / %x560-588 / %x5D0-5EA / %x5EF-5F2 /
%x620-64A / %x66E-66F / %x671-6D3 / %x6D5 /
%x6E5-6E6 / %x6EE-6EF / %x6FA-6FC / %x6FF /
%x710 / %x712-72F / %x74D-7A5 / %x7B1 /
%x7CA-7EA / %x7F4-7F5 / %x7FA / %x800-815 /
%x81A / %x824 / %x828 / %x840-858 /
%x860-86A / %x8A0-8B4 / %x8B6-8C7 / %x904-939 /
%x93D / %x950 / %x958-961 / %x971-980 /
%x985-98C / %x98F-990 / %x993-9A8 / %x9AA-9B0 /
%x9B2 / %x9B6-9B9 / %x9BD / %x9CE /
%x9DC-9DD / %x9DF-9E1 / %x9F0-9F1 / %x9FC /
%xA05-A0A / %xA0F-A10 / %xA13-A28 / %xA2A-A30 /
%xA32-A33 / %xA35-A36 / %xA38-A39 / %xA59-A5C /
%xA5E / %xA72-A74 / %xA85-A8D / %xA8F-A91 /
%xA93-AA8 / %xAAA-AB0 / %xAB2-AB3 / %xAB5-AB9 /
%xABD / %xAD0 / %xAE0-AE1 / %xAF9 /
%xB05-B0C / %xB0F-B10 / %xB13-B28 / %xB2A-B30 /
%xB32-B33 / %xB35-B39 / %xB3D / %xB5C-B5D /
%xB5F-B61 / %xB71 / %xB83 / %xB85-B8A /
%xB8E-B90 / %xB92-B95 / %xB99-B9A / %xB9C /
%xB9E-B9F / %xBA3-BA4 / %xBA8-BAA / %xBAE-BB9 /
%xBD0 / %xC05-C0C / %xC0E-C10 / %xC12-C28 /
%xC2A-C39 / %xC3D / %xC58-C5A / %xC60-C61 /
%xC80 / %xC85-C8C / %xC8E-C90 / %xC92-CA8 /
%xCAA-CB3 / %xCB5-CB9 / %xCBD / %xCDE /
%xCE0-CE1 / %xCF1-CF2 / %xD04-D0C / %xD0E-D10 /
%xD12-D3A / %xD3D / %xD4E / %xD54-D56 /
%xD5F-D61 / %xD7A-D7F / %xD85-D96 / %xD9A-DB1 /
%xDB3-DBB / %xDBD / %xDC0-DC6 / %xE01-E30 /
%xE32-E33 / %xE40-E46 / %xE81-E82 / %xE84 /
%xE86-E8A / %xE8C-EA3 / %xEA5 / %xEA7-EB0 /
%xEB2-EB3 / %xEBD / %xEC0-EC4 / %xEC6 /
%xEDC-EDF / %xF00 / %xF40-F47 / %xF49-F6C /
%xF88-F8C / %x1000-102A / %x103F / %x1050-1055 /
%x105A-105D / %x1061 / %x1065-1066 / %x106E-1070 /
%x1075-1081 / %x108E / %x10A0-10C5 / %x10C7 /
%x10CD / %x10D0-10FA / %x10FC-1248 / %x124A-124D /
%x1250-1256 / %x1258 / %x125A-125D / %x1260-1288 /
%x128A-128D / %x1290-12B0 / %x12B2-12B5 / %x12B8-12BE /
%x12C0 / %x12C2-12C5 / %x12C8-12D6 / %x12D8-1310 /
%x1312-1315 / %x1318-135A / %x1380-138F / %x13A0-13F5 /
%x13F8-13FD / %x1401-166C / %x166F-167F / %x1681-169A /
%x16A0-16EA / %x16F1-16F8 / %x1700-170C / %x170E-1711 /
%x1720-1731 / %x1740-1751 / %x1760-176C / %x176E-1770 /
%x1780-17B3 / %x17D7 / %x17DC / %x1820-1878 /
%x1880-1884 / %x1887-18A8 / %x18AA / %x18B0-18F5 /
%x1900-191E / %x1950-196D / %x1970-1974 / %x1980-19AB /
%x19B0-19C9 / %x1A00-1A16 / %x1A20-1A54 / %x1AA7 /
%x1B05-1B33 / %x1B45-1B4B / %x1B83-1BA0 / %x1BAE-1BAF /
%x1BBA-1BE5 / %x1C00-1C23 / %x1C4D-1C4F / %x1C5A-1C7D /
%x1C80-1C88 / %x1C90-1CBA / %x1CBD-1CBF / %x1CE9-1CEC /
%x1CEE-1CF3 / %x1CF5-1CF6 / %x1CFA / %x1D00-1DBF /
%x1E00-1F15 / %x1F18-1F1D / %x1F20-1F45 / %x1F48-1F4D /
%x1F50-1F57 / %x1F59 / %x1F5B / %x1F5D /
%x1F5F-1F7D / %x1F80-1FB4 / %x1FB6-1FBC / %x1FBE /
%x1FC2-1FC4 / %x1FC6-1FCC / %x1FD0-1FD3 / %x1FD6-1FDB /
%x1FE0-1FEC / %x1FF2-1FF4 / %x1FF6-1FFC / %x2071 /
%x207F / %x2090-209C / %x2102 / %x2107 /
%x210A-2113 / %x2115 / %x2119-211D / %x2124 /
%x2126 / %x2128 / %x212A-212D / %x212F-2139 /
%x213C-213F / %x2145-2149 / %x214E / %x2183-2184 /
%x2C00-2C2E / %x2C30-2C5E / %x2C60-2CE4 / %x2CEB-2CEE /
%x2CF2-2CF3 / %x2D00-2D25 / %x2D27 / %x2D2D /
%x2D30-2D67 / %x2D6F / %x2D80-2D96 / %x2DA0-2DA6 /
%x2DA8-2DAE / %x2DB0-2DB6 / %x2DB8-2DBE / %x2DC0-2DC6 /
%x2DC8-2DCE / %x2DD0-2DD6 / %x2DD8-2DDE / %x2E2F /
%x3005-3006 / %x3031-3035 / %x303B-303C / %x3041-3096 /
%x309D-309F / %x30A1-30FA / %x30FC-30FF / %x3105-312F /
%x3131-318E / %x31A0-31BF / %x31F0-31FF / %x3400-4DBF /
%x4E00-9FFC / %xA000-A48C / %xA4D0-A4FD / %xA500-A60C /
%xA610-A61F / %xA62A-A62B / %xA640-A66E / %xA67F-A69D /
%xA6A0-A6E5 / %xA717-A71F / %xA722-A788 / %xA78B-A7BF /
%xA7C2-A7CA / %xA7F5-A801 / %xA803-A805 / %xA807-A80A /
%xA80C-A822 / %xA840-A873 / %xA882-A8B3 / %xA8F2-A8F7 /
%xA8FB / %xA8FD-A8FE / %xA90A-A925 / %xA930-A946 /
%xA960-A97C / %xA984-A9B2 / %xA9CF / %xA9E0-A9E4 /
%xA9E6-A9EF / %xA9FA-A9FE / %xAA00-AA28 / %xAA40-AA42 /
%xAA44-AA4B / %xAA60-AA76 / %xAA7A / %xAA7E-AAAF /
%xAAB1 / %xAAB5-AAB6 / %xAAB9-AABD / %xAAC0 /
%xAAC2 / %xAADB-AADD / %xAAE0-AAEA / %xAAF2-AAF4 /
%xAB01-AB06 / %xAB09-AB0E / %xAB11-AB16 / %xAB20-AB26 /
%xAB28-AB2E / %xAB30-AB5A / %xAB5C-AB69 / %xAB70-ABE2 /
%xAC00-D7A3 / %xD7B0-D7C6 / %xD7CB-D7FB / %xF900-FA6D /
%xFA70-FAD9 / %xFB00-FB06 / %xFB13-FB17 / %xFB1D /
%xFB1F-FB28 / %xFB2A-FB36 / %xFB38-FB3C / %xFB3E /
%xFB40-FB41 / %xFB43-FB44 / %xFB46-FBB1 / %xFBD3-FD3D /
%xFD50-FD8F / %xFD92-FDC7 / %xFDF0-FDFB / %xFE70-FE74 /
%xFE76-FEFC / %xFF21-FF3A / %xFF41-FF5A / %xFF66-FFBE /
%xFFC2-FFC7 / %xFFCA-FFCF / %xFFD2-FFD7 / %xFFDA-FFDC /
%x10000-1000B / %x1000D-10026 / %x10028-1003A / %x1003C-1003D /
%x1003F-1004D / %x10050-1005D / %x10080-100FA / %x10280-1029C /
%x102A0-102D0 / %x10300-1031F / %x1032D-10340 / %x10342-10349 /
%x10350-10375 / %x10380-1039D / %x103A0-103C3 / %x103C8-103CF /
%x10400-1049D / %x104B0-104D3 / %x104D8-104FB / %x10500-10527 /
%x10530-10563 / %x10600-10736 / %x10740-10755 / %x10760-10767 /
%x10800-10805 / %x10808 / %x1080A-10835 / %x10837-10838 /
%x1083C / %x1083F-10855 / %x10860-10876 / %x10880-1089E /
%x108E0-108F2 / %x108F4-108F5 / %x10900-10915 / %x10920-10939 /
%x10980-109B7 / %x109BE-109BF / %x10A00 / %x10A10-10A13 /
%x10A15-10A17 / %x10A19-10A35 / %x10A60-10A7C / %x10A80-10A9C /
%x10AC0-10AC7 / %x10AC9-10AE4 / %x10B00-10B35 / %x10B40-10B55 /
%x10B60-10B72 / %x10B80-10B91 / %x10C00-10C48 / %x10C80-10CB2 /
%x10CC0-10CF2 / %x10D00-10D23 / %x10E80-10EA9 / %x10EB0-10EB1 /
%x10F00-10F1C / %x10F27 / %x10F30-10F45 / %x10FB0-10FC4 /
%x10FE0-10FF6 / %x11003-11037 / %x11083-110AF / %x110D0-110E8 /
%x11103-11126 / %x11144 / %x11147 / %x11150-11172 /
%x11176 / %x11183-111B2 / %x111C1-111C4 / %x111DA /
%x111DC / %x11200-11211 / %x11213-1122B / %x11280-11286 /
%x11288 / %x1128A-1128D / %x1128F-1129D / %x1129F-112A8 /
%x112B0-112DE / %x11305-1130C / %x1130F-11310 / %x11313-11328 /
%x1132A-11330 / %x11332-11333 / %x11335-11339 / %x1133D /
%x11350 / %x1135D-11361 / %x11400-11434 / %x11447-1144A /
%x1145F-11461 / %x11480-114AF / %x114C4-114C5 / %x114C7 /
%x11580-115AE / %x115D8-115DB / %x11600-1162F / %x11644 /
%x11680-116AA / %x116B8 / %x11700-1171A / %x11800-1182B /
%x118A0-118DF / %x118FF-11906 / %x11909 / %x1190C-11913 /
%x11915-11916 / %x11918-1192F / %x1193F / %x11941 /
%x119A0-119A7 / %x119AA-119D0 / %x119E1 / %x119E3 /
%x11A00 / %x11A0B-11A32 / %x11A3A / %x11A50 /
%x11A5C-11A89 / %x11A9D / %x11AC0-11AF8 / %x11C00-11C08 /
%x11C0A-11C2E / %x11C40 / %x11C72-11C8F / %x11D00-11D06 /
%x11D08-11D09 / %x11D0B-11D30 / %x11D46 / %x11D60-11D65 /
%x11D67-11D68 / %x11D6A-11D89 / %x11D98 / %x11EE0-11EF2 /
%x11FB0 / %x12000-12399 / %x12480-12543 / %x13000-1342E /
%x14400-14646 / %x16800-16A38 / %x16A40-16A5E / %x16AD0-16AED /
%x16B00-16B2F / %x16B40-16B43 / %x16B63-16B77 / %x16B7D-16B8F /
%x16E40-16E7F / %x16F00-16F4A / %x16F50 / %x16F93-16F9F /
%x16FE0-16FE1 / %x16FE3 / %x17000-187F7 / %x18800-18CD5 /
%x18D00-18D08 / %x1B000-1B11E / %x1B150-1B152 / %x1B164-1B167 /
%x1B170-1B2FB / %x1BC00-1BC6A / %x1BC70-1BC7C / %x1BC80-1BC88 /
%x1BC90-1BC99 / %x1D400-1D454 / %x1D456-1D49C / %x1D49E-1D49F /
%x1D4A2 / %x1D4A5-1D4A6 / %x1D4A9-1D4AC / %x1D4AE-1D4B9 /
%x1D4BB / %x1D4BD-1D4C3 / %x1D4C5-1D505 / %x1D507-1D50A /
%x1D50D-1D514 / %x1D516-1D51C / %x1D51E-1D539 / %x1D53B-1D53E /
%x1D540-1D544 / %x1D546 / %x1D54A-1D550 / %x1D552-1D6A5 /
%x1D6A8-1D6C0 / %x1D6C2-1D6DA / %x1D6DC-1D6FA / %x1D6FC-1D714 /
%x1D716-1D734 / %x1D736-1D74E / %x1D750-1D76E / %x1D770-1D788 /
%x1D78A-1D7A8 / %x1D7AA-1D7C2 / %x1D7C4-1D7CB / %x1E100-1E12C /
%x1E137-1E13D / %x1E14E / %x1E2C0-1E2EB / %x1E800-1E8C4 /
%x1E900-1E943 / %x1E94B / %x1EE00-1EE03 / %x1EE05-1EE1F /
%x1EE21-1EE22 / %x1EE24 / %x1EE27 / %x1EE29-1EE32 /
%x1EE34-1EE37 / %x1EE39 / %x1EE3B / %x1EE42 /
%x1EE47 / %x1EE49 / %x1EE4B / %x1EE4D-1EE4F /
%x1EE51-1EE52 / %x1EE54 / %x1EE57 / %x1EE59 /
%x1EE5B / %x1EE5D / %x1EE5F / %x1EE61-1EE62 /
%x1EE64 / %x1EE67-1EE6A / %x1EE6C-1EE72 / %x1EE74-1EE77 /
%x1EE79-1EE7C / %x1EE7E / %x1EE80-1EE89 / %x1EE8B-1EE9B /
%x1EEA1-1EEA3 / %x1EEA5-1EEA9 / %x1EEAB-1EEBB / %x20000-2A6DD /
%x2A700-2B734 / %x2B740-2B81D / %x2B820-2CEA1 / %x2CEB0-2EBE0 /
%x2F800-2FA1D / %x30000-3134A
; 131241 codepoints in total
; unicode codepoints from categories Nd, Nl
numbers = %x30-39 / %x660-669 / %x6F0-6F9 / %x7C0-7C9 /
%x966-96F / %x9E6-9EF / %xA66-A6F / %xAE6-AEF /
%xB66-B6F / %xBE6-BEF / %xC66-C6F / %xCE6-CEF /
%xD66-D6F / %xDE6-DEF / %xE50-E59 / %xED0-ED9 /
%xF20-F29 / %x1040-1049 / %x1090-1099 / %x16EE-16F0 /
%x17E0-17E9 / %x1810-1819 / %x1946-194F / %x19D0-19D9 /
%x1A80-1A89 / %x1A90-1A99 / %x1B50-1B59 / %x1BB0-1BB9 /
%x1C40-1C49 / %x1C50-1C59 / %x2160-2182 / %x2185-2188 /
%x3007 / %x3021-3029 / %x3038-303A / %xA620-A629 /
%xA6E6-A6EF / %xA8D0-A8D9 / %xA900-A909 / %xA9D0-A9D9 /
%xA9F0-A9F9 / %xAA50-AA59 / %xABF0-ABF9 / %xFF10-FF19 /
%x10140-10174 / %x10341 / %x1034A / %x103D1-103D5 /
%x104A0-104A9 / %x10D30-10D39 / %x11066-1106F / %x110F0-110F9 /
%x11136-1113F / %x111D0-111D9 / %x112F0-112F9 / %x11450-11459 /
%x114D0-114D9 / %x11650-11659 / %x116C0-116C9 / %x11730-11739 /
%x118E0-118E9 / %x11950-11959 / %x11C50-11C59 / %x11D50-11D59 /
%x11DA0-11DA9 / %x12400-1246E / %x16A60-16A69 / %x16B50-16B59 /
%x1D7CE-1D7FF / %x1E140-1E149 / %x1E2F0-1E2F9 / %x1E950-1E959 /
%x1FBF0-1FBF9
; 886 codepoints in total
; unicode codepoints from categories Mn, Mc
combining_marks = %x300-36F / %x483-487 / %x591-5BD / %x5BF /
%x5C1-5C2 / %x5C4-5C5 / %x5C7 / %x610-61A /
%x64B-65F / %x670 / %x6D6-6DC / %x6DF-6E4 /
%x6E7-6E8 / %x6EA-6ED / %x711 / %x730-74A /
%x7A6-7B0 / %x7EB-7F3 / %x7FD / %x816-819 /
%x81B-823 / %x825-827 / %x829-82D / %x859-85B /
%x8D3-8E1 / %x8E3-903 / %x93A-93C / %x93E-94F /
%x951-957 / %x962-963 / %x981-983 / %x9BC /
%x9BE-9C4 / %x9C7-9C8 / %x9CB-9CD / %x9D7 /
%x9E2-9E3 / %x9FE / %xA01-A03 / %xA3C /
%xA3E-A42 / %xA47-A48 / %xA4B-A4D / %xA51 /
%xA70-A71 / %xA75 / %xA81-A83 / %xABC /
%xABE-AC5 / %xAC7-AC9 / %xACB-ACD / %xAE2-AE3 /
%xAFA-AFF / %xB01-B03 / %xB3C / %xB3E-B44 /
%xB47-B48 / %xB4B-B4D / %xB55-B57 / %xB62-B63 /
%xB82 / %xBBE-BC2 / %xBC6-BC8 / %xBCA-BCD /
%xBD7 / %xC00-C04 / %xC3E-C44 / %xC46-C48 /
%xC4A-C4D / %xC55-C56 / %xC62-C63 / %xC81-C83 /
%xCBC / %xCBE-CC4 / %xCC6-CC8 / %xCCA-CCD /
%xCD5-CD6 / %xCE2-CE3 / %xD00-D03 / %xD3B-D3C /
%xD3E-D44 / %xD46-D48 / %xD4A-D4D / %xD57 /
%xD62-D63 / %xD81-D83 / %xDCA / %xDCF-DD4 /
%xDD6 / %xDD8-DDF / %xDF2-DF3 / %xE31 /
%xE34-E3A / %xE47-E4E / %xEB1 / %xEB4-EBC /
%xEC8-ECD / %xF18-F19 / %xF35 / %xF37 /
%xF39 / %xF3E-F3F / %xF71-F84 / %xF86-F87 /
%xF8D-F97 / %xF99-FBC / %xFC6 / %x102B-103E /
%x1056-1059 / %x105E-1060 / %x1062-1064 / %x1067-106D /
%x1071-1074 / %x1082-108D / %x108F / %x109A-109D /
%x135D-135F / %x1712-1714 / %x1732-1734 / %x1752-1753 /
%x1772-1773 / %x17B4-17D3 / %x17DD / %x180B-180D /
%x1885-1886 / %x18A9 / %x1920-192B / %x1930-193B /
%x1A17-1A1B / %x1A55-1A5E / %x1A60-1A7C / %x1A7F /
%x1AB0-1ABD / %x1ABF-1AC0 / %x1B00-1B04 / %x1B34-1B44 /
%x1B6B-1B73 / %x1B80-1B82 / %x1BA1-1BAD / %x1BE6-1BF3 /
%x1C24-1C37 / %x1CD0-1CD2 / %x1CD4-1CE8 / %x1CED /
%x1CF4 / %x1CF7-1CF9 / %x1DC0-1DF9 / %x1DFB-1DFF /
%x20D0-20DC / %x20E1 / %x20E5-20F0 / %x2CEF-2CF1 /
%x2D7F / %x2DE0-2DFF / %x302A-302F / %x3099-309A /
%xA66F / %xA674-A67D / %xA69E-A69F / %xA6F0-A6F1 /
%xA802 / %xA806 / %xA80B / %xA823-A827 /
%xA82C / %xA880-A881 / %xA8B4-A8C5 / %xA8E0-A8F1 /
%xA8FF / %xA926-A92D / %xA947-A953 / %xA980-A983 /
%xA9B3-A9C0 / %xA9E5 / %xAA29-AA36 / %xAA43 /
%xAA4C-AA4D / %xAA7B-AA7D / %xAAB0 / %xAAB2-AAB4 /
%xAAB7-AAB8 / %xAABE-AABF / %xAAC1 / %xAAEB-AAEF /
%xAAF5-AAF6 / %xABE3-ABEA / %xABEC-ABED / %xFB1E /
%xFE00-FE0F / %xFE20-FE2F / %x101FD / %x102E0 /
%x10376-1037A / %x10A01-10A03 / %x10A05-10A06 / %x10A0C-10A0F /
%x10A38-10A3A / %x10A3F / %x10AE5-10AE6 / %x10D24-10D27 /
%x10EAB-10EAC / %x10F46-10F50 / %x11000-11002 / %x11038-11046 /
%x1107F-11082 / %x110B0-110BA / %x11100-11102 / %x11127-11134 /
%x11145-11146 / %x11173 / %x11180-11182 / %x111B3-111C0 /
%x111C9-111CC / %x111CE-111CF / %x1122C-11237 / %x1123E /
%x112DF-112EA / %x11300-11303 / %x1133B-1133C / %x1133E-11344 /
%x11347-11348 / %x1134B-1134D / %x11357 / %x11362-11363 /
%x11366-1136C / %x11370-11374 / %x11435-11446 / %x1145E /
%x114B0-114C3 / %x115AF-115B5 / %x115B8-115C0 / %x115DC-115DD /
%x11630-11640 / %x116AB-116B7 / %x1171D-1172B / %x1182C-1183A /
%x11930-11935 / %x11937-11938 / %x1193B-1193E / %x11940 /
%x11942-11943 / %x119D1-119D7 / %x119DA-119E0 / %x119E4 /
%x11A01-11A0A / %x11A33-11A39 / %x11A3B-11A3E / %x11A47 /
%x11A51-11A5B / %x11A8A-11A99 / %x11C2F-11C36 / %x11C38-11C3F /
%x11C92-11CA7 / %x11CA9-11CB6 / %x11D31-11D36 / %x11D3A /
%x11D3C-11D3D / %x11D3F-11D45 / %x11D47 / %x11D8A-11D8E /
%x11D90-11D91 / %x11D93-11D97 / %x11EF3-11EF6 / %x16AF0-16AF4 /
%x16B30-16B36 / %x16F4F / %x16F51-16F87 / %x16F8F-16F92 /
%x16FE4 / %x16FF0-16FF1 / %x1BC9D-1BC9E / %x1D165-1D169 /
%x1D16D-1D172 / %x1D17B-1D182 / %x1D185-1D18B / %x1D1AA-1D1AD /
%x1D242-1D244 / %x1DA00-1DA36 / %x1DA3B-1DA6C / %x1DA75 /
%x1DA84 / %x1DA9B-1DA9F / %x1DAA1-1DAAF / %x1E000-1E006 /
%x1E008-1E018 / %x1E01B-1E021 / %x1E023-1E024 / %x1E026-1E02A /
%x1E130-1E136 / %x1E2EC-1E2EF / %x1E8D0-1E8D6 / %x1E944-1E94A /
%xE0100-E01EF
; 2282 codepoints in total |
@marzer Great! @pradyunsg Assuming that one of us prepares a PR, it there any change that this would be merged relatively quickly? Or does it have to wait until 1.0 is released in any case? |
This feels like a significant change to TOMLs interpretation of being "minimal". Maybe we should ask Tom himself to bless this change? |
Is it though? The language itself will be just as minimal as before, since this change will be backwards-compatible. In fact it would actually increase the simplicity of TOML files since keys should work in a WYSIWYG way for more people, and only require quotes in very specific circumstances. It will complicate it for implementers, sure, but not all that much. |
I'm coming to this as someone who is incorporating TOML into a project with keys that will often contain symbols/punctuation. I've read through this thread and I have not seen anyone propose that keys allow any valid unicode except the symbols needed by the TOML parser itself. That would:
I'm not currently arguing this is the best approach but it seemed worth adding to the set of options in the discussion space. |
If anyone is interested in playing around with a parser that supports this tentative feature (as specified in the OP, anyways), my C++ TOML library is now in a publishable state: https://marzer.github.io/tomlplusplus/ @thoughtafter It seems as though your suggestion is very much in-line with @abelbraaksma's (which from my reading, advocates including everything except syntactically-relevant/ambiguous characters). |
In my opinion, we don't program in any language, including English. What we are coding is symbol. ASCII in programming is safe symbols, not English. In high level languages, identifier could be defined as any charactor, because here is IDE and highlight. But TOML is designed for ini file, usually no any extra support when editing. So I think that's really dangerous to allow bare keys include non-ASCII charactor.
But, I think spec allow implementations to support user specified language bare keys support is good. What languages you are fimilar, you use that. For example, I think simplicity and nationality could be no conflict, not must one or the other—otherwise, absolute "fairness" will lead to widespread inefficiency. |
@LongTengDao it's a config file format, not a database specification or real-time streaming format; I don't think 'efficiency' is all that relevant (if you mean the computational complexity of parsing, that is). Unless you mean the efficiency of the actual implementing of the new functionality? As in, it will be a bit complex for implementers and maintainers to get this working in their parsers, thus being inefficient for them? If so, that's not even true. It's pretty easy to implement. I've done it myself, and provide relevant information in the original post. I'm not sure what other sorts of efficiencies you could mean. It wouldn't make TOML any less efficient to write (if anything it would get simpler and easier to use as a result of this proposal). |
@marzer I've never considered the difficulty of writing a parser is a hindrance, and it's not worth considering in the face of a perfectly formatted file design task. If anyone objects to this, I will be on your side. I only mean the efficiency of writing and checking. Introducing special characters too broadly will make the process of reading and writing a file stressful again. Remember, Unicode doesn't just include characters in common languages like the ones you and me use (1en or 1em width). Instead, there are many combinations of display and invisible even right-to-left characters that are in character category, rather than punctuation or whitespace category. It's a nightmare, if you've ever developed an typesetting software like Office Word. After that, I have been frightened by words like "all valid Unicode". But anyway, it doesn't affect my use. If spec said ASCII only, I will support user options to support any Unicode character range. If spec said any Unicode character is valid as your suggestion, I will support user options to limit ASCII only. I think this right belongs to user. |
@LongTengDao to be clear, my proposal isn't to support "any Unicode character", as you seem to think. It's to support a subset (letters, numbers, and some combining marks). Yeah there might be characters in those categories that are effectively garbage for our purposes but they can probably just be ignored; if it's not a character on a keyboard then someone has gone to effort to put it in their config, and if that breaks stuff then that's the life they chose. Parser library users can trivially add additional sanity-checking if they feel the need. |
@LongTengDao I believe the opposite to be true. Limiting users that are accustomed to right-to-left writing means limiting over 1.7 billion people worldwide to a system that is not native to them. What is perhaps perceived by you as "a nightmare" is perceived by others as a nightmare if it isn't allowed. Not everyone speaks English or can write in their native tongue using only ASCII characters (in fact, it is a relative small share of the world population). Inclusion of other cultures, languages and writing systems is a good thing, and although TOML is not a programming language, many well-known programming language embrace inclusion more than exclusion: C#, VB, F# (allows any character), Java (they allow a broader set than defined here), Ruby, Perl, XML/HTML tag names, CSS classes/id and there are many more. Unicode even has a specific TR that describes the recommended way for allowing Unicode characters in identifiers: https://unicode.org/reports/tr31/. Differences between languages will always exist, but the closer a language (or a spec like TOML) gets to TR31, the better it is for the worldwide community of thousands of languages that can write in their native tongue. If any company or individual wishes to limit the allowed set of characters in identifiers, or in coding in general, they are of course free to do so, just like coding styles exist for many programming languages, you could limit your style to "only ASCII" or whatever you prefer. And as already has been said, the proposal here is a safe subset of the Unicode language. |
@ChristianSi, apologies for the wait, I forgot about your question here. The precise definition of
Let's split that up. According to this, the following are not allowed as a starting character (and I believe this mostly follows current practice in TOML as well):
I'm not sure why the "tie" is forbidden as starting character in XML names (it is not a combining tie, it is spaced), but the other ones seem sensible. I can write up a TOML spec proposal for this set, and/or extend it to TR31 if that somehow makes sense, but I think it is easier for people in general to use the XML specification (without reference to XML, of course, as it is otherwise unrelated), since they already did the necessary research, it's concise, and it's trivial to implement. TR31 is quite hard to read and probably raises new questions again. |
This looks like we're missing a PR for doing this. If someone wants to pick this up, and file a PR expanding the allowed bare keys syntax to include letters from the broader unicode spec, that'd be welcome! |
@pradyunsg, done, I've created a PR in #891. I tried to be both as inclusive as possible, while maintaining simplicity for parsers. Basically the rule is now: "Any Unicode letter, letterlike character or digit, except dot", as discussed above. |
This has now been merged. Thanks everyone for their support and insights! |
This backs out the unicode bare keys from toml-lang#891. This does *not* mean we can't include it in a future 1.2 (or 1.3, or whatever); just that right now there doesn't seem to be a clear consensus regarding to normalisation and which characters to include. It's already the most discussed single issue in the history of TOML. I kind of hate doing this as it seems a step backwards; in principle I think we *should* have this so I'm not against the idea of the feature as such, but things seem to be at a bit of a stalemate right now, and this will allow TOML to move forward on other issues. It hasn't come up *that* often; the issue (toml-lang#687) wasn't filed until 2019, and has only 11 upvotes. Other than that, the issue was raised only once before in 2015 as far as I can find (toml-lang#337). I also can't really find anyone asking for it in any of the HN threads on TOML. All of this means we can push forward releasing TOML 1.1, giving people access to the much more frequently requested relaxing of inline tables (toml-lang#516, with 122 upvotes, and has come up on HN as well) and some other more minor things (e.g. `\e` has 12 upvotes in toml-lang#715). Basically, a lot more people are waiting for this, and all things considered this seems a better path forward for now, unless someone comes up with a proposal which addresses all issues (I tried and thus far failed). I proposed this over here a few months ago, and the response didn't seem too hostile to the idea: toml-lang#966 (comment)
This backs out the unicode bare keys from toml-lang#891. This does *not* mean we can't include it in a future 1.2 (or 1.3, or whatever); just that right now there doesn't seem to be a clear consensus regarding to normalisation and which characters to include. It's already the most discussed single issue in the history of TOML. I kind of hate doing this as it seems a step backwards; in principle I think we *should* have this so I'm not against the idea of the feature as such, but things seem to be at a bit of a stalemate right now, and this will allow TOML to move forward on other fronts. It hasn't come up *that* often; the issue (toml-lang#687) wasn't filed until 2019, and has only 11 upvotes. Other than that, the issue was raised only once before in 2015 as far as I can find (toml-lang#337). I also can't really find anyone asking for it in any of the HN threads on TOML. Reverting this means we can go forward releasing TOML 1.1, giving people access to the much more frequently requested relaxing of inline tables (toml-lang#516, with 122 upvotes, and has come up on HN as well) and some other more minor things (e.g. `\e` has 12 upvotes in toml-lang#715). Basically, a lot more people are waiting for this, and all things considered this seems a better path forward for now, unless someone comes up with a proposal which addresses all issues (I tried and thus far failed). I proposed this over here a few months ago, and the responses didn't seem too hostile to the idea: toml-lang#966 (comment)
I believe this would greatly improve things and solves all the issues, mostly. It's a bit more complex, but not overly so, and can be implemented without a Unicode library without too much effort. It offers a good middle ground, IMHO. I don't think there are ANY perfect solutions here and ANY solution is a trade-off. That said, I do believe some trade-offs are better than others, and after looking at a bunch of different options I believe this is by far the best path for TOML. Advantages: - This is what I would consider the "minimal set" of characters we need to add for reasonable international support, meaning we can't really make a mistake with this by accidentally allowing too much. We can add new ranges in TOML 1.2 (or even change the entire approach, although I'd be very surprised if we need to), based on actual real-world feedback, but any approach we will take will need to include letters and digits from all scripts. This is the strongest argument in favour of this and the biggest improvement: we can't really do anything wrong here in a way that we can't correct later. Being conservative is probably the right way forward. - This solves the normalisation issues, since combining characters are no longer allowed in bare keys, so it becomes a moot point. For quoted keys normalisation is mostly a non-issue because few people use them and the specification even strongly discourages people from using them, which is why this gone largely unnoticed and undiscussed before the "Unicode in bare keys" PR was merged.[1] - It's consistent in what we allow: no "this character is allowed, but this very similar other thing isn't, what gives?!" Note that toml-lang#954 was NOT about "I want all emojis to work", but "this character works fine, but this very similar doesn't". This shows up in a number of things: a.toml: Input: ; = 42 # U+037E GREEK QUESTION MARK (Other_Punctuation) Error: line 1: expected '.' or '=', but got ';' instead b.toml: Input: · = 42 # # U+0387 GREEK ANO TELEIA (Other_Punctuation) Error: (none) c.toml: Input: – = 42 # U+2013 EN DASH (Dash_Punctuation) Error: line 1: expected '.' or '=', but got '–' instead d.toml: Input: ⁻ = 42 # U+207B SUPERSCRIPT MINUS (Math_Symbol) Error: (none) e.toml: Input: #x = "commented ... or is it?" # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) Error: (none) "Some punctuation is allowed but some isn't" is hard to explain, and also not what the specification says: "Punctuation, spaces, arrows, box drawing and private use characters are not allowed." In reality, a lot of punctuation IS allowed, but not all. People don't read specifications, nor should they. People try something and sees if it works. Now it seems to work on first approximation, and then (possibly months later) it seems to "break". From the user's perspective this seems like a bug in the TOML parser. There is no good way to communicate this other than "these codepoints, which cover most of what you'd write in a sentence, except when it doesn't". In contrast, "we allow letters and digits" is simple to spec, simple to communicate, and should have a minimum potential for confusion. The current spec disallows some things seemingly almost arbitrary while allowing other very similar characters. - This avoids a long list of confusable special TOML characters; some were mentioned above but there are many more: '#' U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) '"' U+FF02 FULLWIDTH QUOTATION MARK (Other_Punctuation) '﹟' U+FE5F SMALL NUMBER SIGN (Other_Punctuation) '﹦' U+FE66 SMALL EQUALS SIGN (Math_Symbol) '﹐' U+FE50 SMALL COMMA (Other_Punctuation) '︲' U+FE32 PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation) '˝' U+02DD DOUBLE ACUTE ACCENT (Modifier_Symbol) '՚' U+055A ARMENIAN APOSTROPHE (Other_Punctuation) '܂' U+0702 SYRIAC SUBLINEAR FULL STOP (Other_Punctuation) 'ᱹ' U+1C79 OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter) '₌' U+208C SUBSCRIPT EQUALS SIGN (Math_Symbol) '⹀' U+2E40 DOUBLE HYPHEN (Dash_Punctuation) '࠰' U+0830 SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation) Is this a big problem? I guess it depends; I can certainly imagine an Armenian speaker accidentally leaving an Armenian apostrophe. - Maps to identifiers in more (though not all) languages. We discussed whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and while views differ (mostly because they're both) it seems to me that making it map *closer* is better. This is a minor issue, but it's nice. That does not mean it's perfect; as I mentioned all solutions come with a trade-off. The ones made here are: - The biggest issue by far is that the check to see if a character is valid may become more complex for some languages and environments that can't rely on a Unicode database being present. However, implementing this check is trivial logic-wise: it just needs to loop over every character and check if it's in a range table. The downside is it needs a somewhat large-ish "allowed characters" table with 716 start/stop ranges, which is not ideal, but entirely doable and easily auto-generated. It's ~164 lines hard-wrapped at column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387 lines, so that seems within the limits of reason (actually, reading through the code adding multibyte support in the first case will probably be harder, with this range table being a minor part). - There's a new Unicode version roughly every year or so, and the way it's written now means it's "locked" to Unicode 9 or, optionally, a later version. This is probably fine: Apple's APFS filesystem (which does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2. Go is Unicode 8.0. etc. I don't think this is really much of an issue in practice. I choose Unicode 9 as everyone supports this; I doubted a long time over it, and we can also use a more recent version. I feel this gives us a nice balance between reasonable interoperability while also future-proofing things. - ABNF doesn't support Unicode. This is a tooling issue, and in my opinion the tooling should adjust to how we want TOML to look like, rather than adjusting TOML to what tooling supports. AFAIK no one uses the ABNF directly in code, and it's merely "informational". I'm not happy with this, but personally I think this should be a non-issue when considering what to do here. We're not the only people running in to this limitation, and is really something that IETF should address in a new RFC or something "Extra Augmented BNF?" Another solution I tried is restricting the code ranges; I twice tried to do this (with some months in-between) and spent a long time looking at Unicode blocks and ranges, and I found this impractical: we'll end up with a long list which isn't all that different from what this proposal adds. Fixes toml-lang#954 Fixes toml-lang#966 Fixes toml-lang#979 Ref toml-lang#687 Ref toml-lang#891 Ref toml-lang#941 [1]: Aside: I encountered this just the other day as I created a TOML file with all UK election results since 1945, which looks like: [1950] Labour = [13_266_176, 315, 617] Conservative = [12_492_404, 298, 619] Liberal = [ 2_621_487, 9, 475] Sinn_Fein = [ 23_362, 0, 2] That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote it as Sinn_Fein. This is what most people seem to do.
I believe this would greatly improve things and solves all the issues, mostly. It's a bit more complex, but not overly so, and can be implemented without a Unicode library without too much effort. It offers a good middle ground, IMHO. I don't think there are ANY perfect solutions here and that *anything* will be a trade-off. That said, I do believe some trade-offs are better than others, and after looking at a bunch of different options I believe this is by far the best path for TOML. Advantages: - This is what I would consider the "minimal set" of characters we need to add for reasonable international support, meaning we can't really make a mistake with this by accidentally allowing too much. We can add new ranges in TOML 1.2 (or even change the entire approach, although I'd be very surprised if we need to), based on actual real-world feedback, but any approach we will take will need to include letters and digits from all scripts. This is a strong argument in favour of this and a huge improvement: we can't really do anything wrong here in a way that we can't correct later. Being conservative for these type of things is is good! - This solves the normalisation issues, since combining characters are no longer allowed in bare keys, so it becomes a moot point. For quoted keys normalisation is mostly a non-issue because few people use them and the specification even strongly discourages people from using them, which is why this gone largely unnoticed and undiscussed before the "Unicode in bare keys" PR was merged.[1] - It's consistent in what we allow: no "this character is allowed, but this very similar other thing isn't, what gives?!" Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but "this character works fine, but this very similar doesn't". This shows up in a number of things aside from emojis: a.toml: Input: ; = 42 # U+037E GREEK QUESTION MARK (Other_Punctuation) Error: line 1: expected '.' or '=', but got ';' instead b.toml: Input: · = 42 # # U+0387 GREEK ANO TELEIA (Other_Punctuation) Error: (none) c.toml: Input: – = 42 # U+2013 EN DASH (Dash_Punctuation) Error: line 1: expected '.' or '=', but got '–' instead d.toml: Input: ⁻ = 42 # U+207B SUPERSCRIPT MINUS (Math_Symbol) Error: (none) e.toml: Input: #x = "commented ... or is it?" # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) Error: (none) "Some punctuation is allowed but some isn't" is hard to explain, and also not what the specification says: "Punctuation, spaces, arrows, box drawing and private use characters are not allowed." In reality, a lot of punctuation IS allowed, but not all. People don't read specifications, nor should they. People try something and sees if it works. Now it seems to work on first approximation, and then (possibly months later) it seems to "break". It should either allow everything or nothing. This in-between is just horrible. From the user's perspective this seems like a bug in the TOML parser, but it's not: it's a bug in the specification. There is no good way to communicate this other than "these codepoints, which cover most of what you'd write in a sentence, except when it doesn't". In contrast, "we allow letters and digits" is simple to spec, simple to communicate, and should have a minimum potential for confusion. The current spec disallows some things seemingly almost arbitrary while allowing other very similar characters. - This avoids a long list of confusable special TOML characters; some were mentioned above but there are many more: '#' U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) '"' U+FF02 FULLWIDTH QUOTATION MARK (Other_Punctuation) '﹟' U+FE5F SMALL NUMBER SIGN (Other_Punctuation) '﹦' U+FE66 SMALL EQUALS SIGN (Math_Symbol) '﹐' U+FE50 SMALL COMMA (Other_Punctuation) '︲' U+FE32 PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation) '˝' U+02DD DOUBLE ACUTE ACCENT (Modifier_Symbol) '՚' U+055A ARMENIAN APOSTROPHE (Other_Punctuation) '܂' U+0702 SYRIAC SUBLINEAR FULL STOP (Other_Punctuation) 'ᱹ' U+1C79 OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter) '₌' U+208C SUBSCRIPT EQUALS SIGN (Math_Symbol) '⹀' U+2E40 DOUBLE HYPHEN (Dash_Punctuation) '࠰' U+0830 SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation) Is this a big problem? I guess it depends; I can certainly imagine an Armenian speaker accidentally leaving an Armenian apostrophe. - Maps to identifiers in more (though not all) languages. We discussed whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and while views differ (mostly because they're both) it seems to me that making it map *closer* is better. This is a minor issue, but it's nice. That does not mean it's perfect; as I mentioned all solutions come with a trade-off. The ones made here are: - The biggest issue by far is that the check to see if a character is valid may become more complex for some languages and environments that can't rely on a Unicode database being present. However, implementing this check is trivial logic-wise: it just needs to loop over every character and check if it's in a range table. The downside is it needs a somewhat large-ish "allowed characters" table with 716 start/stop ranges, which is not ideal, but entirely doable and easily auto-generated. It's ~164 lines hard-wrapped at column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387 lines, so that seems within the limits of reason (actually, reading through the code adding multibyte support in the first case will probably be harder, with this range table being a minor part). - There's a new Unicode version roughly every year or so, and the way it's written now means it's "locked" to Unicode 9 or, optionally, a later version. This is probably fine: Apple's APFS filesystem (which does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2. Go is Unicode 8.0. etc. I don't think this is really much of an issue in practice. I choose Unicode 9 as everyone supports this; I doubted a long time over it, and we can also use a more recent version. I feel this gives us a nice balance between reasonable interoperability while also future-proofing things. - ABNF doesn't support Unicode. This is a tooling issue, and in my opinion the tooling should adjust to how we want TOML to look like, rather than adjusting TOML to what tooling supports. AFAIK no one uses the ABNF directly in code, and it's merely "informational". I'm not happy with this, but personally I think this should be a non-issue when considering what to do here. We're not the only people running in to this limitation, and is really something that IETF should address in a new RFC or something "Extra Augmented BNF?" Another solution I tried is restricting the code ranges; I twice tried to do this (with some months in-between) and spent a long time looking at Unicode blocks and ranges, and I found this impractical: we'll end up with a long list which isn't all that different from what this proposal adds. Fixes toml-lang#954 Fixes toml-lang#966 Fixes toml-lang#979 Ref toml-lang#687 Ref toml-lang#891 Ref toml-lang#941 --- [1]: Aside: I encountered this just the other day as I created a TOML file with all UK election results since 1945, which looks like: [1950] Labour = [13_266_176, 315, 617] Conservative = [12_492_404, 298, 619] Liberal = [ 2_621_487, 9, 475] Sinn_Fein = [ 23_362, 0, 2] That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote it as Sinn_Fein. This is what most people seem to do.
I believe this would greatly improve things and solves all the issues, mostly. It's a bit more complex, but not overly so, and can be implemented without a Unicode library without too much effort. It offers a good middle ground, IMHO. I don't think there are ANY perfect solutions here and that *anything* will be a trade-off. That said, I do believe some trade-offs are better than others, and after looking at a bunch of different options I believe this is by far the best path for TOML. Advantages: - This is what I would consider the "minimal set" of characters we need to add for reasonable international support, meaning we can't really make a mistake with this by accidentally allowing too much. We can add new ranges in TOML 1.2 (or even change the entire approach, although I'd be very surprised if we need to), based on actual real-world feedback, but any approach we will take will need to include letters and digits from all scripts. This is a strong argument in favour of this and a huge improvement: we can't really do anything wrong here in a way that we can't correct later. Being conservative for these type of things is is good! - This solves the normalisation issues, since combining characters are no longer allowed in bare keys, so it becomes a moot point. For quoted keys normalisation is mostly a non-issue because few people use them and the specification even strongly discourages people from using them, which is why this gone largely unnoticed and undiscussed before the "Unicode in bare keys" PR was merged.[1] - It's consistent in what we allow: no "this character is allowed, but this very similar other thing isn't, what gives?!" Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but "this character works fine, but this very similar doesn't". This shows up in a number of things aside from emojis: a.toml: Input: ; = 42 # U+037E GREEK QUESTION MARK (Other_Punctuation) Error: line 1: expected '.' or '=', but got ';' instead b.toml: Input: · = 42 # # U+0387 GREEK ANO TELEIA (Other_Punctuation) Error: (none) c.toml: Input: – = 42 # U+2013 EN DASH (Dash_Punctuation) Error: line 1: expected '.' or '=', but got '–' instead d.toml: Input: ⁻ = 42 # U+207B SUPERSCRIPT MINUS (Math_Symbol) Error: (none) e.toml: Input: #x = "commented ... or is it?" # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) Error: (none) "Some punctuation is allowed but some isn't" is hard to explain, and also not what the specification says: "Punctuation, spaces, arrows, box drawing and private use characters are not allowed." In reality, a lot of punctuation IS allowed, but not all. People don't read specifications, nor should they. People try something and sees if it works. Now it seems to work on first approximation, and then (possibly months later) it seems to "break". It should either allow everything or nothing. This in-between is just horrible. From the user's perspective this seems like a bug in the TOML parser, but it's not: it's a bug in the specification. There is no good way to communicate this other than "these codepoints, which cover most of what you'd write in a sentence, except when it doesn't". In contrast, "we allow letters and digits" is simple to spec, simple to communicate, and should have a minimum potential for confusion. The current spec disallows some things seemingly almost arbitrary while allowing other very similar characters. - This avoids a long list of confusable special TOML characters; some were mentioned above but there are many more: '#' U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) '"' U+FF02 FULLWIDTH QUOTATION MARK (Other_Punctuation) '﹟' U+FE5F SMALL NUMBER SIGN (Other_Punctuation) '﹦' U+FE66 SMALL EQUALS SIGN (Math_Symbol) '﹐' U+FE50 SMALL COMMA (Other_Punctuation) '︲' U+FE32 PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation) '˝' U+02DD DOUBLE ACUTE ACCENT (Modifier_Symbol) '՚' U+055A ARMENIAN APOSTROPHE (Other_Punctuation) '܂' U+0702 SYRIAC SUBLINEAR FULL STOP (Other_Punctuation) 'ᱹ' U+1C79 OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter) '₌' U+208C SUBSCRIPT EQUALS SIGN (Math_Symbol) '⹀' U+2E40 DOUBLE HYPHEN (Dash_Punctuation) '࠰' U+0830 SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation) Is this a big problem? I guess it depends; I can certainly imagine an Armenian speaker accidentally leaving an Armenian apostrophe. - Maps to identifiers in more (though not all) languages. We discussed whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and while views differ (mostly because they're both) it seems to me that making it map *closer* is better. This is a minor issue, but it's nice. That does not mean it's perfect; as I mentioned all solutions come with a trade-off. The ones made here are: - The biggest issue by far is that the check to see if a character is valid may become more complex for some languages and environments that can't rely on a Unicode database being present. However, implementing this check is trivial logic-wise: it just needs to loop over every character and check if it's in a range table. You already need this with TOML 1.0, it's just that the range tables become larger. The downside is it needs a somewhat large-ish "allowed characters" table with 716 start/stop ranges, which is not ideal, but entirely doable and easily auto-generated. It's ~164 lines hard-wrapped at column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387 lines, so that seems within the limits of reason (actually, reading through the tomlc99 code adding multibyte support at all will be the harder part, with this range table being a minor part). - There's a new Unicode version roughly every year or so, and the way it's written now means it's "locked" to Unicode 9 or, optionally, a later version. This is probably fine: Apple's APFS filesystem (which does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2. Go is Unicode 8.0. etc. I don't think this is really much of an issue in practice. I choose Unicode 9 as everyone supports this; I doubted a long time over it, and we can also use a more recent version. I feel this gives us a nice balance between reasonable interoperability while also future-proofing things. - ABNF doesn't support Unicode. This is a tooling issue, and in my opinion the tooling should adjust to how we want TOML to look like, rather than adjusting TOML to what tooling supports. AFAIK no one uses the ABNF directly in code, and it's merely "informational". I'm not happy with this, but personally I think this should be a non-issue when considering what to do here. We're not the only people running in to this limitation, and is really something that IETF should address in a new RFC or something ("Extra Augmented BNF"?) Another solution I tried is restricting the code ranges; I twice tried to do this (with some months in-between) and spent a long time looking at Unicode blocks and ranges, and I found this impractical: we'll end up with a long list which isn't all that different from what this proposal adds. Fixes toml-lang#954 Fixes toml-lang#966 Fixes toml-lang#979 Ref toml-lang#687 Ref toml-lang#891 Ref toml-lang#941 --- [1]: Aside: I encountered this just the other day as I created a TOML file with all UK election results since 1945, which looks like: [1950] Labour = [13_266_176, 315, 617] Conservative = [12_492_404, 298, 619] Liberal = [ 2_621_487, 9, 475] Sinn_Fein = [ 23_362, 0, 2] That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote it as Sinn_Fein. This is what most people seem to do.
I believe this would greatly improve things and solves all the issues, mostly. It's a bit more complex, but not overly so, and can be implemented without a Unicode library without too much effort. It offers a good middle ground, IMHO. I don't think there are ANY perfect solutions here and that *anything* will be a trade-off. That said, I do believe some trade-offs are better than others, and I've made it no secret that I feel the current trade-off is a bad one. After looking at a bunch of different options I believe this is by far the best path for TOML. Advantages: - This is what I would consider the "minimal set" of characters we need to add for reasonable international support, meaning we can't really make a mistake with this by accidentally allowing too much. We can add new ranges in TOML 1.2 (or even change the entire approach, although I'd be very surprised if we need to), based on actual real-world feedback, but any approach we will take will need to include letters and digits from all scripts. This is a strong argument in favour of this and a huge improvement: we can't really do anything wrong here in a way that we can't correct later, unlike what we have now, which is "well I think it probably won't cause any problems, based on what these 5 European/American guys think, but if it does: we won't be able to correct it". Being conservative for these type of things is good! - This solves the normalisation issues, since combining characters are no longer allowed in bare keys, so it becomes a moot point. For quoted keys normalisation is mostly a non-issue because few people use them, which is why this gone largely unnoticed and undiscussed before the "Unicode in bare keys" PR was merged.[1] - It's consistent in what we allow: no "this character is allowed, but this very similar other thing isn't, what gives?!" Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but "this character works fine, but this very similar doesn't". This shows up in a number of things aside from emojis: a.toml: Input: ; = 42 # U+037E GREEK QUESTION MARK (Other_Punctuation) Error: line 1: expected '.' or '=', but got ';' instead b.toml: Input: · = 42 # # U+0387 GREEK ANO TELEIA (Other_Punctuation) Error: (none) c.toml: Input: – = 42 # U+2013 EN DASH (Dash_Punctuation) Error: line 1: expected '.' or '=', but got '–' instead d.toml: Input: ⁻ = 42 # U+207B SUPERSCRIPT MINUS (Math_Symbol) Error: (none) e.toml: Input: #x = "commented ... or is it?" # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) Error: (none) "Some punctuation is allowed but some isn't" is hard to explain, and also not what the specification says: "Punctuation, spaces, arrows, box drawing and private use characters are not allowed." In reality, a lot of punctuation IS allowed, but not all (especially outside of the Latin character range by the way, which shows the Euro/US bias in how it's written). People don't read specifications in great detail, nor should they. People try something and sees if it works. Now it seems to work on first approximation, and then (possibly months or years later) it seems to "suddenly break". From the user's perspective this seems like a bug in the TOML parser, but it's not: it's a bug in the specification. It should either allow everything or nothing. This in-between is confusing and horrible. There is no good way to communicate this other than "these codepoints, which cover most of what you'd write in a sentence, except when it doesn't". In contrast, "we allow letters and digits" is simple to spec, simple to communicate, and should have a minimum potential for confusion. The current spec disallows some things seemingly almost arbitrary while allowing other very similar characters. - This avoids a long list of confusable special TOML characters; some were mentioned above but there are many more: '#' U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) '"' U+FF02 FULLWIDTH QUOTATION MARK (Other_Punctuation) '﹟' U+FE5F SMALL NUMBER SIGN (Other_Punctuation) '﹦' U+FE66 SMALL EQUALS SIGN (Math_Symbol) '﹐' U+FE50 SMALL COMMA (Other_Punctuation) '︲' U+FE32 PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation) '˝' U+02DD DOUBLE ACUTE ACCENT (Modifier_Symbol) '՚' U+055A ARMENIAN APOSTROPHE (Other_Punctuation) '܂' U+0702 SYRIAC SUBLINEAR FULL STOP (Other_Punctuation) 'ᱹ' U+1C79 OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter) '₌' U+208C SUBSCRIPT EQUALS SIGN (Math_Symbol) '⹀' U+2E40 DOUBLE HYPHEN (Dash_Punctuation) '࠰' U+0830 SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation) Is this a big problem? I guess it depends; I can certainly imagine an Armenian speaker accidentally leaving an Armenian apostrophe. Confusables is also an issue with different scripts (Latin and Cyrillic is well-known), but this is less of an issue since it's not syntax, and also something that's fundamentally unavoidable in any multi-script environment. - Maps closer to identifiers in more (though not all) languages. We discussed whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and while views differ (mostly because they're both) it seems to me that making it map *closer* is better. This is a minor issue, but it's nice. That does not mean it's perfect; as I mentioned all solutions come with a trade-off. The ones made here are: - The biggest issue by far is that the check to see if a character is valid may become more complex for some languages and environments that can't rely on a Unicode database being present. However, implementing this check is trivial logic-wise: it just needs to loop over every character and check if it's in a range table. You already need this with TOML 1.0, it's just that the range tables become larger. The downside is it needs a somewhat large-ish "allowed characters" table with 716 start/stop ranges, which is not ideal, but entirely doable and easily auto-generated. It's ~164 lines hard-wrapped at column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387 lines, so that seems within the limits of reason (actually, reading through the tomlc99 code adding multibyte support at all will be the harder part, with this range table being a minor part). - There's a new Unicode version roughly every year or so, and the way it's written now means it's "locked" to Unicode 9 or, optionally, a later version. This is probably fine: Apple's APFS filesystem (which does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2. Go is Unicode 8.0. etc. I don't think this is really much of an issue in practice. I choose Unicode 9 as everyone supports this; I doubted a long time over it, and we can also use a more recent version. I feel this gives us a nice balance between reasonable interoperability while also future-proofing things. - ABNF doesn't support Unicode. This is a tooling issue, and in my opinion the tooling should adjust to how we want TOML to look like, rather than adjusting TOML to what tooling supports. AFAIK no one uses the ABNF directly in code, and it's merely "informational". I'm not happy with this, but personally I think this should be a non-issue when considering what to do here. We're not the only people running in to this limitation, and is really something that IETF should address in a new RFC or something ("Extra Augmented BNF"?) Another solution I tried is restricting the code ranges; I twice tried to do this (with some months in-between) and spent a long time looking at Unicode blocks and ranges, and I found this impractical: we'll end up with a long list which isn't all that different from what this proposal adds. Fixes toml-lang#954 Fixes toml-lang#966 Fixes toml-lang#979 Ref toml-lang#687 Ref toml-lang#891 Ref toml-lang#941 --- [1]: Aside: I encountered this just the other day as I created a TOML file with all UK election results since 1945, which looks like: [1950] Labour = [13_266_176, 315, 617] Conservative = [12_492_404, 298, 619] Liberal = [ 2_621_487, 9, 475] Sinn_Fein = [ 23_362, 0, 2] That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote it as Sinn_Fein. This is what most people seem to do.
I believe this would greatly improve things and solves all the issues, mostly. It's a bit more complex, but not overly so, and can be implemented without a Unicode library without too much effort. It offers a good middle ground, IMHO. I don't think there are ANY perfect solutions here and that *anything* will be a trade-off. That said, I do believe some trade-offs are better than others, and I've made it no secret that I feel the current trade-off is a bad one. After looking at a bunch of different options I believe this is by far the best path for TOML. Advantages: - This is what I would consider the "minimal set" of characters we need to add for reasonable international support, meaning we can't really make a mistake with this by accidentally allowing too much. We can add new ranges in TOML 1.2 (or even change the entire approach, although I'd be very surprised if we need to), based on actual real-world feedback, but any approach we will take will need to include letters and digits from all scripts. This is a strong argument in favour of this and a huge improvement: we can't really do anything wrong here in a way that we can't correct later, unlike what we have now, which is "well I think it probably won't cause any problems, based on what these 5 European/American guys think, but if it does: we won't be able to correct it". Being conservative for these type of things is good! - This solves the normalisation issues, since combining characters are no longer allowed in bare keys, so it becomes a moot point. For quoted keys normalisation is mostly a non-issue because few people use them, which is why this gone largely unnoticed and undiscussed before the "Unicode in bare keys" PR was merged.[1] - It's consistent in what we allow: no "this character is allowed, but this very similar other thing isn't, what gives?!" Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but "this character works fine, but this very similar doesn't". This shows up in a number of things aside from emojis: a.toml: Input: ; = 42 # U+037E GREEK QUESTION MARK (Other_Punctuation) Error: line 1: expected '.' or '=', but got ';' instead b.toml: Input: · = 42 # # U+0387 GREEK ANO TELEIA (Other_Punctuation) Error: (none) c.toml: Input: – = 42 # U+2013 EN DASH (Dash_Punctuation) Error: line 1: expected '.' or '=', but got '–' instead d.toml: Input: ⁻ = 42 # U+207B SUPERSCRIPT MINUS (Math_Symbol) Error: (none) e.toml: Input: #x = "commented ... or is it?" # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) Error: (none) "Some punctuation is allowed but some isn't" is hard to explain, and also not what the specification says: "Punctuation, spaces, arrows, box drawing and private use characters are not allowed." In reality, a lot of punctuation IS allowed, but not all (especially outside of the Latin character range by the way, which shows the Euro/US bias in how it's written). People don't read specifications in great detail, nor should they. People try something and sees if it works. Now it seems to work on first approximation, and then (possibly months or years later) it seems to "suddenly break". From the user's perspective this seems like a bug in the TOML parser, but it's not: it's a bug in the specification. It should either allow everything or nothing. This in-between is confusing and horrible. There is no good way to communicate this other than "these codepoints, which cover most of what you'd write in a sentence, except when it doesn't". In contrast, "we allow letters and digits" is simple to spec, simple to communicate, and should have a minimum potential for confusion. The current spec disallows some things seemingly almost arbitrary while allowing other very similar characters. - This avoids a long list of confusable special TOML characters; some were mentioned above but there are many more: '#' U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) '"' U+FF02 FULLWIDTH QUOTATION MARK (Other_Punctuation) '﹟' U+FE5F SMALL NUMBER SIGN (Other_Punctuation) '﹦' U+FE66 SMALL EQUALS SIGN (Math_Symbol) '﹐' U+FE50 SMALL COMMA (Other_Punctuation) '︲' U+FE32 PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation) '˝' U+02DD DOUBLE ACUTE ACCENT (Modifier_Symbol) '՚' U+055A ARMENIAN APOSTROPHE (Other_Punctuation) '܂' U+0702 SYRIAC SUBLINEAR FULL STOP (Other_Punctuation) 'ᱹ' U+1C79 OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter) '₌' U+208C SUBSCRIPT EQUALS SIGN (Math_Symbol) '⹀' U+2E40 DOUBLE HYPHEN (Dash_Punctuation) '࠰' U+0830 SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation) Is this a big problem? I guess it depends; I can certainly imagine an Armenian speaker accidentally leaving an Armenian apostrophe. Confusables is also an issue with different scripts (Latin and Cyrillic is well-known), but this is less of an issue since it's not syntax, and also something that's fundamentally unavoidable in any multi-script environment. - Maps closer to identifiers in more (though not all) languages. We discussed whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and while views differ (mostly because they're both) it seems to me that making it map *closer* is better. This is a minor issue, but it's nice. That does not mean it's perfect; as I mentioned all solutions come with a trade-off. The ones made here are: - The biggest issue by far is that the check to see if a character is valid may become more complex for some languages and environments that can't rely on a Unicode database being present. However, implementing this check is trivial logic-wise: it just needs to loop over every character and check if it's in a range table. You already need this with TOML 1.0, it's just that the range tables become larger. The downside is it needs a somewhat large-ish "allowed characters" table with 716 start/stop ranges, which is not ideal, but entirely doable and easily auto-generated. It's ~164 lines hard-wrapped at column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387 lines, so that seems within the limits of reason (actually, reading through the tomlc99 code adding multibyte support at all will be the harder part, with this range table being a minor part). - There's a new Unicode version roughly every year or so, and the way it's written now means it's "locked" to Unicode 9 or, optionally, a later version. This is probably fine: Apple's APFS filesystem (which does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2. Go is Unicode 8.0. etc. I don't think this is really much of an issue in practice. I choose Unicode 9 as everyone supports this; I doubted a long time over it, and we can also use a more recent version. I feel this gives us a nice balance between reasonable interoperability while also future-proofing things. - ABNF doesn't support Unicode. This is a tooling issue, and in my opinion the tooling should adjust to how we want TOML to look like, rather than adjusting TOML to what tooling supports. AFAIK no one uses the ABNF directly in code, and it's merely "informational". I'm not happy with this, but personally I think this should be a non-issue when considering what to do here. We're not the only people running in to this limitation, and is really something that IETF should address in a new RFC or something ("Extra Augmented BNF"?) Another solution I tried is restricting the code ranges; I twice tried to do this (with some months in-between) and spent a long time looking at Unicode blocks and ranges, and I found this impractical: we'll end up with a long list which isn't all that different from what this proposal adds. Fixes toml-lang#954 Fixes toml-lang#966 Fixes toml-lang#979 Ref toml-lang#687 Ref toml-lang#891 Ref toml-lang#941 --- [1]: Aside: I encountered this just the other day as I created a TOML file with all UK election results since 1945, which looks like: [1950] Labour = [13_266_176, 315, 617] Conservative = [12_492_404, 298, 619] Liberal = [ 2_621_487, 9, 475] Sinn_Fein = [ 23_362, 0, 2] That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote it as Sinn_Fein. This is what most people seem to do.
Issue
TOML's "bare key" syntax is too restrictive. People who regularly use characters from languages other than English should be able to do so in TOML keys without additional gymnastics.
I know there's already been a lot of discussion about this but much of it was from when TOML was less established and I think it warrants revisiting.
Proposed change
Expand the set of accepted characters allowed in bare keys to include letters and numbers from the entire Unicode space, similar to how identifiers are handled in other Unicode-compliant contexts (e.g. python, javascript, etc.). Specifically:
Rationale
After reading much of the existing discussion on the issue, I've identified the points below as being the main objections. I've written a counterpoint for each.
"ASCII-only is easy to understand"
Allowing Unicode letters and numbers wouldn't change the understandability of the written word in "mostly-ASCII" contexts, excepting maybe people from English-centric countries encountering characters they otherwise rarely see and being unsure how to pronounce them. I'm one of those people and my brain seems to consume them just fine. And it's almost certainly going to improve the understandability of bare keys to people for whom an ASCII environment is not their regular one.
It also wouldn't change the semantic/syntatic understandability of the language; I'm only advocating relaxing the spec to allow letter and number characters, not anything that might be confused for a language construct (no math symbols, for instance).
"Guides users to choose simple key names"
See above. I'd argue that the keys would be no less simple with this change. I live and work in a European country and a number of my friends and colleagues have non-ASCII letters in their name (e.g.
ä
). I doubt they consider their names to be complex; I certainly don't. If anything, by forcing people to jump through hoops just to type in their language, we're actually making the key names more complex w.r.t. cognitive load."Eliminate any weirdness that could come from having to deal with undelimited Unicode"
The TOML spec dictates UTF-8, not UTF-8-ish. UTF-8 is a solved problem at this point. If a parser doesn't correctly detect and handle malformed UTF-8, I'd argue that the parser needs fixing, not that we should bend over to accommodate users who are using crap tools and libraries. It's such a solved problem that you can even portably consume it using a state machine and validate it using vector intrinsics.
"Keys should be identifier-like"
Despite the fact that the concept of an "identifier" isn't a thing in TOML, I'll concede that in some situations this might be a concern. A reasonable example is using TOML in code generation contexts; if you used TOML keys to inform variable names historically you'd run into issues in many languages with non-ASCII characters, though this is no longer true. Even good old C++ supports unicode characters in identifiers on modern compilers.
...all of which is rendered moot by the fact that TOML supports hyphens in bare keys which are often invalid in identifier contexts, so this objection is a non-starter anyway.
"It complicates implementation"
It really doesn't. Many implementations will be able to leverage built-in helper functions or libraries for working with Unicode. For those that can't, I've put my money where my mouth is and implemented this as a proof-of-concept in my own TOML parser and I'm happy for my code to be used as a starting point:
is_unicode_XXXXX()
codepoint identity functions (generated by a script)Of course you might argue that simply accepting UTF-8 bytes from a TOML implementation is not an option for everyone, and you'd be right; there will always be situations where only ASCII makes sense (e.g. legacy codebases). I'd respond by pointing out that detecting non-ASCII characters in a character stream is laughably trivial. Applications requiring ASCII-only can easily enforce this themselves.
The text was updated successfully, but these errors were encountered: