-
Notifications
You must be signed in to change notification settings - Fork 857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unclear whether control characters are allowed in comments #567
Comments
It is.
If there's more people who think this would add value, we might do this. I don't want to complicate an otherwise straightforward explanation. |
The general theme in issues #566, #567, #568, #569 is that I believe the specification to be ambiguous. I assumed the text document is the complete, stand-alone, authoritative specification, while the ABNF has only experimental status. If indeed the ABNF is authoritative, perhaps these issues are resolved. |
From the point of view of a parser, it's probably easier to allow control characters (except for newline) in comments, because then the rule is "Once you see a comment marker, ignore everything until the next newline". This is simpler (at least for some parser libraries) to implement than "Once you see the comment marker, ignore all valid characters but throw an error for So I'd personally prefer for the rule to be "control characters are legal (and ignored) in comments; after a comment marker is seen, only the next newline matters." |
@rmunn On the other hand, if control chars are prohibited in comments too, they would be prohibited anywhere in a TOML document since they are also prohibited in strings of all kinds (and they are certainly not allowed outside of strings or comments). So prohibiting them everywhere might actually be quite simple for many parsers. |
Currently, I am ambivalent on this issue tbh. I don't see any major gains either way so I'm inclined to say status quo would win here. I can be convinced either way though. |
@pradyunsg I assume if you say "status quo" you mean the status quo as defined by the ABNF (which prohibits control chars in comments)? The original poster's point was exactly that, in the human-readable spec, the status quo is not defined. I too would advice clarifying this in the written spec in addition to the ABNF, for people who can read English better than Backus–Naur. |
Thanks everyone for the patience here -- I took a long time to come back around to this. Yes, I think forbidding control characters in comments makes sense. If there's a significant use case for control characters in comments, do holler. Meanwhile, I think what we need here is a PR clarifying in text what's already clear in the ABNF -- that control characters are not allowed in comments. |
- implemented hexfloats - handled trailing underscores in ints and floats - improved parse performance for single-digit integers - handled control characters in comments (per toml-lang/toml#567)
I understand this is closed now, and I also understand the logic behind the decision, however I think the following has not been considered and might make you reconsider the decision here: If a comment with certain Unicode characters (except newline) is to be considered invalid, then this has the following drawbacks:
Bottom line of this argument, I'd like to propose a simplification: define a comment as any range of code points, until a newline is encountered. This is simpler, and clearer for end users. |
As an implementation author, I agree with @abelbraaksma above. Firstly, it's substantially more complicated to parse out control characters in comments and raise an error, and this kind of strictness provides no real advantage to users - it results in documents being rejected for content that is commented out, which is itself an indication by the user that the content should be ignored by the parser. |
I don't seriously object to allowing control characters in comments. Are there any security concerns about them here? I could object to a delete char, If you think it's merited, let's reopen this issue, and I can compose a PR that would allow most everything but newlines in comments. |
I understand these sentiments, but currently the spec says "Control characters other than tab ... are not permitted in comments." So if we change this again, all 1.0-compatible parsers would supposedly have to be changed again. Also some implementations might conceivably check for control chars before doing any further parsing step? If so, like I wrote earlier: "prohibiting them everywhere might actually be quite simple for many parsers." Maybe this is a good case to let implementations decide on their behavior? The spec could say something like:
In that way, both rejecting control chars and ignoring them together with the rest of the comment would both be fine. This language would have to be adjusted regarding Unicode validity too, but here I think the same logic applies: some parsers might get the rejection of invalid Unicode sequences for free, because they use an OS/library function that does it for them. For them, if we require that "any range of code points" is accepted in the comment, whether valid Unicode or not, might actually make things harder. So "it's implementation-dependent" may be the best course of action too. |
@ChristianSi We don't need to say anything if implementations are going to choose their own behavior for handling comments. We just need to be simple and obvious. So we can simplify things to their essentials, and leave the rest to interpretation. It's a given that any TOML document must be a valid Unicode document encoded in UTF-8. So we need not worry about invalid code points (like surrogates) or invalid byte strings, which would yield well-defined errors of their own. One concern that wasn't addressed is whether or not the 0x0D in a Windows-type newline ought to be ignored. Since the line feed character is central to how comments work, we need only to mention that either type of newline can mark the end of a comment. I propose this text to replace the restrictions at the end of the Comments section in
That's about all that needs to be said, I think. Simple parsers can just start from the |
For now, I've implemented the stricter mechanism in TomlJ - but in a bit of a sketchy way: I lex out any comment as That works only because I know control characters aren't allowed anywhere else, so will immediately generate an error (except for |
@cleishm, not sure if this is feasible, but if there are inconsistent line endings, I'd report that as a specific error. Something like "Inconsistent line endings detected in file". Personally, I think any combination of |
@abelbraaksma For most parsers, it's probably not feasible. In my case, I use the ANTLR lexer (tokenizer) to detect newlines so that the parser only has to deal with the newline token. To give that kind of error would mean the lexer needs to be stateful - recording that it saw a newline of one type and then of a different type later. Not impossible, of course, but not trivial either. |
@cleishm, I’m not sure how the lexer in your case defines the errors, but there ought to be a location where that exception is thrown. Since it’s only ever going to be a new line character that could possibly mean “wrong/corrupt line endings” and since it could mean nothing else, you could just make the method that throws the error itself smarter, ie by simply switching over whether the incorrect token is a It’s been many years ago that I worked with ANTLR, so forgive me if I’m missing the obvious, and oversimplify things… |
As it stands, a TOML document with mixed LFs and CRLFs for line endings should not produce an error. Either line ending would be handled properly as a newline. And within multiline strings, the parser will normalize the line endings in the resulting strings. |
Yes. But I think we were talking about sole CRs here, which are explicitly disallowed, when not followed by an LF. |
Which I agree with. My point is that the error message should not complain about "inconsistent" line endings, because they actually can be inconsistent, as long as only the two permitted line endings are used. |
Totally, I didn’t mean to muddy the waters. Sorry for the confusion! |
In TOMLJ, sole CRs are already raised as errors, and a newline is tokenized from It won't handle documents where only CRs are used as line endings, but that is not currently permitted by the spec (and such documents are very uncommon now anyway). |
This reverts commit ab74958. I'm a simple guy. Someone reports a problem, I fix it. No one reports a problem? There is nothing to fix so I go drink beer. No one really reported this as a problem, so there isn't anything to fix. But it *does* introduce entirely needless churn for all TOML implementations. Do we need to forbid *anything* in comments? Probably not. In strings we probably only need to forbid \x00. But at least before it was consistent with strings, and more importantly, what everyone wrote code for, which is tested, and already works. And [none of the hypotheticals](toml-lang#567 (comment)) on why this is "needed" are practical issues people reported, and most aren't even fixed: a comment can still invalidate the file, you must still parse each character in a comment as some are still forbidden, the performance benefits are very close to zero they might as well be zero, and you still can't "dump whatever you like" in comments. So it doesn't *actually* change anything, it just changes "disallow this set of control characters" to ... another (smaller) set. That's not really a substantial change. The only (minor) real-world issue that was reported (from the person doing the Java implementation) was that "it's substantially more complicated to parse out control characters in comments and raise an error, and this kind of strictness provides no real advantage to users". And that's not addressed at all with this. --- And while I'm at it, let me have a complaint about how this was merged: 1. Two people, both of whom actually maintain implementations, say they don't like this change. 2. This is basically ignored. 3. Three people continue written a fairly large number of extensive comments, so anyone who wasn't already interested in this change unsubscribes and/or goes 🤷 4. "Consensus". Sometimes I feel TOML attracts people who like to argue things from a mile-high ivory tower with abstract arguments that have only superficial bearing to actual pragmatic reality. Fixes toml-lang#995
This reverts commit ab74958. I'm a simple guy. Someone reports a problem, I drink coffee and fix it. No one reports a problem? There is nothing to fix and I go drink beer. No one really reported this as a problem, but it *does* introduce needless churn for all TOML implementations and the test suite. Do we need to forbid *anything* in comments? Probably not, and in strings we probably only need to forbid \x00. But at least before it was consistent with strings, and more importantly, what everyone wrote code for, which is tested, and already works. [None of the hypotheticals](toml-lang#567 (comment)) on why this is "needed" are practical issues people reported, and most aren't even fixed: a comment can still invalidate the file, you must still parse each character in a comment as some are still forbidden, the performance benefits are very close to zero they might as well be zero, and you still can't "dump whatever you like" in comments. So it doesn't *actually* change anything, it just changes "disallow this set of control characters" to ... "disallow this set of control characters" (but for a different set). That's not really a substantial or meaningful change. The only (minor) real-world issue that was reported (from the person doing the Java implementation) was that "it's substantially more complicated to parse out control characters in comments and raise an error, and this kind of strictness provides no real advantage to users". And that's not addressed at all with this, so... --- And while I'm at it, let me have a complaint about how this was merged: 1. Two people, both of whom actually maintain implementations, say they don't like this change. 2. This is basically ignored. 3. Three people continue written a fairly large number of large comments, so anyone who wasn't already interested in this change unsubscribes and/or goes 🤷 4. "Consensus". Sometimes I feel TOML attracts people who like to argue things from a mile-high ivory tower with abstract arguments that have only passing familiarity with any actual pragmatic reality. Fixes toml-lang#995
It is not clear from the specification whether control characters (U+0000 .. U+001F and U+007F) are allowed in comments.
The section Comment only states that "A hash symbol marks the rest of the line as a comment.".
This implies that any non-newline characters are allowed after the hash symbol.
Discussion about control characters only appears in the section String, which should have no implication on comments.
The ABNF definition forbids control characters (except tab) in comments, but the ABNF is not authoritative.
I think the section Comment should explicitly state which characters are allowed in comments.
Note that it would be unlogical to allow e.g. form-feed inside comments, because form-feed is traditionally a stronger separator than newline.
The text was updated successfully, but these errors were encountered: