-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
http: Reject paths containing non-ASCII characters #3062
Conversation
http would previously accept paths with non-ASCII characters. This proved problematic, because multi-byte characters were encoded as 'binary', that is, the first byte was taken and the remaining bytes were dropped for that character. There is no sensible way to fix this without breaking backwards compatibility for paths containing U+0080 to U+00FF characters. We already reject paths with unescaped spaces with an exception. This commit does the same for paths with non-ASCII characters too. The alternative would have been to encode paths in UTF-8, but this would cause the behaviour to silently change for paths with single-byte non-ASCII characters (eg: the copyright character U+00A9 ©). I find it preferable to to add to the existing prohibition of bad paths with spaces. Bug report: nodejs#2114
Hmmm.. this is a step in the right direction, but it doesn't address the full problem. Strings like |
Done. |
If we're throwing against invalid whitespace, might as well make it all invalid whitespace, tabs included. They may not be as much of a risk as newlines, but they're still invalid. |
Done. |
@@ -41,13 +41,16 @@ function ClientRequest(options, cb) { | |||
if (self.agent && self.agent.protocol) | |||
expectedProtocol = self.agent.protocol; | |||
|
|||
if (options.path && / /.test(options.path)) { | |||
if (options.path && ! /^[\x00-\x08\x0E-\x1F\x21-\x7F]*$/.test(options.path)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it wouldn't be better to use a negated character set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The highest Unicode code point is U+10FFFF, but a higher one could be introduced in the future. That, along with the complications of non-BMP code points in Javascript, make writing a correct future-proof regex with a negated class tricky. In any case, both this regex and one using a negated character should be O(n)
, so I see no need for a change.
7da4fd4
to
c7066fb
Compare
c133999
to
83c7a88
Compare
Was this fixed by #8923? |
No, #8923 only rejects characters <= U+0020. |
ping @nodejs/http |
@jasnell Absolutely. I can rebase it or whatever is preferred now that there are merge conflicts. I can take the time to make sure the bug is still present in newer Node versions. If there is anything I can do differently this time to make sure that the pull request gets merged or rejected, let me know. |
FWIW I think using a lookup table will probably yield the best performance, instead of a regexp. |
ping @Flimm, can you rebase this? Thanks. |
I don't know if this can land as-is, even when rebased. Backwards compatibility is a concern - UTF-8 is used in the wild - and I wouldn't want to vouch it works with different combinations of header/body encodings. |
@bnoordhuis Can someone help me out in knowing what the process is for getting this approach approved? Do I just need to convince one person with commit rights to merge something like this? This patch does not behave identically to previous versions, so in that sense, it is not backwards compatible. But the way it used to behave is compeletly broken for characters greater than U+00FF, I hope you can agree. And the way it behaves for characters U+0080 to U+00FF is also weird, it behaves leniently, even though the same code already throws an exception when it comes to spaces. An exception is already thrown when the invalid character space is given as input, all this is doing is making sure an exception is thrown for other invalid characters, instead of irreversibly throwing away data, (by only considering the first byte of multi-byte characters). @bnoordhuis If this approach is not the best one, which approach would you take instead? |
@Flimm ... this has been an ongoing issue with the current HTTP/1 implementation and is a difficult problem to address. As @bnoordhuis points out, there is a significant amount of existing code that uses UTF8 in the path that would be broken if we started rejecting such values outright. Our policy has been to avoid such breaking changes when possible unless the changes are necessary to address security concerns. Personally, I'm a big fan of strict spec compliance, in which case rejecting is technically the right thing to do, but the backwards compatibility concerns cannot be ignored and we'll need to weigh those carefully. Another possible approach that we can take is to perform additional pct-encoding on those characters rather than throwing. Doing so would come at a performance and would likely also need to be carefully evaluated to ensure it wouldn't break existing code. In terms of our process for getting things landed, however... this change qualifies as a |
It's complicated. The set of characters to reject depends on the encoding used for the request headers. That in turn is influenced by the encoding of the request body because node.js tries hard to pack the headers and the body into a single outgoing packet. An example: U+010A ('Ċ') is fine with encoding="utf8"; it decodes to bytes C4 8A. The same codepoint should be rejected with encoding="binary" (or "latin1") because it decodes to byte 0A, a newline. It was arguably unwise to truncate codepoints > U+FF like that but it goes back all the way to node.js v0.1.x - hard to change now. |
@Flimm Sorry that we haven't landed this yet. The general process is that a PR needs at least two approvals, and no objections. Seems like it will be very hard to find consensus on this PR. How invested are you in the change? |
I'll be honest, I'm not very confident that any more effort on my side is going to help. We need a core contributor to approve or disapprove the idea of this fix. It looks like everyone is focussing on the proposed fix (throwing an exception instead of silently corrupting data), but no one is focussing on the fact that Node is currently silently corrupting data. We need a fix, even if it's not this one. I've created a separate issue for the fact that Node is silently corrupting Unicode paths in requests, the issue is here: #13296, and I've created a pull request with a test case that illustrates the bug here: #13297 |
I'm OK with closing this pull request. I want the bug to be fixed, it doesn't have to be through throwing an exception on Unicode input. Thanks @fhinkel and others would made sure this pull request didn't completely fall through the cracks :) |
http would previously accept paths with non-ASCII characters. This
proved problematic, because multi-byte characters were encoded as
'binary', that is, the first byte was taken and the remaining bytes were
dropped for that character.
There is no sensible way to fix this without breaking backwards
compatibility for paths containing U+0080 to U+00FF characters.
We already reject paths with unescaped spaces with an exception. This
commit does the same for paths with non-ASCII characters too.
The alternative would have been to encode paths in UTF-8, but this would
cause the behaviour to silently change for paths with single-byte
non-ASCII characters (eg: the copyright character U+00A9 ©). I find it
preferable to to add to the existing prohibition of bad paths with
spaces.
Fixes #2114