http: Reject paths containing non-ASCII characters #3062

Flimm · 2015-09-25T15:52:59Z

http would previously accept paths with non-ASCII characters. This
proved problematic, because multi-byte characters were encoded as
'binary', that is, the first byte was taken and the remaining bytes were
dropped for that character.

There is no sensible way to fix this without breaking backwards
compatibility for paths containing U+0080 to U+00FF characters.

We already reject paths with unescaped spaces with an exception. This
commit does the same for paths with non-ASCII characters too.

The alternative would have been to encode paths in UTF-8, but this would
cause the behaviour to silently change for paths with single-byte
non-ASCII characters (eg: the copyright character U+00A9 ©). I find it
preferable to to add to the existing prohibition of bad paths with
spaces.

Fixes #2114

http would previously accept paths with non-ASCII characters. This proved problematic, because multi-byte characters were encoded as 'binary', that is, the first byte was taken and the remaining bytes were dropped for that character. There is no sensible way to fix this without breaking backwards compatibility for paths containing U+0080 to U+00FF characters. We already reject paths with unescaped spaces with an exception. This commit does the same for paths with non-ASCII characters too. The alternative would have been to encode paths in UTF-8, but this would cause the behaviour to silently change for paths with single-byte non-ASCII characters (eg: the copyright character U+00A9 ©). I find it preferable to to add to the existing prohibition of bad paths with spaces. Bug report: nodejs#2114

jasnell · 2015-09-25T15:59:07Z

Hmmm.. this is a step in the right direction, but it doesn't address the full problem. Strings like "a\nb" are still passed through. Can you please update the regexp to catch all whitespace (newlines, tabs, etc) ?

Flimm · 2015-09-25T16:22:06Z

Done.

jasnell · 2015-09-25T16:36:44Z

If we're throwing against invalid whitespace, might as well make it all invalid whitespace, tabs included. They may not be as much of a risk as newlines, but they're still invalid.

Flimm · 2015-10-01T10:14:07Z

Done.

Fishrock123 · 2015-10-16T04:03:28Z

lib/_http_client.js

@@ -41,13 +41,16 @@ function ClientRequest(options, cb) {
  if (self.agent && self.agent.protocol)
    expectedProtocol = self.agent.protocol;

-  if (options.path && / /.test(options.path)) {
+  if (options.path && ! /^[\x00-\x08\x0E-\x1F\x21-\x7F]*$/.test(options.path)) {


I wonder if it wouldn't be better to use a negated character set?

The highest Unicode code point is U+10FFFF, but a higher one could be introduced in the future. That, along with the complications of non-BMP code points in Javascript, make writing a correct future-proof regex with a negated class tricky. In any case, both this regex and one using a negated character should be O(n), so I see no need for a change.

evanlucas · 2016-12-08T23:06:47Z

Was this fixed by #8923?

bnoordhuis · 2016-12-08T23:26:30Z

No, #8923 only rejects characters <= U+0020.

jasnell · 2017-02-28T22:56:13Z

ping @nodejs/http
@Flimm ... is this still something you'd like to pursue?

Flimm · 2017-03-01T08:51:12Z

@jasnell Absolutely. I can rebase it or whatever is preferred now that there are merge conflicts. I can take the time to make sure the bug is still present in newer Node versions. If there is anything I can do differently this time to make sure that the pull request gets merged or rejected, let me know.

mscdex · 2017-03-01T09:27:11Z

FWIW I think using a lookup table will probably yield the best performance, instead of a regexp.

fhinkel · 2017-03-26T10:49:52Z

ping @Flimm, can you rebase this? Thanks.

bnoordhuis · 2017-03-26T11:12:11Z

I don't know if this can land as-is, even when rebased. Backwards compatibility is a concern - UTF-8 is used in the wild - and I wouldn't want to vouch it works with different combinations of header/body encodings.

Flimm · 2017-03-27T09:56:02Z

@bnoordhuis Can someone help me out in knowing what the process is for getting this approach approved? Do I just need to convince one person with commit rights to merge something like this?

This patch does not behave identically to previous versions, so in that sense, it is not backwards compatible. But the way it used to behave is compeletly broken for characters greater than U+00FF, I hope you can agree. And the way it behaves for characters U+0080 to U+00FF is also weird, it behaves leniently, even though the same code already throws an exception when it comes to spaces. An exception is already thrown when the invalid character space is given as input, all this is doing is making sure an exception is thrown for other invalid characters, instead of irreversibly throwing away data, (by only considering the first byte of multi-byte characters).

@bnoordhuis If this approach is not the best one, which approach would you take instead?

jasnell · 2017-03-27T16:16:27Z

@Flimm ... this has been an ongoing issue with the current HTTP/1 implementation and is a difficult problem to address. As @bnoordhuis points out, there is a significant amount of existing code that uses UTF8 in the path that would be broken if we started rejecting such values outright. Our policy has been to avoid such breaking changes when possible unless the changes are necessary to address security concerns. Personally, I'm a big fan of strict spec compliance, in which case rejecting is technically the right thing to do, but the backwards compatibility concerns cannot be ignored and we'll need to weigh those carefully.

Another possible approach that we can take is to perform additional pct-encoding on those characters rather than throwing. Doing so would come at a performance and would likely also need to be carefully evaluated to ensure it wouldn't break existing code.

In terms of our process for getting things landed, however... this change qualifies as a semver-major, which requires sign-off from at least two members of the @nodejs/ctc before it can land.

bnoordhuis · 2017-03-27T18:54:22Z

If this approach is not the best one, which approach would you take instead?

It's complicated. The set of characters to reject depends on the encoding used for the request headers. That in turn is influenced by the encoding of the request body because node.js tries hard to pack the headers and the body into a single outgoing packet.

An example: U+010A ('Ċ') is fine with encoding="utf8"; it decodes to bytes C4 8A. The same codepoint should be rejected with encoding="binary" (or "latin1") because it decodes to byte 0A, a newline.

It was arguably unwise to truncate codepoints > U+FF like that but it goes back all the way to node.js v0.1.x - hard to change now.

fhinkel · 2017-05-23T19:09:54Z

@Flimm Sorry that we haven't landed this yet. The general process is that a PR needs at least two approvals, and no objections. Seems like it will be very hard to find consensus on this PR. How invested are you in the change?

Flimm · 2017-05-30T13:31:05Z

I'll be honest, I'm not very confident that any more effort on my side is going to help. We need a core contributor to approve or disapprove the idea of this fix. It looks like everyone is focussing on the proposed fix (throwing an exception instead of silently corrupting data), but no one is focussing on the fact that Node is currently silently corrupting data. We need a fix, even if it's not this one.

I've created a separate issue for the fact that Node is silently corrupting Unicode paths in requests, the issue is here: #13296, and I've created a pull request with a test case that illustrates the bug here: #13297

Flimm · 2017-05-30T13:55:05Z

I'm OK with closing this pull request. I want the bug to be fixed, it doesn't have to be through throwing an exception on Unicode input. Thanks @fhinkel and others would made sure this pull request didn't completely fall through the cracks :)

mscdex added the http Issues or PRs related to the http subsystem. label Sep 25, 2015

http: Reject additional whitespace chars in paths

5743cae

http: Reject tab character in path

b4d6da8

Fishrock123 reviewed Oct 16, 2015
View reviewed changes

Trott force-pushed the master branch from 1e896a6 to 082cc8d Compare December 27, 2015 02:00

jasnell added the stalled Issues and PRs that are stalled. label Mar 22, 2016

estliberitas force-pushed the master branch 2 times, most recently from 7da4fd4 to c7066fb Compare April 26, 2016 05:22

jasnell added the url Issues and PRs related to the legacy built-in url module. label Jun 7, 2016

Trott force-pushed the master branch from b0df363 to c5ce7f4 Compare September 21, 2016 00:09

rvagg force-pushed the master branch 2 times, most recently from c133999 to 83c7a88 Compare October 18, 2016 17:01

MylesBorins force-pushed the master branch from 8df7ee0 to 54fef67 Compare February 1, 2017 01:00

jasnell added the semver-major PRs that contain breaking changes and should be released in the next major version. label Mar 27, 2017

refack force-pushed the master branch from 16073c0 to fbe946b Compare April 14, 2017 04:11

Flimm closed this May 30, 2017

bnoordhuis mentioned this pull request May 31, 2017

http silently corrupts the request URL when it contains non-Latin-1 codepoints #13296

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

http: Reject paths containing non-ASCII characters #3062

http: Reject paths containing non-ASCII characters #3062

Flimm commented Sep 25, 2015

jasnell commented Sep 25, 2015

Flimm commented Sep 25, 2015

jasnell commented Sep 25, 2015

Flimm commented Oct 1, 2015

Fishrock123 Oct 16, 2015

Flimm Oct 16, 2015

evanlucas commented Dec 8, 2016

bnoordhuis commented Dec 8, 2016

jasnell commented Feb 28, 2017

Flimm commented Mar 1, 2017

mscdex commented Mar 1, 2017

fhinkel commented Mar 26, 2017

bnoordhuis commented Mar 26, 2017

Flimm commented Mar 27, 2017

jasnell commented Mar 27, 2017

bnoordhuis commented Mar 27, 2017

fhinkel commented May 23, 2017

Flimm commented May 30, 2017 •

edited

Loading

Flimm commented May 30, 2017

http: Reject paths containing non-ASCII characters #3062

http: Reject paths containing non-ASCII characters #3062

Conversation

Flimm commented Sep 25, 2015

jasnell commented Sep 25, 2015

Flimm commented Sep 25, 2015

jasnell commented Sep 25, 2015

Flimm commented Oct 1, 2015

Fishrock123 Oct 16, 2015

Choose a reason for hiding this comment

Flimm Oct 16, 2015

Choose a reason for hiding this comment

evanlucas commented Dec 8, 2016

bnoordhuis commented Dec 8, 2016

jasnell commented Feb 28, 2017

Flimm commented Mar 1, 2017

mscdex commented Mar 1, 2017

fhinkel commented Mar 26, 2017

bnoordhuis commented Mar 26, 2017

Flimm commented Mar 27, 2017

jasnell commented Mar 27, 2017

bnoordhuis commented Mar 27, 2017

fhinkel commented May 23, 2017

Flimm commented May 30, 2017 • edited Loading

Flimm commented May 30, 2017

Flimm commented May 30, 2017 •

edited

Loading