unicode fragments #150

achristensen07 · 2016-10-07T22:02:10Z

Right now there is a note: "Unfortunately not using percent-encoding is intentional as implementations with majority market share exhibit this behavior."

I think we should change "Append c to url’s fragment." to at least "If url’s scheme is a special scheme, append c to url’s fragment. Otherwise, append the result of running UTF-8 percent encode with the simple encode set on c" or something similar.

Right now Firefox and Safari percent-encode non-ASCII characters in the fragment. Chrome percent-encodes non-ASCII characters in the fragment if the scheme is not a special scheme. Edge does not percent-encode any non-ASCII characters in the fragment.

This change is needed to continue to support percent-encoding non-ascii characters after a '#' in a data URL, which works in Safari, Chrome, and Firefox.

achristensen07 · 2016-10-07T22:56:18Z

Ideally, I think we'd all just switch to percent encoding all non-ASCII characters in all fragments so the output of URL parsing is always ASCII. Would everyone be willing to make such a change?

https://bugs.webkit.org/show_bug.cgi?id=163153 Reviewed by Tim Horton. Source/WebCore: This is needed to keep compatibility with data URLs with non-ASCII characters after a '#' which works in Chrome, Firefox, and Safari, while maintaining compatibility with Chrome, IE, and Edge which keep non-ASCII characters in the fragments of special URLs. This was proposed to the spec in whatwg/url#150 Covered by new API tests. * platform/URLParser.cpp: (WebCore::URLParser::syntaxViolation): Removed assertion because we now have fragments that need percent encoding but are all ASCII. (WebCore::URLParser::fragmentSyntaxViolation): (WebCore::URLParser::parse): Tools: * TestWebKitAPI/Tests/WebCore/URLParser.cpp: (TestWebKitAPI::TEST_F): git-svn-id: http://svn.webkit.org/repository/webkit/trunk@206942 268f45cc-cd09-0410-ab3c-d52691b4dbfc

annevk · 2016-10-11T07:52:58Z

@valenting? I think @dbaron also wanted us to always percent-encode, to keep URLs "ASCII safe" and easier to copy-and-paste (due to whitespace and such).

valenting · 2016-10-11T08:00:42Z

I support percent encoding all characters in the hash. It turns out percent encoded strings are much easier to manage and parse.

…s in fragment https://bugs.webkit.org/show_bug.cgi?id=163287 Reviewed by Brady Eidson. Source/WebCore: Based on discussion in whatwg/url#150 If that discussion decides to keep the spec as-is (which keeps non-ASCII characters in the fragment to match IE and Edge's behavior, which Chrome has followed for special schemes) then we can revert this change later after enabling the URL parser. Making this change keeps behavior matching Safari and Firefox, as well as Chrome's handling of non-special schemes, such as data URLs. Covered by updated API tests. * platform/URLParser.cpp: (WebCore::URLParser::appendToASCIIBuffer): (WebCore::URLParser::copyURLPartsUntil): (WebCore::URLParser::syntaxViolation): (WebCore::URLParser::currentPosition): (WebCore::URLParser::parse): (WebCore::URLParser::fragmentSyntaxViolation): Deleted. * platform/URLParser.h: No more non-ASCII characters in canonicalized URLs. Tools: * TestWebKitAPI/Tests/WebCore/URLParser.cpp: (TestWebKitAPI::TEST_F): git-svn-id: http://svn.webkit.org/repository/webkit/trunk@207152 268f45cc-cd09-0410-ab3c-d52691b4dbfc

annevk · 2016-10-19T12:43:45Z

It seems HTML already decodes fragment identifiers in https://html.spec.whatwg.org/multipage/browsers.html#the-indicated-part-of-the-document so presumably that would keep working.

But I suspect there might be websites that depend on fragment not being encoded?

@achristensen07 did you verify that all APIs for URLs in WebKit return the URL with the fragment encoded as well? Or in order to have URLs with the fragment encoded would we need to add special casing to APIs?

achristensen07 · 2016-10-20T02:20:05Z

In WebKit, the fragments are encoded/serialized as they are parsed, so there is no time in which there will be an unencoded fragment in a URL

annevk · 2016-10-20T13:13:24Z

@achristensen07 that's understood, but my question is if there's any APIs such as location.hash that might yield them as unencoded (essentially with the accessor doing its conversion).

achristensen07 · 2016-10-20T16:07:26Z

I do not think we ever percent-decode fragments in WebKit. Going to the indicated part of the document with percent-encoded fragments or non-ASCII fragments which we percent encode doesn't work in WebKit right now, which we need to fix.

dbaron · 2016-10-20T16:42:43Z

I think @dbaron also wanted us to always percent-encode, to keep URLs "ASCII safe" and easier to copy-and-paste (due to whitespace and such).

In particular, the argument I made was in Mozilla bug 1148861 (comment 0 and comment 25).

annevk · 2016-10-21T06:09:16Z

So Gecko seems to be encoding too (although the address bar decodes). So that leaves Chrome and Edge. @foolip who's a good contact for URL parsing in Chrome?

sleevi · 2016-10-21T13:47:13Z

I'm still lurking watching these bugs, @annevk , at least for our low-level implementation (and how it relates to URL bar). When it get's to Blink side, @mikewest is good to poke just to make sure I don't stuff it all up :)

foolip · 2016-10-21T14:08:12Z

Thanks @sleevi. @sof has also poked a bit at the Blink-side URL.

annevk · 2016-10-21T14:21:53Z

@sleevi I guess the question is then whether you'd consider always encoding fragments. (Or is that already happening and is the Blink side doing magic in the API layer?)

sleevi · 2016-10-21T14:27:08Z

@annevk Yeah, I'm planning on looking into this today and trying to get an answer. I seem to recall that it's similar to Gecko, and so I suspect our URL substrate is decoding, and any encoding would need to be done at the Blink glue to that (the KURL and bindings side). On the upside, it makes it easier to change when it would only affect Chrome (our URL code is shared with a growing number of Google products due to Cronet embedding), the downside is it may require a more extensive audit of where we expose those URLs to a web-visible form.

I'm not hearing a request to change address bar behavior, but did I misunderstand the remarks re: FF? My gut is that it seems to make sense to change that too, to ensure authors aren't going to copy/paste?

annevk · 2016-10-21T14:35:54Z

The address bar is a hard problem. If we always show everything encoded, we basically make it useless for some set of pages and people, e.g., Wikipedia articles with titles in Kanji visited by folks that can read Japanese. On the other hand, I don't think most people should really be looking at the address bar beyond the domain (and the rest can be hidden until you need to copy), but folks are going to disagree on that one and only Safari is really that far ahead right now.

There's also the aspect that we cannot dictate the address bar since it's UX. And you could potentially show decoded, but copy encoded, etc. I think for this issue we should consider it out-of-scope, and just consider the effects we can directly observe through JavaScript and following links in documents.

annevk · 2016-10-28T11:55:02Z

@sleevi any update? I'm inclined to go with always encoding the fragment since it would be the most consistent (URL members would only ever contain ASCII) and it appears that's what Firefox and Safari get away with.

Fixes #150.

annevk · 2016-12-08T02:27:50Z

I created a PR to change the specification here. I'd appreciate review. @achristensen07 do you know if this is already tested? I can change the tests.

annevk · 2016-12-08T02:54:52Z

FWIW, I started working on a patch for the tests and I noticed that Firefox has a few differences from my proposed change. It does not seem to drop U+0000 and it seems to preserve and encode newline and tab code points. Not sure that's what we want to keep given what we specified for other setters.

achristensen07 · 2016-12-08T03:00:13Z

This is tested by these existing tests:
Parsing: <#β> against http://example.org/foo/bar
Parsing: <http://www.google.com/foo?bar=baz# »> against about:blank
Parsing: <data:test# »> against about:blank

See whatwg/url#150.

annevk · 2016-12-08T18:48:53Z

web-platform-tests/wpt#4298 has an update to the tests. I want to be sure though about what we want to encode, although I guess this change is still a step in the right direction so maybe we should make it regardless.

@valenting thoughts? Should we encode spaces as well? Is it okay to drop U+0000?

valenting · 2016-12-08T19:29:35Z

Right now Firefox drops \r\n\t in the URL constructor, and trims U+0000 to U+0020.
In the setters we encode all C0-and space characters. This is true for pathname, search and hash.
I don't have any preference about this, but we should at least try to be consistent.

achristensen07 · 2016-12-08T20:23:50Z

Chrome and Safari don't encode spaces. I propose using the simple encode set

annevk · 2016-12-08T20:29:46Z

@valenting the consistency we have now is that \r\n\t are always dropped. And trimming of U+0000 to U+0020 happens only for the full URL (not setters). Note that it should also happen for <a> and such, not just the URL constructor.

The main problem with not encoding spaces is that it's harder to copy-and-paste those URLs and drop them elsewhere. Since most tooling assumes URLs do not contain spaces.

Anyway, if the test changes look okay I suggest we land this and then if there are still concerns we can fiddle with the details in a new issue.

valenting · 2016-12-08T20:39:47Z

👍

zcorpan · 2016-12-08T22:30:17Z

Related issue #125

annevk · 2016-12-09T01:44:02Z

Thanks @zcorpan, we can use that as the follow up for spaces.

See whatwg/url#150.

Now all components of a URL can be represented using ASCII strings or integers. Tests: web-platform-tests/wpt#4298. Fixes #150.

This patch contains the following changes: url: make IPv4 parser more spec compliant * Return int64_t from ParseNumber to prevent overflow for valid big numbers * Don't throw when there are more than 4 parts (it cannot be an IP address) * Correctly interpret the address and don't always throw when there are numbers > 255 Ref: https://url.spec.whatwg.org/#concept-ipv4-parser Fixes: nodejs#10306 url: percent encode fragment to follow spec change Ref: whatwg/url#150 Ref: whatwg/url@373dbed url: fix URL#search setter The check for empty string must be done before removing the leading '?'. Ref: https://url.spec.whatwg.org/#dom-url-search url: set port to null if an empty string is given This is to follow a spec change. Ref: whatwg/url#113 url: fix parsing of paths with Windows drive letter test: update WHATWG URL test fixtures PR-URL: nodejs#10317 Reviewed-By: James M Snell <[email protected]> Reviewed-By: Benjamin Gruenbaum <[email protected]>

This patch contains the following changes: url: make IPv4 parser more spec compliant * Return int64_t from ParseNumber to prevent overflow for valid big numbers * Don't throw when there are more than 4 parts (it cannot be an IP address) * Correctly interpret the address and don't always throw when there are numbers > 255 Ref: https://url.spec.whatwg.org/#concept-ipv4-parser Fixes: #10306 url: percent encode fragment to follow spec change Ref: whatwg/url#150 Ref: whatwg/url@373dbed url: fix URL#search setter The check for empty string must be done before removing the leading '?'. Ref: https://url.spec.whatwg.org/#dom-url-search url: set port to null if an empty string is given This is to follow a spec change. Ref: whatwg/url#113 url: fix parsing of paths with Windows drive letter test: update WHATWG URL test fixtures PR-URL: #10317 Reviewed-By: James M Snell <[email protected]> Reviewed-By: Benjamin Gruenbaum <[email protected]>

…s in fragment https://bugs.webkit.org/show_bug.cgi?id=163287 Reviewed by Brady Eidson. Source/WebCore: Based on discussion in whatwg/url#150 If that discussion decides to keep the spec as-is (which keeps non-ASCII characters in the fragment to match IE and Edge's behavior, which Chrome has followed for special schemes) then we can revert this change later after enabling the URL parser. Making this change keeps behavior matching Safari and Firefox, as well as Chrome's handling of non-special schemes, such as data URLs. Covered by updated API tests. * platform/URLParser.cpp: (WebCore::URLParser::appendToASCIIBuffer): (WebCore::URLParser::copyURLPartsUntil): (WebCore::URLParser::syntaxViolation): (WebCore::URLParser::currentPosition): (WebCore::URLParser::parse): (WebCore::URLParser::fragmentSyntaxViolation): Deleted. * platform/URLParser.h: No more non-ASCII characters in canonicalized URLs. Tools: * TestWebKitAPI/Tests/WebCore/URLParser.cpp: (TestWebKitAPI::TEST_F): Canonical link: https://commits.webkit.org/181119@main git-svn-id: https://svn.webkit.org/repository/webkit/trunk@207152 268f45cc-cd09-0410-ab3c-d52691b4dbfc

annevk added a commit that referenced this issue Dec 8, 2016

Percent encode fragments too

be6cdb6

Fixes #150.

annevk mentioned this issue Dec 8, 2016

Percent encode fragments too #169

Merged

annevk added a commit to web-platform-tests/wpt that referenced this issue Dec 8, 2016

URL: percent encode in fragments

b263150

See whatwg/url#150.

annevk mentioned this issue Dec 8, 2016

URL: percent encode in fragments web-platform-tests/wpt#4298

Merged

annevk added a commit to web-platform-tests/wpt that referenced this issue Dec 9, 2016

URL: percent encode in fragments

8202564

See whatwg/url#150.

annevk closed this as completed in #169 Dec 9, 2016

annevk added a commit that referenced this issue Dec 9, 2016

Percent encode fragments too

373dbed

Now all components of a URL can be represented using ASCII strings or integers. Tests: web-platform-tests/wpt#4298. Fixes #150.

frewsxcv mentioned this issue Dec 9, 2016

Implement URL spec changes regarding 'percent encode fragments' servo/rust-url#246

Closed

rmisev mentioned this issue Sep 18, 2017

Consider percent-encoding more characters in "fragment state" #344

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicode fragments #150

unicode fragments #150

achristensen07 commented Oct 7, 2016

achristensen07 commented Oct 7, 2016

annevk commented Oct 11, 2016

valenting commented Oct 11, 2016

annevk commented Oct 19, 2016

achristensen07 commented Oct 20, 2016

annevk commented Oct 20, 2016

achristensen07 commented Oct 20, 2016

dbaron commented Oct 20, 2016

annevk commented Oct 21, 2016

sleevi commented Oct 21, 2016

foolip commented Oct 21, 2016

annevk commented Oct 21, 2016

sleevi commented Oct 21, 2016

annevk commented Oct 21, 2016

annevk commented Oct 28, 2016

annevk commented Dec 8, 2016

annevk commented Dec 8, 2016

achristensen07 commented Dec 8, 2016

annevk commented Dec 8, 2016

valenting commented Dec 8, 2016

achristensen07 commented Dec 8, 2016

annevk commented Dec 8, 2016

valenting commented Dec 8, 2016

zcorpan commented Dec 8, 2016

annevk commented Dec 9, 2016

unicode fragments #150

unicode fragments #150

Comments

achristensen07 commented Oct 7, 2016

achristensen07 commented Oct 7, 2016

annevk commented Oct 11, 2016

valenting commented Oct 11, 2016

annevk commented Oct 19, 2016

achristensen07 commented Oct 20, 2016

annevk commented Oct 20, 2016

achristensen07 commented Oct 20, 2016

dbaron commented Oct 20, 2016

annevk commented Oct 21, 2016

sleevi commented Oct 21, 2016

foolip commented Oct 21, 2016

annevk commented Oct 21, 2016

sleevi commented Oct 21, 2016

annevk commented Oct 21, 2016

annevk commented Oct 28, 2016

annevk commented Dec 8, 2016

annevk commented Dec 8, 2016

achristensen07 commented Dec 8, 2016

annevk commented Dec 8, 2016

valenting commented Dec 8, 2016

achristensen07 commented Dec 8, 2016

annevk commented Dec 8, 2016

valenting commented Dec 8, 2016

zcorpan commented Dec 8, 2016

annevk commented Dec 9, 2016