Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode fragments #150

Closed
achristensen07 opened this issue Oct 7, 2016 · 25 comments
Closed

unicode fragments #150

achristensen07 opened this issue Oct 7, 2016 · 25 comments

Comments

@achristensen07
Copy link
Collaborator

Right now there is a note: "Unfortunately not using percent-encoding is intentional as implementations with majority market share exhibit this behavior."

I think we should change "Append c to url’s fragment." to at least "If url’s scheme is a special scheme, append c to url’s fragment. Otherwise, append the result of running UTF-8 percent encode with the simple encode set on c" or something similar.

Right now Firefox and Safari percent-encode non-ASCII characters in the fragment. Chrome percent-encodes non-ASCII characters in the fragment if the scheme is not a special scheme. Edge does not percent-encode any non-ASCII characters in the fragment.

This change is needed to continue to support percent-encoding non-ascii characters after a '#' in a data URL, which works in Safari, Chrome, and Firefox.

@achristensen07
Copy link
Collaborator Author

Ideally, I think we'd all just switch to percent encoding all non-ASCII characters in all fragments so the output of URL parsing is always ASCII. Would everyone be willing to make such a change?

kisg pushed a commit to paul99/webkit-mips that referenced this issue Oct 8, 2016
https://bugs.webkit.org/show_bug.cgi?id=163153

Reviewed by Tim Horton.

Source/WebCore:

This is needed to keep compatibility with data URLs with non-ASCII characters after a '#'
which works in Chrome, Firefox, and Safari, while maintaining compatibility with Chrome, IE, and Edge
which keep non-ASCII characters in the fragments of special URLs.
This was proposed to the spec in whatwg/url#150

Covered by new API tests.

* platform/URLParser.cpp:
(WebCore::URLParser::syntaxViolation):
Removed assertion because we now have fragments that need percent encoding but are all ASCII.
(WebCore::URLParser::fragmentSyntaxViolation):
(WebCore::URLParser::parse):

Tools:

* TestWebKitAPI/Tests/WebCore/URLParser.cpp:
(TestWebKitAPI::TEST_F):



git-svn-id: http://svn.webkit.org/repository/webkit/trunk@206942 268f45cc-cd09-0410-ab3c-d52691b4dbfc
@annevk
Copy link
Member

annevk commented Oct 11, 2016

@valenting? I think @dbaron also wanted us to always percent-encode, to keep URLs "ASCII safe" and easier to copy-and-paste (due to whitespace and such).

@valenting
Copy link
Collaborator

I support percent encoding all characters in the hash. It turns out percent encoded strings are much easier to manage and parse.

kisg pushed a commit to paul99/webkit-mips that referenced this issue Oct 11, 2016
…s in fragment

https://bugs.webkit.org/show_bug.cgi?id=163287

Reviewed by Brady Eidson.

Source/WebCore:

Based on discussion in whatwg/url#150
If that discussion decides to keep the spec as-is (which keeps non-ASCII characters in the fragment
to match IE and Edge's behavior, which Chrome has followed for special schemes) then we can revert
this change later after enabling the URL parser.  Making this change keeps behavior matching Safari
and Firefox, as well as Chrome's handling of non-special schemes, such as data URLs.

Covered by updated API tests.

* platform/URLParser.cpp:
(WebCore::URLParser::appendToASCIIBuffer):
(WebCore::URLParser::copyURLPartsUntil):
(WebCore::URLParser::syntaxViolation):
(WebCore::URLParser::currentPosition):
(WebCore::URLParser::parse):
(WebCore::URLParser::fragmentSyntaxViolation): Deleted.
* platform/URLParser.h:
No more non-ASCII characters in canonicalized URLs.

Tools:

* TestWebKitAPI/Tests/WebCore/URLParser.cpp:
(TestWebKitAPI::TEST_F):



git-svn-id: http://svn.webkit.org/repository/webkit/trunk@207152 268f45cc-cd09-0410-ab3c-d52691b4dbfc
@annevk
Copy link
Member

annevk commented Oct 19, 2016

It seems HTML already decodes fragment identifiers in https://html.spec.whatwg.org/multipage/browsers.html#the-indicated-part-of-the-document so presumably that would keep working.

But I suspect there might be websites that depend on fragment not being encoded?

@achristensen07 did you verify that all APIs for URLs in WebKit return the URL with the fragment encoded as well? Or in order to have URLs with the fragment encoded would we need to add special casing to APIs?

@achristensen07
Copy link
Collaborator Author

In WebKit, the fragments are encoded/serialized as they are parsed, so there is no time in which there will be an unencoded fragment in a URL

@annevk
Copy link
Member

annevk commented Oct 20, 2016

@achristensen07 that's understood, but my question is if there's any APIs such as location.hash that might yield them as unencoded (essentially with the accessor doing its conversion).

@achristensen07
Copy link
Collaborator Author

I do not think we ever percent-decode fragments in WebKit. Going to the indicated part of the document with percent-encoded fragments or non-ASCII fragments which we percent encode doesn't work in WebKit right now, which we need to fix.

@dbaron
Copy link
Member

dbaron commented Oct 20, 2016

I think @dbaron also wanted us to always percent-encode, to keep URLs "ASCII safe" and easier to copy-and-paste (due to whitespace and such).

In particular, the argument I made was in Mozilla bug 1148861 (comment 0 and comment 25).

@annevk
Copy link
Member

annevk commented Oct 21, 2016

So Gecko seems to be encoding too (although the address bar decodes). So that leaves Chrome and Edge. @foolip who's a good contact for URL parsing in Chrome?

@sleevi
Copy link

sleevi commented Oct 21, 2016

I'm still lurking watching these bugs, @annevk , at least for our low-level implementation (and how it relates to URL bar). When it get's to Blink side, @mikewest is good to poke just to make sure I don't stuff it all up :)

@foolip
Copy link
Member

foolip commented Oct 21, 2016

Thanks @sleevi. @sof has also poked a bit at the Blink-side URL.

@annevk
Copy link
Member

annevk commented Oct 21, 2016

@sleevi I guess the question is then whether you'd consider always encoding fragments. (Or is that already happening and is the Blink side doing magic in the API layer?)

@sleevi
Copy link

sleevi commented Oct 21, 2016

@annevk Yeah, I'm planning on looking into this today and trying to get an answer. I seem to recall that it's similar to Gecko, and so I suspect our URL substrate is decoding, and any encoding would need to be done at the Blink glue to that (the KURL and bindings side). On the upside, it makes it easier to change when it would only affect Chrome (our URL code is shared with a growing number of Google products due to Cronet embedding), the downside is it may require a more extensive audit of where we expose those URLs to a web-visible form.

I'm not hearing a request to change address bar behavior, but did I misunderstand the remarks re: FF? My gut is that it seems to make sense to change that too, to ensure authors aren't going to copy/paste?

@annevk
Copy link
Member

annevk commented Oct 21, 2016

The address bar is a hard problem. If we always show everything encoded, we basically make it useless for some set of pages and people, e.g., Wikipedia articles with titles in Kanji visited by folks that can read Japanese. On the other hand, I don't think most people should really be looking at the address bar beyond the domain (and the rest can be hidden until you need to copy), but folks are going to disagree on that one and only Safari is really that far ahead right now.

There's also the aspect that we cannot dictate the address bar since it's UX. And you could potentially show decoded, but copy encoded, etc. I think for this issue we should consider it out-of-scope, and just consider the effects we can directly observe through JavaScript and following links in documents.

@annevk
Copy link
Member

annevk commented Oct 28, 2016

@sleevi any update? I'm inclined to go with always encoding the fragment since it would be the most consistent (URL members would only ever contain ASCII) and it appears that's what Firefox and Safari get away with.

annevk added a commit that referenced this issue Dec 8, 2016
@annevk
Copy link
Member

annevk commented Dec 8, 2016

I created a PR to change the specification here. I'd appreciate review. @achristensen07 do you know if this is already tested? I can change the tests.

@annevk
Copy link
Member

annevk commented Dec 8, 2016

FWIW, I started working on a patch for the tests and I noticed that Firefox has a few differences from my proposed change. It does not seem to drop U+0000 and it seems to preserve and encode newline and tab code points. Not sure that's what we want to keep given what we specified for other setters.

@achristensen07
Copy link
Collaborator Author

This is tested by these existing tests:
Parsing: <#β> against http://example.org/foo/bar
Parsing: <http://www.google.com/foo?bar=baz# »> against about:blank
Parsing: <data:test# »> against about:blank

annevk added a commit to web-platform-tests/wpt that referenced this issue Dec 8, 2016
@annevk
Copy link
Member

annevk commented Dec 8, 2016

web-platform-tests/wpt#4298 has an update to the tests. I want to be sure though about what we want to encode, although I guess this change is still a step in the right direction so maybe we should make it regardless.

@valenting thoughts? Should we encode spaces as well? Is it okay to drop U+0000?

@valenting
Copy link
Collaborator

Right now Firefox drops \r\n\t in the URL constructor, and trims U+0000 to U+0020.
In the setters we encode all C0-and space characters. This is true for pathname, search and hash.
I don't have any preference about this, but we should at least try to be consistent.

@achristensen07
Copy link
Collaborator Author

Chrome and Safari don't encode spaces. I propose using the simple encode set

@annevk
Copy link
Member

annevk commented Dec 8, 2016

@valenting the consistency we have now is that \r\n\t are always dropped. And trimming of U+0000 to U+0020 happens only for the full URL (not setters). Note that it should also happen for <a> and such, not just the URL constructor.

The main problem with not encoding spaces is that it's harder to copy-and-paste those URLs and drop them elsewhere. Since most tooling assumes URLs do not contain spaces.

Anyway, if the test changes look okay I suggest we land this and then if there are still concerns we can fiddle with the details in a new issue.

@valenting
Copy link
Collaborator

👍

@zcorpan
Copy link
Member

zcorpan commented Dec 8, 2016

Related issue #125

@annevk
Copy link
Member

annevk commented Dec 9, 2016

Thanks @zcorpan, we can use that as the follow up for spaces.

annevk added a commit to web-platform-tests/wpt that referenced this issue Dec 9, 2016
annevk added a commit that referenced this issue Dec 9, 2016
Now all components of a URL can be represented using ASCII strings or integers.

Tests: web-platform-tests/wpt#4298.

Fixes #150.
targos added a commit to targos/node that referenced this issue Jan 1, 2017
This patch contains the following changes:

url: make IPv4 parser more spec compliant

* Return int64_t from ParseNumber to prevent overflow for valid big numbers
* Don't throw when there are more than 4 parts (it cannot be an IP
address)
* Correctly interpret the address and don't always throw when there are
numbers > 255

Ref: https://url.spec.whatwg.org/#concept-ipv4-parser
Fixes: nodejs#10306

url: percent encode fragment to follow spec change

Ref: whatwg/url#150
Ref: whatwg/url@373dbed

url: fix URL#search setter

The check for empty string must be done before removing the leading '?'.

Ref: https://url.spec.whatwg.org/#dom-url-search

url: set port to null if an empty string is given

This is to follow a spec change.

Ref: whatwg/url#113

url: fix parsing of paths with Windows drive letter

test: update WHATWG URL test fixtures

PR-URL: nodejs#10317
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Benjamin Gruenbaum <[email protected]>
evanlucas pushed a commit to nodejs/node that referenced this issue Jan 3, 2017
This patch contains the following changes:

url: make IPv4 parser more spec compliant

* Return int64_t from ParseNumber to prevent overflow for valid big numbers
* Don't throw when there are more than 4 parts (it cannot be an IP
address)
* Correctly interpret the address and don't always throw when there are
numbers > 255

Ref: https://url.spec.whatwg.org/#concept-ipv4-parser
Fixes: #10306

url: percent encode fragment to follow spec change

Ref: whatwg/url#150
Ref: whatwg/url@373dbed

url: fix URL#search setter

The check for empty string must be done before removing the leading '?'.

Ref: https://url.spec.whatwg.org/#dom-url-search

url: set port to null if an empty string is given

This is to follow a spec change.

Ref: whatwg/url#113

url: fix parsing of paths with Windows drive letter

test: update WHATWG URL test fixtures

PR-URL: #10317
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Benjamin Gruenbaum <[email protected]>
evanlucas pushed a commit to nodejs/node that referenced this issue Jan 4, 2017
This patch contains the following changes:

url: make IPv4 parser more spec compliant

* Return int64_t from ParseNumber to prevent overflow for valid big numbers
* Don't throw when there are more than 4 parts (it cannot be an IP
address)
* Correctly interpret the address and don't always throw when there are
numbers > 255

Ref: https://url.spec.whatwg.org/#concept-ipv4-parser
Fixes: #10306

url: percent encode fragment to follow spec change

Ref: whatwg/url#150
Ref: whatwg/url@373dbed

url: fix URL#search setter

The check for empty string must be done before removing the leading '?'.

Ref: https://url.spec.whatwg.org/#dom-url-search

url: set port to null if an empty string is given

This is to follow a spec change.

Ref: whatwg/url#113

url: fix parsing of paths with Windows drive letter

test: update WHATWG URL test fixtures

PR-URL: #10317
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Benjamin Gruenbaum <[email protected]>
ryanhaddad pushed a commit to WebKit/WebKit that referenced this issue Dec 22, 2020
…s in fragment

https://bugs.webkit.org/show_bug.cgi?id=163287

Reviewed by Brady Eidson.

Source/WebCore:

Based on discussion in whatwg/url#150
If that discussion decides to keep the spec as-is (which keeps non-ASCII characters in the fragment
to match IE and Edge's behavior, which Chrome has followed for special schemes) then we can revert
this change later after enabling the URL parser.  Making this change keeps behavior matching Safari
and Firefox, as well as Chrome's handling of non-special schemes, such as data URLs.

Covered by updated API tests.

* platform/URLParser.cpp:
(WebCore::URLParser::appendToASCIIBuffer):
(WebCore::URLParser::copyURLPartsUntil):
(WebCore::URLParser::syntaxViolation):
(WebCore::URLParser::currentPosition):
(WebCore::URLParser::parse):
(WebCore::URLParser::fragmentSyntaxViolation): Deleted.
* platform/URLParser.h:
No more non-ASCII characters in canonicalized URLs.

Tools:

* TestWebKitAPI/Tests/WebCore/URLParser.cpp:
(TestWebKitAPI::TEST_F):



Canonical link: https://commits.webkit.org/181119@main
git-svn-id: https://svn.webkit.org/repository/webkit/trunk@207152 268f45cc-cd09-0410-ab3c-d52691b4dbfc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

7 participants