Keep non-UTF-8 encoded URLs (Python 3) #19

sebastian-nagel · 2017-01-19T14:56:43Z

Surt with Python 2.x keeps URLs with non-ASCII characters in the percent-encoded path intact:

http://onlinestreet.de/strassen/in-M%FCnchen.html

(letters in hex characters are lowercased)

de,onlinestreet)/strassen/in-m%fcnchen.html

With Python 3.x the latin-1-encoded character is substituted by the replacement character:

de,onlinestreet)/strassen/in-m%ef%bf%bdnchen.html
http://onlinestreet.de/strassen/in-m%ef%bf%bdnchen.html

Since UTF-8 as character encoding was introduced by RFC3986 in 2005, there may be still many URLs which use a different encoding.

2d4bde5 adds a test to catch this problem.

The text was updated successfully, but these errors were encountered:

wumpus · 2017-05-18T18:45:56Z

SURT is a one-way transformation... do you think that there will be websites that have BOTH the latin-1 and utf-8 version of the url, with different content?

sebastian-nagel · 2017-05-18T20:17:39Z

Sure, it's a one way transformation. But the result should be the same, independent from implementation (Java vs. Python) or runtime (Python 2 vs. 3), otherwise the CDXs are not portable and require that exactly the same software and runtime is used for both writing the cdx and reading (look-up). I've tried to solve a similar issue for Java (commoncrawl/ia-web-commons#6) without finding an easy solution (moving from String to char[] would change everything). Hopefully, over time all websites follow RFC3986.

wumpus · 2017-05-19T02:49:21Z

Thank you, that makes more sense!

In which case, may I suggest that the result -- whether the input is %-encoded utf8 bytes or %-encoded latin-1 bytes -- ought to be encoded utf8 bytes.

i.e. I suggest m%ef%bf%bdnchen.html in the surt either way.

That's different from what your suggested test is specifying.

The benefit of having the surt be the same is that I suspect any website which used to have latin-1 encoding and changed to utf8 will show the same content for both (maybe via redirect from old to new.) The risk is that there might be different content, or, a crawler might become confused by the redir to the same surt() and only crawl the redir and not the content. Which is always a risk for other situations in the crawler.

Basically, I'm pointing out that there's crawl policies implied by the choice of surt algorithm. If 2 urls surt differently, surt-using crawlers will be willing to crawl them both.

Maybe I'm thinking too hard :-)

sebastian-nagel · 2017-08-10T11:19:12Z

i.e. I suggest m%ef%bf%bdnchen.html in the surt either way.

Really? Unescaped it's "m�nchen.html". I guess you mean "m%c3%bcnchen.html" resp. "münchen.html". That would be the ideal solution, of course, regarding look-up and to avoid duplicates. The problem is that a the encoding (if not UTF-8) is not necessarily latin-1. It can be any encoding dependent on language and country, e.g.,

iso-8859-2: http://www.nerosty.cz/nerosty/m%EC%EF/3360_m%EC%EF (decoded http://www.nerosty.cz/nerosty/měď/3360_měď)
cp-1251: http://io.ua/c14/z/Osen/%CE%F1%E5%ED%FC (http://io.ua/c14/z/Osen/Осень)

I've extract a list of URLs with non-utf8 percent encoding from Common Crawl's July 2017 crawl: only 500,000 URLs or 0.02% of the entire crawl. Many of the sites also accept utf-8 encoded URLs. However, internal links are non-utf8. That's how they get into the crawl.

wumpus · 2017-08-17T18:25:01Z

Sorry, I did mean %c3%b !

Thanks for counting the urls affected by this issue, even with the English focus of CC it's hard to imagine that failing to dedup non-UTF8 properly will be a significant issue.

tfmorris · 2023-08-23T23:34:08Z

@sebastian-nagel mentioned on the common-crawl mailing list that this may have been resolved by 6b8e656

sebastian-nagel mentioned this issue Jan 19, 2017

WaybackURLKeyMaker to keep non-utf8 percent encodings commoncrawl/ia-web-commons#6

Open

tfmorris linked a pull request Aug 27, 2023 that will close this issue

Test whether non-UTF-8 encoded paths in URLs are left intact #30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep non-UTF-8 encoded URLs (Python 3) #19

Keep non-UTF-8 encoded URLs (Python 3) #19

sebastian-nagel commented Jan 19, 2017

wumpus commented May 18, 2017

sebastian-nagel commented May 18, 2017

wumpus commented May 19, 2017 •

edited

Loading

sebastian-nagel commented Aug 10, 2017

wumpus commented Aug 17, 2017

tfmorris commented Aug 23, 2023

Keep non-UTF-8 encoded URLs (Python 3) #19

Keep non-UTF-8 encoded URLs (Python 3) #19

Comments

sebastian-nagel commented Jan 19, 2017

wumpus commented May 18, 2017

sebastian-nagel commented May 18, 2017

wumpus commented May 19, 2017 • edited Loading

sebastian-nagel commented Aug 10, 2017

wumpus commented Aug 17, 2017

tfmorris commented Aug 23, 2023

wumpus commented May 19, 2017 •

edited

Loading