-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep non-UTF-8 encoded URLs (Python 3) #19
Comments
SURT is a one-way transformation... do you think that there will be websites that have BOTH the latin-1 and utf-8 version of the url, with different content? |
Sure, it's a one way transformation. But the result should be the same, independent from implementation (Java vs. Python) or runtime (Python 2 vs. 3), otherwise the CDXs are not portable and require that exactly the same software and runtime is used for both writing the cdx and reading (look-up). I've tried to solve a similar issue for Java (commoncrawl/ia-web-commons#6) without finding an easy solution (moving from String to char[] would change everything). Hopefully, over time all websites follow RFC3986. |
Thank you, that makes more sense! In which case, may I suggest that the result -- whether the input is %-encoded utf8 bytes or %-encoded latin-1 bytes -- ought to be encoded utf8 bytes. i.e. I suggest m%ef%bf%bdnchen.html in the surt either way. That's different from what your suggested test is specifying. The benefit of having the surt be the same is that I suspect any website which used to have latin-1 encoding and changed to utf8 will show the same content for both (maybe via redirect from old to new.) The risk is that there might be different content, or, a crawler might become confused by the redir to the same surt() and only crawl the redir and not the content. Which is always a risk for other situations in the crawler. Basically, I'm pointing out that there's crawl policies implied by the choice of surt algorithm. If 2 urls surt differently, surt-using crawlers will be willing to crawl them both. Maybe I'm thinking too hard :-) |
Really? Unescaped it's "m�nchen.html". I guess you mean "m%c3%bcnchen.html" resp. "münchen.html". That would be the ideal solution, of course, regarding look-up and to avoid duplicates. The problem is that a the encoding (if not UTF-8) is not necessarily latin-1. It can be any encoding dependent on language and country, e.g.,
I've extract a list of URLs with non-utf8 percent encoding from Common Crawl's July 2017 crawl: only 500,000 URLs or 0.02% of the entire crawl. Many of the sites also accept utf-8 encoded URLs. However, internal links are non-utf8. That's how they get into the crawl. |
Sorry, I did mean %c3%b ! Thanks for counting the urls affected by this issue, even with the English focus of CC it's hard to imagine that failing to dedup non-UTF8 properly will be a significant issue. |
@sebastian-nagel mentioned on the common-crawl mailing list that this may have been resolved by 6b8e656 |
Surt with Python 2.x keeps URLs with non-ASCII characters in the percent-encoded path intact:
(letters in hex characters are lowercased)
With Python 3.x the latin-1-encoded character is substituted by the replacement character:
Since UTF-8 as character encoding was introduced by RFC3986 in 2005, there may be still many URLs which use a different encoding.
2d4bde5 adds a test to catch this problem.
The text was updated successfully, but these errors were encountered: