Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep non-UTF-8 encoded URLs (Python 3) #19

Open
sebastian-nagel opened this issue Jan 19, 2017 · 6 comments · May be fixed by #30
Open

Keep non-UTF-8 encoded URLs (Python 3) #19

sebastian-nagel opened this issue Jan 19, 2017 · 6 comments · May be fixed by #30

Comments

@sebastian-nagel
Copy link

Surt with Python 2.x keeps URLs with non-ASCII characters in the percent-encoded path intact:

http://onlinestreet.de/strassen/in-M%FCnchen.html

(letters in hex characters are lowercased)

de,onlinestreet)/strassen/in-m%fcnchen.html

With Python 3.x the latin-1-encoded character is substituted by the replacement character:

de,onlinestreet)/strassen/in-m%ef%bf%bdnchen.html
http://onlinestreet.de/strassen/in-m%ef%bf%bdnchen.html

Since UTF-8 as character encoding was introduced by RFC3986 in 2005, there may be still many URLs which use a different encoding.

2d4bde5 adds a test to catch this problem.

@wumpus
Copy link

wumpus commented May 18, 2017

SURT is a one-way transformation... do you think that there will be websites that have BOTH the latin-1 and utf-8 version of the url, with different content?

@sebastian-nagel
Copy link
Author

Sure, it's a one way transformation. But the result should be the same, independent from implementation (Java vs. Python) or runtime (Python 2 vs. 3), otherwise the CDXs are not portable and require that exactly the same software and runtime is used for both writing the cdx and reading (look-up). I've tried to solve a similar issue for Java (commoncrawl/ia-web-commons#6) without finding an easy solution (moving from String to char[] would change everything). Hopefully, over time all websites follow RFC3986.

@wumpus
Copy link

wumpus commented May 19, 2017

Thank you, that makes more sense!

In which case, may I suggest that the result -- whether the input is %-encoded utf8 bytes or %-encoded latin-1 bytes -- ought to be encoded utf8 bytes.

i.e. I suggest m%ef%bf%bdnchen.html in the surt either way.

That's different from what your suggested test is specifying.

The benefit of having the surt be the same is that I suspect any website which used to have latin-1 encoding and changed to utf8 will show the same content for both (maybe via redirect from old to new.) The risk is that there might be different content, or, a crawler might become confused by the redir to the same surt() and only crawl the redir and not the content. Which is always a risk for other situations in the crawler.

Basically, I'm pointing out that there's crawl policies implied by the choice of surt algorithm. If 2 urls surt differently, surt-using crawlers will be willing to crawl them both.

Maybe I'm thinking too hard :-)

@sebastian-nagel
Copy link
Author

i.e. I suggest m%ef%bf%bdnchen.html in the surt either way.

Really? Unescaped it's "m�nchen.html". I guess you mean "m%c3%bcnchen.html" resp. "münchen.html". That would be the ideal solution, of course, regarding look-up and to avoid duplicates. The problem is that a the encoding (if not UTF-8) is not necessarily latin-1. It can be any encoding dependent on language and country, e.g.,

I've extract a list of URLs with non-utf8 percent encoding from Common Crawl's July 2017 crawl: only 500,000 URLs or 0.02% of the entire crawl. Many of the sites also accept utf-8 encoded URLs. However, internal links are non-utf8. That's how they get into the crawl.

@wumpus
Copy link

wumpus commented Aug 17, 2017

Sorry, I did mean %c3%b !

Thanks for counting the urls affected by this issue, even with the English focus of CC it's hard to imagine that failing to dedup non-UTF8 properly will be a significant issue.

@tfmorris
Copy link

@sebastian-nagel mentioned on the common-crawl mailing list that this may have been resolved by 6b8e656

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants