Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different URLs that contains “www” in the host name are incorrectly canonicalized. #28

Open
kritikagarg opened this issue Jun 27, 2022 · 0 comments · May be fixed by #31
Open

Different URLs that contains “www” in the host name are incorrectly canonicalized. #28

kritikagarg opened this issue Jun 27, 2022 · 0 comments · May be fixed by #31

Comments

@kritikagarg
Copy link

The following URLs are incorrectly canonicalized with SURT as "com)/".

SURT = "com)/"
1. https://www1355544.com/
2. https://www3288.com/
3. https://www504778.com/
4. https://www556798.com/
5. https://www57912.com/

Problem is that 'www\d*.' is removed from the URL, which leaves "com)/" as the SURT.

_RE_WWWDIGITS = re.compile(b'www\d*\.')

the other instance where the URLs were correctly canonicalized by removing 'www\d*.'

SURT = "jp,co,daily)/"
1. https://daily.co.jp/
2. https://www4.daily.co.jp/
3. https://www8.daily.co.jp/
4. https://www9.daily.co.jp/

We proposed to update the regex where it should only remove the 'www\d*.' if at least two dots are present in the hostname.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant