Broken web links #320
Replies: 16 comments 17 replies
-
The issue cf-convention/cf-convention.github.io#493, is an example of a issue with broken link report has opened, and updated with new comments added to the issue from checker cron job: When the action (i.e. cron job ) it's re-run manually and "new" error appear, and the others disappear. The issue has been updated with the report for this "manual" check: IMO, there are 2 pending actions that we need to discuss:
It maybe would be useful to add this to the next meeting for the Information Management Team @cf-convention/info-mgmt |
Beta Was this translation helpful? Give feedback.
-
Dear Antonio @cofinoa I agree that this issue would be good to discuss at the information management team meeting. Meanwhile, thanks for the work you've already done to improve it. I like the more informative report. Here are some thoughts:
Best wishes Jonathan |
Beta Was this translation helpful? Give feedback.
-
Hi Antonio @cofinoa and Jonathan @JonathanGregory, I agree it would be good to discuss this at the next CF info-mgmt team meeting. I will add it to the agenda for that meeting. |
Beta Was this translation helpful? Give feedback.
-
In this week's link checker output, with the more tolerant timeout thanks to Antonio, there is only one broken link, namely http://coastwatch.pfeg.noaa.gov/erddap/convert/units.html. It's working now, but I've noticed that we quite often have problems with it. It's referred to in the FAQ, thus:
It's potentially useful information, so it seems a shame to remove it, but since it persistently goes wrong, I suggest we could deactivate it as a link e.g.
Is that a reasonable approach for this and other similar problems detected by the link checker? Antonio @cofinoa, why does the link checker append two reports to the issue, not just one? |
Beta Was this translation helpful? Give feedback.
-
Regarding the ERDDAP site I have several times checked shortly after receiving the link checker report, and it always work without problems. Could it be that the site blocks web spiders or something similar? Further down that web page there is an alternative way for using the service from within software tools. Would it be possible to keep the link as is in the document, but actually have the link checker using one of the provided examples (e.g. this) to test the availability ? |
Beta Was this translation helpful? Give feedback.
-
Antonio @cofinoa, would it be possible to suppress the output completely when there are no timeouts, errors or "unknown"? That would be helpful because there would be no addition to the issue when nothing is wrong, and we would not be alerted. |
Beta Was this translation helpful? Give feedback.
-
@cofinoa @larsbarring Can someone send me information on which ERDDAP site is being accessed, the request being made, and the IP that it is coming from? Also when it failed (date and time if available) so I can look in the appropriate log files. |
Beta Was this translation helpful? Give feedback.
-
@JonathanGregory I will look, but first can you change the URL to be https. It would still help if I have the IP, we have gotten much more aggressive in blocking, and particularly if you are coming from a data center (Digital Ocean, AWS, Google, Microsoft) there is a chance you have been blocked because of the actions of others. These days about 80%-90% of our "bad actors" are coming from data centers. |
Beta Was this translation helpful? Give feedback.
-
And right now it dawned to me that we should check all problematic links to see if they are |
Beta Was this translation helpful? Give feedback.
-
@JonathanGregory In our case the timeout was because the service was rebooting. You just happened to hit that time. But these days different sites have different policies about http, depending on IT. Some may refuse you, some, like us, redirect you. Usually something is put in the header also. |
Beta Was this translation helpful? Give feedback.
-
@cf-convention/info-mgmt I have updated the workflow for the link check cron job to:
you can found more details at the proper workflow: As agreed, if the link check is successful, it will not be reported, regardless of the status of the issue, and the issue should be manually closed |
Beta Was this translation helpful? Give feedback.
-
Thanks, Antonio @cofinoa. We didn't hear from the link checker this morning, and nothing has been added to issue #493. No news is good news, I presume! |
Beta Was this translation helpful? Give feedback.
-
Dear Roy @rmendels and Antonio @cofinoa At 0835 UTC today the CF link-checker reported
NB it has It's good to see that the existing issue was automatically reopened, according to plan, Antonio - thanks! Jonathan |
Beta Was this translation helpful? Give feedback.
-
https://coastwatch.pfeg.noaa.gov/erddap/convert/units.html didn't want to talk to CF again today. Unless there are other ways round this problem, I suggest that we should either deactivate the link (we can leave the URL in the text, but not as a link) or exclude it from the checker. What do you think, @cofinoa and @rmendels? |
Beta Was this translation helpful? Give feedback.
-
All -- has this been all concluded? If so, let's close this. If not, then maybe make an issue out the remaining, issues? |
Beta Was this translation helpful? Give feedback.
-
I'm closing this discussion with the following summary of status. Summary of Decisions
Pending:
Note: lychee config file verbose = "error"
no_progress = true
timeout = 300 # maximum HTTP request timeout (defaut is 20 seconds)
max-retries = 10 # increase from default (3 retries)
retry-wait-time = 2 # Minimum wait time in seconds between retries of failed requests (default: 1)
accept = ["200", "429", "403"]
exclude = [
"cfeditor.ceda.ac.uk", # standard_name_rule, vocabularies, discussion
"https://mailman.cgd.ucar.edu/pipermail/cf-metadata", # discussion, governance
#BEGIN Data/cf-standard-names/
"http://glossary.ametsoc.org/wiki",
"https://www.unidata.ucar.edu/software/udunits/udunits-current/doc/udunits",
"https://www.unidata.ucar.edu/software/udunits/udunits-2.2.28/udunits2.html",
"https://www.sciencedirect.com/science/article/pii/0967063793901018",
"https://www.ipcc.ch/ipccreports/tar/wg1/273.htm",
"http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata",
"http://gcmd.nasa.gov/Resources/valids",
#END Data/cf-standard-names/
# "http://mmisw.org/ont", # faq (TIMEOUT)
# "https://mmisw.org/ont", # faq (TIMEOUT)
"http://www.cgd.ucar.edu/cms/eaton/cf-metadata/clivar_article.pdf", # Data/cf-documents/cf-governance/cf2_whitepaper_final.html
"http://www.cgd.ucar.edu/cms/eaton/cf-metadata/CF-current.html", # Data/cf-documents/requirements-recommendations
"https://www.usbr.gov/lc/socal/reports/SMappend_C.pdf", # Data/area-type-table/**/build/area-type-table.html
"https://cf-trac.llnl.gov/trac/", # 2018-Workshop, 2019-Workshop
"http://mailman.cgd.ucar.edu/pipermail/cf-metadata", # 2019-Workshop
"https://www.wonder.me", # 2021-Workshop
"https://figshare.com/account/articles/24633939", # 2023-Workshop
"https://figshare.com/account/articles/24633894", # 2023-Workshop
# "https://github.com/", # Uncomment if you hit GitHub Rate Limit: https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api
### QUARANTINE
"https://coastwatch.pfeg.noaa.gov/erddap/convert/units.html", # faq
"https://github.com/orgs/cf-convention/projects/1", # Meetings/2020-Workshop.md
"Data/cf-standard-names/current/build/kwic_index_for_cf_standard_names.html", # vocabularies (temporal issue with KWIC generator)
"Data/cf-standard-names/86/build/kwic_index_for_cf_standard_names.html", # vocabularies (temporal issue with KWIC generator)
###
]
exclude_path = [
# Jekyll post build directory (i.e. _site)
"_site/Data/cf-standard-names/docs/guidelines.html",
"_site/Data/cf-conventions/",
"_site/Data/Trac-tickets/",
"_site/GDT/", # some HTML docs are invalid input encoded, choking the link checker
"_site/CF-beta/", # some HTML docs are invalid input encoded, choking the link checker
] |
Beta Was this translation helpful? Give feedback.
-
Topic for discussion
Moving the discussion about broken web links, from issue cf-convention/cf-convention.github.io#318
Two workflows/actions has been created to check links:
check_jekyll_build.yml
: An action with 2 main jobs triggered when a PR it's created :A. to check links in Markdown files (
./**/*.md
)B. to check that Jekyll can build the Website
check_links_cron.yml
: The other action that runs on Mondays, and has also 2 main jobs:C. to check that Jekyll can build the Website
D. to check links on the site built on job C, and if it fails a new issue, it's open: #490
The exclusion rules are at
.lychee/config.toml
which are being used for both actions (1 and 2), but we can create different ones for each action, in case it's needed.Currently, the following URL are being excluded:
Some of the excluded URL, are spurious broken links, which are temporarily broken.
Other, are permanently broken, and we need to decided what to do [1].
Also, paths are been excluded, mainly because they contain some documents with invalid encoding, or many broken relatives links (i.e. Trac-tickets):
[1]
For example, for the
https://www.ipcc.ch/ipccreports/tar/wg1/273.htm
we could link to a capture from the Wayback Machine:
https://web.archive.org/web/20181104000136/http://www.ipcc.ch/ipccreports/tar/wg1/273.htm
Beta Was this translation helpful? Give feedback.
All reactions