Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] glutton_fatcat consolidation support #508

Draft
wants to merge 1,775 commits into
base: master
Choose a base branch
from

Conversation

bnewbold
Copy link
Contributor

@bnewbold bnewbold commented Oct 4, 2019

This PR probably needs a bit of work, but i'm posting here for an early review if I have taken the correct approach.

These patches add a new consolidation option: glutton_fatcat. This calls a patched version of biblio-glutton which returns metadata in the fatcat 'release' schema (instead of the Crossref API schema). The motivation is to support additional works which may not have Crossref DOIs (eg, some JALC or Datacite DOIs not in the Crossref API bulk corpus, or things like arxiv papers with no DOIs at all). The motivation to upstream these changes here is to avoid having to maintain a separate patchset, and to also include some small improvements.

In addition to the consolidation option, these patches include better support for the rawName attribute, some extra setters/getters for identifiers (and support for Wikidata QIDs), and a change in how URLs are output.

This entire code path can be publicly tested at: https://grobid.qa.fatcat.wiki

The glutton_fatcat biblio-glutton code can be browsed at https://github.com/bnewbold/biblio-glutton, on the branch 'fatcat'. See long comment thread in that repo: kermitt2/biblio-glutton#33. My changes to biblio-glutton will probably be harder to merge upstream, but I'm happy to try. I don't think the API (between GROBID/biblio-glutton) would need to change significantly, so these GROBID-side patches should work fine.

kermitt2 and others added 30 commits February 12, 2019 06:24
* Use numerical mapping when ocr is not activated.
kermitt2 and others added 27 commits September 11, 2019 10:25
fatcat_ident, wikidata_qid. also pass arxiv_id around in more places for
consistency.
This changes link order to:

- arxiv.org: always available/reliable
- web link in URL: could be better than DOI-based match (eg, if website)
- fatcat.wiki: should be a superset of other OA links, and more reliable/stable
- unpaywall OA link: better than doi, though links not as stable over time
- doi.org: fallback
Unless in code review we decide to actually rename this variable for
legibility.
This commit *should not* be merged into master!
@coveralls
Copy link

coveralls commented Oct 4, 2019

Coverage Status

Coverage decreased (-0.1%) to 37.19% when pulling 3969c87 on bnewbold:fatcat into 472324a on kermitt2:master.

@lfoppiano lfoppiano marked this pull request as draft November 20, 2024 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants