-
Notifications
You must be signed in to change notification settings - Fork 459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] glutton_fatcat consolidation support #508
Draft
bnewbold
wants to merge
1,775
commits into
kermitt2:master
Choose a base branch
from
bnewbold:fatcat
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
update lexicon
* Use numerical mapping when ocr is not activated.
Improvement in evaluation framework
Avoid duplicated body part in the abstract
Improved dehypenisation
fatcat_ident, wikidata_qid. also pass arxiv_id around in more places for consistency.
This changes link order to: - arxiv.org: always available/reliable - web link in URL: could be better than DOI-based match (eg, if website) - fatcat.wiki: should be a superset of other OA links, and more reliable/stable - unpaywall OA link: better than doi, though links not as stable over time - doi.org: fallback
Unless in code review we decide to actually rename this variable for legibility.
This commit *should not* be merged into master!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR probably needs a bit of work, but i'm posting here for an early review if I have taken the correct approach.
These patches add a new consolidation option:
glutton_fatcat
. This calls a patched version of biblio-glutton which returns metadata in the fatcat 'release' schema (instead of the Crossref API schema). The motivation is to support additional works which may not have Crossref DOIs (eg, some JALC or Datacite DOIs not in the Crossref API bulk corpus, or things like arxiv papers with no DOIs at all). The motivation to upstream these changes here is to avoid having to maintain a separate patchset, and to also include some small improvements.In addition to the consolidation option, these patches include better support for the
rawName
attribute, some extra setters/getters for identifiers (and support for Wikidata QIDs), and a change in how URLs are output.This entire code path can be publicly tested at: https://grobid.qa.fatcat.wiki
The
glutton_fatcat
biblio-glutton code can be browsed at https://github.com/bnewbold/biblio-glutton, on the branch 'fatcat'. See long comment thread in that repo: kermitt2/biblio-glutton#33. My changes tobiblio-glutton
will probably be harder to merge upstream, but I'm happy to try. I don't think the API (between GROBID/biblio-glutton) would need to change significantly, so these GROBID-side patches should work fine.