Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add meta stats fields #290

Open
tokee opened this issue May 30, 2022 · 0 comments
Open

Add meta stats fields #290

tokee opened this issue May 30, 2022 · 0 comments

Comments

@tokee
Copy link
Collaborator

tokee commented May 30, 2022

Running a web archive is often about managing scale. And about learning from experience when building the next iteration. Related to #205, which provides statistics aimed at quantitative analyses of content, we could use some index metrics:

  • doc_term_count The total number of terms in all the fields in the document (copyFields in Solr might increase this a bit)
  • doc_term_chars The total number of characters in all the terms in all the fields in the document (this will only be an approximation due to number fields)

This would help locating "large documents" and subsequently do qualified adjustments of field limits in the config file for the next full index.

Technically it would be simple to implement, as a post-analysis hook that iterates warc-indexer's Solr Document representation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant