You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running a web archive is often about managing scale. And about learning from experience when building the next iteration. Related to #205, which provides statistics aimed at quantitative analyses of content, we could use some index metrics:
doc_term_count The total number of terms in all the fields in the document (copyFields in Solr might increase this a bit)
doc_term_chars The total number of characters in all the terms in all the fields in the document (this will only be an approximation due to number fields)
This would help locating "large documents" and subsequently do qualified adjustments of field limits in the config file for the next full index.
Technically it would be simple to implement, as a post-analysis hook that iterates warc-indexer's Solr Document representation.
The text was updated successfully, but these errors were encountered:
Running a web archive is often about managing scale. And about learning from experience when building the next iteration. Related to #205, which provides statistics aimed at quantitative analyses of content, we could use some index metrics:
doc_term_count
The total number of terms in all the fields in the document (copyField
s in Solr might increase this a bit)doc_term_chars
The total number of characters in all the terms in all the fields in the document (this will only be an approximation due to number fields)This would help locating "large documents" and subsequently do qualified adjustments of field limits in the config file for the next full index.
Technically it would be simple to implement, as a post-analysis hook that iterates warc-indexer's Solr Document representation.
The text was updated successfully, but these errors were encountered: