Skip to content

Latest commit

 

History

History
78 lines (63 loc) · 12.2 KB

2021-04-29 ALTO Board Meeting Minutes.md

File metadata and controls

78 lines (63 loc) · 12.2 KB

2021-04-29 ALTO Board Meeting Minutes

  1. Welcome [All]
  2. Find and tell a non-offensive, maybe self-deprecating joke before the meeting begins and/or after it ends. [All]
  3. Review recent schema issues:
  4. Preliminary findings from paper for the 2021 ICDAR conference on inconsistencies in accuracy use between characters and words. [Clemens]
  5. Board membership:
    • Candidate nomination from KB
    • Chair position available
  6. Other business. [All]

Attending members

  • Art Rhyno
  • Clemens Neudecker
  • Frederick Zarndt
  • Hany A. Abdellatif
  • Jean-Philippe Moreux
  • Nate Trail
  • Raju Buddharaju
  • Sébastien Cretin
  • Stefan Pletschacher

Minutes

wrt agenda item 3. Review recent schema issues:

wrt agenda item 4. Preliminary findings from paper on inconsistencies in accuracy use between characters and words.

Clemens gave a sneak preview of the work that his team has been doing for the HIP workshop, which is co-located with the ICDAR conference this year in Lausanne, Switzerland. He reminded the Board that there are still 2 weeks to submit a paper to HIP, and that there are plans for possible onsite participation. Clemens described the many challenges to evaluating OCR accuracy. There are too many documents to rely solely on ground truth, but sampling techniques tend to ignore layout analysis. For users interested in using OCR documents for advanced processing, like natural language or distant reading, it is essential to retain the correct sequence for paragraphs, columns, etc. Most of the common metrics do not address this.

The team did some comparisons with many of the common tools, including those from IMPACT, the PRImA evaluation tools and the dinglehopper tool from the Qurator project. The comparisons were carried out on two datasets, one dataset consisted of historical books and the other consisted of newspapers. Even with the same measures, for example, word error rate, the tools provided results that differed substantially. This could be due to implementation details or lack of specification, for example, how to treat ligatures, or how to combine code points. Clemens also noted differences in alignment between the ground truth and the OCR results, and the sometimes inconsistent treatment of different languages by the same tool.

ALTO could have role in clarifying some of the varied aspects of OCR, such as code points for characters and the use of spaces. Consistent metrics for layout analysis are also needed, PRImA, in particular, has done good work on this but there is still a lack of standardization that needs to be addressed. Clemens hopes the paper will be finished soon and can share a copy when it is completed.

Art asked about the extent to which OCR quality impacts downstream analysis in areas such as name entity recognition. Clemens suggested that there usually needs to be an evaluation phase for text analysis projects to help measure the appropriateness of the OCR for the task. Kudos were given to the IMPACT Historical Document Images dataset and the ENP Image and Ground Truth Dataset of Historical Newspapers, which include reading order. Stefan noted the difficulty of using OCR metrics with funding agencies and other types of decision-makers, where there is a desire to see one number for use in evaluation, a point echoed by Clemens where applications for funding sometimes require an internal OCR accuracy rate to be provided. Frederick asked how the results of this work could be used in the future. The importance of providing the underlying data and the abilty to reproduce results was emphasised, and the identification of areas for further research is a key direction, for example, defining better metrics within OCR projects. Clemens flagged the work by David Smith and others in the Research Agenda for Historical and Multilingual OCR report, which includes a recommendation for better measures to reflect use cases from the DH community. Clemens also underlined the need for cultural heritage organizations to provide transparent information on the quality of the OCR results used in services and projects in order to allow informed decisions by users.

Frederick raised the use of accuracy metrics for open and commercial OCR engines, would it be possible for OCR systems and services to use common metrics to support better comparison points in selecting OCR options? There was general agreement that better metrics would help everyone involved in OCR provision and use. Frederick described a use case involving Ancient Greek, where challenges in provisioning solutions for a non-mainstream language in terms of OCR might also provide a sandbox for defining accuracy.

wrt agenda item 5. Board membership

Evelien has identified someone at the Koninklijke Bibliotheek who could step into her spot on the Board and Art will send out a CV for the candidate. Art also reminded the Board that the Chair position will be available in January, 2022, and encouraged anyone interested to contact him.

wrt agenda item 6. Other business:

Clemens provided a link to a background document to give some sense of the type of OCR metrics used in Germany for funding. Ashok and others expressed support for the prospect of meeting face-to-face at ICDAR, and Stefan noted that conference funding and procedures may take some time to stabilize as conferences start to offer in-person registration again.

The next Board meeting will likely be held in June.