-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow intermediate PAGE annotation for word segmentation ambiguity #72
Comments
Input from different OCR engines should be different PAGE-XML files in the "standard" format where words are annotated not glyphs and whitespace. I'd suggest to have a top-level mechanism to define this different "profile", since the semantics change if a PAGE consumer is suposed to expect whitespace in words. Maybe a keyword in the To represent glyph alternatives indeed requires Should this format handle line segmentation alignment? Probably all elements should have an ID, to make it possible to re-map them to the "standard" format sources or to reference them directly in addition to by As for the requirements for representation, @finkf can probably offer the most informed opinion. |
Could the PAGE Layout Evaluation XML format be helpful here? https://github.com/PRImA-Research-Lab/PAGE-XML/blob/master/layout-evaluation/schema/layouteval.xsd |
I skimmed over it. But I did not see anything related to text. It seems to be a layout evaluation schema. What did I miss? |
Also just skimmed it. The mechanism for classifying errors in there could be helpful to make the alignment more explicit and less conventional etc. You are the experts, if it works for you and is documented/reusable/standards-compliant, go for it :) |
@kba Point 3 was about representation of the output of multi-OCR alignment on the line level (not on the page level), so it's about word segmentation (not line segmentation). @finkf When we discussed this last week, your 2 ideas of how to do this with PAGE were all losing information (Word level), or trying to assume we can reserve some characters (codepoints) for the empty symbol (TextLine level). So I came up with this. Do you think you can change cisocrgroup/cis-ocrd-py along those lines? (Or should I give it a try?) This proposal goes beyond merely representing alignment though, cf. points 1 and 2. Remember we also want glyph alternatives from OCR (aligned or not), and still aspire to use PAGE to integrate the language model. @kba As far as I understand it, there is no After a glance at layouteval.xsd I side with @finkf in that this does not help here IMHO. It can represent segmentation errors on all levels for sure. But we want to bring together different segmentations, and irrespective of which is 'right'. Same goes for error metric vs confidence level. |
Ping @splet @chris1010010 for further opinions. |
@bertsky @kba @cneud @finkf @splet |
@tboenig We had a lively discussion about this one in Rostock. Personally, I do not like the one-word-per-line option for aesthetic reasons. Assigning myself as a reminder to explore other options namely ways to represent n-fold alignments within XML. |
Some considerations which might help to swing the decision between a specialised PAGE annotation (with deviating semantics) and a new customised XML format: pro PAGE:
contra PAGE:
Point 2 goes both ways and needs to be cleared up first IMO. |
@bertsky Could you please provide self-contained examples which may help us in developing a solution as discussed in Rostock (i.e. with another level of representation effectively preventing the abuse of |
Sorry to get back so late, but this problem seems to be a Gordic knot of sorts. Getting good real-life example data entails having some OCR which can already give these representations, which in turn entails taking action to extend the API of tesseract, ocropy or another promising candidate, for which of course a good illustrative example (and visualisation) would be invaluable. Before I put more effort into that, please consider the following proposal for an extension of PAGE for lattices – so that we get an adequate representation of alternative segmentations without cheating. Representing a graph in a structure-oriented XML schema like PAGE is impossible: it can only describe a tree. So one needs a pointer-oriented schema. GraphML is a popular example. For PAGE this means we should introduce:
And this should be possible on any level of granularity, depending on use-cases. So here it goes:
The semantics of this would be straightforward. The arcs begin and end on the respective nodes, all nodes must be connected, there must not be circles etc. Note that the lattice is a terminal level – when used on some hierarchy level, it replaces the normal option available on that level, which can go deeper. |
Sure! So here is what the above (artificial) example could look like: <TextLine ...><Coords points=.../>
<TextSegmentLattice begin="1" end="9"> <!-- instead of a sequence of Word -->
<TextSegmentNode id="1">
<TextSegmentNode id="2">
<TextSegmentArc begin="1" end="2" id=... language=...><Coords points=.../>
<TextEquiv index="0" conf="0.9"><Unicode>m</Unicode></TextEquiv>
</TextSegmentArc>
<TextSegmentNode id="3">
<TextSegmentArc begin="1" end="3" id=... language=...><Coords points=.../>
<TextEquiv index="0" conf="0.75"><Unicode>n</Unicode></TextEquiv>
<TextEquiv index="1" conf="0.65"><Unicode>r</Unicode></TextEquiv>
</TextSegmentArc>
<TextSegmentArc begin="3" end="2" id=... language=...><Coords points=.../>
<TextEquiv index="0" conf="0.9"><Unicode>i</Unicode></TextEquiv>
<TextEquiv index="1" conf="0.6"><Unicode>r</Unicode></TextEquiv>
</TextSegmentArc>
<TextSegmentNode id="4">
<TextSegmentArc begin="2" end="4" id=... language=...><Coords points=.../>
<TextEquiv index="0" conf="0.9"><Unicode>y</Unicode></TextEquiv>
<TextEquiv index="1" conf="0.8"><Unicode>v</Unicode></TextEquiv>
</TextSegmentArc>
<TextSegmentNode id="5"> <!-- explicit space: -->
<TextSegmentArc begin="4" end="5" id=... language=...><Coords points=.../>
<TextEquiv index="0" conf="0.9"><Unicode> </Unicode></TextEquiv>
</TextSegmentArc>
<TextSegmentNode id="6">
<TextSegmentArc begin="5" end="6" id=... language=...><Coords points=.../>
<TextEquiv index="0" conf="0.9"><Unicode>p</Unicode></TextEquiv>
</TextSegmentArc>
<TextSegmentNode id="7">
<TextSegmentArc begin="4" end="7" id=... language=...><Coords points=.../>
<TextEquiv index="0" conf="0.8"><Unicode> ,</Unicode></TextEquiv>
</TextSegmentArc>
<TextSegmentArc begin="7" end="6" id=... language=...><Coords points=.../>
<TextEquiv index="0" conf="0.9"><Unicode>o</Unicode></TextEquiv>
</TextSegmentArc>
<TextSegmentNode id="8">
<TextSegmentArc begin="6" end="8" id=... language=...><Coords points=.../>
<TextEquiv index="0" conf="0.9"><Unicode>a</Unicode></TextEquiv>
<TextEquiv index="0" conf="0.7"><Unicode>e</Unicode></TextEquiv>
</TextSegmentArc>
<TextSegmentNode id="9">
<TextSegmentArc begin="8" end="9" id=... language=...><Coords points=.../>
<TextEquiv index="0" conf="0.9"><Unicode>y</Unicode></TextEquiv>
</TextSegmentArc>
</TextSegmentLattice>
<TextEquiv><Unicode>my pay</Unicode></TextEquiv>
</TextLine> Allow me to elaborate a little. By comparison, with the current purely hierarchical schema we had a tree of:
– with implicit white space (unless the old merge rules were to be re-activated). Whereas with the proposed extension we would additionally get:
– with explicit white space (in TextEquiv), and possibly empty arcs (e.g. to get a single final node). Of course, it depends on the use case what granularity level is chosen throughout documents, or even mixed within single pages. But just what might the use cases be for this extension? As said in the opening statement, multi-OCR alignment and post-correction would benefit a lot. (Whereby post-correction could be both automatic or interactive or a combination thereof, i.e. supervised adaptation.) They are not impossible without true lattices. But the current, tree-shaped representation – which by What all those use cases have in common is that they need a representation for processing data (as opposed to GT data), they build upon the OCR search space (not the authoritative fulltext). This is of course not entirely new to PAGE, as TextEquiv lists already allow giving OCR alternatives. But it would be more rigorous with a lattice extension. BTW, contrary to what one might expect from a first glance, this does not in fact worsen PAGE's consistency issue: As argued earlier, producing TextEquiv on multiple levels makes sense only for certain purposes:
I would like to add that there is a recent proposal in ALTO regarding segmentation ambiguity as well (albeit not lattice-based and only for the Glyph level). |
I wonder if we should add options for custom extension in PAGE (like in ALTO), where you can put any custom XML elements that are exempt from validation. See this from the ALTO schema:
|
Thanks, @chris1010010. It is, of course, your decision, but that option would also make it impossible to enforce the new element to be a terminal alternative to line, word and glyph sequences (via I have now made a draft version of the schema for my proposal on https://github.com/bertsky/PAGE-XML. The very last changeset adds the BTW, I have also tried to reduce redundancy and code duplication by refactoring the shared elements into |
Okay, thanks @bertsky Does anyone have more information on:
|
Dear all, |
@chris1010010 Dear Christian, too bad, but thanks for detailing your reasons! Do you still want me to separate the 3 runup changesets mentioned above (without the actual lattice extension) and prepare a pull request from them? See here for a comparision including changelog. @kba, Do you think we can still attempt an extension within OCR-D (perhaps under a different namespace, maybe by patching the schema on the fly in core)? |
@chris1010010 I now separated the lattice extension proposal from the other purely cosmetic commits and made a PR from the latter. @kba I updated (with forced push) the lattice extension proposal, because I found a minor validation error which xmllint had not reported earlier. (Apparently, you cannot use attribute type |
@bertsky |
@bertsky fyi, within ALTO we are currently investigating CITlab's confidence matrices (or |
@cneud thanks for bringing this to attention. Yes, I am aware of CITlab's confmat approach/format. (In fact, I have linked to it above and mentioned the possibility to extend PAGE with a But as argued in my proposal for Tesseract (yes, this is the most recent), I believe using a lattice instead of a matrix
(As for point 3, one could probably train a sequence-to-sequence decoder for that task as well, but I am not certain of it, and it still would not work in complicated cases like Tesseract.) Please feel free to refute any of those points! I would be more than happy to get that discussion going again. (I am sorry to have let it die earlier with Gundram and Günter.) Is the discussion for ALTO public so I can catch up with it? |
@bertsky Thanks for the elaboration. Initial discussions were held in the ALTO board meeting alongside DATeCH2019 last week but as soon as we have something public, I'll post the link here and let you know. |
In the OCR-D workflow, there are several steps that likely require input or output to be able to represent word segmentation ambiguity and confidence values of word boundaries (whitespace characters):
Each individual OCR engine might be able to provide such information. Postcorrection could benefit from it (importing whitespace characters with confidence as but one of many alternative characters within the input graph). Tesseract LSTM models definitely have this, at least internally (see also fill PAGE's glyphs and its variants and confidences via GetIterator() in recognize.py ocrd_tesserocr#7 which is distinct but related). PAGE output needs to incorporate those without loss.
Character-based language models need whitespace characters both for input and for output, at least when going below the
TextLine
level: For input, because the tokenization implicit inWord
annotation is not guaranteed to be a trivial whitespace join (especially at punctuation). And for output, because the LM's whitespace probabilities would have to be thrown at (the confidence attributes of) neighbouring elements otherwise.Alignment of multiple OCR results produces symbol pairings including the empty symbol (insertion/deletion) and whitespace symbol (tokenization ambiguity) for each line. Since there is no reserved character for the empty symbol within
TextLine:TextEquiv:Unicode
– all of Unicode is possible here, except control characters, which are forbidden in CDATA –, we cannot use it to encode such alignments. We do have a natural representation for empty symbol atGlyph:TextEquiv:Unicode
, but Glyph already necessitates a strict (hierarchical) Word segmentation, which would break tokenization ambiguity again.Thus, it seems advisable to allow a PAGE annotation as interface at those particular steps which deviates from the standard in that
WordType
(by using only 1 Word per Line by convention),GlyphType
(by using whitespace within its TextEquiv).The alternative would be to either not use PAGE at all there or loose important information by design.
Example for multi-OCR alignment:
@kba @wrznr @finkf @lschiffer
The text was updated successfully, but these errors were encountered: