You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Handwritten documents are more and more present into current projects and even ALTO can be used today to define a page layout and text information for this type of materials, I think there is still place for improvement. One recent change was related to baseline definition, that was changed from a float value (y coordinate of the line) to PointsType, since for handwritten text the baseline is not a straight line. Probably there are much more issues related to this topic that we can discuss and improve.
This topic is intended to be a place for collecting ideas for further discussions, from here we will collect most important topics and create individual issues
The text was updated successfully, but these errors were encountered:
I have asked some people from Transkribus why they choose PAGE instead of ALTO, and what ALTO is missing to be a better format for handwritten comunity, and here is the answer:
"As far as I remember, we chose PAGE as
it was designed specifically with GT in mind and there was already a good amount of training data available in that format.
researchers in the project often already had IO libraries for the format
it allows to define polygonal regions/lines for cropping (I think the major OCR formats only allowed rectangular blocks back then. Correct me if I am wrong)
to capture bent/twisted lines in handwriting
to separate overlapping lines as far as possible (e.g. ascenders/descenders of characters might still cross other lines)
baselines with multiple points were added quickly on request in 2013
the text representation can be added on any level (regions, lines, words) without the need to go into more detail if not needed. Most tools in the project worked on line level only and therefore this was most important to have."
From here I see one topic we may think on future (since some of the features missing at one point in time are already added, like polyline baseline, polygonal shape on all levels, etc.):
Allow CONTENT on any level, without the need to go deeper into the structure if not needed (f.e. full text line content just below the Textline). Discussion would be if we keep the deeper structure as mandatory for ALTO produces, but make consumer life easier, or we let details as optional on any level (this could lead to a very simple ALTO containing just plain text as part of a single block... ). Might be useful if we look from GT perspective, from presentation systems point of view may not be useful at all.
When working with Transkribus-SWT to generate GT my colleagues and I found ourselves several times running into trouble because we forgot to synchronize text line and word contents. The major advantage (IMHO) for ALTO compared to PAGE is the singular store point for OCR content, especially when one aims to create GT at least on word-level, as we do.
Allowing content on text line level might introduce problems with reading order as well when mixing RTL and LTR languages in the same line.
Handwritten documents are more and more present into current projects and even ALTO can be used today to define a page layout and text information for this type of materials, I think there is still place for improvement. One recent change was related to baseline definition, that was changed from a float value (y coordinate of the line) to PointsType, since for handwritten text the baseline is not a straight line. Probably there are much more issues related to this topic that we can discuss and improve.
This topic is intended to be a place for collecting ideas for further discussions, from here we will collect most important topics and create individual issues
The text was updated successfully, but these errors were encountered: