Issue with sentence segmentation offsets #753

kermitt2 · 2021-04-29T00:58:45Z

In the following example

https://arxiv.org/pdf/2103.12028v1.pdf

there are cases of wrong sentence segmentations, with sentence offsets apparently shifted by a few characters, resulting in word cut. This happens whatever the selected sentence segmenter is, OpenNLP or Pragmatic Segmenter:

<s>Human annotators evaluated the quality of document alignments for six languages (de, zh, ar, ro, et, my) selected for their different scripts and amount of retrieved documents, reporting precision of over 90%. T</s>
<s>e quality of the extracted parallel sentences is evaluated in a machine translation (MT) task on six European...</s>

As it happens with both segmenters, which use different offset calculation methods, it might be due to issues with character encoding.

lfoppiano · 2021-04-29T02:25:01Z

I haven't tested but I think this could be fixed by PR #701

kermitt2 · 2021-04-29T03:08:21Z

Just tested... PR #701 does not fix it unfortunately, same error.
But this is the opportunity to check both this PR and fix the bug :)

lfoppiano · 2021-05-10T23:57:55Z

I've added some tests and the output of the method getSentenceOffsets is correct (PR 701).

I did some debugging, trying to understand where the issue is, and IMHO is at line ~244 of SentenceUtilities, where the upper bound is increased.

newPosition.end += pushedEnd+1;

In this specific case, the synchronisation between layout token and sentences seems containing the problem.
Also, there is a pos=0 after switching to a new sentence, and the synchronisation starts again from scratch... the newPosition.end is modified based on tokens that are not in the current sentence.

lfoppiano · 2021-08-13T08:02:43Z

I'm revising this issue and the somehow related PR #701.

The comment states that we need to compose the "text" without the forbidden elements (references), however IMHO we should keep these references in the text as well, run the segmentation and then remove them, isn't it?

I notice also that while the layout tokens contain all the token (including references), the text is a mixture:

CCAligned ) is a 119language 1 parallel dataset built off 68 snapshots of Common Crawl. Documents are aligned if they are in the same language according to FastText LangID (Joulin et al., 2016(Joulin et al., , 2017, and have the same URL but for a differing language code. These alignments are refined with cross-lingual LASER embeddings (Artetxe and Schwenk, 2019). For sentence-level data, they split on newlines and align with LASER, but perform no further filtering. Human annotators evaluated the quality of document alignments for six languages (de, zh, ar, ro, et, my) selected for their different scripts and amount of retrieved documents, reporting precision of over 90%. The quality of the extracted parallel sentences is evaluated in a machine translation (MT) task on six European (da, cr, sl, sk, lt, et) languages of the TED corpus (Qi et al., 2018), where it compares favorably to systems built on crawled sentences from WikiMatrix and ParaCrawl   (Qi et al., 2018); WMT-5: cs, de, fi, lv, ro. POS/DEP-5: part-of-speech labeling and dependency parsing for bg, ca, da, fi, id.

kermitt2 · 2021-08-13T08:54:10Z

(removing comment, it was more for #811 !)
#811 (comment)

but it's relevant to the fact that we don't remove references, just keep track of the positions. The text at this stage is not modified after segmentation. The rest of the method is re-injecting the tags in the segmented text, but don't touch the text.

lfoppiano · 2021-08-16T01:10:46Z

but it's relevant to the fact that we don't remove references, just keep track of the positions. The text at this stage is not modified after segmentation. The rest of the method is re-injecting the tags in the segmented text, but don't touch the text.

Ok, so the forbidden spans and the text are in sync.

I've pushed a test (currently ignored) that reproduce the issue. The problem seems to be generated at this point:

grobid/grobid-core/src/main/java/org/grobid/core/utilities/SentenceUtilities.java

Line 234 in cdb52ad

if (this.isValidSuperScriptNumericalReferenceMarker(nextToken)) {

The layout token inspection ends up not in sync with the sentences, as you mentioned before.
However, I'm a bit lost here :-)

The layout token at index 25 has "superscript" = true (I set it explicitly in the test, without it would work) and is causing the chain reaction. Although, it is inspected only when we are at the sentence with index = 4, the one that appears with incorrect positions.

This following is unrelated.
I notice that some references (e.g. the first reference in text (El-Kishky et al., 2020)) are present in the layout tokens but missing in the text:

CCAligned (El-Kishky et al., 2020) is a 119-
language 1 parallel dataset built off 68 snapshots 
of Common Crawl. Documents are aligned if they 
[...]

CCAligned ) is a 119language 1 parallel dataset built off 68 snapshots of Common Crawl.[...]

lfoppiano · 2022-07-28T09:03:13Z

Since I've did some swimming in this part of the code, I've checked again with a fresh mind.

It seems that the footnote 1 (superscript=True) trigger line 234 if, which increases the upperlimit of the sentence.
Maybe we should just check that such token is not in the reference list?

kermitt2 self-assigned this Apr 29, 2021

kermitt2 added the bug From Hemiptera and especially its suborder Heteroptera label Apr 29, 2021

lfoppiano linked a pull request Jul 29, 2022 that will close this issue

Improvement of the recovery of Pragmatic Segmenter sentence segmentation text wrt to the original text offsets #701

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with sentence segmentation offsets #753

Issue with sentence segmentation offsets #753

kermitt2 commented Apr 29, 2021

lfoppiano commented Apr 29, 2021 via email

kermitt2 commented Apr 29, 2021

lfoppiano commented May 10, 2021 •

edited

Loading

lfoppiano commented Aug 13, 2021

kermitt2 commented Aug 13, 2021 •

edited

Loading

lfoppiano commented Aug 16, 2021 •

edited

Loading

lfoppiano commented Jul 28, 2022

Issue with sentence segmentation offsets #753

Issue with sentence segmentation offsets #753

Comments

kermitt2 commented Apr 29, 2021

lfoppiano commented Apr 29, 2021 via email

kermitt2 commented Apr 29, 2021

lfoppiano commented May 10, 2021 • edited Loading

lfoppiano commented Aug 13, 2021

kermitt2 commented Aug 13, 2021 • edited Loading

lfoppiano commented Aug 16, 2021 • edited Loading

lfoppiano commented Jul 28, 2022

lfoppiano commented May 10, 2021 •

edited

Loading

kermitt2 commented Aug 13, 2021 •

edited

Loading

lfoppiano commented Aug 16, 2021 •

edited

Loading