Tagger bug (assertion length mismatch) #77

proycon · 2019-06-22T09:06:30Z

Input file: mlp08:/scratch/proycon/HuygensING-stellingwerff-1-1_62beb0b1-1ddd-4ef2-866d-21362b73ab83.translated.folia.xml

  frog-:Sat Jun 22 10:38:45 2019 Frogging in/HuygensING-stellingwerff-1-1_62beb0b1-1ddd-4ef2-866d-21362b73ab83.translated.folia.xml
  frog: cgn_tagger_mod.cxx:214: void CGNTagger::add_tags(const std::vector<folia::Word*>&, const frog_data&) const: Assertion `wv.size() == fd.size()' failed.
  .command.sh: line 25: 23648 Aborted                 frog $opts --override tokenizer.rulesFile=tokconfig-nld-historical -x --xmldir "out/" --textclass contemporary --nostdout --testdir in/ --retry

To be investigated further, my first guess is perhaps not all words have text with textclass contemporary and the tagger can't handle that?

The text was updated successfully, but these errors were encountered:

kosloot · 2019-07-10T11:58:47Z

UPDATE:
This revealed some issues in libfolia, which I could solve with an ugly hack.
Now working on the problems within Frog

kosloot · 2019-07-10T12:36:09Z

Well. This error is caused by a space in the word:

                <w xml:id="HuygensING-stellingwerff-1-1_62beb0b1-1ddd-4ef2-866d-21362b73ab83.text.text.1.body.1.div1.1.div2.2.div3.1.p.1.s.2.w.81" class="WORD" datetime="2017-06-27T14:09:02" set="tokconfig-nld">
                  <t>opte</t>
                  <t class="contemporary">op die</t>
                  <lemma class="MNW:39184⊕90021" set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/int_lemmaid_withcompounds.foliaset.ttl"/>
                  <lemma class="op⊕die" set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/int_lemmatext_withcompounds.foliaset.ttl"/>
                  <metric class="modernisationsource" value="inthistlexicon"/>
                </w>

frog tries to handle this as 2 separate words: 'op' and 'die' but that contradicts with the already present tokenization.

I suppose this should be changed into op⊕die like in the lemma?

kosloot · 2019-07-10T14:10:42Z

Ok, I found the problem.
Frog is very well capable of handling 'words' like 'op die', but in this case there isn't a full space inserted but a NARROW NO-BREAK SPACE, UTF-8 (e2, 80, af)  

When it is replaced by a normal space, it should work, as frog will replace embedded spaces by '_'

…aking space hack that now causes other problems in frog (LanguageMachines/frog#77)

kosloot · 2019-07-10T15:41:32Z

I improved the Frog code to handle a wide range of embedded "spaces" . testing in progress

proycon · 2019-07-10T20:31:51Z

My test collection passed now, if there's nothing else left to do we can close this now I think.

kosloot · 2019-07-11T07:42:31Z

Assume this is fixed

proycon added bug TAGGER labels Jun 22, 2019

proycon assigned proycon and kosloot Jun 22, 2019

kosloot pushed a commit to LanguageMachines/libfolia that referenced this issue Jul 8, 2019

fix for problem detected in LanguageMachines/frog#77

b2635a8

proycon added a commit to LanguageMachines/foliautils that referenced this issue Jul 10, 2019

frog should be able to deal with spaces now, no need for ugly non-bre…

0be49db

…aking space hack that now causes other problems in frog (LanguageMachines/frog#77)

proycon added a commit to LanguageMachines/foliautils that referenced this issue Jul 10, 2019

frog should be able to deal with spaces now, no need for ugly non-bre…

4e2b95c

…aking space hack that now causes other problems in frog (LanguageMachines/frog#77)

kosloot closed this as completed Jul 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tagger bug (assertion length mismatch) #77

Tagger bug (assertion length mismatch) #77

proycon commented Jun 22, 2019

kosloot commented Jul 10, 2019

kosloot commented Jul 10, 2019

kosloot commented Jul 10, 2019

kosloot commented Jul 10, 2019

proycon commented Jul 10, 2019

kosloot commented Jul 11, 2019

Tagger bug (assertion length mismatch) #77

Tagger bug (assertion length mismatch) #77

Comments

proycon commented Jun 22, 2019

kosloot commented Jul 10, 2019

kosloot commented Jul 10, 2019

kosloot commented Jul 10, 2019

kosloot commented Jul 10, 2019

proycon commented Jul 10, 2019

kosloot commented Jul 11, 2019