Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tagger bug (assertion length mismatch) #77

Closed
proycon opened this issue Jun 22, 2019 · 6 comments
Closed

Tagger bug (assertion length mismatch) #77

proycon opened this issue Jun 22, 2019 · 6 comments
Assignees

Comments

@proycon
Copy link
Member

proycon commented Jun 22, 2019

Input file: mlp08:/scratch/proycon/HuygensING-stellingwerff-1-1_62beb0b1-1ddd-4ef2-866d-21362b73ab83.translated.folia.xml

  frog-:Sat Jun 22 10:38:45 2019 Frogging in/HuygensING-stellingwerff-1-1_62beb0b1-1ddd-4ef2-866d-21362b73ab83.translated.folia.xml
  frog: cgn_tagger_mod.cxx:214: void CGNTagger::add_tags(const std::vector<folia::Word*>&, const frog_data&) const: Assertion `wv.size() == fd.size()' failed.
  .command.sh: line 25: 23648 Aborted                 frog $opts --override tokenizer.rulesFile=tokconfig-nld-historical -x --xmldir "out/" --textclass contemporary --nostdout --testdir in/ --retry

To be investigated further, my first guess is perhaps not all words have text with textclass contemporary and the tagger can't handle that?

kosloot pushed a commit to LanguageMachines/libfolia that referenced this issue Jul 8, 2019
@kosloot
Copy link
Collaborator

kosloot commented Jul 10, 2019

UPDATE:
This revealed some issues in libfolia, which I could solve with an ugly hack.
Now working on the problems within Frog

@kosloot
Copy link
Collaborator

kosloot commented Jul 10, 2019

Well. This error is caused by a space in the word:

                <w xml:id="HuygensING-stellingwerff-1-1_62beb0b1-1ddd-4ef2-866d-21362b73ab83.text.text.1.body.1.div1.1.div2.2.div3.1.p.1.s.2.w.81" class="WORD" datetime="2017-06-27T14:09:02" set="tokconfig-nld">
                  <t>opte</t>
                  <t class="contemporary">op die</t>
                  <lemma class="MNW:39184⊕90021" set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/int_lemmaid_withcompounds.foliaset.ttl"/>
                  <lemma class="op⊕die" set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/int_lemmatext_withcompounds.foliaset.ttl"/>
                  <metric class="modernisationsource" value="inthistlexicon"/>
                </w>

frog tries to handle this as 2 separate words: 'op' and 'die' but that contradicts with the already present tokenization.

I suppose this should be changed into op⊕die like in the lemma?

@kosloot
Copy link
Collaborator

kosloot commented Jul 10, 2019

Ok, I found the problem.
Frog is very well capable of handling 'words' like 'op die', but in this case there isn't a full space inserted but a NARROW NO-BREAK SPACE, UTF-8 (e2, 80, af)  

When it is replaced by a normal space, it should work, as frog will replace embedded spaces by '_'

proycon added a commit to LanguageMachines/foliautils that referenced this issue Jul 10, 2019
proycon added a commit to LanguageMachines/foliautils that referenced this issue Jul 10, 2019
@kosloot
Copy link
Collaborator

kosloot commented Jul 10, 2019

I improved the Frog code to handle a wide range of embedded "spaces" . testing in progress

@proycon
Copy link
Member Author

proycon commented Jul 10, 2019

My test collection passed now, if there's nothing else left to do we can close this now I think.

@kosloot
Copy link
Collaborator

kosloot commented Jul 11, 2019

Assume this is fixed

@kosloot kosloot closed this as completed Jul 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants