Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty or incomplete hypotheses #667

Open
ncakhoa opened this issue Nov 8, 2022 · 10 comments
Open

Empty or incomplete hypotheses #667

ncakhoa opened this issue Nov 8, 2022 · 10 comments

Comments

@ncakhoa
Copy link

ncakhoa commented Nov 8, 2022

When I trained conformer stateless streaming mode (transducer_stateless2), in the decode phases, I met a situation like this issue (#403), that is decode with fast_beam_search_nbest_LG and LG graph gives a lot of empty hypotheses.

I tried to fix it by following the solution in (#403), but didn't find any use-max argument.

@csukuangfj
Copy link
Collaborator

Are you using the latest k2 (i.e., the master branch of k2)?

@ncakhoa
Copy link
Author

ncakhoa commented Nov 9, 2022

I use k2 version 1.19.dev20220922

@csukuangfj
Copy link
Collaborator

I use k2 version 1.19.dev20220922

Could you try the latest one from the master?
https://k2-fsa.github.io/k2/installation/from_source.html

@ncakhoa
Copy link
Author

ncakhoa commented Nov 9, 2022

I use k2 version 1.19.dev20220922

Could you try the latest one from the master? https://k2-fsa.github.io/k2/installation/from_source.html

I have tried but it didn't reduce empty hypotheses

@ncakhoa
Copy link
Author

ncakhoa commented Nov 9, 2022

I also try greedy search, and it output exact tokens, so that I think the problem is in fast_beam_search_nbest_LG

@armusc
Copy link
Contributor

armusc commented Dec 4, 2022

Hi

has anyone else experienced something like this?
I'm getting similar results where the LG graph is used in decoding, that is:

head -2 beam_search/errs-test-beam_size_4-epoch-50-avg-25-beam_search-beam-size-4.txt
%WER = 17.32
Errors: 494 insertions, 738 deletions, 3337 substitutions, over 26379 reference words (22304 correct)

head -2 fast_beam_search/errs-test-beam_15.0_max_contexts_8_max_states_64-epoch-50-avg-25-beam-15.0-max-contexts-8-max-states-64.txt
%WER = 18.22
Errors: 464 insertions, 1042 deletions, 3299 substitutions, over 26379 reference words (22038 correct)

head -2 greedy_search/errs-test-greedy_search-epoch-50-avg-25-context-2-max-sym-per-frame-1.txt
%WER = 18.06
Errors: 465 insertions, 899 deletions, 3399 substitutions, over 26379 reference words (22081 correct)

head -2 modified_beam_search/errs-test-beam_size_4-epoch-50-avg-25-modified_beam_search-beam-size-4.txt
%WER = 17.45
Errors: 484 insertions, 755 deletions, 3364 substitutions, over 26379 reference words (22260 correct)

head -2 fast_beam_search_nbest/errs-test-beam_15.0_max_contexts_8_max_states_64_num_paths_100_nbest_scale_0.5-epoch-50-avg-25-beam-15.0-max-contexts-8-max-states-64-nbest-scale-0.5-num-paths-100.txt
%WER = 17.62
Errors: 485 insertions, 795 deletions, 3369 substitutions, over 26379 reference words (22215 correct)

head -2 fast_beam_search_nbest_LG/errs-test-beam_20.0_max_contexts_8_max_states_64_num_paths_200_nbest_scale_0.5_ngram_lm_scale_0.01-epoch-50-avg-25-beam-20.0-max-contexts-8-max-states-64-nbest-scale-0.5-num-paths-200-ngram-lm-scale-0.01.txt
%WER = 21.19
Errors: 1131 insertions, 859 deletions, 3600 substitutions, over 26379 reference words (21920 correct)

there's a big drop in WER with fast_beam_search_nbest_LG, and no difference when a 2-gram or tri-gram is used
I stress that all LG or HLG based decoding methods are especially useful for all these situations where the model needs adaptation on text-only data, and imposing word lexicon and arbitrary word pronunciations, which is a very common scenario in industrial applications

@csukuangfj
Copy link
Collaborator

Could you please check your errs-xxx file and see how many errors are caused by OOV words when LG is used?

@armusc
Copy link
Contributor

armusc commented Dec 5, 2022

Out of 26379 words of the eval corpus, there are 438 OOV word occurrences w.r.t. the word list in L, which is 1.66% OOV ratio
now, a thumb rule I've been told in the past is that in closed vocabulary ASR, for every OOV you have 1.5 word errors because of side effects in recognition, so that 1.66% OOV ratio might (empirically) translate into a 2.5% absolute WER degradation
then again, there are errors among those same OOVs also in the non-LG methods, so not everything is coming from that
in the 17.62% WER from the fast_beam_search_nbest decoding, there are 272 errors from those same words (I just grepped the OOV word list into the sub-del-ins errors and summed)

-when I use HLG based decoding, let's say 1best or nbest method, in conformer_ctc, I have a more reasonable 18.7-18.8 WER with the same L and G
-I am also surprised that using a G bigram or G trigram does not really change the result
-I have a few cases where ending part of utterance is not decoded, which reminded me of this thread and that does not seem to happen with the other methods, but this has happened only occasionally so I cannot really generalize this observation

ex:
1e15a26c-6a37-45c4-abd5-c62eba481801: ref=['de', 'nieuwe', 'programmatieregeling', 'om', 'dit', 'mogelijk', 'te', 'maken']
1e15a26c-6a37-45c4-abd5-c62eba481801: hyp=['de', 'nieuwe', "programma's", 'om', 'dit', 'hoofd']

293b82f6-2407-4d08-8a27-93dc690c2313: ref=['dat', 'zou', 'niet', 'nodig', 'zijn', 'als', 'hij', 'in', 'deze', 'cockpit', 'zou', 'vliegen']
293b82f6-2407-4d08-8a27-93dc690c2313: hyp=['dat', 'zou', 'niet', 'nodig', 'zijn', 'als', 'die', 'in', 'deze', 'kop']
10681284-d597-4ca4-9ae6-e9b5d633231c: ref=['in', 'de', 'toekomst', 'willen', 'wij', 'absoluut', 'die', 'domeinscholen']
10681284-d597-4ca4-9ae6-e9b5d633231c: hyp=['in', 'de', 'toekomst', 'willen', 'wij', 'absolute', 'doet']
which I do not in fast_beam_search_nbest (not LG based)

now, maybe what I could do is train your latest model, where encoder-ctc output is combined with transducer and the HLG decoding can be done on the ctc output and see what I obtain

@csukuangfj
Copy link
Collaborator

-when I use HLG based decoding, let's say 1best or nbest method, in conformer_ctc, I have a more reasonable 18.7-18.8 WER with the same L and G
-I am also surprised that using a G bigram or G trigram does not really change the result

Do you mean it is not helpful for HLG decoding?

@armusc
Copy link
Contributor

armusc commented Dec 5, 2022

that remark referred to fast_beam_search_nbest_LG
when I use a G bigram or trigram has not changed results

when I use in G in first pass decoding in HLG with the conformer_ctc and then rescore with a 4-Gram (for example, whole-lattice-rescoring) results are improved (let's say 7-8% relative improvement, in this specific case)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants