[BUGFIX] Chapter-3 Fixes bugs for word segmentation model in HMM #39

chenw23 · 2020-12-09T08:13:12Z

First thanks to you and all other authors/contributors of the book and this GitHub repository for providing the material that can help beginners in deep neural networks learn basic/advanced natural language processing techniques. The provided code is especially convenient to reproduce the results in the book.

Nevertheless, there is a small defect that may prevent the algorithm you provided in the book from working in some scenarios.
In chapter 3, you introduced the algorithm for word segmentation using HMM. It is very nice that you considered the condition when an unknown word appears in the middle of the prediction sentence, where the typical HMM model cannot provide corresponding predictions. However, there are also the circumstances when an unknown character appears as the first character in a sentence, which will cause the algorithm to stop working.

This case can be simply fixed by checking whether the first character has appeared in the trained set and give it a value if not appeared. Since in Chinese, a rarely used word, when it appears as the first character in the sentence, should be most likely a Beginning class, I gave a B label here. This setting won't hurt much on general prediction accuracy since the unknown words are only a small section in Chinese language. Hence the most important goal of this work is to prevent the algorithm from crashing.

For your reference purposes, I've attached the running log of the original code and my modified code(in this pull request) below:

Your existing code and output

hmm = HMM()
hmm.train('./data/trainCorpus.txt_utf8')
text = '譬如'
res = hmm.cut(text)
print(text)
print(str(list(res)))

譬如

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-09fa8779fc79> in <module>
      5 res = hmm.cut(text)
      6 print(text)
----> 7 print(str(list(res)))

<ipython-input-1-b11d71c4e220> in cut(self, text)
    140         if not self.load_para:
    141             self.try_load_model(os.path.exists(self.model_file))
--> 142         prob, pos_list = self.viterbi(text, self.state_list, self.Pi_dic, self.A_dic, self.B_dic)
    143         begin, next = 0, 0
    144         for i, char in enumerate(text):

<ipython-input-1-b11d71c4e220> in viterbi(self, text, states, start_p, trans_p, emit_p)
    121             for y in states:
    122                 emitP = emit_p[y].get(text[t], 0) if not neverSeen else 1.0 #设置未知字单独成词
--> 123                 (prob, state) = max(
    124                     [(V[t - 1][y0] * trans_p[y0].get(y, 0) *
    125                       emitP, y0)

ValueError: max() arg is an empty sequence

My code and output

hmm = HMM()
hmm.train('./data/trainCorpus.txt_utf8')
text = '譬如'
res = hmm.cut(text)
print(text)
print(str(list(res)))

譬如
['譬如']

This pull request partially answers the questions in #34

…acter

chenw23 · 2020-12-09T12:56:15Z

The commit 8077641 above fixes another bug in the HMM word segmentation model.
Since the HMM model computes probability on a left-to-right character-by-character behavior, the probability might get extremely low when computing a word that appears late in a sentence. Therefore, the probability will become zero after these circumstances, which prevents the algorithms from giving predictions.
This bug can be easily fixed by separating the input sentence into different parts using its native punctuation marks.

For your reference purposes, I've attached the running log of the original code and my modified code(in this pull request) below:

Your existing code and output:

hmm = HMM()
hmm.train('./data/trainCorpus.txt_utf8')

text = '丰子恺在１９７４年专门为他作了一幅斗方毛笔画：“种瓜得瓜”，并题词：“世庆贤台雅赏”（见图）。如今这幅珍贵的墨存还挂在胡世庆一家三口１０平方米左右的斗室中。丰子恺把自己几十年人生阅历浓缩成三句话送给他———“多读书，广结交，少说话”，把他引为忘年知己。'
res = hmm.cut(text)
print(text)
print(str(list(res)))

丰子恺在１９７４年专门为他作了一幅斗方毛笔画：“种瓜得瓜”，并题词：“世庆贤台雅赏”（见图）。如今这幅珍贵的墨存还挂在胡世庆一家三口１０平方米左右的斗室中。丰子恺把自己几十年人生阅历浓缩成三句话送给他———“多读书，广结交，少说话”，把他引为忘年知己。

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-110a64a94244> in <module>
      5 res = hmm.cut(text)
      6 print(text)
----> 7 print(str(list(res)))

<ipython-input-1-b11d71c4e220> in cut(self, text)
    140         if not self.load_para:
    141             self.try_load_model(os.path.exists(self.model_file))
--> 142         prob, pos_list = self.viterbi(text, self.state_list, self.Pi_dic, self.A_dic, self.B_dic)
    143         begin, next = 0, 0
    144         for i, char in enumerate(text):

<ipython-input-1-b11d71c4e220> in viterbi(self, text, states, start_p, trans_p, emit_p)
    121             for y in states:
    122                 emitP = emit_p[y].get(text[t], 0) if not neverSeen else 1.0 #设置未知字单独成词
--> 123                 (prob, state) = max(
    124                     [(V[t - 1][y0] * trans_p[y0].get(y, 0) *
    125                       emitP, y0)

ValueError: max() arg is an empty sequence

My code and output

hmm = HMM()
hmm.train('./data/trainCorpus.txt_utf8')

text = '丰子恺在１９７４年专门为他作了一幅斗方毛笔画：“种瓜得瓜”，并题词：“世庆贤台雅赏”（见图）。如今这幅珍贵的墨存还挂在胡世庆一家三口１０平方米左右的斗室中。丰子恺把自己几十年人生阅历浓缩成三句话送给他———“多读书，广结交，少说话”，把他引为忘年知己。'
res = []
split_text = text.split("，")
for i in range(1,len(split_text)):
    split_text[i] = "，" + split_text[i]
for fragment in split_text:
    for segment in list(hmm.cut(fragment)):
        res.append(segment)
print(text)
print(str(res))

丰子恺在１９７４年专门为他作了一幅斗方毛笔画：“种瓜得瓜”，并题词：“世庆贤台雅赏”（见图）。如今这幅珍贵的墨存还挂在胡世庆一家三口１０平方米左右的斗室中。丰子恺把自己几十年人生阅历浓缩成三句话送给他———“多读书，广结交，少说话”，把他引为忘年知己。
['丰子', '恺', '在', '１９７４年', '专门', '为', '他作', '了', '一幅斗方', '毛笔', '画', '：', '“', '种瓜', '得瓜', '”', '，', '并题', '词', '：', '“', '世庆', '贤', '台', '雅赏', '”', '（', '见', '图', '）', '。', '如今', '这幅', '珍贵', '的', '墨存', '还', '挂', '在', '胡', '世庆', '一家', '三口', '１０平', '方米', '左右', '的', '斗室', '中', '。', '丰子', '恺', '把', '自己', '几', '十年', '人生', '阅历', '浓缩', '成三句话', '送', '给', '他—', '——', '“', '多', '读书', '，', '广结交', '，', '少', '说', '话', '”', '，', '把', '他', '引为', '忘年', '知己', '。']

This commit in this pull request fixes #25

This design can better train the model and have a better prediction accuracy.

chenw23 · 2020-12-10T07:23:50Z

The commit 36aabd3 above enhances the model design in the HMM word segmentation model.
In the original version, B_dic is not computed for the first character. However, in practice, this can reduce the performance of the trained model. In my testing sets, the prediction accuracy is affected by around 1%. Changing this design as in 36aabd3 can give the model design a performance enhancement.
For convenient comparison purposes, I copies the code comparison below:

Original Code

for k, v in enumerate(line_state):
    count[v] += 1
    if k == 0:
        self.Pi_dic[v] += 1  # 每个句子的第一个字的状态，用于计算初始状态概率
    else:
        self.A_dic[line_state[k - 1]][v] += 1  # 计算转移概率
        self.B_dic[line_state[k]][word_list[k]] = \
            self.B_dic[line_state[k]].get(word_list[k], 0) + 1.0  # 计算发射概率

Changed Code

for k, v in enumerate(line_state):
    count[v] += 1
    if k == 0:
        self.Pi_dic[v] += 1  # 每个句子的第一个字的状态，用于计算初始状态概率
    else:
        self.A_dic[line_state[k - 1]][v] += 1  # 计算转移概率
    self.B_dic[line_state[k]][word_list[k]] = \
        self.B_dic[line_state[k]].get(word_list[k], 0) + 1.0  # 计算发射概率

…better performance results This closes nlpinaction#24, closes nlpinaction#25 and closes nlpinaction#34

chenw23 · 2020-12-10T08:03:22Z

The commit d456e60 above enhances the model design in the HMM word segmentation model.
In practice, the last character of the labeling sentence has the characteristics described in my change of the commit d456e60. In my testing sets, the prediction accuracy is affected by around 0.5%. Changing this design as in d456e60 can give the model design a performance enhancement.
For convenient comparison purposes, I copies the code comparison below:

Original Code

if emit_p['M'].get(text[-1], 0)> emit_p['S'].get(text[-1], 0):
    (prob, state) = max([(V[len(text) - 1][y], y) for y in ('E','M')])
else:
    (prob, state) = max([(V[len(text) - 1][y], y) for y in states])

Changed Code

(prob, state) = max((V[len(text) - 1][y], y) for y in ('E', 'S'))

[BUGFIX] Fixes a bug when unknown character appears as the first char…

e1e9a66

…acter

chenw23 changed the title ~~[BUGFIX] Chapere-3 Fixes bugs for word segmentation model in HMM~~ [BUGFIX] Chapter-3 Fixes bugs for word segmentation model in HMM Dec 9, 2020

chenw23 force-pushed the master branch from 461b544 to e1e9a66 Compare December 9, 2020 12:39

[BUGFIX] Fixes a bug when predict long sentences

8077641

[PERFORMANCE] Adds B_dic for the first word

36aabd3

This design can better train the model and have a better prediction accuracy.

[PERFORMANCE] Changes the label design of the last character to have …

d456e60

…better performance results This closes nlpinaction#24, closes nlpinaction#25 and closes nlpinaction#34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUGFIX] Chapter-3 Fixes bugs for word segmentation model in HMM #39

[BUGFIX] Chapter-3 Fixes bugs for word segmentation model in HMM #39

chenw23 commented Dec 9, 2020

chenw23 commented Dec 9, 2020

chenw23 commented Dec 10, 2020

chenw23 commented Dec 10, 2020

[BUGFIX] Chapter-3 Fixes bugs for word segmentation model in HMM #39

Are you sure you want to change the base?

[BUGFIX] Chapter-3 Fixes bugs for word segmentation model in HMM #39

Conversation

chenw23 commented Dec 9, 2020

chenw23 commented Dec 9, 2020

chenw23 commented Dec 10, 2020

chenw23 commented Dec 10, 2020