Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUGFIX] Chapter-3 Fixes bugs for word segmentation model in HMM #39

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

chenw23
Copy link

@chenw23 chenw23 commented Dec 9, 2020

First thanks to you and all other authors/contributors of the book and this GitHub repository for providing the material that can help beginners in deep neural networks learn basic/advanced natural language processing techniques. The provided code is especially convenient to reproduce the results in the book.

Nevertheless, there is a small defect that may prevent the algorithm you provided in the book from working in some scenarios.
In chapter 3, you introduced the algorithm for word segmentation using HMM. It is very nice that you considered the condition when an unknown word appears in the middle of the prediction sentence, where the typical HMM model cannot provide corresponding predictions. However, there are also the circumstances when an unknown character appears as the first character in a sentence, which will cause the algorithm to stop working.

This case can be simply fixed by checking whether the first character has appeared in the trained set and give it a value if not appeared. Since in Chinese, a rarely used word, when it appears as the first character in the sentence, should be most likely a Beginning class, I gave a B label here. This setting won't hurt much on general prediction accuracy since the unknown words are only a small section in Chinese language. Hence the most important goal of this work is to prevent the algorithm from crashing.

For your reference purposes, I've attached the running log of the original code and my modified code(in this pull request) below:

  • Your existing code and output
hmm = HMM()
hmm.train('./data/trainCorpus.txt_utf8')
text = '譬如'
res = hmm.cut(text)
print(text)
print(str(list(res)))
譬如

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-09fa8779fc79> in <module>
      5 res = hmm.cut(text)
      6 print(text)
----> 7 print(str(list(res)))

<ipython-input-1-b11d71c4e220> in cut(self, text)
    140         if not self.load_para:
    141             self.try_load_model(os.path.exists(self.model_file))
--> 142         prob, pos_list = self.viterbi(text, self.state_list, self.Pi_dic, self.A_dic, self.B_dic)
    143         begin, next = 0, 0
    144         for i, char in enumerate(text):

<ipython-input-1-b11d71c4e220> in viterbi(self, text, states, start_p, trans_p, emit_p)
    121             for y in states:
    122                 emitP = emit_p[y].get(text[t], 0) if not neverSeen else 1.0 #设置未知字单独成词
--> 123                 (prob, state) = max(
    124                     [(V[t - 1][y0] * trans_p[y0].get(y, 0) *
    125                       emitP, y0)

ValueError: max() arg is an empty sequence
  • My code and output
hmm = HMM()
hmm.train('./data/trainCorpus.txt_utf8')
text = '譬如'
res = hmm.cut(text)
print(text)
print(str(list(res)))
譬如
['譬如']

This pull request partially answers the questions in #34

@chenw23 chenw23 changed the title [BUGFIX] Chapere-3 Fixes bugs for word segmentation model in HMM [BUGFIX] Chapter-3 Fixes bugs for word segmentation model in HMM Dec 9, 2020
@chenw23
Copy link
Author

chenw23 commented Dec 9, 2020

The commit 8077641 above fixes another bug in the HMM word segmentation model.
Since the HMM model computes probability on a left-to-right character-by-character behavior, the probability might get extremely low when computing a word that appears late in a sentence. Therefore, the probability will become zero after these circumstances, which prevents the algorithms from giving predictions.
This bug can be easily fixed by separating the input sentence into different parts using its native punctuation marks.

For your reference purposes, I've attached the running log of the original code and my modified code(in this pull request) below:

  • Your existing code and output:
hmm = HMM()
hmm.train('./data/trainCorpus.txt_utf8')

text = '丰子恺在1974年专门为他作了一幅斗方毛笔画:“种瓜得瓜”,并题词:“世庆贤台雅赏”(见图)。如今这幅珍贵的墨存还挂在胡世庆一家三口10平方米左右的斗室中。丰子恺把自己几十年人生阅历浓缩成三句话送给他———“多读书,广结交,少说话”,把他引为忘年知己。'
res = hmm.cut(text)
print(text)
print(str(list(res)))
丰子恺在1974年专门为他作了一幅斗方毛笔画:“种瓜得瓜”,并题词:“世庆贤台雅赏”(见图)。如今这幅珍贵的墨存还挂在胡世庆一家三口10平方米左右的斗室中。丰子恺把自己几十年人生阅历浓缩成三句话送给他———“多读书,广结交,少说话”,把他引为忘年知己。

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-110a64a94244> in <module>
      5 res = hmm.cut(text)
      6 print(text)
----> 7 print(str(list(res)))

<ipython-input-1-b11d71c4e220> in cut(self, text)
    140         if not self.load_para:
    141             self.try_load_model(os.path.exists(self.model_file))
--> 142         prob, pos_list = self.viterbi(text, self.state_list, self.Pi_dic, self.A_dic, self.B_dic)
    143         begin, next = 0, 0
    144         for i, char in enumerate(text):

<ipython-input-1-b11d71c4e220> in viterbi(self, text, states, start_p, trans_p, emit_p)
    121             for y in states:
    122                 emitP = emit_p[y].get(text[t], 0) if not neverSeen else 1.0 #设置未知字单独成词
--> 123                 (prob, state) = max(
    124                     [(V[t - 1][y0] * trans_p[y0].get(y, 0) *
    125                       emitP, y0)

ValueError: max() arg is an empty sequence
  • My code and output
hmm = HMM()
hmm.train('./data/trainCorpus.txt_utf8')

text = '丰子恺在1974年专门为他作了一幅斗方毛笔画:“种瓜得瓜”,并题词:“世庆贤台雅赏”(见图)。如今这幅珍贵的墨存还挂在胡世庆一家三口10平方米左右的斗室中。丰子恺把自己几十年人生阅历浓缩成三句话送给他———“多读书,广结交,少说话”,把他引为忘年知己。'
res = []
split_text = text.split(",")
for i in range(1,len(split_text)):
    split_text[i] = "," + split_text[i]
for fragment in split_text:
    for segment in list(hmm.cut(fragment)):
        res.append(segment)
print(text)
print(str(res))
丰子恺在1974年专门为他作了一幅斗方毛笔画:“种瓜得瓜”,并题词:“世庆贤台雅赏”(见图)。如今这幅珍贵的墨存还挂在胡世庆一家三口10平方米左右的斗室中。丰子恺把自己几十年人生阅历浓缩成三句话送给他———“多读书,广结交,少说话”,把他引为忘年知己。
['丰子', '恺', '在', '1974年', '专门', '为', '他作', '了', '一幅斗方', '毛笔', '画', ':', '“', '种瓜', '得瓜', '”', ',', '并题', '词', ':', '“', '世庆', '贤', '台', '雅赏', '”', '(', '见', '图', ')', '。', '如今', '这幅', '珍贵', '的', '墨存', '还', '挂', '在', '胡', '世庆', '一家', '三口', '10平', '方米', '左右', '的', '斗室', '中', '。', '丰子', '恺', '把', '自己', '几', '十年', '人生', '阅历', '浓缩', '成三句话', '送', '给', '他—', '——', '“', '多', '读书', ',', '广结交', ',', '少', '说', '话', '”', ',', '把', '他', '引为', '忘年', '知己', '。']

This commit in this pull request fixes #25

This design can better train the model and have a better prediction accuracy.
@chenw23
Copy link
Author

chenw23 commented Dec 10, 2020

The commit 36aabd3 above enhances the model design in the HMM word segmentation model.
In the original version, B_dic is not computed for the first character. However, in practice, this can reduce the performance of the trained model. In my testing sets, the prediction accuracy is affected by around 1%. Changing this design as in 36aabd3 can give the model design a performance enhancement.
For convenient comparison purposes, I copies the code comparison below:

  • Original Code
for k, v in enumerate(line_state):
    count[v] += 1
    if k == 0:
        self.Pi_dic[v] += 1  # 每个句子的第一个字的状态,用于计算初始状态概率
    else:
        self.A_dic[line_state[k - 1]][v] += 1  # 计算转移概率
        self.B_dic[line_state[k]][word_list[k]] = \
            self.B_dic[line_state[k]].get(word_list[k], 0) + 1.0  # 计算发射概率
  • Changed Code
for k, v in enumerate(line_state):
    count[v] += 1
    if k == 0:
        self.Pi_dic[v] += 1  # 每个句子的第一个字的状态,用于计算初始状态概率
    else:
        self.A_dic[line_state[k - 1]][v] += 1  # 计算转移概率
    self.B_dic[line_state[k]][word_list[k]] = \
        self.B_dic[line_state[k]].get(word_list[k], 0) + 1.0  # 计算发射概率

@chenw23
Copy link
Author

chenw23 commented Dec 10, 2020

The commit d456e60 above enhances the model design in the HMM word segmentation model.
In practice, the last character of the labeling sentence has the characteristics described in my change of the commit d456e60. In my testing sets, the prediction accuracy is affected by around 0.5%. Changing this design as in d456e60 can give the model design a performance enhancement.
For convenient comparison purposes, I copies the code comparison below:

  • Original Code
if emit_p['M'].get(text[-1], 0)> emit_p['S'].get(text[-1], 0):
    (prob, state) = max([(V[len(text) - 1][y], y) for y in ('E','M')])
else:
    (prob, state) = max([(V[len(text) - 1][y], y) for y in states])
  • Changed Code
(prob, state) = max((V[len(text) - 1][y], y) for y in ('E', 'S'))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant