-
Notifications
You must be signed in to change notification settings - Fork 823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUGFIX] Chapter-3 Fixes bugs for word segmentation model in HMM #39
base: master
Are you sure you want to change the base?
Conversation
The commit 8077641 above fixes another bug in the HMM word segmentation model. For your reference purposes, I've attached the running log of the original code and my modified code(in this pull request) below:
hmm = HMM()
hmm.train('./data/trainCorpus.txt_utf8')
text = '丰子恺在1974年专门为他作了一幅斗方毛笔画:“种瓜得瓜”,并题词:“世庆贤台雅赏”(见图)。如今这幅珍贵的墨存还挂在胡世庆一家三口10平方米左右的斗室中。丰子恺把自己几十年人生阅历浓缩成三句话送给他———“多读书,广结交,少说话”,把他引为忘年知己。'
res = hmm.cut(text)
print(text)
print(str(list(res)))
hmm = HMM()
hmm.train('./data/trainCorpus.txt_utf8')
text = '丰子恺在1974年专门为他作了一幅斗方毛笔画:“种瓜得瓜”,并题词:“世庆贤台雅赏”(见图)。如今这幅珍贵的墨存还挂在胡世庆一家三口10平方米左右的斗室中。丰子恺把自己几十年人生阅历浓缩成三句话送给他———“多读书,广结交,少说话”,把他引为忘年知己。'
res = []
split_text = text.split(",")
for i in range(1,len(split_text)):
split_text[i] = "," + split_text[i]
for fragment in split_text:
for segment in list(hmm.cut(fragment)):
res.append(segment)
print(text)
print(str(res))
This commit in this pull request fixes #25 |
This design can better train the model and have a better prediction accuracy.
The commit 36aabd3 above enhances the model design in the HMM word segmentation model.
for k, v in enumerate(line_state):
count[v] += 1
if k == 0:
self.Pi_dic[v] += 1 # 每个句子的第一个字的状态,用于计算初始状态概率
else:
self.A_dic[line_state[k - 1]][v] += 1 # 计算转移概率
self.B_dic[line_state[k]][word_list[k]] = \
self.B_dic[line_state[k]].get(word_list[k], 0) + 1.0 # 计算发射概率
for k, v in enumerate(line_state):
count[v] += 1
if k == 0:
self.Pi_dic[v] += 1 # 每个句子的第一个字的状态,用于计算初始状态概率
else:
self.A_dic[line_state[k - 1]][v] += 1 # 计算转移概率
self.B_dic[line_state[k]][word_list[k]] = \
self.B_dic[line_state[k]].get(word_list[k], 0) + 1.0 # 计算发射概率 |
…better performance results This closes nlpinaction#24, closes nlpinaction#25 and closes nlpinaction#34
The commit d456e60 above enhances the model design in the HMM word segmentation model.
if emit_p['M'].get(text[-1], 0)> emit_p['S'].get(text[-1], 0):
(prob, state) = max([(V[len(text) - 1][y], y) for y in ('E','M')])
else:
(prob, state) = max([(V[len(text) - 1][y], y) for y in states])
(prob, state) = max((V[len(text) - 1][y], y) for y in ('E', 'S')) |
First thanks to you and all other authors/contributors of the book and this GitHub repository for providing the material that can help beginners in deep neural networks learn basic/advanced natural language processing techniques. The provided code is especially convenient to reproduce the results in the book.
Nevertheless, there is a small defect that may prevent the algorithm you provided in the book from working in some scenarios.
In chapter 3, you introduced the algorithm for word segmentation using HMM. It is very nice that you considered the condition when an unknown word appears in the middle of the prediction sentence, where the typical HMM model cannot provide corresponding predictions. However, there are also the circumstances when an unknown character appears as the first character in a sentence, which will cause the algorithm to stop working.
This case can be simply fixed by checking whether the first character has appeared in the trained set and give it a value if not appeared. Since in Chinese, a rarely used word, when it appears as the first character in the sentence, should be most likely a Beginning class, I gave a B label here. This setting won't hurt much on general prediction accuracy since the unknown words are only a small section in Chinese language. Hence the most important goal of this work is to prevent the algorithm from crashing.
For your reference purposes, I've attached the running log of the original code and my modified code(in this pull request) below:
This pull request partially answers the questions in #34