-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better Chinese support #10
Comments
I originally wanted to implement such a feature, but I couldn't quite afford to commit the time and maintenance burden needed to implement them. At one point I even implemented a Japanese parser, but Yomichan does a much better job, and it added too much in the way of dependencies. I didn't know of a dictionary-based way of splitting words before.
If you have any questions about the architecture of the program, feel free to ask! |
Also, I was actually considering parsing the sentences with something like jieba. That uses a more sophisticated algorithm to split the words and may work for words not covered (proper names). |
I barely know a thing about programming, let alone coding, but you're talking about using spaCy right, it supports 64 languages so I guess that would work for all the other language vocabesieve supports |
I am using |
In fact, not only Chinese, but also Japanese, Korean has this problem. In Vietnamese, space is used to separate syllables; in Thai and Lao, space is used to separate sentences. |
@GrimPixel @BenMueller |
I just knew about tools for word segmentation and saw you needed them. I have no experience with them. |
Is this something that still needs work? Has there been any progress in the last few years? I'm happy to take a look at it if it's needed. |
Since Chinese doesn't as easily distinguish words/names with spaces as English, attempting to double click a word simply selects the entire sentence.
The Firefox/Chrome extension Zhongwen and Android App Pleco are examples of software which use various methods of automatically detecting words in the dictionary (CC-CEDICT is bundled in the case of Zhongwen, which comes in at only 3.6 MB zipped).
It would be advantageous to integrate CC-CEDICT as a dictionary option for Chinese, as well as leveraging it to help select words in a Chinese sentence. I'm willing to help contribute some code if necessary to help do this, but I'd like some input from the primary developer before doing so.
The text was updated successfully, but these errors were encountered: