[Question] How to break text into words? #2794
-
First of all, thanks for the crate!! I am looking into coding something like this ICU demo for computing ICU segments using icu4x in Rust, but I could not find how to use the icu::segmenter::WordBreakSegmenter in a locale-aware fashion. Using unicode_segmentation::UnicodeSegmentation is pretty straight forward, but this only implements the Default Word Boundary Specification. I am looking into doing a similar job being locale-aware. I know that icu::segmenter::WordBreakSegmenter is behind the As a benchmark, I would like to break this text in Primitive Irish (locale
into words Thanks beforehand! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Like ICU4C, ICU4X has dictionary-based segmentation for Chinese, Japanese, Korean, Thai, Burmese, Lao, and Khmer. It also has machine-learning-based segmentation for Thai, Burmese, Lao, and Khmer. @makotokato @aethanyc Do we support |
Beta Was this translation helpful? Give feedback.
-
No, we don't. We currently only have the same set of dictionaries as ICU4C, so for |
Beta Was this translation helpful? Give feedback.
No, we don't. We currently only have the same set of dictionaries as ICU4C, so for
pgl-Ogam
segmentation, we'll use Default Word Boundary Specification in UAX29.