[Question] How to break text into words? #2794

saona-raimundo · 2022-10-26T21:08:54Z

saona-raimundo
Oct 26, 2022

First of all, thanks for the crate!!

I am looking into coding something like this ICU demo for computing ICU segments using icu4x in Rust, but I could not find how to use the icu::segmenter::WordBreakSegmenter in a locale-aware fashion.

Using unicode_segmentation::UnicodeSegmentation is pretty straight forward, but this only implements the Default Word Boundary Specification. I am looking into doing a similar job being locale-aware.

I know that icu::segmenter::WordBreakSegmenter is behind the experimental flag, but I was wondering if it is possible to do it today(?) Its constructor does not seem to be locale-aware (but maybe I am wrong)

As a benchmark, I would like to break this text in Primitive Irish (locale pgl, using the Ogham alphabet)

ᚁᚔ ᚇᚔᚂᚔᚄ ᚇᚒᚔᚈ ᚃᚓᚔᚅ

into words ᚁᚔ, ᚇᚔᚂᚔᚄ, ᚇᚒᚔᚈ, and ᚃᚓᚔᚅ.

Thanks beforehand!

Answered by aethanyc

Oct 27, 2022

@makotokato @aethanyc Do we support pgl-Ogam segmentation?

No, we don't. We currently only have the same set of dictionaries as ICU4C, so for pgl-Ogam segmentation, we'll use Default Word Boundary Specification in UAX29.

View full answer

sffc · 2022-10-26T22:11:35Z

sffc
Oct 26, 2022
Maintainer

Like ICU4C, ICU4X has dictionary-based segmentation for Chinese, Japanese, Korean, Thai, Burmese, Lao, and Khmer. It also has machine-learning-based segmentation for Thai, Burmese, Lao, and Khmer.

@makotokato @aethanyc Do we support pgl-Ogam segmentation?

0 replies

aethanyc · 2022-10-27T22:00:42Z

aethanyc
Oct 27, 2022
Maintainer

@makotokato @aethanyc Do we support pgl-Ogam segmentation?

No, we don't. We currently only have the same set of dictionaries as ICU4C, so for pgl-Ogam segmentation, we'll use Default Word Boundary Specification in UAX29.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] How to break text into words? #2794

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

[Question] How to break text into words? #2794

saona-raimundo Oct 26, 2022

Replies: 2 comments

sffc Oct 26, 2022 Maintainer

aethanyc Oct 27, 2022 Maintainer

saona-raimundo
Oct 26, 2022

sffc
Oct 26, 2022
Maintainer

aethanyc
Oct 27, 2022
Maintainer