Unicode sentence boundaries #24

tomcumming · 2017-05-06T22:16:12Z

This is an implementation of the sentence breaks specification, including changes to the python files to grab the sentence break test data.

I welcome any advice for improving.

Passes all tests in the examples provided here: http://www.unicode.org/Public/9.0.0/ucd/auxiliary/SentenceBreakTest.txt

Manishearth · 2017-05-16T20:53:23Z

Sorry for letting this stagnate! I'm rather busy to review this right now, but will try to get to it soon!

Sentence boundaries is something I've wanted implemented here for a while 😄

Manishearth · 2017-05-26T17:40:07Z

(still no time to look at this, apologies. Really hope to get to it soon)

rth · 2019-05-06T22:04:08Z

This PR looks quite good and it would be great to have this functionality. Could I help in any way ?

tomcumming · 2019-05-07T13:05:57Z

@rth I can split this out into another crate if required?

rth · 2019-05-08T13:20:17Z

@tomcumming I would really like to use this implementation (and compare it with other sentences splitting approaches) in rth/vtext#51 . Having this implementation in the unicode-segmentation crate would be ideal, but if it is unlikely to be reviewed in the near future, maybe putting it in some other crate could be a workaround.

Any chance @Manishearth that you would have some review bandwidth for this, or could suggest someone who could review it?

Manishearth · 2019-05-09T20:33:08Z

@rth mind doing a review yourself as well? I can also try and review, but I don't think I'd be able to give this a proper thorough review and would feel more comfortable if more people have gone through it.

rth · 2019-05-09T20:54:08Z

Sure, I'll try to review it in the next few days.

Manishearth

Code looks correct! Mostly want more documentation.

src/lib.rs

src/sentence.rs

tomcumming · 2019-05-13T18:34:13Z

@Manishearth @rth I have updated the PR including requested changes

Manishearth · 2019-05-13T20:17:42Z

Looks good! @rth want to do a second review?

rth

Thanks a lot for the review @Manishearth !

I went through the code in more detail, I find it quite readable and I don't really have anything to add. (Though I am fairly new to rust and don't know that much about Unicode segmentation specs).

I can confirm that src/tables.rs and src/testdata.rs in this PR can be re-generated in their current state with the included python scripts, but they require setting,
scripts/unicode.py

-        os.system("curl -O http://www.unicode.org/Public/UNIDATA/%s"
+        os.system("curl -O http://www.unicode.org/Public/9.0.0/ucd/%s"

as otherwise data for latest Unicode 12.0 is downloaded.

src/lib.rs

src/sentence.rs

tomcumming · 2019-05-15T09:10:20Z

Fixing the URL for test data should probably be another PR

rth · 2019-05-15T09:13:18Z

Fixing the URL for test data should probably be another PR

Yes, I'll do it.

Thanks @tomcumming I don't have any other comments.

Manishearth · 2019-05-15T16:06:40Z

Thank you! I'll push a release soonish

Manishearth · 2019-05-15T16:20:50Z

Published 1.3.0. Thanks for the work on this, and sorry for the delay in reviewing!

Fetch and generate sentence tests, property table

fa10dd3

tomcumming changed the title ~~Code review please~~ Code review please (Unicode sentence boundary partial implementation) May 6, 2017

tomcumming force-pushed the master branch from 14dbeb8 to 93b0d56 Compare May 16, 2017 20:08

Added forward iterator for unicode sentences

7ac6f29

Passes all tests in the examples provided here: http://www.unicode.org/Public/9.0.0/ucd/auxiliary/SentenceBreakTest.txt

tomcumming force-pushed the master branch from 93b0d56 to 7ac6f29 Compare May 16, 2017 20:10

tomcumming changed the title ~~Code review please (Unicode sentence boundary partial implementation)~~ Unicode sentence boundaries May 16, 2017

rth mentioned this pull request May 3, 2019

Add sentence splitter rth/vtext#51

Closed

Manishearth reviewed May 9, 2019

View reviewed changes

tomcumming added 2 commits May 13, 2019 19:06

Adds unicode_sentences and split_sentence_bound_indices

50058a5

Documentation and code reorg

9c7abf2

Manishearth approved these changes May 13, 2019

View reviewed changes

rth reviewed May 14, 2019

View reviewed changes

src/lib.rs Show resolved Hide resolved

rth reviewed May 14, 2019

View reviewed changes

src/sentence.rs Show resolved Hide resolved

rth mentioned this pull request May 15, 2019

MAINT Fixes for Python scripts #54

Merged

Manishearth merged commit c7a6b6f into unicode-rs:master May 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode sentence boundaries #24

Unicode sentence boundaries #24

tomcumming commented May 6, 2017 •

edited

Loading

Manishearth commented May 16, 2017

Manishearth commented May 26, 2017

rth commented May 6, 2019

tomcumming commented May 7, 2019

rth commented May 8, 2019

Manishearth commented May 9, 2019

rth commented May 9, 2019

Manishearth left a comment

tomcumming commented May 13, 2019

Manishearth commented May 13, 2019

rth left a comment

tomcumming commented May 15, 2019

rth commented May 15, 2019

Manishearth commented May 15, 2019

Manishearth commented May 15, 2019

Unicode sentence boundaries #24

Unicode sentence boundaries #24

Conversation

tomcumming commented May 6, 2017 • edited Loading

Manishearth commented May 16, 2017

Manishearth commented May 26, 2017

rth commented May 6, 2019

tomcumming commented May 7, 2019

rth commented May 8, 2019

Manishearth commented May 9, 2019

rth commented May 9, 2019

Manishearth left a comment

Choose a reason for hiding this comment

tomcumming commented May 13, 2019

Manishearth commented May 13, 2019

rth left a comment

Choose a reason for hiding this comment

tomcumming commented May 15, 2019

rth commented May 15, 2019

Manishearth commented May 15, 2019

Manishearth commented May 15, 2019

tomcumming commented May 6, 2017 •

edited

Loading