-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode sentence boundaries #24
Conversation
Passes all tests in the examples provided here: http://www.unicode.org/Public/9.0.0/ucd/auxiliary/SentenceBreakTest.txt
Sorry for letting this stagnate! I'm rather busy to review this right now, but will try to get to it soon! Sentence boundaries is something I've wanted implemented here for a while 😄 |
(still no time to look at this, apologies. Really hope to get to it soon) |
This PR looks quite good and it would be great to have this functionality. Could I help in any way ? |
@rth I can split this out into another crate if required? |
@tomcumming I would really like to use this implementation (and compare it with other sentences splitting approaches) in rth/vtext#51 . Having this implementation in the Any chance @Manishearth that you would have some review bandwidth for this, or could suggest someone who could review it? |
@rth mind doing a review yourself as well? I can also try and review, but I don't think I'd be able to give this a proper thorough review and would feel more comfortable if more people have gone through it. |
Sure, I'll try to review it in the next few days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks correct! Mostly want more documentation.
@Manishearth @rth I have updated the PR including requested changes |
Looks good! @rth want to do a second review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the review @Manishearth !
I went through the code in more detail, I find it quite readable and I don't really have anything to add. (Though I am fairly new to rust and don't know that much about Unicode segmentation specs).
I can confirm that src/tables.rs
and src/testdata.rs
in this PR can be re-generated in their current state with the included python scripts, but they require setting,
scripts/unicode.py
- os.system("curl -O http://www.unicode.org/Public/UNIDATA/%s"
+ os.system("curl -O http://www.unicode.org/Public/9.0.0/ucd/%s"
as otherwise data for latest Unicode 12.0 is downloaded.
Fixing the URL for test data should probably be another PR |
Yes, I'll do it. Thanks @tomcumming I don't have any other comments. |
Thank you! I'll push a release soonish |
Published 1.3.0. Thanks for the work on this, and sorry for the delay in reviewing! |
This is an implementation of the sentence breaks specification, including changes to the python files to grab the sentence break test data.
I welcome any advice for improving.