-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offsets incorrect #192
Comments
Thank you for the bug report. I should have added sentences to the unit tests that contain characters consisting of multiple bytes. Stupid me, I forgot that Rust indices are byte indices but Python indices are character indices. So the indices need to be converted. I'm sorry, I'm going to fix it as soon as possible. |
Oh, no worries at all. This is such an awesome repository. Thanks for the great work ! |
In the meantime, you can use the latest 1.3 release. This is the pure Python implementation where the indices are handled correctly. It's just slower than version 2.0. |
@boltonn I've fixed the bug now in version 2.0.1. See commit pemistahl/lingua-rs@d64963b if you are interested in the details. Please try again. Feel free to open a new issue if you encounter other problems. Thanks again. |
Below are two minimal examples where the offsets do not seem correct. In the first it correctly identifies both languages but gives start_index past the length of the text and in the second example the result does not match the README. If you run it on a document of 2k characters the end_index has been as much as 7k.
The text was updated successfully, but these errors were encountered: