Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offsets incorrect #192

Closed
boltonn opened this issue Nov 20, 2023 · 4 comments
Closed

Offsets incorrect #192

boltonn opened this issue Nov 20, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@boltonn
Copy link

boltonn commented Nov 20, 2023

Below are two minimal examples where the offsets do not seem correct. In the first it correctly identifies both languages but gives start_index past the length of the text and in the second example the result does not match the README. If you run it on a document of 2k characters the end_index has been as much as 7k.

detector = LanguageDetectorBuilder.from_all_spoken_languages().with_low_accuracy_mode().build()
text = "他能在多大程度上对此施加影响是很重要的,因为无论结果如何,他都将难脱干系。\n\n相关主题内容\nThis is an example English sentence."
dets = model.detect_multiple_languages_of(text)
for result in detector.detect_multiple_languages_of(text):
     print(f"{result.language.name}: '{text[result.start_index:result.end_index]}'")


languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN]
detector = LanguageDetectorBuilder.from_languages(*languages).build()
sentence = "Parlez-vous français? " + \
    "Ich spreche Französisch nur ein bisschen. " + \
    "A little bit is better than nothing."
for result in detector.detect_multiple_languages_of(sentence):
     print(f"{result.language.name}: '{sentence[result.start_index:result.end_index]}'")
# example 1 output
CHINESE: '他能在多大程度上对此施加影响是很重要的,因为无论结果如何,他都将难脱干系。

相关主题内容
This is an example English sentence.'
ENGLISH: ''

# example 2 output
FRENCH: 'Parlez-vous français? I'
GERMAN: 'ch spreche Französisch nur ein bisschen. A '
ENGLISH: 'little bit is better than nothing.'
@pemistahl pemistahl added the bug Something isn't working label Nov 21, 2023
@pemistahl
Copy link
Owner

Thank you for the bug report. I should have added sentences to the unit tests that contain characters consisting of multiple bytes. Stupid me, I forgot that Rust indices are byte indices but Python indices are character indices. So the indices need to be converted. I'm sorry, I'm going to fix it as soon as possible.

@boltonn
Copy link
Author

boltonn commented Nov 21, 2023

Oh, no worries at all. This is such an awesome repository. Thanks for the great work !

@pemistahl
Copy link
Owner

In the meantime, you can use the latest 1.3 release. This is the pure Python implementation where the indices are handled correctly. It's just slower than version 2.0.

@pemistahl
Copy link
Owner

pemistahl commented Nov 23, 2023

@boltonn I've fixed the bug now in version 2.0.1. See commit pemistahl/lingua-rs@d64963b if you are interested in the details. Please try again. Feel free to open a new issue if you encounter other problems. Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants