Offsets incorrect #192

boltonn · 2023-11-20T22:55:42Z

Below are two minimal examples where the offsets do not seem correct. In the first it correctly identifies both languages but gives start_index past the length of the text and in the second example the result does not match the README. If you run it on a document of 2k characters the end_index has been as much as 7k.

detector = LanguageDetectorBuilder.from_all_spoken_languages().with_low_accuracy_mode().build()
text = "他能在多大程度上对此施加影响是很重要的，因为无论结果如何，他都将难脱干系。\n\n相关主题内容\nThis is an example English sentence."
dets = model.detect_multiple_languages_of(text)
for result in detector.detect_multiple_languages_of(text):
     print(f"{result.language.name}: '{text[result.start_index:result.end_index]}'")


languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN]
detector = LanguageDetectorBuilder.from_languages(*languages).build()
sentence = "Parlez-vous français? " + \
    "Ich spreche Französisch nur ein bisschen. " + \
    "A little bit is better than nothing."
for result in detector.detect_multiple_languages_of(sentence):
     print(f"{result.language.name}: '{sentence[result.start_index:result.end_index]}'")

# example 1 output
CHINESE: '他能在多大程度上对此施加影响是很重要的，因为无论结果如何，他都将难脱干系。

相关主题内容
This is an example English sentence.'
ENGLISH: ''

# example 2 output
FRENCH: 'Parlez-vous français? I'
GERMAN: 'ch spreche Französisch nur ein bisschen. A '
ENGLISH: 'little bit is better than nothing.'

The text was updated successfully, but these errors were encountered:

pemistahl · 2023-11-21T23:13:39Z

Thank you for the bug report. I should have added sentences to the unit tests that contain characters consisting of multiple bytes. Stupid me, I forgot that Rust indices are byte indices but Python indices are character indices. So the indices need to be converted. I'm sorry, I'm going to fix it as soon as possible.

boltonn · 2023-11-21T23:55:54Z

Oh, no worries at all. This is such an awesome repository. Thanks for the great work !

pemistahl · 2023-11-22T05:59:20Z

In the meantime, you can use the latest 1.3 release. This is the pure Python implementation where the indices are handled correctly. It's just slower than version 2.0.

pemistahl · 2023-11-23T22:42:22Z

@boltonn I've fixed the bug now in version 2.0.1. See commit pemistahl/lingua-rs@d64963b if you are interested in the details. Please try again. Feel free to open a new issue if you encounter other problems. Thanks again.

pemistahl added the bug Something isn't working label Nov 21, 2023

pemistahl closed this as completed Nov 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Offsets incorrect #192

Offsets incorrect #192

boltonn commented Nov 20, 2023

pemistahl commented Nov 21, 2023

boltonn commented Nov 21, 2023

pemistahl commented Nov 22, 2023

pemistahl commented Nov 23, 2023 •

edited

Loading

Offsets incorrect #192

Offsets incorrect #192

Comments

boltonn commented Nov 20, 2023

pemistahl commented Nov 21, 2023

boltonn commented Nov 21, 2023

pemistahl commented Nov 22, 2023

pemistahl commented Nov 23, 2023 • edited Loading

pemistahl commented Nov 23, 2023 •

edited

Loading