Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimum Text Length Threshold for Reliable Language Detection in Langdetect #110

Open
Chetan-Yeola opened this issue Nov 3, 2023 · 1 comment

Comments

@Chetan-Yeola
Copy link

What is considered a 'short text' in langdetect, and is there a specific minimum text length threshold for reliable language detection?

@jeanbaptisteb
Copy link

@Chetan-Yeola According to the presentation page of this other library , langdetect performs poorly on texts with length similar to twitter messages ("For very short text snippets such as Twitter messages, they do not provide adequate results."). Which means anything less than 280 characters might give poor results, assuming the page does not exaggerate the problem. However, the page is a bit vague, and the threshold (if any) might be higher than 280 characters. It also probably depends on the language considered (I guess that some languages may be much easier to detect than others -e.g. consider detecting Hebrew, which uses a rare alphabet, vs. detecting Spanish, which is very similar to other Romance languages).

But you could try and test automatically with a large sample of short texts taken from various language instances of Wikipedia, to see if the error rate is OK relative to your requirements. The previous page does not mention the classification error rate they observed to make this statement, so if your own requirements relative to the error rate are very liberal, it may be worth take the time to test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants