Minimum Text Length Threshold for Reliable Language Detection in Langdetect #110

Chetan-Yeola · 2023-11-03T11:20:15Z

What is considered a 'short text' in langdetect, and is there a specific minimum text length threshold for reliable language detection?

jeanbaptisteb · 2024-05-02T09:14:21Z

@Chetan-Yeola According to the presentation page of this other library , langdetect performs poorly on texts with length similar to twitter messages ("For very short text snippets such as Twitter messages, they do not provide adequate results."). Which means anything less than 280 characters might give poor results, assuming the page does not exaggerate the problem. However, the page is a bit vague, and the threshold (if any) might be higher than 280 characters. It also probably depends on the language considered (I guess that some languages may be much easier to detect than others -e.g. consider detecting Hebrew, which uses a rare alphabet, vs. detecting Spanish, which is very similar to other Romance languages).

But you could try and test automatically with a large sample of short texts taken from various language instances of Wikipedia, to see if the error rate is OK relative to your requirements. The previous page does not mention the classification error rate they observed to make this statement, so if your own requirements relative to the error rate are very liberal, it may be worth take the time to test.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimum Text Length Threshold for Reliable Language Detection in Langdetect #110

Minimum Text Length Threshold for Reliable Language Detection in Langdetect #110

Chetan-Yeola commented Nov 3, 2023

jeanbaptisteb commented May 2, 2024

Minimum Text Length Threshold for Reliable Language Detection in Langdetect #110

Minimum Text Length Threshold for Reliable Language Detection in Langdetect #110

Comments

Chetan-Yeola commented Nov 3, 2023

jeanbaptisteb commented May 2, 2024