This repository provides access to two critical benchmarks designed to advance the development and evaluation of large language models (LLMs) in Thai: ThaiCLI for evaluating cultural intelligence and Thai-H6 for assessing core language capabilities. These benchmarks are tailored for Thai, an under-represented language in LLM research, to promote a deeper understanding of both linguistic and cultural nuances.
Dataset Availability: The datasets are scheduled for release in mid-November.
As the significance of large language models continues to grow, there is an increasing need for evaluation frameworks that rigorously assess both language proficiency and cultural understanding—especially in languages like Thai, which are under-represented in LLM research.
- ThaiCLI evaluates LLM performance on Thai-specific cultural intelligence tasks, offering insights into how well models can understand and respond to culturally sensitive queries.
- Thai-H6 adapts six global benchmarks to evaluate core linguistic capabilities in Thai, providing a solid foundation for Thai LLM assessment.
ThaiCLI is specifically designed to evaluate LLMs' comprehension of cultural and societal norms in Thailand. It includes two question formats: Factoid and Instruction, across key themes such as the royal family, religion, culture, and politics.
Question Format | Theme | # of Samples |
---|---|---|
Factoid | Royal Family | 520 |
Religion | 220 | |
Culture | 210 | |
Economy | 210 | |
Humanity | 210 | |
Lifestyle | 210 | |
Politics | 210 | |
Total | 1790 | |
Instruction | Royal Family | 25 |
Religion | 25 | |
Culture | 10 | |
Economy | 10 | |
Humanity | 10 | |
Lifestyle | 10 | |
Politics | 10 | |
Total | 100 |
- Factoid Questions: Conversational questions related to daily life in Thailand. Example:
- Instruction Tasks: Culturally-contextualized tasks requiring the LLM to follow specific instructions. Example:
- Chosen Answers: Reflect cultural sensitivity and inclusivity.
- Rejected Answers: Demonstrate a lack of awareness or sensitivity towards Thai cultural norms.
The evaluation was conducted using the latest stable GPT-4o model as the external LLM judge, with performance comparisons between both open and closed models.
Model | ThaiCLI (Avg.) | Factoid | Instruction |
---|---|---|---|
GPT-4o | 8.39 | 8.42 | 8.35 |
GPT-4 Turbo | 7.31 | 7.56 | 7.05 |
GPT-4o Mini | 8.10 | 8.16 | 8.04 |
GPT-3.5 Turbo | 5.86 | 6.72 | 4.99 |
Gemini Pro | 7.45 | 7.36 | 7.54 |
Claude Sonnet | 8.17 | 8.20 | 8.14 |
Model | ThaiCLI (Avg.) | Factoid | Instruction |
---|---|---|---|
Meta-Llama-3.1-8B-Instruct | 4.85 | 5.95 | 3.75 |
Meta-Llama-3.1-70B-Instruct | 5.49 | 5.86 | 5.11 |
Qwen2-72B-Instruct | 6.15 | 6.96 | 5.34 |
Llama-3-Typhoon-v1.5x-70b-Instruct | 5.97 | 6.75 | 5.19 |
Sailor-14B-Chat | 5.66 | 6.51 | 4.81 |
SeaLLMs-v3-7B-Chat | 6.23 | 7.05 | 5.41 |
Thai-H6 is an adaptation of six globally-recognized benchmarks—ARC, HellaSwag, MMLU, TruthfulQA, GSM8K, and Winogrande—evaluating the core capabilities of LLMs in the Thai language.
Dataset Name | # of Samples |
---|---|
th-ARC | 1,222 |
th-HellaSwag | 10,052 |
th-MMLU | 14,585 |
th-TruthfulQA | 817 |
th-GSM8K | 1,324 |
th-Winogrande | 1,272 |
Each dataset tests different reasoning, knowledge, and language understanding tasks, providing comprehensive coverage of LLM performance in Thai.
We adopt the same evaluation strategy for each dataset as in the original English H6 benchmark.
Model | Thai-H6 (Avg.) | th-ARC | th-HellaSwag | th-MMLU | th-TruthfulQA | th-Winogrande | th-GSM8K |
---|---|---|---|---|---|---|---|
Meta-Llama-3.1-8B-Instruct | 52.42 | 39.59 | 52.01 | 53.63 | 44.81 | 65.59 | 58.91 |
Meta-Llama-3.1-70B-Instruct | 63.89 | 54.10 | 65.34 | 71.30 | 51.80 | 73.48 | 67.32 |
Qwen2-72B-Instruct | 68.80 | 58.11 | 70.12 | 75.78 | 62.03 | 73.80 | 73.01 |
Llama-3-Typhoon-v1.5x-70b-Instruct | 65.48 | 54.86 | 64.73 | 69.10 | 53.24 | 73.24 | 77.71 |
Sailor-14B-Chat | 56.11 | 47.44 | 61.82 | 54.12 | 52.50 | 70.64 | 50.11 |
SeaLLMs-v3-7B-Chat | 51.85 | 46.76 | 56.05 | 60.61 | 48.24 | 66.61 | 32.83 |
If you find our dataset useful, please cite it as follows:
@misc{kim2024representingunderrepresentedculturalcore,
title={Representing the Under-Represented: Cultural and Core Capability Benchmarks for Developing Thai Large Language Models},
author={Dahyun Kim and Sukyung Lee and Yungi Kim and Attapol Rutherford and Chanjun Park},
year={2024},
eprint={2410.04795},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.04795},
}