[📄ArXiv Paper] [📚LeaderBoard]
- [2024.12.18] We release the ArXiv Paper of GPassK. 🎉🎉🎉
G-Pass@k is a novel evaluation metric that provides a continuous assessment of model performance across multiple sampling attempts, quantifying both the model’s peak performance potential and its stability. In addition, it comes with LiveMathBench, a dynamic benchmark comprising challenging, contemporary mathematical problems designed to minimize data leakage risks during evaluation. In order to track the latest performance and stability of LLMs, we will continue updating the benchmark with new comptition level mathmatical problems and provide the latest results of the models on the benchmark with G-Pass@k.
where
where
Intuitively,
LiveMathBench-202412 version
OpenCompass is a toolkit for evaluating the performance of large language models (LLMs). To use GPassK in OpenCompass, you can follow the steps below:
Coming Soon...
If you use GPassK in your research, please cite the following paper:
@misc{liu2024llmscapablestablereasoning,
title={Are Your LLMs Capable of Stable Reasoning?},
author={Junnan Liu and Hongwei Liu and Linchen Xiao and Ziyi Wang and Kuikun Liu and Songyang Gao and Wenwei Zhang and Songyang Zhang and Kai Chen},
year={2024},
eprint={2412.13147},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2412.13147},
}