GPassK: Are Your LLMs Capable of Stable Reasoning?

🚀 News

[2024.12.18] We release the ArXiv Paper of GPassK. 🎉🎉🎉

☀️Introduction

G-Pass@k is a novel evaluation metric that provides a continuous assessment of model performance across multiple sampling attempts, quantifying both the model’s peak performance potential and its stability. In addition, it comes with LiveMathBench, a dynamic benchmark comprising challenging, contemporary mathematical problems designed to minimize data leakage risks during evaluation. In order to track the latest performance and stability of LLMs, we will continue updating the benchmark with new comptition level mathmatical problems and provide the latest results of the models on the benchmark with G-Pass@k.

🌲 Definition of GPassK

$$ \text{G-Pass@}k = \mathbb{E}_{\text{Questions}} \left[ \frac{{c \choose k}}{{n \choose k}} \right] $$

where $n$ represents the total number of generations per question, and $c$ denotes the number of generations resulting in correct solutions.

$$ \text{G-Pass@}k_{\tau} = E_{\text{Questions}} \left[ \sum_{j = \lceil \tau \cdot k \rceil}^{c} \frac{\binom{c}{j} \cdot \binom{n - c}{k - j}}{\binom{n}{k}} \right] $$

where $\lceil \tau \cdot k \rceil$ denotes the smallest integer greater than or equal to $\tau \cdot k$.

$$ \text{mG-Pass@}k_{\tau} = 2\int_{0.5}^{1.0} \text{G-Pass@}k_{\tau} d \tau = \frac{2}{k} \sum_{i= \lceil 0.5 \cdot k \rceil + 1}^{k} \text{G-Pass@}k_{\frac{i}{k}} $$

Intuitively, $\text{mG-Pass@}k$ provides an interpolated estimate of the area under the curve of $\text{mG-Pass@}k_{[0.5:1.0]}$, serving as a comprehensive metric that integrates all $\text{G-Pass@}k_{\tau}$ values where $\tau \in [0.5, 1.0]$.

📚 Main Result

LiveMathBench-202412 version

🖋Use GPassK in OpenCompass

OpenCompass is a toolkit for evaluating the performance of large language models (LLMs). To use GPassK in OpenCompass, you can follow the steps below:

Coming Soon...

Citation and Tech Report

If you use GPassK in your research, please cite the following paper:

@misc{liu2024llmscapablestablereasoning,
      title={Are Your LLMs Capable of Stable Reasoning?}, 
      author={Junnan Liu and Hongwei Liu and Linchen Xiao and Ziyi Wang and Kuikun Liu and Songyang Gao and Wenwei Zhang and Songyang Zhang and Kai Chen},
      year={2024},
      eprint={2412.13147},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2412.13147}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPassK: Are Your LLMs Capable of Stable Reasoning?

🚀 News

☀️Introduction

🌲 Definition of GPassK

📚 Main Result

🖋Use GPassK in OpenCompass

Citation and Tech Report

About

Releases

Packages

Contributors 2

open-compass/GPassK

Folders and files

Latest commit

History

Repository files navigation

GPassK: Are Your LLMs Capable of Stable Reasoning?

🚀 News

☀️Introduction

🌲 Definition of GPassK

📚 Main Result

🖋Use GPassK in OpenCompass

Citation and Tech Report

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages