SciBench

SciBench is a novel benchmark for college-level scientific problems sourced from instructional textbooks. The benchmark is designed to evaluate the complex reasoning capabilities, strong domain knowledge, and advanced calculation skills of LLMs.

We developed an innovative evaluation protocol for a detailed analysis of reasoning abilities. This involves instructing LLMs to self-identify and categorize their errors within a predefined set of capabilities. This process offers a fine-grained understanding of where the models are falling short.

Update

Our paper has been accepted for ICML 2024.
Our dataset is now accessible at Huggingface Datasets.
The multimodal dataset is available in the ./dataset/img folder.
Our dataset has been updated with minor changes. The previous version can be accessed in the "old" branch. For the latest results based on our most current dataset, please visit our website.

Data

The SciBench dataset is under dataset/original folder in json file format. Each file is list of dictionary and can be extracted using following scripts. Each file stands for one textbook, which is fully elaborated in the paper.

subject='atkins'
with open("./dataset/original/{}.json".format(subject), encoding='utf-8') as json_file:
    problems=json.load(json_file)

Evaluation

To evaluate our data using LLM, please refer to folder under eval

Analysis (Evaluation Protocol)

The evaluation protocol involves analyzing both LLM and reference (correct) solutions with the assistance of human annotators to identify error reasons. These reasons are then summarized into ten essential scientific problem-solving skills in which LLM may face challenges. Subsequently, a LLM verifier is employed to automatically attribute each incorrectly answered problem to a lack of a specific skill. The resulting error profiles enable the interpretation of the improved skills by certain prompting strategies and direct comparison of various strategies.

run evaluation protocol

After running the evaluation part, use the output to run the evaluation protocol. "setting" refers to the experiment setting: zero_nosys, zero, zeroCot, few, fewCot, python, wolfram, which are fully explained under eval folder.

cd eval
OPENAI_API_KEY=your_key python ana_error.py --setting your_setting

Citation

If you find our paper useful, please cite our paper

@inproceedings{wang2024scibench,
author = {Wang, Xiaoxuan and Hu, Ziniu and Lu, Pan and Zhu, Yanqiao and Zhang, Jieyu and Subramaniam, Satyen and Loomba, Arjun R. and Zhang, Shichang and Sun, Yizhou and Wang, Wei},
title = {{SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models}},
booktitle = {Proceedings of the Forty-First International Conference on Machine Learning},
year = {2024},
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
dataset		dataset
eval		eval
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SciBench

Update

Data

Evaluation

Analysis (Evaluation Protocol)

run evaluation protocol

Citation

About

Releases

Packages

Languages

License

mandyyyyii/scibench

Folders and files

Latest commit

History

Repository files navigation

SciBench

Update

Data

Evaluation

Analysis (Evaluation Protocol)

run evaluation protocol

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages