Skip to content

AIFlames/Reflection-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🪞Reflection-Bench: probing AI intelligence with reflection


Las Meninas, Diego Velázquez, 1656

Motivation: to what extent LLMs possess intelligence?

Geoffrey Hinton endorsed the legislation of SB 1047, the strict AI safety bill, because he thinks that LLMs are actually reasoning and understanding. But Yann LeCun criticized that Hinton's premature worries are originated from the overestimation of LLMs intelligence in prediction, planning, common sense etc. Obviously, this longlasting debate on AI intelligence directly impacts our understanding of AI trustworthiness and regulation.

Lots of studies explore this question from various aspects such as reasoning, planning, cognitive flexibility, self cognition etc. These angles, however, seem to be interconnected in certain under-explained way related to epistemology of AI systems. So we aim to clarify this enigma from the lens of cognitive science and evaluate the general process of intelligence underlying above aspects.

Reflection: the general process of intelligent systems


Reflection & Meta-reflection

Intelligent systems existing in the uncertain world must interact with, learn about, and adapt to the environment. One emerging school in cognitive science describes these systems, from first principles, as predictive machines that keep predicting what will happen next with their internal models. Such a smart energy-saving strategy allows systems adapting to the environment flexibly only focusing on minimizing the unexpected by updating thoughts or actions. We define the general process of such intelligence in everyday life as reflection - predicting based on priors, making decision leading to desired state, perceiving mismatch between observation and prediction, updating prior belief accordingly. Reflection is a complex capability requiring cognitive elements including perception, memory, belief updating, decision making, prediction, counterfactual thinking, and meta-reflection.

Therefore, we can evaluate intelligence of AI systems by focusing on this general process of intelligence as well, i.e., Reflection-Bench.

Reflection-Bench

Reflection-Bench involves 7 tasks corresponding to different cognitive elements required in reflection.


Reflection-Bench: assessment architecture

Experiment

Cognition focus Task Trials Sessions
Perception Oddball Paradigm 50 3
Working memory N-back task, n=2 52 2
Belief updating Probability Reversal Task (PRT), p = 0.9 40 2
Decision making Wisconsin Card Sorting Task 108 2
Prediction Weather Prediction Task 100 2
Counterfactual Double-Choice Iowa Gambling Task 100 2
Meta-reflection Meta-PRT, interval = 3, p = 1 60 2
Model $/1 M input tokens $/ 1 M output tokens Actual Cost ($)
o1-preview 15 60 281
o1-mini 3 12 57
gpt-4 10 30 45
gpt-4o 5 15 20.1
gpt-4o-mini 0.15 0.6 0.6
claude-3.5-sonnet 3 15 16.5
gemini-1.5-pro 3.5 10.5 12
llama-3.1-405b 6 6 18
llama-3.1-70b 0.35 0.4 1.48
llama-3.1-8b 0.05 0.05 0.27
qwen-2.5-72b 0.57 1.71 0
qwen-2.5-32b 0.5 1 0
qwen-2.5-14b 0.28 0.56 0

Results


Performances of 13 moels on Reflection-Bench

Paper

For detailed information about Reflection-Bench, please read our papaer!

You can cite Reflection-Bench as:

@misc{li2024reflectionbenchprobingaiintelligence,
      title={Reflection-Bench: probing AI intelligence with reflection}, 
      author={Lingyu Li and Yixu Wang and Haiquan Zhao and Shuqi Kong and Yan Teng and Chunbo Li and Yingchun Wang},
      year={2024},
      eprint={2410.16270},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2410.16270}, 
}

About

probing AI intelligence with reflection

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published