HumanEval.jl

This project is a julia version of HumanEval. Our goal is to gain a better understanding of latest LLMs' performance with the Julia programming language.

model	evalplus *	basic **
gpt-4-0125-preview	0.774	0.823
gpt-4-turbo	0.756	0.823
mistral-large-instruct-2407	0.744	0.823
gpt-4o	0.738	0.817
claude-3-5-sonnet-20240620	0.72	0.823
gpt-4-1106-preview	0.72	0.805
DeepSeek-Coder-V2-Instruct	0.695	0.774
DeepSeek-V2-Chat	0.689	0.756
Llama-3.1-405B-Instruct	0.628	0.744
claude-3-opus-20240229	0.61	0.689
Qwen2-72B-Instruct	0.598	0.665
Phind-CodeLlama-34B-v2	0.591	0.659
gpt-3.5-turbo-0125	0.591	0.652
mistral-large-latest	0.573	0.659
gpt-3.5-turbo-0613	0.567	0.64
gpt-3.5-turbo-1106	0.555	0.628
DeepSeek-Coder-33B-instruct	0.543	0.598
Magicoder-S-DS-6.7B	0.543	0.616
WizardCoder-33B-V1.1	0.543	0.604
Qwen1.5-110B-Chat	0.53	0.598
yi-large	0.524	0.652
deepseek-coder-6.7b-instruct	0.488	0.549
CodeLlama-70b-Instruct-hf	0.457	0.561
code-millenials-34b	0.439	0.5
Magicoder-S-CL-7B	0.402	0.463
CodeLlama-34b-Instruct-hf	0.311	0.366
Starling-LM-7B-alpha	0.299	0.354
Yi-34B-Chat	0.232	0.317

_{* evalplus: scores are calculated based on test cases from both HumanEval and evalplus.

** basic: scores are calculated based on test cases from HumanEval only.

By default, all results are calculated by pass@1 using greedy decoding. Models are deployed with vllm which uses a predefined chat template stored in the tokenizer. Feel free to create an issue if you'd like to evaluate some other models.}

Getting Started

First, deploy the model you'd like to evaluate with a OpenAI compatible endpoint, like vLLM or Ollama. We'll need the OPENAI_API_KEY and OPENAI_BASE_URL in the next step.

To test models from Anthropic, you should set ANTHROPIC_API_KEY and ANTHROPIC_BASE_URL instead.

Evaluate with docker

docker run -it --rm \
  -v /PATH/TO/SAVE/RESULTS/generations:/workspace/HumanEval.jl/generations \
  -e OPENAI_API_KEY=YOUR_SECRET \
  -e OPENAI_BASE_URL=http://localhost:8000/v1 \
  -e RETESTITEMS_NWORKERS=16 \
  -e RETESTITEMS_TESTITEM_TIMEOUT=15 \
  -e MODEL=gpt-3.5-turbo-0613 \
  ghcr.io/01-ai/humaneval.jl:latest

/PATH/TO/SAVE/RESULTS/generations, this folder will contain raw responses from the model, extracted julia code snippets, and unit test results.
YOUR_SECRET, it should be the same with the one you provided when deploying the server.
RETESTITEMS_NWORKERS, adjust it to the number of cores with your test environment. It specifies how many workers we use to run tests.
RETESTITEMS_TESTITEM_TIMEOUT, the default 15 seconds should be enough to pass all the test cases.
MODEL, the model name you specified when deploying models. If you use vLLM, it should be the same with the value of --served-model-name

Evaluate with local development environment

Make sure you have the latest Julia installed.
Clone and enter the root of this project.
Start the Julia REPL with the following command

OPENAI_API_KEY=debug OPENAI_BASE_URL=http://localhost:8000/v1 RETESTITEMS_NWORKERS=16 RETESTITEMS_TESTITEM_TIMEOUT=15 MODEL=gpt-3.5-turbo-0613 julia --project

The meaning of the environment variables are the same with above.

Execute following commands in the Julia REPL.

julia> import Pkg; Pkg.instantiate();

julia> include("src/evaluation.jl")

julia> evaluate("YOUR_MODEL_NAME")

Once finished, the results will be displayed. You may find more details under the generations directory.

Related Work

nuprl/MultiPL-E contains Julia version prompts transformed from the original Python version HumanEval. However, based on my limited Julia programming experience, the prompts are not that accurate and conventional.
Julia-LLM-Leaderboard, which focuses on practicality and simplicity.
EvalPlus Leaderboard

Future Work

Explore advanced techniques to improve LLM's performance with code in general. Especially how to iteratively refine code.
Julia specific LLM training/finetuning. We want to know the minimum requirement to train a code LLM.
Improve Yi series models' performance with code.

We're hiring! If you're interested in working on code LLM at 01.ai, please contact yi@01.ai.

FAQ

Acknowledgement

This project heavily relies on many features provided by ReTestItems.jl. Great thanks to Nick Robinson's help during the development.

Name	Name	Last commit message	Last commit date
Latest commit findmyway add instructions to test Anthropic models Aug 13, 2024 52b58aa · Aug 13, 2024 History 9 Commits
.github/workflows	.github/workflows	fix docker image address	Feb 19, 2024
docs	docs	add more models	Aug 13, 2024
generations	generations	add more models	Aug 13, 2024
src	src	add more models	Aug 13, 2024
test	test	init	Feb 19, 2024
.gitignore	.gitignore	add more models	Aug 13, 2024
CITATION.cff	CITATION.cff	init	Feb 19, 2024
Dockerfile	Dockerfile	init	Feb 19, 2024
LICENSE	LICENSE	init	Feb 19, 2024
Manifest.toml	Manifest.toml	add more models	Aug 13, 2024
Project.toml	Project.toml	add more models	Aug 13, 2024
README.md	README.md	add instructions to test Anthropic models	Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HumanEval.jl

Getting Started

Evaluate with docker

Evaluate with local development environment

Related Work

Future Work

FAQ

Acknowledgement

About

Releases 1

Packages 1

Languages

License

01-ai/HumanEval.jl

Folders and files

Latest commit

History

Repository files navigation

HumanEval.jl

Getting Started

Evaluate with docker

Evaluate with local development environment

Related Work

Future Work

FAQ

Acknowledgement

About

Resources

License

Citation

Stars

Watchers

Forks

Releases 1

Packages 1

Languages