hle-eval-ollama

Want to see how your favorite local LLMs fare against Humanity's Last Exam? What difference quantizations may make?

This repo aims to allow anyone to get up and running with Humanity's Last Exam (or similar benchmarks!) and Ollama locally.

The official repo with evaluation scripts by HLE is notoriously hard to use, only lightly documented and merely made to work with the OpenAI API. While Ollama exposes an OpenAI API compatible endpoint, this project aims for a two-way approach, featuring both a pure, Ollama-agnostic API implementation and an OpenAI API compatible backend to show what's possible.

Important

The whole quality of the benchmark results bases on how good the judge model does its job. If it judges poorly, good models might look worse and bad models better. Make sure to choose a strong model and verify results yourself.

There are ongoing problems with the quality of the judge model's responses. Often, answers are still misjudged. Please exercise caution or manually review until cutting-edge models are able to correctly identify correct and wrong responses consistently.

How to use it

It's luckily simple! First of all, make sure to have a Hugging Face account. The HLE dataset is gated, which means that you will need to authenticate in order to use it. You may want to visit the HLE page on Hugging Face and agree to your information being submitted.

python3 -m venv .venv
. ./.venv/bin/activate
pip install -r ./requirements.txt

Generate a Hugging Face access token here and copy it to your clipboard. Then, run

huggingface-cli login

Now you're all set! For example, run

python3 ./src/eval.py --model=gemma3 --judge=llama3:8b --num-questions=150

to begin the exam for the model! Results will also be written to an output file in the project root directory, ending in .results.json.

Tip: You can specify several models separated by commas in order to make them compete against each other. You can - and must - only specify one judge model (the model that will rate the answers), and it's highly recommended to choose a model that isn't part of the models taking the exam.

Important: Do not just perform separate runs with --num-questions specified, as this will pick different, random questions from the dataset for each run individually. If you want to compare models with a limited number of questions, use the tip described above.

For text-only models specify --only-text to only use the text subset of the HLE dataset.

Tip

You can also use any OpenAI API compatible endpoint by providing the --backend=openai flag. Make sure to set the HLE_EVAL_API_KEY and HLE_EVAL_ENDPOINT environment variables.

PLEASE NOTE that image input (vision) is still unstable for OpenAI endpoints - while it works, it consumes an absurd amount of tokens (which you may be billed for!) and is not recommended for use. You can use a lighter variant by setting USE_EXPERIMENTAL_IMAGE_UPLOAD to True in src/constants.py, but this does not work for every endpoint.

Environment variables

HLE_EVAL_ENDPOINT: specifies the host to connect to.
HLE_EVAL_API_KEY: specifies the Bearer token to use for authentication.

Thanks

Huge thanks to the creators of Humanity's last exam for the extraordinarily hard questions!

Also, huge thanks to the Ollama contributors and creators of all the packages used for this project!

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.github/workflows		.github/workflows
.vscode		.vscode
images		images
src		src
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hle-eval-ollama

How to use it

Environment variables

Thanks

About

Uh oh!

Releases

Uh oh!

Contributors

Uh oh!

Languages

License

mags0ft/hle-eval-ollama

Folders and files

Latest commit

History

Repository files navigation

hle-eval-ollama

How to use it

Environment variables

Thanks

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors

Uh oh!

Languages