CTF Challenge Benchmark

CTFChallenge Benchmark is a comprehensive evaluation suite designed to assess large language models (LLMs) in the domain of cybersecurity. The benchmark leverages high-quality, real-world CTF (Capture The Flag) challenges collected from past competitions across the internet, ensuring a rigorous and practical test of a model's security reasoning capabilities.

Unlike traditional objective scoring methods—which rely on exact answer matching and are prone to "answer surfing" (where models exploit memorization rather than demonstrate genuine understanding)—this benchmark employs a subjective scoring system. Evaluations are conducted by human experts who assess the reasoning process, methodology, and partial progress, minimizing the risk of models achieving high scores through rote memorization.

CTF problems are chosen as the foundation of this benchmark because they naturally reflect real cybersecurity tasks, such as reverse engineering, binary exploitation, cryptography, web security, and forensic analysis. This ensures that the evaluation goes beyond mere fact recall and instead tests a model’s ability to perform logical deduction, problem-solving, and technical creativity—key skills in practical security work.

By focusing on CTF challenges, CTFChallenge Benchmark provides a more objective, meaningful, and realistic assessment of a language model’s true cybersecurity proficiency.

Name		Name	Last commit message	Last commit date
Latest commit History 172 Commits
img		img
rewards		rewards
scripts		scripts
tests		tests
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CTF Challenge Benchmark

Badges

About

Uh oh!

Releases

Packages

Contributors 6

Uh oh!

Languages

License

ansanyuan/CTF-Challenge-Benchmark

Folders and files

Latest commit

History

Repository files navigation

CTF Challenge Benchmark

Badges

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Uh oh!

Languages

Packages