Skip to content

Athena-Software-Group/athenabench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AthenaBench

Paper: AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence (WAITI Workshop 2025) — https://arxiv.org/abs/2511.01144

AthenaBench provides cybersecurity benchmarking tasks for evaluating language models on a shared set of CTI tasks. Full benchmark datasets live in benchmark/, with matching mini subsets under benchmark-mini/ for quick iteration. If you want to add your model’s results to the benchmark, open a pull request that includes your runs/<model>/ (and optional runs-mini/<model>/) scored outputs plus an updated results table entry in this README.

Setup

  1. Install dependencies

    pip install -r requirements.txt
    git lfs install
    git lfs pull

    Git LFS is required to fetch the large benchmark artifacts.

  2. Configure models and credentials in athena_eval/config.yaml. Each entry specifies a provider (openai, gemini, huggingface, or dummy) and model name. API keys can be placed in the environment or a .env file that is auto-loaded. Example:

    OPENAI_API_KEY=""
    GEMINI_API_KEY=""
    HF_TOKEN=""
    

Run the Benchmark

Full datasets

Generate predictions on the full benchmark (writes to runs/<model>/<task>.jsonl):

python -m athena_eval.run --model gpt-4o --task RCM
  • Omit --model or --task to iterate over all configured entries.
  • Evaluation runs by default; add --no-evaluate to skip scoring during generation.

Re-evaluate existing predictions:

python -m athena_eval.evaluate --model gpt-4o --task RCM
  • CKT uses benchmark/athena-cti-ckt-3k.jsonl (3k-set available; no full CKT file in this repo). If an unscored file is missing, the evaluator will fall back to the existing *-scored.jsonl for metrics without rewriting.

Mini subsets

Use the lightweight mini splits (writes to runs-mini/<model>/<task>.jsonl):

python -m athena_eval.run --mini --model gpt-4o --task RCM
python -m athena_eval.evaluate --mini --model gpt-4o --task RCM
  • The --mini flag swaps each dataset path for its counterpart in benchmark-mini/. Evaluator will read from runs-mini/ if present; otherwise it maps full-run outputs to the mini records by prompt_hash. As with full runs, if only scored artifacts exist, metrics are computed from *-scored.jsonl without rewriting.

Dataset inventory

  • Full benchmark (benchmark/): athena-cti-ckt-3k.jsonl, athena-cti-ate.jsonl, athena-cti-rcm.jsonl, athena-cti-rms.jsonl, athena-cti-taa.jsonl, athena-cti-vsp.jsonl.
  • Mini subsets (benchmark-mini/): aligned smaller splits for each task (e.g., athena-cti-ckt-3k.jsonl), used by --mini.
  • Task names used with --task must match the keys in athena_eval/config.yaml (e.g., CKT, ATE, RCM, RMS, TAA, VSP).
  • Existing outputs in runs/ and runs-mini/ are scored-only for most tasks; regenerate if you need fresh unscored prediction files.

Benchmark Results

Full Benchmark

Model CKT (Accuracy) ATE (Accuracy) RCM (Accuracy) RMS (F1-score) VSP (Acc) TAA (Accuracy) Combined
GPT-4 78.7 35.8 63.1 15.1 84.7 31.0 51.4
GPT-4o 85.2 51.6 71.3 20.2 84.7 35.0 58.0
GPT-5 92.0 76.0 71.6 32.6 85.4 39.0 66.1
Gemini-2.5-flash 85.1 51.6 65.1 13.4 78.5 30.0 54.0
Gemini-2.5-pro 89.1 76.2 71.2 28.4 85.4 31.0 63.6
Qwen3-4B 74.7 5.6 45.4 4.8 79.6 15.0 37.5
Qwen3-8B 75.7 11.8 48.9 5.5 82.6 16.0 40.1
Qwen3-14B 78.6 19.4 54.1 7.0 80.3 17.0 42.7
Llama 3.1-8B 71.8 16.4 42.8 3.6 74.0 24.0 38.8
Llama 3-70b-Instruct 78.9 31.6 56.7 11.1 63.8 22.0 44.0
Llama 3.3-70b-Instruct 81.4 30.4 60.0 11.1 70.1 26.0 46.5
Llama-Primus-Merged 76.3 33.8 56.6 6.6 71.9 17.0 43.7

Mini Benchmark

Model CKT (Accuracy) ATE (Accuracy) RCM (Accuracy) RMS (F1-score) VSP (Acc) TAA (Accuracy) Combined
GPT-4 80.3 43.0 60.0 13.7 84.2 26.0 51.2
GPT-4o 87.7 59.0 68.0 19.9 85.9 30.0 58.4
GPT-5 96.0 77.0 69.5 33.0 88.3 30.0 65.6
Gemini-2.5-flash 87.7 57.0 64.5 14.0 78.3 22.0 53.9
Gemini-2.5-pro 91.0 77.0 68.0 29.0 86.7 24.0 62.6
Qwen3-4B 76.3 8.0 43.5 5.8 78.2 16.0 38.0
Qwen3-8B 75.3 13.0 45.5 6.8 82.6 20.0 40.5
Qwen3-14B 82.7 21.0 49.0 8.5 78.0 16.0 42.5
Llama 3.1-8B 74.0 16.0 41.0 5.4 74.1 24.0 39.1
Llama 3-70b-Instruct 81.0 37.0 54.5 10.9 63.4 24.0 45.1
Llama 3.3-70b-Instruct 81.7 44.0 59.0 11.5 69.7 22.0 48.0
Llama-Primus-Merged 79.7 32.0 51.0 6.4 71.8 18.0 43.1

Star History

Star History Chart

Citation

If you use this code or datasets, please cite both papers:

@article{alam2025athenabench,
  title={AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence},
  author={Alam, Md Tanvirul and Bhusal, Dipkamal and Ahmad, Salman and Rastogi, Nidhi and Worth, Peter},
  journal={arXiv preprint arXiv:2511.01144},
  year={2025}
}

@article{alam2024ctibench,
  title={Ctibench: A benchmark for evaluating llms in cyber threat intelligence},
  author={Alam, Md Tanvirul and Bhusal, Dipkamal and Nguyen, Le and Rastogi, Nidhi},
  journal={Advances in Neural Information Processing Systems},
  volume={37},
  pages={50805--50825},
  year={2024}
}

About

AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence (WAITI 2025)

Resources

License

Unknown, Unknown licenses found

Licenses found

Unknown
License.md
Unknown
License.txt

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages