This repository benchmarks Greek-capable language models across seven supported NLP tasks: Grammatical Error Correction (GEC), Machine Translation (MT), Intent Classification, Legal Text Classification, Named Entity Recognition (NER), Part-of-Speech (POS) Tagging, and Summarization. It brings together the task datasets, task-specific prompting logic, a unified Python runner, and Colab notebooks for repeated evaluation with Ollama-based local models. The repository also preserves older exploratory notebooks and analyses under misc/, but the main runnable benchmark surface is now organized around the current seven-task suite and its task-specific Colab entrypoints.
- Explore the Greek datasets in data.csv.
- Use the reorganized Colab notebooks under notebooks/colab.
- Legacy exploratory notebooks now live under misc/notebooks.
The supported task set is:
gecmachine_translationintent_classificationlegal_classificationnerpossummarization
- Create and activate a virtual environment.
- Install the dependencies:
pip install -r requirements.txt- Start Ollama and pull the models you want to compare, for example:
ollama pull qwen2.5:7b-instruct
ollama pull aya-expanse:8b
ollama pull llama3.1:8bUse scripts/run_all_benchmarks.py to run one task or the whole benchmark suite.
Run all supported tasks:
python scripts/run_all_benchmarks.py --task all --models qwen2.5:7b-instruct aya-expanse:8b llama3.1:8b --sample-size 100Run a single task:
python scripts/run_all_benchmarks.py --task ner --models qwen2.5:7b-instruct llama3.1:8b --sample-size 100Run on the full available dataset for a task:
python scripts/run_all_benchmarks.py --task summarization --sample-size 0Run a deterministic capped test-set profile that keeps the already small tasks full and trims only the large ones using the first instances:
python scripts/run_all_benchmarks.py --task all --sample-size 0 --task-cap-profile reasonableRun repeated Monte Carlo-style sampled evaluations (mean + SEM):
python scripts/run_all_benchmarks.py --task all --sample-size 100 --repeats 5The compatibility entrypoint suite_benchmark.py forwards to the same runner, so this also works:
python suite_benchmark.py --task allOutputs are written under results/full_benchmark_suite/.
For long-running server work, a clearer layout is to keep sampled and full-dataset runs separate, for example:
results/server_runs/
completed_runs/
20260326_235652_full_suite_default_models_sample100_volume_labels/
full_test_set/
20260327_XXXXXX_full_suite_default_models_full_test/Pass --output-dir explicitly when you want to keep a run in one of these directories.
When --repeats 1 (default, single run):
{task}_summary.csv{task}_predictions.csv{task}_visualization.htmlall_tasks_summary.csvwhen--task allis used
When --repeats > 1 (Monte Carlo mode):
{task}/repeat_XX/{task}_summary.csv{task}/repeat_XX/{task}_predictions.csv{task}/repeat_XX/{task}_visualization.html{task}/{task}_summary_with_sem.csv{task}/{task}_repeat_summaries.csv{task}/{task}_repeat_predictions.csvall_tasks_summary_with_sem.csvwhen--task allis used
Useful flags:
--task: one ofall,gec,machine_translation,intent_classification,legal_classification,ner,pos,summarization--models: one or more Ollama model names--sample-size: number of examples to score; use0for the full dataset--repeats: optional; number of repeated sampled runs (default:1). Use>1with--sample-size > 0.--random-state: sampling seed--output-dir: where result files are written--task-cap-profile: optional deterministic per-task caps.reasonablekeepsgec,intent_classification, andposat full test size, capslegal_classificationandnerat 500, capssummarizationat 300, and caps machine translation to the first 500 evaluation pairs per target language.--temperature: Ollama sampling temperature--num-predict: maximum output tokens--timeout-seconds: request timeout per generation
To run the benchmark on a remote Linux server:
- Clone the repository and enter it:
git clone https://github.com/greek-nlp/benchmark.git
cd benchmark- Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate- Install the dependencies:
pip install -r requirements.txt- Install Ollama on the server and start it:
ollama serve- In another shell, pull the models you want to benchmark:
ollama pull qwen2.5:7b-instruct
ollama pull aya-expanse:8b
ollama pull llama3.1:8b- Run the benchmark:
python scripts/run_all_benchmarks.py --task all --models qwen2.5:7b-instruct aya-expanse:8b llama3.1:8b --sample-size 100To run a single task on the server:
python scripts/run_all_benchmarks.py --task ner --models qwen2.5:7b-instruct llama3.1:8b --sample-size 100To keep a long benchmark running after disconnecting, use tmux or screen. For example:
tmux new -s benchmark
python scripts/run_all_benchmarks.py --task all --sample-size 100Server outputs are written under:
results/full_benchmark_suite/The current Colab entrypoints are:
These notebooks follow the same general pattern:
- install dependencies
- start Ollama
- pull selected models
- run repeated Monte Carlo-style evaluations
- save results and zip outputs
To benchmark accessible Greek-capable LLMs for grammatical error correction locally, use gec_benchmark.py with Ollama.
- Create and activate a virtual environment.
- Install the dependencies:
pip install pandas pywer zenodo-get wget datasets conll-df openpyxl- Start Ollama and pull the models you want to compare, for example:
ollama pull qwen2.5:7b-instruct
ollama pull aya-expanse:8b
ollama pull llama3.1:8b- Run the benchmark:
python gec_benchmark.py --models qwen2.5:7b-instruct aya-expanse:8b llama3.1:8b --sample-size 100The benchmark uses the KorreDt dataset, prompts each model to correct Modern Greek text, and writes:
results/gec_ollama/gec_benchmark_summary.csvresults/gec_ollama/gec_benchmark_predictions.csv
For repeated sampled runs with mean and standard error of the mean (SEM), use suite_benchmark_monte_carlo.py.
How to run it:
- Create and activate a virtual environment.
- Install the dependencies:
pip install -r requirements.txt- Start Ollama and pull the models you want to compare, for example:
ollama pull qwen2.5:7b-instruct
ollama pull aya-expanse:8b
ollama pull llama3.1:8b- Run one task:
python suite_benchmark_monte_carlo.py --task ner --models qwen2.5:7b-instruct llama3.1:8b --sample-size 100 --num-splits 5- Run all supported tasks:
python suite_benchmark_monte_carlo.py --task all --sample-size 100 --num-splits 5 --data-limit-per-task 500 --models qwen2.5:7b-instruct aya-expanse:8b llama3.1:8bExample:
python suite_benchmark_monte_carlo.py --task all --sample-size 100 --num-splits 5 --data-limit-per-task 500 --models qwen2.5:7b-instruct aya-expanse:8b llama3.1:8bTo resume a long run on a server or after a Colab disconnect:
python suite_benchmark_monte_carlo.py --task all --sample-size 100 --num-splits 5 --resume--num-splits controls how many repeated sampled runs are performed per task. --data-limit-per-task caps the task dataset before sampling; use 0 to keep the full dataset. The older --repeats flag still works as an alias for --num-splits.
Useful flags:
--task: run a single task such asner,gec, orsummarization, or useall.--models: one or more Ollama model names.--sample-size: how many examples to score in each split. Use0for the full available dataset after any task cap.--num-splits: how many repeated sampled runs to perform per task.--data-limit-per-task: maximum number of examples to keep per task before sampling.--resume: reuse already saved split outputs instead of recomputing them.
This writes:
results/suite_monte_carlo/{task}/repeat_XX/{task}_summary.csvresults/suite_monte_carlo/{task}/{task}_summary_with_sem.csvresults/suite_monte_carlo/all_tasks_summary_with_sem.csvresults/suite_monte_carlo/performance_by_task.csv
This work is licensed under a Creative Commons Attribution 4.0 International License.
