The Greek NLP Benchmark

This repository benchmarks Greek-capable language models across seven supported NLP tasks: Grammatical Error Correction (GEC), Machine Translation (MT), Intent Classification, Legal Text Classification, Named Entity Recognition (NER), Part-of-Speech (POS) Tagging, and Summarization. It brings together the task datasets, task-specific prompting logic, a unified Python runner, and Colab notebooks for repeated evaluation with Ollama-based local models. The repository also preserves older exploratory notebooks and analyses under misc/, but the main runnable benchmark surface is now organized around the current seven-task suite and its task-specific Colab entrypoints.

Guidelines

Explore the Greek datasets in data.csv.
Use the reorganized Colab notebooks under notebooks/colab.
Legacy exploratory notebooks now live under misc/notebooks.

Running The Benchmark Suite

The supported task set is:

gec
machine_translation
intent_classification
legal_classification
ner
pos
summarization

Setup

Create and activate a virtual environment.
Install the dependencies:

pip install -r requirements.txt

Start Ollama and pull the models you want to compare, for example:

ollama pull qwen2.5:7b-instruct
ollama pull aya-expanse:8b
ollama pull llama3.1:8b

Unified Python Runner

Use scripts/run_all_benchmarks.py to run one task or the whole benchmark suite.

Run all supported tasks:

python scripts/run_all_benchmarks.py --task all --models qwen2.5:7b-instruct aya-expanse:8b llama3.1:8b --sample-size 100

Run a single task:

python scripts/run_all_benchmarks.py --task ner --models qwen2.5:7b-instruct llama3.1:8b --sample-size 100

Run on the full available dataset for a task:

python scripts/run_all_benchmarks.py --task summarization --sample-size 0

Run a deterministic capped test-set profile that keeps the already small tasks full and trims only the large ones using the first instances:

python scripts/run_all_benchmarks.py --task all --sample-size 0 --task-cap-profile reasonable

Run repeated Monte Carlo-style sampled evaluations (mean + SEM):

python scripts/run_all_benchmarks.py --task all --sample-size 100 --repeats 5

The compatibility entrypoint suite_benchmark.py forwards to the same runner, so this also works:

python suite_benchmark.py --task all

Outputs are written under results/full_benchmark_suite/.

For long-running server work, a clearer layout is to keep sampled and full-dataset runs separate, for example:

results/server_runs/
  completed_runs/
    20260326_235652_full_suite_default_models_sample100_volume_labels/
  full_test_set/
    20260327_XXXXXX_full_suite_default_models_full_test/

Pass --output-dir explicitly when you want to keep a run in one of these directories.

When --repeats 1 (default, single run):

{task}_summary.csv
{task}_predictions.csv
{task}_visualization.html
all_tasks_summary.csv when --task all is used

When --repeats > 1 (Monte Carlo mode):

{task}/repeat_XX/{task}_summary.csv
{task}/repeat_XX/{task}_predictions.csv
{task}/repeat_XX/{task}_visualization.html
{task}/{task}_summary_with_sem.csv
{task}/{task}_repeat_summaries.csv
{task}/{task}_repeat_predictions.csv
all_tasks_summary_with_sem.csv when --task all is used

Useful flags:

--task: one of all, gec, machine_translation, intent_classification, legal_classification, ner, pos, summarization
--models: one or more Ollama model names
--sample-size: number of examples to score; use 0 for the full dataset
--repeats: optional; number of repeated sampled runs (default: 1). Use >1 with --sample-size > 0.
--random-state: sampling seed
--output-dir: where result files are written
--task-cap-profile: optional deterministic per-task caps. reasonable keeps gec, intent_classification, and pos at full test size, caps legal_classification and ner at 500, caps summarization at 300, and caps machine translation to the first 500 evaluation pairs per target language.
--temperature: Ollama sampling temperature
--num-predict: maximum output tokens
--timeout-seconds: request timeout per generation

Running On A Server

To run the benchmark on a remote Linux server:

Clone the repository and enter it:

git clone https://github.com/greek-nlp/benchmark.git
cd benchmark

Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate

Install the dependencies:

pip install -r requirements.txt

Install Ollama on the server and start it:

ollama serve

In another shell, pull the models you want to benchmark:

ollama pull qwen2.5:7b-instruct
ollama pull aya-expanse:8b
ollama pull llama3.1:8b

Run the benchmark:

python scripts/run_all_benchmarks.py --task all --models qwen2.5:7b-instruct aya-expanse:8b llama3.1:8b --sample-size 100

To run a single task on the server:

python scripts/run_all_benchmarks.py --task ner --models qwen2.5:7b-instruct llama3.1:8b --sample-size 100

To keep a long benchmark running after disconnecting, use tmux or screen. For example:

tmux new -s benchmark
python scripts/run_all_benchmarks.py --task all --sample-size 100

Server outputs are written under:

results/full_benchmark_suite/

Colab Notebooks

The current Colab entrypoints are:

These notebooks follow the same general pattern:

install dependencies
start Ollama
pull selected models
run repeated Monte Carlo-style evaluations
save results and zip outputs

Greek GEC Benchmark In VS Code

To benchmark accessible Greek-capable LLMs for grammatical error correction locally, use gec_benchmark.py with Ollama.

Create and activate a virtual environment.
Install the dependencies:

pip install pandas pywer zenodo-get wget datasets conll-df openpyxl

Start Ollama and pull the models you want to compare, for example:

ollama pull qwen2.5:7b-instruct
ollama pull aya-expanse:8b
ollama pull llama3.1:8b

Run the benchmark:

python gec_benchmark.py --models qwen2.5:7b-instruct aya-expanse:8b llama3.1:8b --sample-size 100

The benchmark uses the KorreDt dataset, prompts each model to correct Modern Greek text, and writes:

results/gec_ollama/gec_benchmark_summary.csv
results/gec_ollama/gec_benchmark_predictions.csv

Monte Carlo Runner

For repeated sampled runs with mean and standard error of the mean (SEM), use suite_benchmark_monte_carlo.py.

How to run it:

Create and activate a virtual environment.
Install the dependencies:

pip install -r requirements.txt

Start Ollama and pull the models you want to compare, for example:

ollama pull qwen2.5:7b-instruct
ollama pull aya-expanse:8b
ollama pull llama3.1:8b

Run one task:

python suite_benchmark_monte_carlo.py --task ner --models qwen2.5:7b-instruct llama3.1:8b --sample-size 100 --num-splits 5

Run all supported tasks:

python suite_benchmark_monte_carlo.py --task all --sample-size 100 --num-splits 5 --data-limit-per-task 500 --models qwen2.5:7b-instruct aya-expanse:8b llama3.1:8b

Example:

python suite_benchmark_monte_carlo.py --task all --sample-size 100 --num-splits 5 --data-limit-per-task 500 --models qwen2.5:7b-instruct aya-expanse:8b llama3.1:8b

To resume a long run on a server or after a Colab disconnect:

python suite_benchmark_monte_carlo.py --task all --sample-size 100 --num-splits 5 --resume

--num-splits controls how many repeated sampled runs are performed per task. --data-limit-per-task caps the task dataset before sampling; use 0 to keep the full dataset. The older --repeats flag still works as an alias for --num-splits.

Useful flags:

--task: run a single task such as ner, gec, or summarization, or use all.
--models: one or more Ollama model names.
--sample-size: how many examples to score in each split. Use 0 for the full available dataset after any task cap.
--num-splits: how many repeated sampled runs to perform per task.
--data-limit-per-task: maximum number of examples to keep per task before sampling.
--resume: reuse already saved split outputs instead of recomputing them.

This writes:

results/suite_monte_carlo/{task}/repeat_XX/{task}_summary.csv
results/suite_monte_carlo/{task}/{task}_summary_with_sem.csv
results/suite_monte_carlo/all_tasks_summary_with_sem.csv
results/suite_monte_carlo/performance_by_task.csv

Requirements

License

Shield:

This work is licensed under a Creative Commons Attribution 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 146 Commits
benchmark_suite		benchmark_suite
misc/notebooks		misc/notebooks
notebooks/colab		notebooks/colab
overleaf_submission		overleaf_submission
results		results
scripts		scripts
.gitignore		.gitignore
README.md		README.md
data.csv		data.csv
data_wrapper.py		data_wrapper.py
gec_benchmark.py		gec_benchmark.py
overleaf_submission.zip		overleaf_submission.zip
requirements.txt		requirements.txt
setn2026_refs.bib		setn2026_refs.bib
setn2026_submission.aux		setn2026_submission.aux
setn2026_submission.bbl		setn2026_submission.bbl
setn2026_submission.blg		setn2026_submission.blg
setn2026_submission.fdb_latexmk		setn2026_submission.fdb_latexmk
setn2026_submission.fls		setn2026_submission.fls
setn2026_submission.out		setn2026_submission.out
setn2026_submission.pdf		setn2026_submission.pdf
setn2026_submission.tex		setn2026_submission.tex
suite_benchmark.py		suite_benchmark.py
suite_benchmark_monte_carlo.py		suite_benchmark_monte_carlo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Greek NLP Benchmark

Guidelines

Running The Benchmark Suite

Setup

Unified Python Runner

Running On A Server

Colab Notebooks

Greek GEC Benchmark In VS Code

Monte Carlo Runner

Requirements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

The Greek NLP Benchmark

Guidelines

Running The Benchmark Suite

Setup

Unified Python Runner

Running On A Server

Colab Notebooks

Greek GEC Benchmark In VS Code

Monte Carlo Runner

Requirements

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages