A modular, command-line evaluation framework for the VIVID benchmark — Vietnamese Idioms for Validation and Interpretation Depth.
Supports all benchmark tasks from the paper: generative explanation evaluation, LLM-as-a-Judge scoring, and discriminative classification — for both open-source models (via vLLM) and API models (OpenAI, Gemini).
- Benchmark & Method Overview
- Project Structure
- Installation
- Quick Start
- Commands
- Supported Models
- Output Files
- Resume / Incremental Runs
- Module Reference
VIVID (Vietnamese Idioms for Validation and Interpretation Depth) is a culturally grounded benchmark for evaluating LLMs’ Vietnamese figurative language understanding. The dataset contains 1,707 Vietnamese idioms and proverbs, and the released benchmark set used for evaluation contains 1,636 idiom–explanation pairs after human validation.
Each idiom/proverb is annotated with five complexity characteristics that are especially error-prone for LLMs:
- Only literal expressions / literal over-metaphorization
- Pragmatic nuances (sarcasm/irony/negative connotations)
- Uncommon vocabulary
- Archaic/outdated terms
- Customary / folk-knowledge-based expressions
In addition, each item is categorized into 7 semantic themes: Love, Virtues, Criticism, Work and Nature, Society, Life Lessons, Other.
This CLI framework reproduces all evaluation tracks described in the paper:
- Generative explanation evaluation (
generate) — produce Vietnamese explanations for idioms/proverbs. - LLM-as-a-Judge scoring (
judge) — score explanations on a 0–5 scale with an aspect-based rubric. - Discriminative classification (
discriminate) — multiple-choice topic/pattern classification using lm-eval-harness.
We use a two-step evaluation setup with GPT-4.1 as the judge, and compare prompting strategies for judging: Zero-shot, Demonstration, Chain-of-Thought, Aspect-based.
Aspect-based Evaluation shows the strongest alignment with human judgment (Cohen’s κ = 0.792), so this repo defaults to aspect-based judging.
A manual evaluation on a random set of 200 samples reports strong agreement between two native Vietnamese annotators: Cohen’s κ = 0.913 and Pearson correlation = 0.912 on a 0–5 scale.
.
├── vivid_eval.py # CLI entry point — run all tasks from here
├── requirements.txt
├── vivid/ # Core framework package
│ ├── __init__.py
│ ├── constants.py # Paths, model lists, API URLs
│ ├── utils.py # Shared helpers (formatting, file I/O, etc.)
│ ├── prompts.py # All prompt builders (zero-shot, few-shot, judge)
│ ├── generate.py # Task: explanation generation (vLLM + API)
│ ├── judge.py # Task: LLM-as-a-Judge scoring
│ ├── discriminate.py # Task: topic/pattern classification via lm_eval
│ └── pipeline.py # Task: full generate → judge pipeline
├── dataset/
│ ├── VIVID_Dataset.csv # 1,636 idioms with ground-truth explanations
│ ├── VIVID_Semantic_Themes.csv # 7 semantic theme labels
│ └── VIVID_Linguistic_Complexity_Taxonomys.csv # 5 complexity trait labels
├── evaluation/
│ └── discriminative/
│ ├── topic.yaml # lm_eval task config: topic classification
│ ├── topic.json # lm_eval test data: topic classification
│ ├── pattern.yaml # lm_eval task config: pattern classification
│ └── pattern.json # lm_eval test data: pattern classification
└── results/ # All outputs are saved here (auto-created)
pip install vllm openai pandas tqdm lm-evalSet your API key as an environment variable (needed for judge, and for API-based generate / discriminate):
# OpenAI
export OPENAI_API_KEY=sk-...
# Gemini (via Google AI Studio)
export OPENAI_API_KEY=AIza...# 1. Generate explanations with an open-source model
python vivid_eval.py generate --model Qwen/Qwen3-14B --prompt zero-shot
# 2. Score the output with GPT-4.1 as judge
python vivid_eval.py judge \
--input results/generate_Qwen_Qwen3-14B_zero-shot_<timestamp>.csv \
--model-col Qwen_Qwen3-14B_zero_shot_explanation
# 3. Run topic + pattern classification
python vivid_eval.py discriminate --model Qwen/Qwen3-14B --task both
# Or run steps 1 + 2 together
python vivid_eval.py full-pipeline --model Qwen/Qwen3-14B --prompt zero-shotAll commands are run through vivid_eval.py. Every command accepts --help for full option details.
python vivid_eval.py --help
python vivid_eval.py generate --help
python vivid_eval.py judge --help
python vivid_eval.py discriminate --help
python vivid_eval.py full-pipeline --help
Generates a Vietnamese explanation for each idiom/proverb in the dataset. Works with both local open-source models (via vLLM) and hosted API models.
Output column added to the CSV: <model_tag>_<zero_shot|few_shot>_explanation
Open-source model (runs locally via vLLM):
python vivid_eval.py generate \
--model Qwen/Qwen3-14B \
--prompt zero-shot# Multi-GPU with bfloat16
python vivid_eval.py generate \
--model meta-llama/Llama-4-Scout-17B-16E \
--prompt few-shot \
--tensor-parallel 2 \
--dtype bfloat16 \
--batch-size 32API model:
python vivid_eval.py generate \
--model gpt-4o \
--prompt few-shot \
--api-key sk-...python vivid_eval.py generate \
--model gemini-2.5-flash \
--prompt zero-shot \
--api-key AIza...All options:
| Flag | Default | Description |
|---|---|---|
--model |
(required) | HuggingFace model ID or API model name |
--prompt |
zero-shot |
zero-shot or few-shot |
--dataset |
dataset/VIVID_Dataset.csv |
Path to input dataset |
--batch-size |
64 |
vLLM inference batch size |
--max-tokens |
150 |
Max tokens to generate per explanation |
--temperature |
0.7 |
Sampling temperature |
--tensor-parallel |
1 |
Number of GPUs for tensor parallelism (vLLM) |
--gpu-memory-util |
0.85 |
Fraction of GPU memory to use (vLLM) |
--dtype |
auto |
Model dtype: auto, float16, bfloat16 |
--max-model-len |
None |
Override max context length (vLLM) |
--api-key |
env var | API key (overrides OPENAI_API_KEY) |
--api-delay |
1.0 |
Seconds to wait between API calls |
--results-dir |
results/ |
Directory to save output CSV |
Output: results/generate_<model>_<prompt>_<timestamp>.csv
Scores model-generated explanations using GPT-4.1 as an aspect-based judge, the strategy validated at Cohen's κ = 0.792 against human annotators in the paper.
Each explanation is scored 0–5 across four criteria (semantic accuracy, nuance, fluency, completeness), and an overall similarity score is returned. A score of 0 on criterion 1 (semantic accuracy) automatically sets the overall score to 0.
Takes as input the CSV produced by generate, and adds a <col>_score column.
python vivid_eval.py judge \
--input results/generate_Qwen_Qwen3-14B_zero-shot_20250101_120000.csv \
--model-col Qwen_Qwen3-14B_zero_shot_explanation \
--api-key sk-...# Use a different judge model
python vivid_eval.py judge \
--input results/generate_gpt-4o_few-shot_<timestamp>.csv \
--model-col gpt-4o_few_shot_explanation \
--judge-model gpt-4o \
--api-key sk-...All options:
| Flag | Default | Description |
|---|---|---|
--input |
(required) | CSV file from the generate step |
--model-col |
(required) | Column name containing LLM explanations to score |
--judge-model |
gpt-4.1 |
Judge model to use |
--api-key |
env var | OpenAI API key |
--api-delay |
1.0 |
Seconds to wait between API calls |
--results-dir |
results/ |
Directory to save outputs |
Terminal output includes:
============================================================
Judge Results Summary
============================================================
Model column Qwen_Qwen3-14B_zero_shot_explanation
Judge model gpt-4.1
Total rows 1636
Scored rows 1636
Errors / skipped 0
Mean score 1.0900
Median score 1.0000
Score 0 (%) 42.3%
Score 5 (%) 3.1%
Score distribution:
0/5 ████████████████ 692
1/5 ████████ 412
2/5 ████ 198
3/5 ██ 134
4/5 █ 108
5/5 92
Outputs:
results/judge_<model_col>_<timestamp>.csv— full dataset with scores addedresults/judge_<model_col>_<timestamp>.json— summary statistics
Runs multiple-choice classification tasks using the lm-evaluation-harness. Supports both API models and open-source models (via the vLLM backend in lm_eval). Uses exact-match evaluation and 3-shot prompting by default.
Two tasks are available:
| Task | --task value |
Description |
|---|---|---|
| Topic classification | topic |
Classify each idiom into 1 of 7 semantic themes |
| Pattern classification | pattern |
Identify 1 or more of 5 linguistic complexity traits (multi-label) |
API model:
Before running any command with an API model (e.g., gpt-4o, gemini-2.5-flash, ), export your API key:
export OPENAI_API_KEY=sk-...python vivid_eval.py discriminate \
--model gpt-4o \
--task both \
--rebuild-eval-data \
--api-key sk-...Open-source model (vLLM backend in lm_eval):
python vivid_eval.py discriminate \
--model Qwen/Qwen3-14B \
--rebuild-eval-data \
--task both# Multi-GPU, bfloat16, restricted context
python vivid_eval.py discriminate \
--model meta-llama/Llama-4-Scout-17B-16E \
--task topic \
--tensor-parallel 2 \
--dtype bfloat16 \
--batch-size 16 \
--rebuild-eval-data \
--max-model-len 4096All options:
| Flag | Default | Description |
|---|---|---|
--model |
(required) | HuggingFace model ID or API model name |
--task |
both |
topic, pattern, or both |
--num-fewshot |
3 |
Number of few-shot examples passed to lm_eval |
--batch-size |
32 |
Inference batch size for open-source models |
--tensor-parallel |
1 |
Number of GPUs for tensor parallelism (vLLM) |
--gpu-memory-util |
0.85 |
Fraction of GPU memory to use (vLLM) |
--dtype |
auto |
Model dtype: auto, float16, bfloat16 |
--max-model-len |
None |
Override max context length (vLLM) |
--api-key |
env var | API key for hosted models |
--results-dir |
results/ |
Directory to save outputs |
Output: results/discriminate_<model>_<task>_<timestamp>.json
Convenience command that runs generate and judge back-to-back in a single call, automatically passing the generate output CSV into the judge step.
# Open-source model
python vivid_eval.py full-pipeline \
--model Qwen/Qwen3-14B \
--prompt zero-shot \
--api-key sk-...# API model, few-shot
python vivid_eval.py full-pipeline \
--model gpt-4o \
--prompt few-shot \
--api-key sk-...Accepts all flags from both generate and judge. The --api-key is used for both the generation step (if the model is API-based) and the judge step.
| Model | HuggingFace ID |
|---|---|
| Llama-4-Scout (109B) | meta-llama/Llama-4-Scout-17B-16E |
| DeepSeek-R1-Distill (14B) | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B |
| Qwen3 (14B) | Qwen/Qwen3-14B |
| SEA-LION (8B) | aisingapore/Llama-SEA-LION-v3-8B-IT |
| Vistral (7B) | Viet-Mistral/Vistral-7B-Chat |
| GreenMind (14B) | GreenNode/GreenMind-Medium-14B-R1 |
| VinaLLaMA (7B) | vilm/vinallama-7b-chat |
Any model on HuggingFace compatible with vLLM will also work.
| Provider | Model IDs |
|---|---|
| OpenAI | gpt-4o, gpt-4.1, gpt-4o-mini |
gemini-2.5-flash, gemini-2.0-flash |
Model type is detected automatically from the model name — no extra flags needed.
All outputs are written to results/ (configurable via --results-dir) with a timestamp in the filename to avoid overwriting previous runs.
| Command | Output file(s) |
|---|---|
generate |
generate_<model>_<prompt>_<timestamp>.csv |
judge |
judge_<model_col>_<timestamp>.csv + .json |
discriminate |
discriminate_<model>_<task>_<timestamp>.json + lm_eval raw output |
full-pipeline |
All of the above from both steps |
The .csv files contain the full dataset with new columns appended. The .json files contain summary statistics suitable for reporting.
Example output CSV columns after a full pipeline run:
phrase | ground_truth_explanation | Qwen_Qwen3-14B_zero_shot_explanation | Qwen_Qwen3-14B_zero_shot_score
All tasks check for already-computed rows before starting:
generate— skips rows where the explanation column is already filled. Re-run the same command with the same model and prompt to continue an interrupted run.judge— skips rows where the score column is already filled. Pass the partially-scored CSV as--inputto continue.
This means it is safe to interrupt any run and restart it without losing progress.
The vivid/ package can also be imported directly in your own scripts.
from vivid.prompts import build_zero_shot_message, build_few_shot_message, build_judge_prompt
from vivid.utils import safe_model_name, timestamp, extract_number
from vivid.generate import run_generate
from vivid.judge import run_judge
from vivid.discriminate import run_discriminate
from vivid.pipeline import run_full_pipeline| Module | Contents |
|---|---|
vivid/constants.py |
Default paths, model lists, GEMINI_BASE_URL |
vivid/utils.py |
timestamp, safe_model_name, ensure_results_dir, extract_number, print_section, print_summary_table, print_score_distribution |
vivid/prompts.py |
build_zero_shot_message, build_few_shot_message, build_judge_prompt |
vivid/generate.py |
run_generate — vLLM and API explanation generation |
vivid/judge.py |
run_judge — GPT-4.1 aspect-based scoring |
vivid/discriminate.py |
run_discriminate — lm_eval topic/pattern classification |
vivid/pipeline.py |
run_full_pipeline — generate → judge in sequence |
vivid_eval.py |
CLI entry point — argument parsing and dispatch only |

