VIVID Evaluation Framework

A modular, command-line evaluation framework for the VIVID benchmark — Vietnamese Idioms for Validation and Interpretation Depth.

Supports all benchmark tasks from the paper: generative explanation evaluation, LLM-as-a-Judge scoring, and discriminative classification — for both open-source models (via vLLM) and API models (OpenAI, Gemini).

Benchmark & Method Overview
Project Structure
Installation
Quick Start
Commands
Supported Models
Output Files
Resume / Incremental Runs
Module Reference

Benchmark & Method Overview

VIVID (Vietnamese Idioms for Validation and Interpretation Depth) is a culturally grounded benchmark for evaluating LLMs’ Vietnamese figurative language understanding. The dataset contains 1,707 Vietnamese idioms and proverbs, and the released benchmark set used for evaluation contains 1,636 idiom–explanation pairs after human validation.

Dataset labels

Each idiom/proverb is annotated with five complexity characteristics that are especially error-prone for LLMs:

Only literal expressions / literal over-metaphorization
Pragmatic nuances (sarcasm/irony/negative connotations)
Uncommon vocabulary
Archaic/outdated terms
Customary / folk-knowledge-based expressions

In addition, each item is categorized into 7 semantic themes: Love, Virtues, Criticism, Work and Nature, Society, Life Lessons, Other.

Evaluation tasks supported by this framework

This CLI framework reproduces all evaluation tracks described in the paper:

Generative explanation evaluation (generate) — produce Vietnamese explanations for idioms/proverbs.
LLM-as-a-Judge scoring (judge) — score explanations on a 0–5 scale with an aspect-based rubric.
Discriminative classification (discriminate) — multiple-choice topic/pattern classification using lm-eval-harness.

Scoring protocol (LLM-as-a-Judge)

We use a two-step evaluation setup with GPT-4.1 as the judge, and compare prompting strategies for judging: Zero-shot, Demonstration, Chain-of-Thought, Aspect-based.

Aspect-based Evaluation shows the strongest alignment with human judgment (Cohen’s κ = 0.792), so this repo defaults to aspect-based judging.

Human evaluation reference

A manual evaluation on a random set of 200 samples reports strong agreement between two native Vietnamese annotators: Cohen’s κ = 0.913 and Pearson correlation = 0.912 on a 0–5 scale.

Method Overview

Project Structure

.
├── vivid_eval.py                        # CLI entry point — run all tasks from here
├── requirements.txt
├── vivid/                               # Core framework package
│   ├── __init__.py
│   ├── constants.py                     # Paths, model lists, API URLs
│   ├── utils.py                         # Shared helpers (formatting, file I/O, etc.)
│   ├── prompts.py                       # All prompt builders (zero-shot, few-shot, judge)
│   ├── generate.py                      # Task: explanation generation (vLLM + API)
│   ├── judge.py                         # Task: LLM-as-a-Judge scoring
│   ├── discriminate.py                  # Task: topic/pattern classification via lm_eval
│   └── pipeline.py                      # Task: full generate → judge pipeline
├── dataset/
│   ├── VIVID_Dataset.csv                # 1,636 idioms with ground-truth explanations
│   ├── VIVID_Semantic_Themes.csv        # 7 semantic theme labels
│   └── VIVID_Linguistic_Complexity_Taxonomys.csv  # 5 complexity trait labels
├── evaluation/
│   └── discriminative/
│       ├── topic.yaml                   # lm_eval task config: topic classification
│       ├── topic.json                   # lm_eval test data: topic classification
│       ├── pattern.yaml                 # lm_eval task config: pattern classification
│       └── pattern.json                 # lm_eval test data: pattern classification
└── results/                             # All outputs are saved here (auto-created)

Installation

pip install vllm openai pandas tqdm lm-eval

Set your API key as an environment variable (needed for judge, and for API-based generate / discriminate):

# OpenAI
export OPENAI_API_KEY=sk-...

# Gemini (via Google AI Studio)
export OPENAI_API_KEY=AIza...

Quick Start

# 1. Generate explanations with an open-source model
python vivid_eval.py generate --model Qwen/Qwen3-14B --prompt zero-shot

# 2. Score the output with GPT-4.1 as judge
python vivid_eval.py judge \
    --input results/generate_Qwen_Qwen3-14B_zero-shot_<timestamp>.csv \
    --model-col Qwen_Qwen3-14B_zero_shot_explanation

# 3. Run topic + pattern classification
python vivid_eval.py discriminate --model Qwen/Qwen3-14B --task both

# Or run steps 1 + 2 together
python vivid_eval.py full-pipeline --model Qwen/Qwen3-14B --prompt zero-shot

Commands

All commands are run through vivid_eval.py. Every command accepts --help for full option details.

python vivid_eval.py --help
python vivid_eval.py generate --help
python vivid_eval.py judge --help
python vivid_eval.py discriminate --help
python vivid_eval.py full-pipeline --help

`generate`

Generates a Vietnamese explanation for each idiom/proverb in the dataset. Works with both local open-source models (via vLLM) and hosted API models.

Output column added to the CSV: <model_tag>_<zero_shot|few_shot>_explanation

Open-source model (runs locally via vLLM):

python vivid_eval.py generate \
    --model Qwen/Qwen3-14B \
    --prompt zero-shot

# Multi-GPU with bfloat16
python vivid_eval.py generate \
    --model meta-llama/Llama-4-Scout-17B-16E \
    --prompt few-shot \
    --tensor-parallel 2 \
    --dtype bfloat16 \
    --batch-size 32

API model:

python vivid_eval.py generate \
    --model gpt-4o \
    --prompt few-shot \
    --api-key sk-...

python vivid_eval.py generate \
    --model gemini-2.5-flash \
    --prompt zero-shot \
    --api-key AIza...

All options:

Flag	Default	Description
`--model`	(required)	HuggingFace model ID or API model name
`--prompt`	`zero-shot`	`zero-shot` or `few-shot`
`--dataset`	`dataset/VIVID_Dataset.csv`	Path to input dataset
`--batch-size`	`64`	vLLM inference batch size
`--max-tokens`	`150`	Max tokens to generate per explanation
`--temperature`	`0.7`	Sampling temperature
`--tensor-parallel`	`1`	Number of GPUs for tensor parallelism (vLLM)
`--gpu-memory-util`	`0.85`	Fraction of GPU memory to use (vLLM)
`--dtype`	`auto`	Model dtype: `auto`, `float16`, `bfloat16`
`--max-model-len`	`None`	Override max context length (vLLM)
`--api-key`	env var	API key (overrides `OPENAI_API_KEY`)
`--api-delay`	`1.0`	Seconds to wait between API calls
`--results-dir`	`results/`	Directory to save output CSV

Output: results/generate_<model>_<prompt>_<timestamp>.csv

`judge`

Scores model-generated explanations using GPT-4.1 as an aspect-based judge, the strategy validated at Cohen's κ = 0.792 against human annotators in the paper.

Each explanation is scored 0–5 across four criteria (semantic accuracy, nuance, fluency, completeness), and an overall similarity score is returned. A score of 0 on criterion 1 (semantic accuracy) automatically sets the overall score to 0.

Takes as input the CSV produced by generate, and adds a <col>_score column.

python vivid_eval.py judge \
    --input results/generate_Qwen_Qwen3-14B_zero-shot_20250101_120000.csv \
    --model-col Qwen_Qwen3-14B_zero_shot_explanation \
    --api-key sk-...

# Use a different judge model
python vivid_eval.py judge \
    --input results/generate_gpt-4o_few-shot_<timestamp>.csv \
    --model-col gpt-4o_few_shot_explanation \
    --judge-model gpt-4o \
    --api-key sk-...

All options:

Flag	Default	Description
`--input`	(required)	CSV file from the `generate` step
`--model-col`	(required)	Column name containing LLM explanations to score
`--judge-model`	`gpt-4.1`	Judge model to use
`--api-key`	env var	OpenAI API key
`--api-delay`	`1.0`	Seconds to wait between API calls
`--results-dir`	`results/`	Directory to save outputs

Terminal output includes:

============================================================
  Judge Results Summary
============================================================
  Model column              Qwen_Qwen3-14B_zero_shot_explanation
  Judge model               gpt-4.1
  Total rows                1636
  Scored rows               1636
  Errors / skipped          0
  Mean score                1.0900
  Median score              1.0000
  Score 0 (%)               42.3%
  Score 5 (%)               3.1%

  Score distribution:
    0/5  ████████████████          692
    1/5  ████████                  412
    2/5  ████                      198
    3/5  ██                        134
    4/5  █                         108
    5/5                             92

Outputs:

results/judge_<model_col>_<timestamp>.csv — full dataset with scores added
results/judge_<model_col>_<timestamp>.json — summary statistics

`discriminate`

Runs multiple-choice classification tasks using the lm-evaluation-harness. Supports both API models and open-source models (via the vLLM backend in lm_eval). Uses exact-match evaluation and 3-shot prompting by default.

Two tasks are available:

Task	`--task` value	Description
Topic classification	`topic`	Classify each idiom into 1 of 7 semantic themes
Pattern classification	`pattern`	Identify 1 or more of 5 linguistic complexity traits (multi-label)

API model: Before running any command with an API model (e.g., gpt-4o, gemini-2.5-flash, ), export your API key:

export OPENAI_API_KEY=sk-...

python vivid_eval.py discriminate \
    --model gpt-4o \
    --task both \
    --rebuild-eval-data \
    --api-key sk-...

Open-source model (vLLM backend in lm_eval):

python vivid_eval.py discriminate \
    --model Qwen/Qwen3-14B \
    --rebuild-eval-data \
    --task both

# Multi-GPU, bfloat16, restricted context
python vivid_eval.py discriminate \
    --model meta-llama/Llama-4-Scout-17B-16E \
    --task topic \
    --tensor-parallel 2 \
    --dtype bfloat16 \
    --batch-size 16 \
    --rebuild-eval-data \
    --max-model-len 4096

All options:

Flag	Default	Description
`--model`	(required)	HuggingFace model ID or API model name
`--task`	`both`	`topic`, `pattern`, or `both`
`--num-fewshot`	`3`	Number of few-shot examples passed to lm_eval
`--batch-size`	`32`	Inference batch size for open-source models
`--tensor-parallel`	`1`	Number of GPUs for tensor parallelism (vLLM)
`--gpu-memory-util`	`0.85`	Fraction of GPU memory to use (vLLM)
`--dtype`	`auto`	Model dtype: `auto`, `float16`, `bfloat16`
`--max-model-len`	`None`	Override max context length (vLLM)
`--api-key`	env var	API key for hosted models
`--results-dir`	`results/`	Directory to save outputs

Output: results/discriminate_<model>_<task>_<timestamp>.json

`full-pipeline`

Convenience command that runs generate and judge back-to-back in a single call, automatically passing the generate output CSV into the judge step.

# Open-source model
python vivid_eval.py full-pipeline \
    --model Qwen/Qwen3-14B \
    --prompt zero-shot \
    --api-key sk-...

# API model, few-shot
python vivid_eval.py full-pipeline \
    --model gpt-4o \
    --prompt few-shot \
    --api-key sk-...

Accepts all flags from both generate and judge. The --api-key is used for both the generation step (if the model is API-based) and the judge step.

Supported Models

Open-source (run locally via vLLM)

Model	HuggingFace ID
Llama-4-Scout (109B)	`meta-llama/Llama-4-Scout-17B-16E`
DeepSeek-R1-Distill (14B)	`deepseek-ai/DeepSeek-R1-Distill-Qwen-14B`
Qwen3 (14B)	`Qwen/Qwen3-14B`
SEA-LION (8B)	`aisingapore/Llama-SEA-LION-v3-8B-IT`
Vistral (7B)	`Viet-Mistral/Vistral-7B-Chat`
GreenMind (14B)	`GreenNode/GreenMind-Medium-14B-R1`
VinaLLaMA (7B)	`vilm/vinallama-7b-chat`

Any model on HuggingFace compatible with vLLM will also work.

API models

Provider	Model IDs
OpenAI	`gpt-4o`, `gpt-4.1`, `gpt-4o-mini`
Google	`gemini-2.5-flash`, `gemini-2.0-flash`

Model type is detected automatically from the model name — no extra flags needed.

Output Files

All outputs are written to results/ (configurable via --results-dir) with a timestamp in the filename to avoid overwriting previous runs.

Command	Output file(s)
`generate`	`generate_<model>_<prompt>_<timestamp>.csv`
`judge`	`judge_<model_col>_<timestamp>.csv` + `.json`
`discriminate`	`discriminate_<model>_<task>_<timestamp>.json` + lm_eval raw output
`full-pipeline`	All of the above from both steps

The .csv files contain the full dataset with new columns appended. The .json files contain summary statistics suitable for reporting.

Example output CSV columns after a full pipeline run:

phrase | ground_truth_explanation | Qwen_Qwen3-14B_zero_shot_explanation | Qwen_Qwen3-14B_zero_shot_score

Resume / Incremental Runs

All tasks check for already-computed rows before starting:

generate — skips rows where the explanation column is already filled. Re-run the same command with the same model and prompt to continue an interrupted run.
judge — skips rows where the score column is already filled. Pass the partially-scored CSV as --input to continue.

This means it is safe to interrupt any run and restart it without losing progress.

Module Reference

The vivid/ package can also be imported directly in your own scripts.

from vivid.prompts import build_zero_shot_message, build_few_shot_message, build_judge_prompt
from vivid.utils import safe_model_name, timestamp, extract_number
from vivid.generate import run_generate
from vivid.judge import run_judge
from vivid.discriminate import run_discriminate
from vivid.pipeline import run_full_pipeline

Module	Contents
`vivid/constants.py`	Default paths, model lists, `GEMINI_BASE_URL`
`vivid/utils.py`	`timestamp`, `safe_model_name`, `ensure_results_dir`, `extract_number`, `print_section`, `print_summary_table`, `print_score_distribution`
`vivid/prompts.py`	`build_zero_shot_message`, `build_few_shot_message`, `build_judge_prompt`
`vivid/generate.py`	`run_generate` — vLLM and API explanation generation
`vivid/judge.py`	`run_judge` — GPT-4.1 aspect-based scoring
`vivid/discriminate.py`	`run_discriminate` — lm_eval topic/pattern classification
`vivid/pipeline.py`	`run_full_pipeline` — generate → judge in sequence
`vivid_eval.py`	CLI entry point — argument parsing and dispatch only

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VIVID Evaluation Framework

Table of Contents

Benchmark & Method Overview

Dataset labels

Evaluation tasks supported by this framework

Scoring protocol (LLM-as-a-Judge)

Human evaluation reference

Method Overview

Project Structure

Installation

Quick Start

Commands

`generate`

`judge`

`discriminate`

`full-pipeline`

Supported Models

Open-source (run locally via vLLM)

API models

Output Files

Resume / Incremental Runs

Module Reference

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

VIVID Evaluation Framework

Table of Contents

Benchmark & Method Overview

Dataset labels

Evaluation tasks supported by this framework

Scoring protocol (LLM-as-a-Judge)

Human evaluation reference

Method Overview

Project Structure

Installation

Quick Start

Commands

generate

judge

discriminate

full-pipeline

Supported Models

Open-source (run locally via vLLM)

API models

Output Files

Resume / Incremental Runs

Module Reference

`generate`

`judge`

`discriminate`

`full-pipeline`