Skip to content

Latest commit

 

History

History
453 lines (341 loc) · 15.7 KB

File metadata and controls

453 lines (341 loc) · 15.7 KB

VIVID Evaluation Framework

A modular, command-line evaluation framework for the VIVID benchmark — Vietnamese Idioms for Validation and Interpretation Depth.

Supports all benchmark tasks from the paper: generative explanation evaluation, LLM-as-a-Judge scoring, and discriminative classification — for both open-source models (via vLLM) and API models (OpenAI, Gemini).


Table of Contents



Benchmark & Method Overview

VIVID (Vietnamese Idioms for Validation and Interpretation Depth) is a culturally grounded benchmark for evaluating LLMs’ Vietnamese figurative language understanding. The dataset contains 1,707 Vietnamese idioms and proverbs, and the released benchmark set used for evaluation contains 1,636 idiom–explanation pairs after human validation.

Dataset labels

Each idiom/proverb is annotated with five complexity characteristics that are especially error-prone for LLMs:

  1. Only literal expressions / literal over-metaphorization
  2. Pragmatic nuances (sarcasm/irony/negative connotations)
  3. Uncommon vocabulary
  4. Archaic/outdated terms
  5. Customary / folk-knowledge-based expressions

In addition, each item is categorized into 7 semantic themes: Love, Virtues, Criticism, Work and Nature, Society, Life Lessons, Other.

Evaluation tasks supported by this framework

This CLI framework reproduces all evaluation tracks described in the paper:

  • Generative explanation evaluation (generate) — produce Vietnamese explanations for idioms/proverbs.
  • LLM-as-a-Judge scoring (judge) — score explanations on a 0–5 scale with an aspect-based rubric.
  • Discriminative classification (discriminate) — multiple-choice topic/pattern classification using lm-eval-harness.

Scoring protocol (LLM-as-a-Judge)

We use a two-step evaluation setup with GPT-4.1 as the judge, and compare prompting strategies for judging: Zero-shot, Demonstration, Chain-of-Thought, Aspect-based.

Aspect-based Evaluation shows the strongest alignment with human judgment (Cohen’s κ = 0.792), so this repo defaults to aspect-based judging.

Human evaluation reference

A manual evaluation on a random set of 200 samples reports strong agreement between two native Vietnamese annotators: Cohen’s κ = 0.913 and Pearson correlation = 0.912 on a 0–5 scale.


Method Overview

Project Structure

.
├── vivid_eval.py                        # CLI entry point — run all tasks from here
├── requirements.txt
├── vivid/                               # Core framework package
│   ├── __init__.py
│   ├── constants.py                     # Paths, model lists, API URLs
│   ├── utils.py                         # Shared helpers (formatting, file I/O, etc.)
│   ├── prompts.py                       # All prompt builders (zero-shot, few-shot, judge)
│   ├── generate.py                      # Task: explanation generation (vLLM + API)
│   ├── judge.py                         # Task: LLM-as-a-Judge scoring
│   ├── discriminate.py                  # Task: topic/pattern classification via lm_eval
│   └── pipeline.py                      # Task: full generate → judge pipeline
├── dataset/
│   ├── VIVID_Dataset.csv                # 1,636 idioms with ground-truth explanations
│   ├── VIVID_Semantic_Themes.csv        # 7 semantic theme labels
│   └── VIVID_Linguistic_Complexity_Taxonomys.csv  # 5 complexity trait labels
├── evaluation/
│   └── discriminative/
│       ├── topic.yaml                   # lm_eval task config: topic classification
│       ├── topic.json                   # lm_eval test data: topic classification
│       ├── pattern.yaml                 # lm_eval task config: pattern classification
│       └── pattern.json                 # lm_eval test data: pattern classification
└── results/                             # All outputs are saved here (auto-created)

Installation

pip install vllm openai pandas tqdm lm-eval

Set your API key as an environment variable (needed for judge, and for API-based generate / discriminate):

# OpenAI
export OPENAI_API_KEY=sk-...

# Gemini (via Google AI Studio)
export OPENAI_API_KEY=AIza...

Quick Start

# 1. Generate explanations with an open-source model
python vivid_eval.py generate --model Qwen/Qwen3-14B --prompt zero-shot

# 2. Score the output with GPT-4.1 as judge
python vivid_eval.py judge \
    --input results/generate_Qwen_Qwen3-14B_zero-shot_<timestamp>.csv \
    --model-col Qwen_Qwen3-14B_zero_shot_explanation

# 3. Run topic + pattern classification
python vivid_eval.py discriminate --model Qwen/Qwen3-14B --task both

# Or run steps 1 + 2 together
python vivid_eval.py full-pipeline --model Qwen/Qwen3-14B --prompt zero-shot

Commands

All commands are run through vivid_eval.py. Every command accepts --help for full option details.

python vivid_eval.py --help
python vivid_eval.py generate --help
python vivid_eval.py judge --help
python vivid_eval.py discriminate --help
python vivid_eval.py full-pipeline --help

generate

Generates a Vietnamese explanation for each idiom/proverb in the dataset. Works with both local open-source models (via vLLM) and hosted API models.

Output column added to the CSV: <model_tag>_<zero_shot|few_shot>_explanation

Open-source model (runs locally via vLLM):

python vivid_eval.py generate \
    --model Qwen/Qwen3-14B \
    --prompt zero-shot
# Multi-GPU with bfloat16
python vivid_eval.py generate \
    --model meta-llama/Llama-4-Scout-17B-16E \
    --prompt few-shot \
    --tensor-parallel 2 \
    --dtype bfloat16 \
    --batch-size 32

API model:

python vivid_eval.py generate \
    --model gpt-4o \
    --prompt few-shot \
    --api-key sk-...
python vivid_eval.py generate \
    --model gemini-2.5-flash \
    --prompt zero-shot \
    --api-key AIza...

All options:

Flag Default Description
--model (required) HuggingFace model ID or API model name
--prompt zero-shot zero-shot or few-shot
--dataset dataset/VIVID_Dataset.csv Path to input dataset
--batch-size 64 vLLM inference batch size
--max-tokens 150 Max tokens to generate per explanation
--temperature 0.7 Sampling temperature
--tensor-parallel 1 Number of GPUs for tensor parallelism (vLLM)
--gpu-memory-util 0.85 Fraction of GPU memory to use (vLLM)
--dtype auto Model dtype: auto, float16, bfloat16
--max-model-len None Override max context length (vLLM)
--api-key env var API key (overrides OPENAI_API_KEY)
--api-delay 1.0 Seconds to wait between API calls
--results-dir results/ Directory to save output CSV

Output: results/generate_<model>_<prompt>_<timestamp>.csv


judge

Scores model-generated explanations using GPT-4.1 as an aspect-based judge, the strategy validated at Cohen's κ = 0.792 against human annotators in the paper.

Each explanation is scored 0–5 across four criteria (semantic accuracy, nuance, fluency, completeness), and an overall similarity score is returned. A score of 0 on criterion 1 (semantic accuracy) automatically sets the overall score to 0.

Takes as input the CSV produced by generate, and adds a <col>_score column.

python vivid_eval.py judge \
    --input results/generate_Qwen_Qwen3-14B_zero-shot_20250101_120000.csv \
    --model-col Qwen_Qwen3-14B_zero_shot_explanation \
    --api-key sk-...
# Use a different judge model
python vivid_eval.py judge \
    --input results/generate_gpt-4o_few-shot_<timestamp>.csv \
    --model-col gpt-4o_few_shot_explanation \
    --judge-model gpt-4o \
    --api-key sk-...

All options:

Flag Default Description
--input (required) CSV file from the generate step
--model-col (required) Column name containing LLM explanations to score
--judge-model gpt-4.1 Judge model to use
--api-key env var OpenAI API key
--api-delay 1.0 Seconds to wait between API calls
--results-dir results/ Directory to save outputs

Terminal output includes:

============================================================
  Judge Results Summary
============================================================
  Model column              Qwen_Qwen3-14B_zero_shot_explanation
  Judge model               gpt-4.1
  Total rows                1636
  Scored rows               1636
  Errors / skipped          0
  Mean score                1.0900
  Median score              1.0000
  Score 0 (%)               42.3%
  Score 5 (%)               3.1%

  Score distribution:
    0/5  ████████████████          692
    1/5  ████████                  412
    2/5  ████                      198
    3/5  ██                        134
    4/5  █                         108
    5/5                             92

Outputs:

  • results/judge_<model_col>_<timestamp>.csv — full dataset with scores added
  • results/judge_<model_col>_<timestamp>.json — summary statistics

discriminate

Runs multiple-choice classification tasks using the lm-evaluation-harness. Supports both API models and open-source models (via the vLLM backend in lm_eval). Uses exact-match evaluation and 3-shot prompting by default.

Two tasks are available:

Task --task value Description
Topic classification topic Classify each idiom into 1 of 7 semantic themes
Pattern classification pattern Identify 1 or more of 5 linguistic complexity traits (multi-label)

API model: Before running any command with an API model (e.g., gpt-4o, gemini-2.5-flash, ), export your API key:

export OPENAI_API_KEY=sk-...
python vivid_eval.py discriminate \
    --model gpt-4o \
    --task both \
    --rebuild-eval-data \
    --api-key sk-...

Open-source model (vLLM backend in lm_eval):

python vivid_eval.py discriminate \
    --model Qwen/Qwen3-14B \
    --rebuild-eval-data \
    --task both
# Multi-GPU, bfloat16, restricted context
python vivid_eval.py discriminate \
    --model meta-llama/Llama-4-Scout-17B-16E \
    --task topic \
    --tensor-parallel 2 \
    --dtype bfloat16 \
    --batch-size 16 \
    --rebuild-eval-data \
    --max-model-len 4096

All options:

Flag Default Description
--model (required) HuggingFace model ID or API model name
--task both topic, pattern, or both
--num-fewshot 3 Number of few-shot examples passed to lm_eval
--batch-size 32 Inference batch size for open-source models
--tensor-parallel 1 Number of GPUs for tensor parallelism (vLLM)
--gpu-memory-util 0.85 Fraction of GPU memory to use (vLLM)
--dtype auto Model dtype: auto, float16, bfloat16
--max-model-len None Override max context length (vLLM)
--api-key env var API key for hosted models
--results-dir results/ Directory to save outputs

Output: results/discriminate_<model>_<task>_<timestamp>.json


full-pipeline

Convenience command that runs generate and judge back-to-back in a single call, automatically passing the generate output CSV into the judge step.

# Open-source model
python vivid_eval.py full-pipeline \
    --model Qwen/Qwen3-14B \
    --prompt zero-shot \
    --api-key sk-...
# API model, few-shot
python vivid_eval.py full-pipeline \
    --model gpt-4o \
    --prompt few-shot \
    --api-key sk-...

Accepts all flags from both generate and judge. The --api-key is used for both the generation step (if the model is API-based) and the judge step.


Supported Models

Open-source (run locally via vLLM)

Model HuggingFace ID
Llama-4-Scout (109B) meta-llama/Llama-4-Scout-17B-16E
DeepSeek-R1-Distill (14B) deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
Qwen3 (14B) Qwen/Qwen3-14B
SEA-LION (8B) aisingapore/Llama-SEA-LION-v3-8B-IT
Vistral (7B) Viet-Mistral/Vistral-7B-Chat
GreenMind (14B) GreenNode/GreenMind-Medium-14B-R1
VinaLLaMA (7B) vilm/vinallama-7b-chat

Any model on HuggingFace compatible with vLLM will also work.

API models

Provider Model IDs
OpenAI gpt-4o, gpt-4.1, gpt-4o-mini
Google gemini-2.5-flash, gemini-2.0-flash

Model type is detected automatically from the model name — no extra flags needed.


Output Files

All outputs are written to results/ (configurable via --results-dir) with a timestamp in the filename to avoid overwriting previous runs.

Command Output file(s)
generate generate_<model>_<prompt>_<timestamp>.csv
judge judge_<model_col>_<timestamp>.csv + .json
discriminate discriminate_<model>_<task>_<timestamp>.json + lm_eval raw output
full-pipeline All of the above from both steps

The .csv files contain the full dataset with new columns appended. The .json files contain summary statistics suitable for reporting.

Example output CSV columns after a full pipeline run:

phrase | ground_truth_explanation | Qwen_Qwen3-14B_zero_shot_explanation | Qwen_Qwen3-14B_zero_shot_score

Resume / Incremental Runs

All tasks check for already-computed rows before starting:

  • generate — skips rows where the explanation column is already filled. Re-run the same command with the same model and prompt to continue an interrupted run.
  • judge — skips rows where the score column is already filled. Pass the partially-scored CSV as --input to continue.

This means it is safe to interrupt any run and restart it without losing progress.


Module Reference

The vivid/ package can also be imported directly in your own scripts.

from vivid.prompts import build_zero_shot_message, build_few_shot_message, build_judge_prompt
from vivid.utils import safe_model_name, timestamp, extract_number
from vivid.generate import run_generate
from vivid.judge import run_judge
from vivid.discriminate import run_discriminate
from vivid.pipeline import run_full_pipeline
Module Contents
vivid/constants.py Default paths, model lists, GEMINI_BASE_URL
vivid/utils.py timestamp, safe_model_name, ensure_results_dir, extract_number, print_section, print_summary_table, print_score_distribution
vivid/prompts.py build_zero_shot_message, build_few_shot_message, build_judge_prompt
vivid/generate.py run_generate — vLLM and API explanation generation
vivid/judge.py run_judge — GPT-4.1 aspect-based scoring
vivid/discriminate.py run_discriminate — lm_eval topic/pattern classification
vivid/pipeline.py run_full_pipeline — generate → judge in sequence
vivid_eval.py CLI entry point — argument parsing and dispatch only