Skip to content

Evaluate on Primary Benchmarks (CVQA, xMMMU, m-ArenaHard, GlobalMGSM) #20

@engichang1467

Description

@engichang1467

Run the best Phase 2 checkpoint(s) through the four primary benchmarks to establish initial v0 performance numbers. These results serve as the first quantitative validation of the Tiny Aya Vision pipeline and provide the baseline against which Phase 3 ablations (merge ratio, LoRA rank, script-specific compression) will be compared.

Benchmarks

Benchmark Type Metric Languages
CVQA Culturally diverse VQA VQA accuracy 31
xMMMU Multimodal reasoning Accuracy 7
m-ArenaHard Open-ended text generation Win-rates 23
GlobalMGSM Mathematical reasoning Accuracy 35

CVQA and xMMMU measure visual grounding; m-ArenaHard and GlobalMGSM measure text retention. Together they validate that multimodal training has not catastrophically degraded text capabilities.

Context

  • The evaluation harness should already be set up from Phase 1.
  • Results should be compared against: (1) Phase 1 blind baselines (Tiny Aya Base, no image input), (2) external baselines (Qwen3-VL-2B, Gemma 3-1B, etc.) from Phase 1.
  • Compute delta_vision = Score(Tiny Aya Vision) - Score(blind baseline) per benchmark to quantify the vision encoder's contribution.
  • If merge ratio ablation (Issue Implement Cross-Modal Merging and Ablate Merge Ratios (alpha = 0.3--0.7) #18) is complete, evaluate the optimal-alpha checkpoint. Otherwise, evaluate the best available checkpoint.

Dependencies

Acceptance Criteria

  • All four benchmarks (CVQA, xMMMU, m-ArenaHard, GlobalMGSM) are run on the best Phase 2 checkpoint.
  • Per-benchmark scores are recorded in a results table.
  • delta_vision is computed and reported for CVQA and xMMMU (visual grounding benchmarks).
  • Text retention comparison: m-ArenaHard and GlobalMGSM scores compared to Tiny Aya Base (text-only) numbers from Phase 1.
  • Results compared against external baselines (Qwen3-VL-2B at minimum).
  • Any evaluation failures or anomalies are documented.

Estimated Effort

1--2 days (assuming evaluation harness is already functional from Phase 1)

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions