Evaluate on Primary Benchmarks (CVQA, xMMMU, m-ArenaHard, GlobalMGSM)

<html>
<body>

<p>Run the best Phase 2 checkpoint(s) through the four primary benchmarks to establish initial v0 performance numbers. These results serve as the first quantitative validation of the Tiny Aya Vision pipeline and provide the baseline against which Phase 3 ablations (merge ratio, LoRA rank, script-specific compression) will be compared.</p>
<h3>Benchmarks</h3>

Benchmark | Type | Metric | Languages
-- | -- | -- | --
CVQA | Culturally diverse VQA | VQA accuracy | 31
xMMMU | Multimodal reasoning | Accuracy | 7
m-ArenaHard | Open-ended text generation | Win-rates | 23
GlobalMGSM | Mathematical reasoning | Accuracy | 35


<p>CVQA and xMMMU measure visual grounding; m-ArenaHard and GlobalMGSM measure text retention. Together they validate that multimodal training has not catastrophically degraded text capabilities.</p>
<h3>Context</h3>
<ul>
<li>The evaluation harness should already be set up from Phase 1.</li>
<li>Results should be compared against: (1) Phase 1 blind baselines (Tiny Aya Base, no image input), (2) external baselines (Qwen3-VL-2B, Gemma 3-1B, etc.) from Phase 1.</li>
<li>Compute delta_vision = Score(Tiny Aya Vision) - Score(blind baseline) per benchmark to quantify the vision encoder's contribution.</li>
<li>If merge ratio ablation (Issue #18) is complete, evaluate the optimal-alpha checkpoint. Otherwise, evaluate the best available checkpoint.</li>
</ul>
<h3>Dependencies</h3>
<ul>
<li>At least one trained and merged checkpoint from Issues #17 and #18.</li>
<li>Phase 1 evaluation harness and blind baseline numbers.</li>
</ul>
<h2>Acceptance Criteria</h2>
<ul>
<li>[ ] All four benchmarks (CVQA, xMMMU, m-ArenaHard, GlobalMGSM) are run on the best Phase 2 checkpoint.</li>
<li>[ ] Per-benchmark scores are recorded in a results table.</li>
<li>[ ] delta_vision is computed and reported for CVQA and xMMMU (visual grounding benchmarks).</li>
<li>[ ] Text retention comparison: m-ArenaHard and GlobalMGSM scores compared to Tiny Aya Base (text-only) numbers from Phase 1.</li>
<li>[ ] Results compared against external baselines (Qwen3-VL-2B at minimum).</li>
<li>[ ] Any evaluation failures or anomalies are documented.</li>
</ul>
<h2>Estimated Effort</h2>
<p>1--2 days (assuming evaluation harness is already functional from Phase 1)</p></body></html>
</body>
</html>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate on Primary Benchmarks (CVQA, xMMMU, m-ArenaHard, GlobalMGSM) #20

Benchmarks

Context

Dependencies

Acceptance Criteria

Estimated Effort

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark	Type	Metric	Languages
CVQA	Culturally diverse VQA	VQA accuracy	31
xMMMU	Multimodal reasoning	Accuracy	7
m-ArenaHard	Open-ended text generation	Win-rates	23
GlobalMGSM	Mathematical reasoning	Accuracy	35

Evaluate on Primary Benchmarks (CVQA, xMMMU, m-ArenaHard, GlobalMGSM) #20

Description

Benchmarks

Context

Dependencies

Acceptance Criteria

Estimated Effort

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions