-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Description
Run the best Phase 2 checkpoint(s) through the four primary benchmarks to establish initial v0 performance numbers. These results serve as the first quantitative validation of the Tiny Aya Vision pipeline and provide the baseline against which Phase 3 ablations (merge ratio, LoRA rank, script-specific compression) will be compared.
Benchmarks
| Benchmark | Type | Metric | Languages |
|---|---|---|---|
| CVQA | Culturally diverse VQA | VQA accuracy | 31 |
| xMMMU | Multimodal reasoning | Accuracy | 7 |
| m-ArenaHard | Open-ended text generation | Win-rates | 23 |
| GlobalMGSM | Mathematical reasoning | Accuracy | 35 |
CVQA and xMMMU measure visual grounding; m-ArenaHard and GlobalMGSM measure text retention. Together they validate that multimodal training has not catastrophically degraded text capabilities.
Context
- The evaluation harness should already be set up from Phase 1.
- Results should be compared against: (1) Phase 1 blind baselines (Tiny Aya Base, no image input), (2) external baselines (Qwen3-VL-2B, Gemma 3-1B, etc.) from Phase 1.
- Compute delta_vision = Score(Tiny Aya Vision) - Score(blind baseline) per benchmark to quantify the vision encoder's contribution.
- If merge ratio ablation (Issue Implement Cross-Modal Merging and Ablate Merge Ratios (alpha = 0.3--0.7) #18) is complete, evaluate the optimal-alpha checkpoint. Otherwise, evaluate the best available checkpoint.
Dependencies
- At least one trained and merged checkpoint from Issues Train Adapter and Projection Layers on LLaVA-Pretrain Alignment Mix #17 and Implement Cross-Modal Merging and Ablate Merge Ratios (alpha = 0.3--0.7) #18.
- Phase 1 evaluation harness and blind baseline numbers.
Acceptance Criteria
- All four benchmarks (CVQA, xMMMU, m-ArenaHard, GlobalMGSM) are run on the best Phase 2 checkpoint.
- Per-benchmark scores are recorded in a results table.
- delta_vision is computed and reported for CVQA and xMMMU (visual grounding benchmarks).
- Text retention comparison: m-ArenaHard and GlobalMGSM scores compared to Tiny Aya Base (text-only) numbers from Phase 1.
- Results compared against external baselines (Qwen3-VL-2B at minimum).
- Any evaluation failures or anomalies are documented.
Estimated Effort
1--2 days (assuming evaluation harness is already functional from Phase 1)
Reactions are currently unavailable