Project: Culturally Grounded Multilingual RAG Evaluation
Goal Evaluate whether retrieval grounded generation improves factual accuracy for culturally grounded knowledge sources, especially in underrepresented languages.
Languages
- English
- Uzbek
Datasets
- MIRACL retrieval dataset
- TyDi QA multilingual QA
- Uzbek Wikipedia corpus
- Optional: Lex.uz legal corpus
Experiment Variables retrieval_mode:
- none
- vector
chunk_size:
- 256
- 512
chunk_overlap:
- 32
- 64
top_k:
- 3
- 5
prompt_style:
- baseline
- grounded
Evaluation Metrics
Primary
- grounded_answer_score
Secondary
- hallucination_rate
- unsupported_claim_rate
- retrieval_recall_at_k
- latency
Experiment Rules
- All experiments must be config driven
- Never overwrite previous experiment outputs
- Every run must produce:
- JSONL outputs
- CSV metrics
- experiment metadata
- Raw datasets must never be modified
- New experiments must create new folders under results/
- Small smoke test must pass before running large batch jobs
Execution Phases
Phase 1 Environment setup and dataset download
Phase 2 Corpus preprocessing and chunking
Phase 3 Vector index construction
Phase 4 Baseline evaluation
Phase 5 Full experiment matrix
Phase 6 Aggregation and report generation