- [ ] Create prompts/ with versioned prompt templates (e.g., prompts/prompt_v1.py, prompts/prompt_v2.py). - [ ] Build an A/B test runner in test/ab_test.py that runs multiple prompt templates against the same dataset and produces comparative metrics. - [ ] Store results and recommend the best model/prompt per metric.