Fine-tune prompt templates and add A/B prompt tests

- [ ] Create prompts/ with versioned prompt templates (e.g., prompts/prompt_v1.py, prompts/prompt_v2.py).

- [ ] Build an A/B test runner in test/ab_test.py that runs multiple prompt templates against the same dataset and produces comparative metrics.

- [ ] Store results and recommend the best model/prompt per metric.