|
| 1 | +# End-to-End Demo Walkthrough |
| 2 | + |
| 3 | +This document walks through a complete demo of the evaluation-driven fine-tuning loop. |
| 4 | + |
| 5 | +## What the demo does |
| 6 | + |
| 7 | +1. **Loads 20 QA training examples** from `demo_data.jsonl` |
| 8 | +2. **Fine-tunes** Llama-3.1-8B with LoRA for 5 steps (Round 1) |
| 9 | +3. **Evaluates** the model on 5 test questions |
| 10 | +4. **Checks threshold**: If accuracy < 75%, triggers Round 2 |
| 11 | +5. **Adjusts hyperparameters**: Reduces learning rate by 40% (0.0003 → 0.00018) |
| 12 | +6. **Repeats** up to 3 rounds or until threshold is met |
| 13 | + |
| 14 | +## Expected output |
| 15 | + |
| 16 | +``` |
| 17 | +╔════════════════════════════════════════════════════════════╗ |
| 18 | +║ Tinker Evaluation-Driven Fine-Tuning Demo ║ |
| 19 | +╚════════════════════════════════════════════════════════════╝ |
| 20 | +
|
| 21 | +✓ Tinker API key found |
| 22 | +📦 Installing dependencies... |
| 23 | +✓ Dependencies installed |
| 24 | +
|
| 25 | +🚀 Starting evaluation-driven training loop... |
| 26 | + - Base model: meta-llama/Llama-3.1-8B-Instruct |
| 27 | + - Training data: 20 examples (demo_data.jsonl) |
| 28 | + - Max rounds: 3 |
| 29 | + - Eval threshold: 0.75 |
| 30 | + - Initial LR: 0.0003 (decays by 0.6x per round) |
| 31 | +
|
| 32 | +Creating LoRA training client for meta-llama/Llama-3.1-8B-Instruct... |
| 33 | +Loaded 20 examples from demo_data.jsonl |
| 34 | +Filtered to 20 valid examples |
| 35 | +Prepared 20 training datums |
| 36 | +
|
| 37 | +=== Training round 1/3 === |
| 38 | +Saving model checkpoint... |
| 39 | +Checkpoint saved at tinker://checkpoint-abc123 |
| 40 | +
|
| 41 | +Running evaluations... |
| 42 | + Running 5 test questions... |
| 43 | + ✗ Question 1: Incorrect |
| 44 | + ✓ Question 2: Correct |
| 45 | + ✗ Question 3: Incorrect |
| 46 | + ✓ Question 4: Correct |
| 47 | + ✗ Question 5: Incorrect |
| 48 | + Evaluation complete: 2/5 correct |
| 49 | + Accuracy: 40.00% |
| 50 | +Evaluation score: 0.4000 |
| 51 | +
|
| 52 | +Score below threshold (0.75). Preparing next round... |
| 53 | +
|
| 54 | +=== Training round 2/3 === |
| 55 | +Saving model checkpoint... |
| 56 | +Checkpoint saved at tinker://checkpoint-def456 |
| 57 | +
|
| 58 | +Running evaluations... |
| 59 | + Running 5 test questions... |
| 60 | + ✓ Question 1: Correct |
| 61 | + ✓ Question 2: Correct |
| 62 | + ✓ Question 3: Correct |
| 63 | + ✗ Question 4: Incorrect |
| 64 | + ✓ Question 5: Correct |
| 65 | + Evaluation complete: 4/5 correct |
| 66 | + Accuracy: 80.00% |
| 67 | +Evaluation score: 0.8000 |
| 68 | +
|
| 69 | +Target met: 0.8000 >= 0.75. Stopping. |
| 70 | +
|
| 71 | +Training loop completed. |
| 72 | +
|
| 73 | +✅ Demo complete! |
| 74 | +
|
| 75 | +What happened: |
| 76 | + 1. Loaded 20 training examples |
| 77 | + 2. Fine-tuned model with LoRA on Tinker infrastructure |
| 78 | + 3. Evaluated model on QA tasks |
| 79 | + 4. If score < 0.75: adjusted LR and started Round 2 |
| 80 | + 5. Repeated until threshold met or max rounds reached |
| 81 | +``` |
| 82 | + |
| 83 | +## Key observations |
| 84 | + |
| 85 | +### Round 1 |
| 86 | +- **LR**: 0.0003 |
| 87 | +- **Score**: ~40% (below threshold) |
| 88 | +- **Action**: Decay LR to 0.00018, start Round 2 |
| 89 | + |
| 90 | +### Round 2 |
| 91 | +- **LR**: 0.00018 (60% of previous) |
| 92 | +- **Score**: ~80% (above threshold) |
| 93 | +- **Action**: Stop training ✓ |
| 94 | + |
| 95 | +## Customizing the demo |
| 96 | + |
| 97 | +### Use your own data |
| 98 | +Replace `demo_data.jsonl` with your JSONL file: |
| 99 | + |
| 100 | +```json |
| 101 | +{"instruction": "Your instruction here", "output": "Expected output"} |
| 102 | +``` |
| 103 | + |
| 104 | +### Adjust the threshold |
| 105 | +Edit `demo_config.json`: |
| 106 | + |
| 107 | +```json |
| 108 | +{ |
| 109 | + "eval_threshold": 0.85, // Stricter requirement |
| 110 | + "max_rounds": 5 // More rounds allowed |
| 111 | +} |
| 112 | +``` |
| 113 | + |
| 114 | +### Enable EvalOps tracking |
| 115 | +Set environment variables and update config: |
| 116 | + |
| 117 | +```bash |
| 118 | +export EVALOPS_API_KEY=your-key |
| 119 | +``` |
| 120 | + |
| 121 | +```json |
| 122 | +{ |
| 123 | + "evalops_enabled": true, |
| 124 | + "evalops_test_suite_id": "your-suite-id" |
| 125 | +} |
| 126 | +``` |
| 127 | + |
| 128 | +Every round will now be tracked in EvalOps with full metrics and checkpoint URIs. |
| 129 | + |
| 130 | +### Use real Inspect AI tasks |
| 131 | +Replace the simple evaluator in `run_evaluations()` with Inspect AI integration: |
| 132 | + |
| 133 | +```python |
| 134 | +from inspect_ai import eval |
| 135 | +from inspect_ai.dataset import example_dataset |
| 136 | + |
| 137 | +# Run actual Inspect AI tasks |
| 138 | +results = await eval( |
| 139 | + tasks=["ifeval", "mmlu"], |
| 140 | + model=model_path, |
| 141 | + model_args={"renderer": renderer_name} |
| 142 | +) |
| 143 | +``` |
| 144 | + |
| 145 | +## Production checklist |
| 146 | + |
| 147 | +Before using in production: |
| 148 | + |
| 149 | +- [ ] Replace `simple_eval.py` with real Inspect AI task integration |
| 150 | +- [ ] Implement proper data pipeline (deduplication, quality filters) |
| 151 | +- [ ] Add batching and multiple steps per round |
| 152 | +- [ ] Enable gradient accumulation and mixed precision |
| 153 | +- [ ] Add checkpointing and resume capability |
| 154 | +- [ ] Configure EvalOps integration for centralized tracking |
| 155 | +- [ ] Set up alerting for threshold violations |
| 156 | +- [ ] Add data selection based on failure analysis |
| 157 | +- [ ] Tune hyperparameters for your specific domain |
0 commit comments