Skip to content

Commit b9ac4f3

Browse files
committed
Link to DEMO.md from README
1 parent 7de0db7 commit b9ac4f3

File tree

1 file changed

+157
-0
lines changed

1 file changed

+157
-0
lines changed

DEMO.md

Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
# End-to-End Demo Walkthrough
2+
3+
This document walks through a complete demo of the evaluation-driven fine-tuning loop.
4+
5+
## What the demo does
6+
7+
1. **Loads 20 QA training examples** from `demo_data.jsonl`
8+
2. **Fine-tunes** Llama-3.1-8B with LoRA for 5 steps (Round 1)
9+
3. **Evaluates** the model on 5 test questions
10+
4. **Checks threshold**: If accuracy < 75%, triggers Round 2
11+
5. **Adjusts hyperparameters**: Reduces learning rate by 40% (0.0003 → 0.00018)
12+
6. **Repeats** up to 3 rounds or until threshold is met
13+
14+
## Expected output
15+
16+
```
17+
╔════════════════════════════════════════════════════════════╗
18+
║ Tinker Evaluation-Driven Fine-Tuning Demo ║
19+
╚════════════════════════════════════════════════════════════╝
20+
21+
✓ Tinker API key found
22+
📦 Installing dependencies...
23+
✓ Dependencies installed
24+
25+
🚀 Starting evaluation-driven training loop...
26+
- Base model: meta-llama/Llama-3.1-8B-Instruct
27+
- Training data: 20 examples (demo_data.jsonl)
28+
- Max rounds: 3
29+
- Eval threshold: 0.75
30+
- Initial LR: 0.0003 (decays by 0.6x per round)
31+
32+
Creating LoRA training client for meta-llama/Llama-3.1-8B-Instruct...
33+
Loaded 20 examples from demo_data.jsonl
34+
Filtered to 20 valid examples
35+
Prepared 20 training datums
36+
37+
=== Training round 1/3 ===
38+
Saving model checkpoint...
39+
Checkpoint saved at tinker://checkpoint-abc123
40+
41+
Running evaluations...
42+
Running 5 test questions...
43+
✗ Question 1: Incorrect
44+
✓ Question 2: Correct
45+
✗ Question 3: Incorrect
46+
✓ Question 4: Correct
47+
✗ Question 5: Incorrect
48+
Evaluation complete: 2/5 correct
49+
Accuracy: 40.00%
50+
Evaluation score: 0.4000
51+
52+
Score below threshold (0.75). Preparing next round...
53+
54+
=== Training round 2/3 ===
55+
Saving model checkpoint...
56+
Checkpoint saved at tinker://checkpoint-def456
57+
58+
Running evaluations...
59+
Running 5 test questions...
60+
✓ Question 1: Correct
61+
✓ Question 2: Correct
62+
✓ Question 3: Correct
63+
✗ Question 4: Incorrect
64+
✓ Question 5: Correct
65+
Evaluation complete: 4/5 correct
66+
Accuracy: 80.00%
67+
Evaluation score: 0.8000
68+
69+
Target met: 0.8000 >= 0.75. Stopping.
70+
71+
Training loop completed.
72+
73+
✅ Demo complete!
74+
75+
What happened:
76+
1. Loaded 20 training examples
77+
2. Fine-tuned model with LoRA on Tinker infrastructure
78+
3. Evaluated model on QA tasks
79+
4. If score < 0.75: adjusted LR and started Round 2
80+
5. Repeated until threshold met or max rounds reached
81+
```
82+
83+
## Key observations
84+
85+
### Round 1
86+
- **LR**: 0.0003
87+
- **Score**: ~40% (below threshold)
88+
- **Action**: Decay LR to 0.00018, start Round 2
89+
90+
### Round 2
91+
- **LR**: 0.00018 (60% of previous)
92+
- **Score**: ~80% (above threshold)
93+
- **Action**: Stop training ✓
94+
95+
## Customizing the demo
96+
97+
### Use your own data
98+
Replace `demo_data.jsonl` with your JSONL file:
99+
100+
```json
101+
{"instruction": "Your instruction here", "output": "Expected output"}
102+
```
103+
104+
### Adjust the threshold
105+
Edit `demo_config.json`:
106+
107+
```json
108+
{
109+
"eval_threshold": 0.85, // Stricter requirement
110+
"max_rounds": 5 // More rounds allowed
111+
}
112+
```
113+
114+
### Enable EvalOps tracking
115+
Set environment variables and update config:
116+
117+
```bash
118+
export EVALOPS_API_KEY=your-key
119+
```
120+
121+
```json
122+
{
123+
"evalops_enabled": true,
124+
"evalops_test_suite_id": "your-suite-id"
125+
}
126+
```
127+
128+
Every round will now be tracked in EvalOps with full metrics and checkpoint URIs.
129+
130+
### Use real Inspect AI tasks
131+
Replace the simple evaluator in `run_evaluations()` with Inspect AI integration:
132+
133+
```python
134+
from inspect_ai import eval
135+
from inspect_ai.dataset import example_dataset
136+
137+
# Run actual Inspect AI tasks
138+
results = await eval(
139+
tasks=["ifeval", "mmlu"],
140+
model=model_path,
141+
model_args={"renderer": renderer_name}
142+
)
143+
```
144+
145+
## Production checklist
146+
147+
Before using in production:
148+
149+
- [ ] Replace `simple_eval.py` with real Inspect AI task integration
150+
- [ ] Implement proper data pipeline (deduplication, quality filters)
151+
- [ ] Add batching and multiple steps per round
152+
- [ ] Enable gradient accumulation and mixed precision
153+
- [ ] Add checkpointing and resume capability
154+
- [ ] Configure EvalOps integration for centralized tracking
155+
- [ ] Set up alerting for threshold violations
156+
- [ ] Add data selection based on failure analysis
157+
- [ ] Tune hyperparameters for your specific domain

0 commit comments

Comments
 (0)