|
| 1 | +# Key Improvements Based on Tinker Docs Review |
| 2 | + |
| 3 | +## ✅ Implemented |
| 4 | + |
| 5 | +### 1. **Proper Renderers with Loss Masking** |
| 6 | +- **What**: Using `renderers.build_supervised_example()` instead of naive tokenization |
| 7 | +- **Why**: Ensures loss is only computed on assistant outputs, not prompts. Aligns with model's chat format. |
| 8 | +- **Impact**: Fixes fundamental training correctness |
| 9 | +- **Code**: See `data_loader.py` - falls back gracefully if tinker-cookbook unavailable |
| 10 | + |
| 11 | +### 2. **LoRA Rank Configuration** |
| 12 | +- **What**: Added `lora_rank` parameter (default: 16) |
| 13 | +- **Why**: Controls adapter capacity and training cost |
| 14 | +- **Impact**: Allows tuning speed/quality tradeoff |
| 15 | +- **Code**: `config_schema.py` + passed to `create_lora_training_client(rank=...)` |
| 16 | + |
| 17 | +### 3. **save_weights_for_sampler() for Evaluation** |
| 18 | +- **What**: Using `save_weights_for_sampler()` instead of `save_state()` for eval |
| 19 | +- **Why**: `save_state()` includes optimizer state (for resuming). Eval only needs weights. |
| 20 | +- **Impact**: Faster checkpointing, correct eval pattern |
| 21 | +- **Code**: `trainer_with_eval.py` line ~234 |
| 22 | + |
| 23 | +### 4. **Recommended Learning Rate Formula** |
| 24 | +- **What**: Tinker's LR formula: `LR = 5e-5 × 10 × (2000/H_m)^P_m` |
| 25 | +- **Why**: Tuned for LoRA + Tinker infrastructure across model sizes |
| 26 | +- **Impact**: Better convergence, less hyperparameter search |
| 27 | +- **Code**: `hyperparam_utils.py` - `get_recommended_lr(model_name)` |
| 28 | +- **Usage**: Set `use_recommended_lr: true` in config |
| 29 | + |
| 30 | +### 5. **Warmup + Cosine LR Schedule** |
| 31 | +- **What**: Linear warmup → cosine decay to `min_lr` |
| 32 | +- **Why**: Stabilizes training, standard best practice |
| 33 | +- **Impact**: Reduces instability, improves final metrics |
| 34 | +- **Code**: `hyperparam_utils.py` - `get_lr_with_warmup(step, ...)` |
| 35 | + |
| 36 | +## 🚧 To Implement (Production-Ready) |
| 37 | + |
| 38 | +### 6. **Async Futures for Overlapping Requests** |
| 39 | +- **What**: Use `forward_backward_async()` and `optim_step_async()` with batching |
| 40 | +- **Why**: Overlap compute/network, avoid missing Tinker's ~10s clock cycles |
| 41 | +- **Impact**: 2-3x faster training throughput |
| 42 | +- **Code**: Replace `run_training_round()` with async micro-batching loop |
| 43 | +- **Docs**: https://tinker-docs.thinkingmachines.ai/async |
| 44 | + |
| 45 | +```python |
| 46 | +async def run_training_round_async(training_client, datums, cfg, step_offset): |
| 47 | + batches = [datums[i:i+cfg.batch_size] for i in range(0, len(datums), cfg.batch_size)] |
| 48 | + |
| 49 | + futures = [] |
| 50 | + for i in range(min(cfg.steps_per_round, len(batches))): |
| 51 | + lr = get_lr_with_warmup(step=step_offset + i, ...) |
| 52 | + fut = await training_client.forward_backward_async(batches[i], "cross_entropy") |
| 53 | + futures.append(fut) |
| 54 | + |
| 55 | + await asyncio.gather(*futures) |
| 56 | + await training_client.optim_step_async(types.AdamParams(learning_rate=lr)) |
| 57 | +``` |
| 58 | + |
| 59 | +### 7. **Proper Inspect AI Integration** |
| 60 | +- **What**: Use `InspectAPIFromTinkerSampling` with actual Inspect tasks |
| 61 | +- **Why**: Real evals instead of simulated scores |
| 62 | +- **Impact**: Demo credibility, production evals |
| 63 | +- **Code**: In `run_evaluations()`, create sampling client from weights_uri: |
| 64 | + |
| 65 | +```python |
| 66 | +from tinker_cookbook.eval.inspect_utils import InspectAPIFromTinkerSampling |
| 67 | +from inspect_ai import eval_async |
| 68 | + |
| 69 | +sampling_client = service_client.create_sampling_client(model_path=model_path) |
| 70 | +api = InspectAPIFromTinkerSampling( |
| 71 | + renderer_name=renderer_name, |
| 72 | + model_name=model_name, |
| 73 | + sampling_client=sampling_client, |
| 74 | +) |
| 75 | +results = await eval_async(tasks=tasks, model=InspectAIModel(api=api)) |
| 76 | +``` |
| 77 | + |
| 78 | +### 8. **Batch Multiple Examples Per Step** |
| 79 | +- **What**: Currently doing 1 step per round. Should do `steps_per_round` with batching. |
| 80 | +- **Why**: Realistic training (need 100+ steps minimum per docs) |
| 81 | +- **Impact**: Proper gradient signal, better results |
| 82 | +- **Dependency**: Requires #6 (async futures) |
| 83 | + |
| 84 | +### 9. **Resume from Checkpoint** |
| 85 | +- **What**: `load_state()` to continue from saved checkpoint |
| 86 | +- **Why**: Long training runs, experimentation, recovery |
| 87 | +- **Code**: Add `--resume` flag + `load_state(checkpoint_uri)` before loop |
| 88 | + |
| 89 | +### 10. **Model-Specific Defaults** |
| 90 | +- **What**: Auto-detect renderer from model name, validate compatibility |
| 91 | +- **Why**: Prevent silent failures from renderer/model mismatch |
| 92 | +- **Code**: Add model→renderer mapping in config validation |
| 93 | + |
| 94 | +## Priority for Next PR |
| 95 | + |
| 96 | +1. **Async futures + batching** (#6 + #8) - Biggest performance/correctness win |
| 97 | +2. **Inspect AI integration** (#7) - Makes demo real and production-ready |
| 98 | +3. **Resume capability** (#9) - Common production need |
| 99 | + |
| 100 | +## Notes |
| 101 | + |
| 102 | +- Current implementation works but is simplified for demo purposes |
| 103 | +- Renderers fallback gracefully if tinker-cookbook unavailable |
| 104 | +- LR scheduler optional - can use static LR for simple demos |
| 105 | +- All improvements maintain backward compatibility with existing configs |
0 commit comments