To validate that the stack produces a model which is as easy to fine-tune as Llama models, we convert the FSDP checkpoint into a HF checkpoint and fine tune it using popular fine-tuning configurations and data mixes.
Specifically, we follow Allen AI’s open-instruct framework, leveraging the TULU v2 stack as-is (DeepSpeed, TULU v2 mixture and recommended configuration for Llama 2 models). The tuned model scores are presented below and we note improvements in several tasks. We did not do a hyperparameter exploration for the best parameters to fine-tune LlamaT. We note that optimal hyperparameter for LlamaT tuning could be different from Llama 2 as they are likely to have followed different learning rate schedules.
| Evaluation metric | Llama2-7B (baseline) | LlamaT-7B |
|---|---|---|
| MMLU (5-shot weighted avg) | 0.53 | 0.49 |
| Arc challenge | 0.48 | 0.43 |
| Arc easy | 0.73 | 0.67 |
| Boolq | 0.82 | 0.82 |
| Copa | 0.89 | 0.86 |
| Hellaswag | 0.76 | 0.75 |
| Openbookqa | 0.47 | 0.42 |
| Piqa | 0.79 | 0.79 |
| Sciq | 0.93 | 0.91 |
| Winogrande | 0.71 | 0.65 |
| Truthfulqa | 0.45 | 0.46 |
| GSM8k (8-shot) | 0 | 0 |