Skip to content

Commit 2aaa87f

Browse files
Phase 2: experiment tracking, data tools, model evaluation
- Add SQLite experiment tracker (~/.soup/experiments.db) with auto-logging of config, per-step metrics, hardware info, and eval results - Add soup runs commands: list, show (with plotext loss curves), compare, delete - Integrate tracker into soup train (auto start_run/finish_run/fail_run) - Add soup data convert (alpaca/sharegpt/chatml bidirectional conversion) - Add soup data merge (concatenate datasets with optional shuffle) - Add soup data dedup (MinHash near-duplicate removal via datasketch) - Add soup data stats (length percentiles, token counts, language detection) - Add soup eval (lm-evaluation-harness wrapper with tracker integration) - Add reverse format conversion: messages_to_format() in data/formats.py - Add extended_stats() to data/validator.py - Update monitoring callback to log metrics to tracker - Add plotext to deps, datasketch as optional [data] dep - Update README and CLAUDE.md with Phase 2 docs - 70 tests passing, ruff clean Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent a2a0f2c commit 2aaa87f

File tree

18 files changed

+2005
-47
lines changed

18 files changed

+2005
-47
lines changed

CLAUDE.md

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -35,16 +35,23 @@ soup train --config soup.yaml
3535
→ data/loader.py (load file or HF dataset → normalize format)
3636
→ trainer/sft.py (load model → quantize → apply LoRA → train)
3737
→ monitoring/callback.py + display.py (live Rich dashboard)
38+
→ experiment/tracker.py (log run + metrics to SQLite)
3839
→ save LoRA adapter to output/
3940
```
4041

4142
**Config system:** `config/schema.py` is the single source of truth. All YAML fields are validated by Pydantic models (`SoupConfig``TrainingConfig``LoraConfig`, `DataConfig`). Templates (chat/code/medical) live as YAML strings in this file.
4243

43-
**Data pipeline:** `data/loader.py` handles local files (JSONL/JSON/CSV/Parquet) and HuggingFace datasets. `data/formats.py` auto-detects and normalizes alpaca/sharegpt/chatml formats into a unified `{"messages": [...]}` structure.
44+
**Data pipeline:** `data/loader.py` handles local files (JSONL/JSON/CSV/Parquet) and HuggingFace datasets. `data/formats.py` auto-detects and normalizes alpaca/sharegpt/chatml formats into a unified `{"messages": [...]}` structure. Also supports reverse conversion via `messages_to_format()`.
4445

45-
**Trainer:** `trainer/sft.py` (`SFTTrainerWrapper`) wraps HuggingFace's SFTTrainer with auto quantization (BitsAndBytes), LoRA (PEFT), and batch size estimation. Heavy ML imports are lazy (inside methods) so CLI stays fast for non-training commands.
46+
**Trainer:** `trainer/sft.py` (`SFTTrainerWrapper`) and `trainer/dpo.py` (`DPOTrainerWrapper`) wrap HuggingFace's SFTTrainer/DPOTrainer with auto quantization (BitsAndBytes), LoRA (PEFT), and batch size estimation. Heavy ML imports are lazy (inside methods) so CLI stays fast for non-training commands.
4647

47-
**Monitoring:** `monitoring/callback.py` is a HuggingFace `TrainerCallback` that streams metrics to `monitoring/display.py` (Rich Live panel at 2Hz).
48+
**Monitoring:** `monitoring/callback.py` is a HuggingFace `TrainerCallback` that streams metrics to `monitoring/display.py` (Rich Live panel at 2Hz) and optionally to the experiment tracker.
49+
50+
**Experiment tracking:** `experiment/tracker.py` (`ExperimentTracker`) stores runs, per-step metrics, and eval results in SQLite at `~/.soup/experiments.db`. Automatically integrated into `soup train`. Commands: `soup runs`, `soup runs show`, `soup runs compare`, `soup runs delete`.
51+
52+
**Data tools:** `commands/data.py` provides inspect, validate, convert (between alpaca/sharegpt/chatml), merge, dedup (MinHash via datasketch), and stats (extended statistics with plotext histograms).
53+
54+
**Eval:** `commands/eval.py` wraps lm-evaluation-harness for model evaluation on standard benchmarks (mmlu, gsm8k, etc.) with results saved to the experiment tracker.
4855

4956
## Code Conventions
5057

@@ -53,7 +60,7 @@ soup train --config soup.yaml
5360
- **Config validation:** Always Pydantic v2 (BaseModel + Field)
5461
- **CLI framework:** Typer with `rich_markup_mode="rich"`
5562
- **Output:** Use `rich.console.Console` — never bare `print()`
56-
- **Lazy imports:** Heavy deps (torch, transformers, peft) are imported inside functions, not at module level
63+
- **Lazy imports:** Heavy deps (torch, transformers, peft, datasketch, lm_eval, plotext) are imported inside functions, not at module level
5764
- **Variable naming:** Avoid single-letter names (ruff E741) — use `entry`, `part`, `length` instead of `l`
5865

5966
## Git Workflow

README.md

Lines changed: 48 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -162,6 +162,51 @@ soup data inspect ./data/train.jsonl
162162

163163
# Validate format
164164
soup data validate ./data/train.jsonl --format alpaca
165+
166+
# Convert between formats
167+
soup data convert ./data/train.jsonl --to sharegpt --output converted.jsonl
168+
169+
# Merge multiple datasets
170+
soup data merge data1.jsonl data2.jsonl --output merged.jsonl --shuffle
171+
172+
# Remove near-duplicates (requires: pip install 'soup-cli[data]')
173+
soup data dedup ./data/train.jsonl --threshold 0.8
174+
175+
# Extended statistics (length distribution, token counts, languages)
176+
soup data stats ./data/train.jsonl
177+
```
178+
179+
## Experiment Tracking
180+
181+
Every `soup train` run is automatically tracked in a local SQLite database (`~/.soup/experiments.db`).
182+
183+
```bash
184+
# List all training runs
185+
soup runs
186+
187+
# Show detailed info + loss curve for a run
188+
soup runs show run_20260223_143052_a1b2
189+
190+
# Compare two runs side by side
191+
soup runs compare run_1 run_2
192+
193+
# Delete a run
194+
soup runs delete run_1
195+
```
196+
197+
## Model Evaluation
198+
199+
Evaluate models on standard benchmarks using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness):
200+
201+
```bash
202+
# Install eval dependencies
203+
pip install 'soup-cli[eval]'
204+
205+
# Evaluate on benchmarks
206+
soup eval --model ./output --benchmarks mmlu,gsm8k,hellaswag
207+
208+
# Link results to a training run
209+
soup eval --model ./output --benchmarks mmlu --run-id run_20260223_143052_a1b2
165210
```
166211

167212
## Features
@@ -178,7 +223,9 @@ soup data validate ./data/train.jsonl --format alpaca
178223
| HuggingFace datasets support ||
179224
| Interactive model chat ||
180225
| Push to HuggingFace Hub ||
181-
| Experiment tracking | 🔜 |
226+
| Experiment tracking (SQLite) ||
227+
| Data tools (convert, merge, dedup, stats) ||
228+
| Model evaluation (lm-eval) ||
182229
| Web dashboard | 🔜 |
183230
| Cloud mode (BYOG) | 🔜 |
184231

pyproject.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,10 +34,12 @@ dependencies = [
3434
"bitsandbytes>=0.41.0",
3535
"accelerate>=0.25.0",
3636
"huggingface-hub>=0.16.0",
37+
"plotext>=5.2.0",
3738
]
3839

3940
[project.optional-dependencies]
4041
eval = ["lm-eval>=0.4.0"]
42+
data = ["datasketch>=1.6.0"]
4143
dev = ["pytest>=7.0", "ruff>=0.1.0", "pytest-cov>=4.0"]
4244
ui = ["fastapi>=0.104.0", "uvicorn>=0.24.0"]
4345

soup_cli/cli.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
from rich.console import Console
55

66
from soup_cli import __version__
7-
from soup_cli.commands import chat, data, init, push, train
7+
from soup_cli.commands import chat, data, eval, init, push, runs, train
88

99
console = Console()
1010

@@ -20,7 +20,12 @@
2020
app.command()(train.train)
2121
app.command()(chat.chat)
2222
app.command()(push.push)
23-
app.add_typer(data.app, name="data", help="Dataset tools: inspect, convert, validate.")
23+
app.add_typer(
24+
data.app, name="data",
25+
help="Dataset tools: inspect, convert, merge, dedup, validate, stats.",
26+
)
27+
app.add_typer(runs.app, name="runs", help="Experiment tracking: list, show, compare runs.")
28+
app.command(name="eval")(eval.eval_model)
2429

2530

2631
@app.command()

0 commit comments

Comments
 (0)