MakazhanAlpamys
diff --git a/‎CLAUDE.md‎
Lines changed: 11 additions & 4 deletions b/‎CLAUDE.md‎
Lines changed: 11 additions & 4 deletions
diff --git a/‎README.md‎
Lines changed: 48 additions & 1 deletion b/‎README.md‎
Lines changed: 48 additions & 1 deletion
diff --git a/‎pyproject.toml‎
Lines changed: 2 additions & 0 deletions b/‎pyproject.toml‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎soup_cli/cli.py‎
Lines changed: 7 additions & 2 deletions b/‎soup_cli/cli.py‎
Lines changed: 7 additions & 2 deletions
@@ -35,16 +35,23 @@ soup train --config soup.yaml
   → data/loader.py      (load file or HF dataset → normalize format)
   → trainer/sft.py      (load model → quantize → apply LoRA → train)
   → monitoring/callback.py + display.py  (live Rich dashboard)
+  → experiment/tracker.py  (log run + metrics to SQLite)
   → save LoRA adapter to output/
 ```
 
 **Config system:** `config/schema.py` is the single source of truth. All YAML fields are validated by Pydantic models (`SoupConfig` → `TrainingConfig` → `LoraConfig`, `DataConfig`). Templates (chat/code/medical) live as YAML strings in this file.
 
-**Data pipeline:** `data/loader.py` handles local files (JSONL/JSON/CSV/Parquet) and HuggingFace datasets. `data/formats.py` auto-detects and normalizes alpaca/sharegpt/chatml formats into a unified `{"messages": [...]}` structure.
+**Data pipeline:** `data/loader.py` handles local files (JSONL/JSON/CSV/Parquet) and HuggingFace datasets. `data/formats.py` auto-detects and normalizes alpaca/sharegpt/chatml formats into a unified `{"messages": [...]}` structure. Also supports reverse conversion via `messages_to_format()`.
 
-**Trainer:** `trainer/sft.py` (`SFTTrainerWrapper`) wraps HuggingFace's SFTTrainer with auto quantization (BitsAndBytes), LoRA (PEFT), and batch size estimation. Heavy ML imports are lazy (inside methods) so CLI stays fast for non-training commands.
+**Trainer:** `trainer/sft.py` (`SFTTrainerWrapper`) and `trainer/dpo.py` (`DPOTrainerWrapper`) wrap HuggingFace's SFTTrainer/DPOTrainer with auto quantization (BitsAndBytes), LoRA (PEFT), and batch size estimation. Heavy ML imports are lazy (inside methods) so CLI stays fast for non-training commands.
 
-**Monitoring:** `monitoring/callback.py` is a HuggingFace `TrainerCallback` that streams metrics to `monitoring/display.py` (Rich Live panel at 2Hz).
+**Monitoring:** `monitoring/callback.py` is a HuggingFace `TrainerCallback` that streams metrics to `monitoring/display.py` (Rich Live panel at 2Hz) and optionally to the experiment tracker.
+
+**Experiment tracking:** `experiment/tracker.py` (`ExperimentTracker`) stores runs, per-step metrics, and eval results in SQLite at `~/.soup/experiments.db`. Automatically integrated into `soup train`. Commands: `soup runs`, `soup runs show`, `soup runs compare`, `soup runs delete`.
+
+**Data tools:** `commands/data.py` provides inspect, validate, convert (between alpaca/sharegpt/chatml), merge, dedup (MinHash via datasketch), and stats (extended statistics with plotext histograms).
+
+**Eval:** `commands/eval.py` wraps lm-evaluation-harness for model evaluation on standard benchmarks (mmlu, gsm8k, etc.) with results saved to the experiment tracker.
 
 ## Code Conventions
 
@@ -53,7 +60,7 @@ soup train --config soup.yaml
 - **Config validation:** Always Pydantic v2 (BaseModel + Field)
 - **CLI framework:** Typer with `rich_markup_mode="rich"`
 - **Output:** Use `rich.console.Console` — never bare `print()`
-- **Lazy imports:** Heavy deps (torch, transformers, peft) are imported inside functions, not at module level
+- **Lazy imports:** Heavy deps (torch, transformers, peft, datasketch, lm_eval, plotext) are imported inside functions, not at module level
 - **Variable naming:** Avoid single-letter names (ruff E741) — use `entry`, `part`, `length` instead of `l`
 
 ## Git Workflow
 
@@ -162,6 +162,51 @@ soup data inspect ./data/train.jsonl
 
 # Validate format
 soup data validate ./data/train.jsonl --format alpaca
+
+# Convert between formats
+soup data convert ./data/train.jsonl --to sharegpt --output converted.jsonl
+
+# Merge multiple datasets
+soup data merge data1.jsonl data2.jsonl --output merged.jsonl --shuffle
+
+# Remove near-duplicates (requires: pip install 'soup-cli[data]')
+soup data dedup ./data/train.jsonl --threshold 0.8
+
+# Extended statistics (length distribution, token counts, languages)
+soup data stats ./data/train.jsonl
+```
+
+## Experiment Tracking
+
+Every `soup train` run is automatically tracked in a local SQLite database (`~/.soup/experiments.db`).
+
+```bash
+# List all training runs
+soup runs
+
+# Show detailed info + loss curve for a run
+soup runs show run_20260223_143052_a1b2
+
+# Compare two runs side by side
+soup runs compare run_1 run_2
+
+# Delete a run
+soup runs delete run_1
+```
+
+## Model Evaluation
+
+Evaluate models on standard benchmarks using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness):
+
+```bash
+# Install eval dependencies
+pip install 'soup-cli[eval]'
+
+# Evaluate on benchmarks
+soup eval --model ./output --benchmarks mmlu,gsm8k,hellaswag
+
+# Link results to a training run
+soup eval --model ./output --benchmarks mmlu --run-id run_20260223_143052_a1b2
 ```
 
 ## Features
@@ -178,7 +223,9 @@ soup data validate ./data/train.jsonl --format alpaca
 | HuggingFace datasets support | ✅ |
 | Interactive model chat | ✅ |
 | Push to HuggingFace Hub | ✅ |
-| Experiment tracking | 🔜 |
+| Experiment tracking (SQLite) | ✅ |
+| Data tools (convert, merge, dedup, stats) | ✅ |
+| Model evaluation (lm-eval) | ✅ |
 | Web dashboard | 🔜 |
 | Cloud mode (BYOG) | 🔜 |
 
 
@@ -34,10 +34,12 @@ dependencies = [
     "bitsandbytes>=0.41.0",
     "accelerate>=0.25.0",
     "huggingface-hub>=0.16.0",
+    "plotext>=5.2.0",
 ]
 
 [project.optional-dependencies]
 eval = ["lm-eval>=0.4.0"]
+data = ["datasketch>=1.6.0"]
 dev = ["pytest>=7.0", "ruff>=0.1.0", "pytest-cov>=4.0"]
 ui = ["fastapi>=0.104.0", "uvicorn>=0.24.0"]
 
 
@@ -4,7 +4,7 @@
 from rich.console import Console
 
 from soup_cli import __version__
-from soup_cli.commands import chat, data, init, push, train
+from soup_cli.commands import chat, data, eval, init, push, runs, train
 
 console = Console()
 
@@ -20,7 +20,12 @@
 app.command()(train.train)
 app.command()(chat.chat)
 app.command()(push.push)
-app.add_typer(data.app, name="data", help="Dataset tools: inspect, convert, validate.")
+app.add_typer(
+    data.app, name="data",
+    help="Dataset tools: inspect, convert, merge, dedup, validate, stats.",
+)
+app.add_typer(runs.app, name="runs", help="Experiment tracking: list, show, compare runs.")
+app.command(name="eval")(eval.eval_model)
 
 
 @app.command()