Add CLAUDE.md for Claude Code project context

MakazhanAlpamys · claude · MakazhanAlpamys · commit 7665e7c3aa35 · 2026-02-20T16:31:33.000+05:00
Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,65 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Build & Development Commands
+
+```bash
+# Install in dev mode (editable + test deps)
+pip install -e ".[dev]"
+
+# Run all tests
+pytest tests/ -v --tb=short
+
+# Run a single test file
+pytest tests/test_config.py -v
+
+# Run a single test
+pytest tests/test_data.py::test_detect_alpaca_format -v
+
+# Lint
+ruff check soup_cli/ tests/
+
+# Lint with auto-fix
+ruff check --fix soup_cli/ tests/
+```
+
+## Architecture
+
+Soup is a CLI-first tool for LLM fine-tuning. The core flow:
+
+```
+soup train --config soup.yaml
+  → config/loader.py    (YAML → Pydantic SoupConfig)
+  → utils/gpu.py        (detect CUDA/MPS/CPU, estimate batch size)
+  → data/loader.py      (load file or HF dataset → normalize format)
+  → trainer/sft.py      (load model → quantize → apply LoRA → train)
+  → monitoring/callback.py + display.py  (live Rich dashboard)
+  → save LoRA adapter to output/
+```
+
+**Config system:** `config/schema.py` is the single source of truth. All YAML fields are validated by Pydantic models (`SoupConfig` → `TrainingConfig` → `LoraConfig`, `DataConfig`). Templates (chat/code/medical) live as YAML strings in this file.
+
+**Data pipeline:** `data/loader.py` handles local files (JSONL/JSON/CSV/Parquet) and HuggingFace datasets. `data/formats.py` auto-detects and normalizes alpaca/sharegpt/chatml formats into a unified `{"messages": [...]}` structure.
+
+**Trainer:** `trainer/sft.py` (`SFTTrainerWrapper`) wraps HuggingFace's SFTTrainer with auto quantization (BitsAndBytes), LoRA (PEFT), and batch size estimation. Heavy ML imports are lazy (inside methods) so CLI stays fast for non-training commands.
+
+**Monitoring:** `monitoring/callback.py` is a HuggingFace `TrainerCallback` that streams metrics to `monitoring/display.py` (Rich Live panel at 2Hz).
+
+## Code Conventions
+
+- **Line length:** 100 chars (ruff enforced)
+- **Linter:** ruff with E, F, I, N, W rules
+- **Config validation:** Always Pydantic v2 (BaseModel + Field)
+- **CLI framework:** Typer with `rich_markup_mode="rich"`
+- **Output:** Use `rich.console.Console` — never bare `print()`
+- **Lazy imports:** Heavy deps (torch, transformers, peft) are imported inside functions, not at module level
+- **Variable naming:** Avoid single-letter names (ruff E741) — use `entry`, `part`, `length` instead of `l`
+
+## Git Workflow
+
+- Repo: https://github.com/MakazhanAlpamys/Soup
+- Branch: `main`
+- CI: GitHub Actions runs ruff lint + pytest on Python 3.9/3.11/3.12
+- Always run `ruff check soup_cli/ tests/` before committing
+- Always run `pytest tests/ -v` before committing