CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Development Commands

Forgather CLI (forgather) - Primary Interface

The forgather command is the main way to interact with Forgather projects. It's available in PATH and provides comprehensive project management capabilities.

# Basic usage
forgather [-p PROJECT_DIR] [-t CONFIG_TEMPLATE] <subcommand>

# Help
forgather --help
fgcli.pu <subcommand> --help

# Common project exploration commands
forgather index                    # Show project overview as markdown
forgather ls                       # Show project name, short description, and all available configurations.
forgather ls -r                    # As above, but recursively search all sub-directories for projects and list them.
forgather tlist                    # List all available template files
forgather tlist --format md        # Show template inheritance hierarchy for all templates as markdown.
forgather [-t config.yaml] pp      # Show preprocessed configuration; run before attempting to train!
forgather [-t config.yaml] trefs    # Show template inheritance hierarchy, starting with configuration template.
forgather [-t config.yaml] targets # List available output targets

# Configuration development and debugging
forgather tlist | xargs grep SEARCH_PATTERN     # Search all templates for pattern
forgather -t config.yaml pp        # Useful for diagnosing configuration errors

# Training
forgather -t config.yaml train                     # Train with default settings
forgather -t config.yaml train -d 0,1              # Train on specific GPUs
forgather -t config.yaml train --dry-run           # Show command without executing

# Get head and tail of training output logs
head -n 10 output_models/my_custom_model/runs/my_custom_model_2025-06-25T03-16-59/trainer_logs.json
tail -n 10 output_models/my_custom_model/runs/my_custom_model_2025-06-25T03-16-59/trainer_logs.json

# Get config used by training run
cat output_models/my_custom_model/runs/my_custom_model_2025-06-25T03-16-59/config.yaml

Training Job Control

Forgather supports external control of running training jobs through the forgather control commands. This enables real-time interaction with distributed training for hyperparameter experimentation, checkpoint management, and graceful shutdown.

# List all discoverable training jobs
forgather control list

# Get detailed status of a specific job  
forgather control status JOB_ID

# Control running training jobs
forgather control save JOB_ID           # Save checkpoint (triggers evaluation if configured)
forgather control stop JOB_ID           # Gracefully stop training (saves final checkpoint)
forgather control save-stop JOB_ID      # Save checkpoint then gracefully stop  
forgather control abort JOB_ID          # Abort immediately without saving (useful for failed hyperparameter experiments)

# Job management
forgather control cleanup               # Remove dead job files
forgather control cleanup --force      # Skip confirmation

To enable control in your training jobs, add the TrainerControlCallback:

from forgather.ml.trainer.callbacks import TrainerControlCallback

callbacks = [
    TrainerControlCallback(
        job_id="my_experiment",  # Optional: auto-generated if not provided
    ),
    # ... your other callbacks
]

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    callbacks=callbacks
)

Typical workflow:

Start training job: forgather -t config.yaml train
In another terminal: forgather control list to see running jobs
Save checkpoint on-demand: forgather control save JOB_ID
Gracefully stop when satisfied: forgather control stop JOB_ID
Or abort failed experiments: forgather control abort JOB_ID

The system works with distributed training - commands sent to any rank are automatically coordinated across all processes. See examples/trainer_control/ for a complete working example and docs/trainers/trainer-control.md for full documentation.

Inference

Forgather includes a basic OpenAPI compatible inference server and client.

Start server

# Load model in directory using AutoModelForCausalLM.from_pretrained()
# This defaults to bfloat16 on cuda:0
forgather inf server -m MODEL_PATH

# Load model from latest Forgather checkpoint
forgather inf server -c -m MODEL_PATH

Start client

# Start in interactive (chat) mode
forgather inf client

# Perform text completion on prompt
forgather inf client --completion "Once upon a time"

# Get response to single message
forgather inf client --message "Tell me a story"

Detailed inference instructions are located in 'tools/inference_server/README.md'

vLLM Distributed Inference

Forgather models support distributed inference with vLLM for high-throughput serving with tensor and pipeline parallelism.

Validate vLLM Support

from forgather.ml.model_conversion import validate_vllm_plans, print_model_structure
from transformers import AutoModelForCausalLM

# Load trained model
model = AutoModelForCausalLM.from_pretrained("output_models/my_model")

# Print model structure
print_model_structure(model, max_depth=4)

# Validate vLLM plans
if hasattr(model, '_tp_plan') and model._tp_plan:
    is_valid = validate_vllm_plans(model, tp_plan=model._tp_plan, pp_plan=model._pp_plan, strict=True)

Deploy with vLLM

# Single-GPU inference
vllm serve output_models/my_model --trust-remote-code

# Tensor parallel (4 GPUs)
vllm serve output_models/my_model --trust-remote-code --tensor-parallel-size 4

# Tensor + Pipeline parallel (8 GPUs: 2 PP stages, 4 TP per stage)
vllm serve output_models/my_model \
    --trust-remote-code \
    --tensor-parallel-size 4 \
    --pipeline-parallel-size 2 \
    --dtype bfloat16 \
    --max-model-len 8192

Adding vLLM Support to Custom Models

Most transformer models (Llama, DeepOne, etc.) include vLLM support by default. For custom models, add vLLM plans to the [model_code_generator] section:

# In your model configuration
[model_code_generator]
    == super()

    # vLLM Support
    tp_plan:
        "model.layer_stack.layers.*.attention.query_linear": "colwise"
        "model.layer_stack.layers.*.attention.key_linear": "colwise"
        "model.layer_stack.layers.*.attention.value_linear": "colwise"
        "model.layer_stack.layers.*.attention.output_linear": "rowwise"
        # ... additional layers
    pp_plan:
        "model.input_encoder": [["input_ids"], ["hidden_states"]]
        "model.layer_stack": [["hidden_states", "attention_mask"], ["hidden_states"]]
        "model.output_decoder": [["hidden_states"], ["logits"]]

See templatelib/base/models/causal_lm/vllm_plans.yaml for the complete reference template. For detailed information, see docs/inference/vllm_integration.md

Examples from Real Projects

# Working with tiny_llama tutorial project
cd examples/tutorials/tiny_llama
forgather ls                                        # List: train_tiny_llama.yaml, etc.
forgather -t train_tiny_llama.yaml pp               # Show pre-processed configuration.
forgather -t train_tiny_llama.yaml train            # Train with selected configuration.

Workspace and Project Creation

Forgather uses a two-level structure: Workspaces contain Projects. Use the forgather ws commands to create and manage both.

Creating a New Workspace

# Basic workspace creation
forgather ws init --name "My ML Workspace" --description "Machine learning research experiments" --forgather-dir /path/to/forgather

# With additional template search paths
forgather ws init --name "Advanced Workspace" --description "Advanced ML experiments" --forgather-dir /path/to/forgather /extra/templates/path /another/path

# With no default search paths (minimal workspace)
forgather ws init --name "Minimal Workspace" --description "Clean minimal setup" --forgather-dir /path/to/forgather --no-defaults

This creates a forgather_workspace/ directory containing:

README.md - Workspace documentation
base_directories.yaml - Base directory configuration
meta_defaults.yaml - Template search paths and workspace metadata

Creating a New Project in a Workspace

# Basic project creation (directory name auto-generated from project name)
forgather ws project --name "Sentiment Analysis" --description "BERT-based sentiment analysis experiments"

# With custom settings
forgather ws project --name "Image Classification" --description "CNN experiments" --config-prefix "experiments" --default-config "baseline.yaml" custom_directory_name

This creates a project directory with:

README.md - Project documentation
meta.yaml - Project metadata extending workspace defaults
templates/configs/{default_config} - Default configuration template

Typical Workspace Setup Workflow

Create workspace: forgather ws init --name "My Research" --description "ML experiments" --forgather-dir /path/to/forgather
Create project(s): forgather ws project --name "Project 1" --description "First experiment"
Navigate to project: cd project_1
List configurations: forgather ls
Test configuration: forgather pp
Train model: forgather train

Project Installation

pip install -e .

Architecture Overview

Forgather is a configuration-driven ML framework built on template inheritance and code generation. The core abstraction is the Project, which encapsulates an ML experiment through a sophisticated template system.

Key Components

Project System

Project (src/forgather/project.py): Central abstraction managing configuration and code generation
MetaConfig (src/forgather/meta_config.py): Defines project metadata and template search paths
ConfigEnvironment (src/forgather/config.py): Handles template preprocessing with Jinja2 + YAML

Template Hierarchy

templatelib/base/         # Abstract base templates (trainers, models, datasets)
templatelib/examples/     # Reusable example definitions (models, datasets, tokenizers)
examples/*/templates/     # Project-specific templates
modelsrc/transformer/     # Reusable transformer components

Configuration Language

Jinja2 preprocessing with custom line statement syntax (-- if, -- set, -- extends, -- block)
Custom YAML tags: !call, !factory, !partial, !var, !singleton
Template inheritance via -- extends and -- block/-- endblock
IMPORTANT YAML Tag Distinctions:
- !partial: Constructs a Python partial function (produces Callable type)
- !singleton: Lazy object, called once and cached for subsequent accesses
- !factory: Called every time it's accessed (not cached)
- When using with no arguments, add empty list []
- See docs/configuration/syntax-reference.md for details and examples

Training System

Trainer Classes (src/forgather/ml/)

BaseTrainer → SimpleTrainer (basic single-GPU trainer)
AccelTrainer (multi-GPU via Accelerate)
PipelineTrainer (pipeline parallelism)
Custom optimizers in src/forgather/ml/optim/ (AdamW, SGD, AdaFactor, Apollo, etc.)
Extensible callback system for logging and checkpointing

Model Management

Dynamic model construction from configuration graphs
Code generation: Templates → YAML → Node Graph → Python Code → Objects
Generated models stored in output_models/ as standalone Python code
Transformer components in modelsrc/transformer/

Project Structure

Each project follows this pattern:

project_dir/
├── meta.yaml              # Extends forgather_workspace/meta_defaults.yaml
├── templates/
│   ├── project.yaml       # Main project template
│   ├── configs/           # Experiment configurations
│   └── experiments/       # Alternative config organization
├── output_models/         # Generated model code and training runs
└── project_index.ipynb    # Interactive exploration notebook

Development Workflow

Working with Created Projects

After creating a project, cd into the project directory to work with it

The generated default config is minimal - extend it by inheriting from base templates:

-- extends "types/training_script/causal_lm/causal_lm.yaml"

-- block construct_new_model
    -- include 'models/llama.yaml'
-- endblock construct_new_model

-- block optimizer
optimizer: &optimizer !partial:torch.optim:AdamW
    lr: 1.0e-4
-- endblock optimizer

Key template inheritance patterns:
- Use -- extends "template_name.yaml" for single inheritance
- Use -- include 'template_name.yaml' to include template content
- Override template blocks with -- block name / -- endblock
- Use == super() to include parent block content
Add additional config files in templates/configs/ for different experiments
Test configurations immediately: forgather ls then forgather pp
Use forgather meta to see workspace/project structure and template search paths

Configuration Validation

ALWAYS run forgather ls to validate all configurations after making changes
Failed configs show as "PARSE ERROR" instead of their descriptive names
Use forgather -t config.yaml pp to debug preprocessing issues
Check for syntax errors, missing imports, and template reference issues

Interactive Development

Use project_index.ipynb notebooks for experiment development
Load projects with Project("config.yaml")
Materialize from configurations: model_factory, train_dataset = proj("model", "train_dataset")

Template Development

Templates use Jinja2 with custom line statement syntax
Inherit via -- extends 'template_name.yaml'
Override sections with -- block section_name / -- endblock
Include other templates with -- include 'template_name.yaml'
Inline template definition #-------------------- template.name --------------------
Jinja2 inheritance, via 'extends', only allows a single parent template. When overriding blocks from multiple parents, use the 'include and extend' pattern. Example:

-- extends "types/training_script/causal_lm/causal_lm.yaml"
-- block optimizer
# Project override
optimizer: &optimizer !lambda:torch:optim.AdamW
    lr: 1.0e-3
<< endblock optimizer

-- block construct_new_model
    ## Includes inline template.
    -- include 'project.model_config'
-- endblock construct_new_model

# Inline template definition
#-------------------- project.model_config --------------------

-- block model_config
    == super()
    # Project overrides
    hidden_size: 512
<< endblock model_config

For definitive syntax guide, see "docs/configuration/syntax-reference.md"

Code Generation

Models materialized as standalone Python code in output_models/
Generated code is self-contained and deployable
Training runs stored in output_models/model_name/runs/

Important Patterns

Project Loading

from forgather.project import Project
proj = Project("train_tiny_llama.yaml")
training_script = proj()
model_factory = proj("model")
model = model_factory()

Template Inheritance

-- extends 'types/training_script/causal_lm/causal_lm.yaml'
-- block config_metadata
    == super()
    -- set ns.config_name = "My Experiment"
-- endblock

Training Script Usage

Use -p flag to specify project directory
Config files are relative to project templates directory
Supports distributed training with proper logging levels
Generated models include training artifacts and source code

The framework emphasizes systematic experimentation through template-based configuration management, enabling reproducible ML experiments with modular, reusable components.

Key Example Projects Refer to these when creating new projects.

A template project to copy for starting a new one : "examples/template_project/"
Projects overview : "examples/tutorials/projects_overview/"
Forgather project structure : "examples/tutorials/project_composition/"
Model training tutorial project : "examples/tutorials/tiny_llama/"
Attention mechanisms testing project : "examples/tiny_experiments/attention/"

Common Issues and Solutions

Missing import errors (e.g., Callable not imported): Add missing imports to affected files
YAML tag errors: Use !partial for function objects, !singleton/!factory for function calls
Configuration validation: Run forgather ls to check all configs parse correctly
Complex64 serialization: RoPE models may fail to save due to safetensors limitations with complex tensors

Style

Follow existing style conventions. Avoid emojis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Development Commands

Forgather CLI (forgather) - Primary Interface

Training Job Control

Inference

vLLM Distributed Inference

Examples from Real Projects

Workspace and Project Creation

Creating a New Workspace

Creating a New Project in a Workspace

Typical Workspace Setup Workflow

Project Installation

Architecture Overview

Key Components

Training System

Project Structure

Development Workflow

Important Patterns

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Development Commands

Forgather CLI (forgather) - Primary Interface

Training Job Control

Inference

vLLM Distributed Inference

Examples from Real Projects

Workspace and Project Creation

Creating a New Workspace

Creating a New Project in a Workspace

Typical Workspace Setup Workflow

Project Installation

Architecture Overview

Key Components

Training System

Project Structure

Development Workflow

Important Patterns