- Project Structure: Complete directory hierarchy with proper Python package structure
- Dependencies: All required packages configured in
pyproject.toml(FastAPI, SQLAlchemy, Celery, torch, transformers, adapters) - Configuration: Environment-based settings with
.envsupport - Database: PostgreSQL with SQLAlchemy models for sessions, training runs, checkpoints, and samplers
- Authentication: X-API-Key middleware with
tml-prefix validation - Schemas: Complete Pydantic models matching Tinker SDK types exactly
Service Endpoints:
GET /api/v1/get_server_capabilities- Get supported models ✅POST /api/v1/create_session- Create new session ✅POST /api/v1/session_heartbeat- Keep session alive ✅GET /api/v1/health- Health check ✅
Training Endpoints (all with real computation):
POST /api/v1/create_model- Create LoRA training session ✅POST /api/v1/forward_backward- Execute actual LoRA forward/backward pass ✅POST /api/v1/optim_step- Apply actual optimizer step ✅POST /api/v1/save_weights- Save checkpoint to filesystem ✅POST /api/v1/load_weights- Load checkpoint from filesystem ✅POST /api/v1/get_info- Get model info ✅
Sampling Endpoints:
POST /api/v1/create_sampling_session- Create sampling session ✅POST /api/v1/asample- Generate text samples ✅ (supports base model + checkpoint)
Future Polling:
POST /api/v1/retrieve_future- Poll async operation results ✅
Management Endpoints:
GET /api/v1/training_runs- List all training runs ✅GET /api/v1/weights/{model_id}/list- List checkpoints ✅DELETE /api/v1/weights/{model_id}/checkpoint/{checkpoint_id}- Delete checkpoint ✅GET /api/v1/sessions/{session_id}- Get session info ✅
Telemetry:
POST /api/v1/telemetry- Send telemetry data ✅
- LoRA Trainer: Complete implementation using adapters library
- Initialize base models from HuggingFace (float32 for training stability)
- Add and configure LoRA adapters with custom rank
- Forward/backward pass with cross-entropy loss
- Adam optimizer with gradient clipping and grad norm reporting
- Save/load adapter weights and optimizer state
- Unique adapter names per training run (supports concurrent models)
- Trainer Manager: In-process training management with result caching
- Per-model trainer instances
- Thread-safe result storage for future polling
- Automatic device selection (CUDA/CPU)
- Model Manager: Caching system for base models and tokenizers
- Checkpoint Storage: Filesystem-based storage
- Tinker Path:
tinker://URI format for checkpoint addressing
- Base model sampling: Direct generation from HuggingFace models
- Checkpoint sampling: Load trained adapter weights and generate
- Support for temperature, top_p, top_k, seed, stop sequences
- Logprobs computation
- Docker: Multi-stage Dockerfile with CUDA support
- docker-compose: Complete orchestration (API, Celery, PostgreSQL, Redis)
- .env: Environment configuration template
- End-to-end training test: Verifies create→train→save→load workflow
- Loss decrease test: Verifies real training (loss goes from 40.7 to 0.075)
- Sampling test: Verifies base model and checkpoint sampling
- SDK compatibility test: Passes with official Tinker SDK v0.8.0
- Additional loss functions (PPO, DPO, importance sampling)
- Custom loss function support
- Vision model support (multimodal)
- Multi-user support with proper isolation
- Resource quotas and management
- Database migrations (Alembic setup)
- vLLM integration for faster sampling
- Kubernetes deployment manifests
- Monitoring and metrics (Prometheus/Grafana)
- Auto-scaling based on load
- S3 checkpoint storage backend
- Distributed training support
Core Features: Fully functional
- ✅ Real LoRA training with loss computation
- ✅ Gradient descent with measurable loss decrease
- ✅ Checkpoint save/load with filesystem persistence
- ✅ Text generation from base model and trained adapters
- ✅ Full Tinker SDK v0.8.0 compatibility
- ✅ All core endpoints implemented and tested
# Start dependencies
docker run -d --name postgres -p 5432:5432 -e POSTGRES_PASSWORD=postgres -e POSTGRES_DB=opentinker postgres:16
docker run -d --name redis -p 6379:6379 redis:7
# Install and run
cp .env.example .env
uv sync
uv run uvicorn open_tinker.main:app --host 0.0.0.0 --port 8000
# Test
uv run python test_training.py
uv run python examples/quickstart_test.py # requires: uv sync --extra dev