Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -41,5 +41,6 @@ dist/
*.egg-info/
*.pt

.kiro
.kiro/*
!.kiro/steering/
venv
122 changes: 122 additions & 0 deletions .kiro/steering/development.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Development Guidelines

## Repository Access

DJL, DJL-Serving, and LMI are open-source projects under the deepjavalibrary GitHub organization.

### Getting Started

1. Complete Open Source Training
2. Link GitHub account with AWS/Amazon
3. Join the deepjavalibrary GitHub organization
4. Request access to djl-admin or djl-committer groups

### Key Repositories

- https://github.com/deepjavalibrary/djl
- https://github.com/deepjavalibrary/djl-serving
- https://github.com/deepjavalibrary/djl-demo

## Development Workflow

### Setup
```bash
# Fork repos, then clone and track upstream
git clone git@github.com:<username>/djl-serving.git
cd djl-serving
git remote add upstream https://github.com/deepjavalibrary/djl-serving

# Sync with upstream
git fetch upstream && git rebase upstream/master && git push
```

### Making Changes
```bash
git checkout -b my-feature-branch
# Make changes
git add . && git commit -m "Description"
git push -u origin my-feature-branch
# Create PR from fork to upstream/master via GitHub UI
```

## Building LMI Containers

### Container Types

**DLC and DockerHub:**
- LMI-vLLM
- LMI-TensorRT-LLM
- LMI-Neuron

**DockerHub Only:**
- CPU-Full (PyTorch/OnnxRuntime/MxNet/TensorFlow)
- CPU (no engines bundled)
- PyTorch-GPU
- Aarch64 (Graviton support)

### Build Process

```bash
# Prepare build
cd djl-serving
rm -rf serving/docker/distributions
./gradlew clean && ./gradlew --refresh-dependencies :serving:dockerDeb -Psnapshot

# Get versions
cd serving/docker
export DJL_VERSION=$(awk -F '=' '/djl / {gsub(/ ?"/, "", $2); print $2}' ../../gradle/libs.versions.toml)
export SERVING_VERSION=$(awk -F '=' '/serving / {gsub(/ ?"/, "", $2); print $2}' ../../gradle/libs.versions.toml)

# Build specific container
docker compose build --build-arg djl_version=${DJL_VERSION} --build-arg djl_serving_version=${SERVING_VERSION} lmi
docker compose build --build-arg djl_version=${DJL_VERSION} --build-arg djl_serving_version=${SERVING_VERSION} tensorrt-llm
docker compose build --build-arg djl_version=${DJL_VERSION} --build-arg djl_serving_version=${SERVING_VERSION} pytorch-inf2
```

See `serving/docker/docker-compose.yml` for all available targets.

## Testing

### Local Integration Tests

```bash
cd tests/integration
OVERRIDE_TEST_CONTAINER=<image_name> python -m pytest tests.py::<TestClass>::<test_name>

# Example
OVERRIDE_TEST_CONTAINER=deepjavalibrary/djl-serving:lmi python -m pytest tests.py::TestVllm1_g6::test_gemma_2b
```

Full test suite: `tests/integration/tests.py`

## Key Development Areas (Priority Order)

### DJL-Serving
1. **Python Engine** - `engines/python/setup/djl_python/` (vLLM, TensorRT-LLM, rolling batch, chat completions)
2. **Python Engine Java** - `engines/python/src/main/java/ai/djl/python/engine/`
3. **WLM** - `wlm/` (backend ML/DL engine integration)
4. **Serving** - `serving/` (frontend web server)

### DJL (Less Frequent)
PyTorch, HuggingFace Tokenizer, OnnxRuntime, Rust/Candle engines

## CI/CD Workflows

### DJL Repository
- `continuous.yml` - PR checks
- `native_jni_s3_pytorch.yml` - Publish native code to S3
- `nightly_publish.yml` - SNAPSHOT to Maven
- `serving-publish.yml` - DJL-Serving SNAPSHOT to S3

### DJL-Serving Repository
- `nightly.yml` - Build containers → Run tests → Publish to staging
- `docker-nightly-publish.yml` - Build/publish to dev repo (ad-hoc)
- `integration.yml` - Run all tests with custom image (ad-hoc)
- `docker_publish.yml` - Sync dev to staging
- `integration_execute.yml` - Single test on specific instance

## Versioning
- **DJL** → Maven (stable + SNAPSHOT)
- **DJL-Serving** → S3 (stable + SNAPSHOT)
- **Source** → `gradle/libs.versions.toml`
- **Nightly** → SNAPSHOT, **Release** → Stable
133 changes: 133 additions & 0 deletions .kiro/steering/partitioning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# Model Partitioning and Optimization

The partition system (`serving/docker/partition/`) provides tools for model preparation, including tensor parallelism sharding, quantization, and multi-node setup.

## Core Scripts

### partition.py - Main Entry Point
Handles S3 download, requirements install, partitioning, quantization (AWQ/FP8), S3 upload.

**Features:** HF downloads, `OPTION_*` env vars, MPI mode, auto-cleanup

```bash
python partition.py \
--model-id <hf_model_id_or_s3_uri> \
--tensor-parallel-degree 4 \
--quantization awq \
--save-mp-checkpoint-path /tmp/output
```

### run_partition.py - Custom Handlers
Invokes user-provided partition handlers via `partition_handler` property.

### run_multi_node_setup.py - Cluster Coordination
Multi-node setup: queries leader for model info, downloads to workers, exchanges SSH keys, reports readiness.

**Env Vars:** `DJL_LEADER_ADDR`, `LWS_LEADER_ADDR`, `DJL_CACHE_DIR`

### trt_llm_partition.py - TensorRT-LLM Compilation
Builds TensorRT engines with BuildConfig (batch/seq limits), QuantConfig (AWQ/FP8/SmoothQuant), CalibConfig (calibration data).

### SageMaker Neo Integration

Partition scripts power **SageMaker Neo's CreateOptimizationJob API** - managed service for compilation, quantization, and sharding.

**API:** https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateOptimizationJob.html

**Optimization Types:**
1. Compilation (TensorRT-LLM engines)
2. Quantization (AWQ, FP8)
3. Sharding (Fast Model Loader TP)

**Neo Environment Variables:**
- `SM_NEO_INPUT_MODEL_DIR`, `SM_NEO_OUTPUT_MODEL_DIR`
- `SM_NEO_COMPILATION_PARAMS` (JSON config)
- `SERVING_FEATURES` (vllm, trtllm)

**Neo Scripts:**
- `sm_neo_dispatcher.py` - Routes jobs: vllm→Quantize/Shard, trtllm→Compile
- `sm_neo_trt_llm_partition.py` - TensorRT-LLM compilation
- `sm_neo_quantize.py` - Quantization workflows
- `sm_neo_utils.py` - Env var helpers

**Workflow:**
CreateOptimizationJob(source S3, config, output S3, container) → Neo launches container → Dispatcher routes → Handler optimizes → Artifacts to output S3 → Deploy to SageMaker

## Quantization

### AWQ (4-bit, AutoAWQ library)
```properties
option.quantize=awq
option.awq_zero_point=true
option.awq_block_size=128
option.awq_weight_bit_width=4
option.awq_mm_version=GEMM
option.awq_ignore_layers=lm_head
```

### FP8 (llm-compressor, CNN/DailyMail calibration)
```properties
option.quantize=fp8
option.fp8_scheme=FP8_DYNAMIC
option.fp8_ignore=lm_head
option.calib_size=512
option.max_model_len=2048
```

## Multi-Node

### MPI Mode (engine=MPI or TP > 1)
```bash
mpirun -N <tp_degree> --allow-run-as-root \
--mca btl_vader_single_copy_mechanism none \
-x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1 \
python run_partition.py --properties '{...}'
```

### Cluster Setup (LeaderWorkerSet/K8s)
1. Leader generates SSH keys
2. Workers query `/cluster/models` for model info
3. Workers download model, exchange SSH keys via `/cluster/sshpublickey`
4. Workers report to `/cluster/status?message=OK`
5. Leader loads model

## Configuration

### properties_manager.py
Loads `serving.properties`, merges `OPTION_*` env vars, validates, generates output.

**Key Properties:**
- `option.model_id` - HF model ID or S3 URI
- `option.tensor_parallel_degree`, `option.pipeline_parallel_degree`
- `option.save_mp_checkpoint_path` - Output dir
- `option.quantize` - awq, fp8, static_int8
- `engine` - Python, MPI

### utils.py Helpers
`get_partition_cmd()`, `extract_python_jar()`, `load_properties()`, `update_kwargs_with_env_vars()`, `remove_option_from_properties()`, `load_hf_config_and_tokenizer()`

## Container Integration
Scripts at `/opt/djl/partition/` invoked via:
1. Neo compilation (`sm_neo_dispatcher.py`)
2. Container startup (on-the-fly partitioning)
3. Management API (dynamic registration)

## Common Workflows

```bash
# Tensor Parallelism
python partition.py --model-id meta-llama/Llama-2-70b-hf \
--tensor-parallel-degree 8 --save-mp-checkpoint-path /tmp/output

# AWQ Quantization
python partition.py --model-id meta-llama/Llama-2-7b-hf \
--quantization awq --save-mp-checkpoint-path /tmp/output

# TensorRT-LLM Engine
python trt_llm_partition.py --properties_dir /opt/ml/model \
--trt_llm_model_repo /tmp/engine --model_path /tmp/model \
--tensor_parallel_degree 4 --pipeline_parallel_degree 1
```

## Error Handling
Non-zero exit on failure, real-time stdout/stderr, cleanup on success, S3 upload only after success.
33 changes: 33 additions & 0 deletions .kiro/steering/product.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# DJL Serving - Product Overview

High-performance universal model serving solution powered by Deep Java Library (DJL). Serves ML models through REST APIs with automatic scaling, dynamic batching, and multi-engine support.

## Architecture

**3-Layer Design:**
1. **Frontend** - Netty HTTP server (Inference + Management APIs)
2. **Workflows** - Multi-model execution pipelines
3. **WorkLoadManager (WLM)** - Worker thread pools with batching/routing

**Python Engine** - Runs Python-based models and custom handlers
**LMI** - Large Model Inference with vLLM, TensorRT-LLM, HuggingFace Accelerate

## Supported Models

PyTorch TorchScript, SKLearn models, ONNX, Python scripts, XGBoost, Sentencepiece, HuggingFace models

## Primary Use Cases

1. **LLM Serving** - Optimized backends (vLLM, TensorRT-LLM) with LoRA adapters
2. **Multi-Model Endpoints** - Version management, workflows
3. **Custom Handlers** - Python preprocessing/postprocessing
4. **Embeddings & Multimodal** - Text embeddings, vision-language models
5. **AWS Integration** - SageMaker deployment, Neo optimization (compilation, quantization, sharding)

## Key Features

- Auto-scaling worker threads based on load
- Dynamic batching for throughput optimization
- Multi-engine support (serve different frameworks simultaneously)
- Plugin architecture for extensibility
- OpenAPI-compatible REST endpoints
Loading
Loading