deepjavalibrary · Lokiiiiii · Nov 25, 2025 · Nov 25, 2025 · Nov 25, 2025
@@ -41,5 +41,6 @@ dist/
 *.egg-info/
 *.pt
 
-.kiro
+.kiro/*
+!.kiro/steering/
 venv
@@ -0,0 +1,122 @@
+# Development Guidelines
+
+## Repository Access
+
+DJL, DJL-Serving, and LMI are open-source projects under the deepjavalibrary GitHub organization.
+
+### Getting Started
+
+1. Complete Open Source Training
+2. Link GitHub account with AWS/Amazon
+3. Join the deepjavalibrary GitHub organization
+4. Request access to djl-admin or djl-committer groups
+
+### Key Repositories
+
+- https://github.com/deepjavalibrary/djl
+- https://github.com/deepjavalibrary/djl-serving
+- https://github.com/deepjavalibrary/djl-demo
+
+## Development Workflow
+
+### Setup
+```bash
+# Fork repos, then clone and track upstream
+git clone git@github.com:<username>/djl-serving.git
+cd djl-serving
+git remote add upstream https://github.com/deepjavalibrary/djl-serving
+
+# Sync with upstream
+git fetch upstream && git rebase upstream/master && git push
+```
+
+### Making Changes
+```bash
+git checkout -b my-feature-branch
+# Make changes
+git add . && git commit -m "Description"
+git push -u origin my-feature-branch
+# Create PR from fork to upstream/master via GitHub UI
+```
+
+## Building LMI Containers
+
+### Container Types
+
+**DLC and DockerHub:**
+- LMI-vLLM
+- LMI-TensorRT-LLM
+- LMI-Neuron
+
+**DockerHub Only:**
+- CPU-Full (PyTorch/OnnxRuntime/MxNet/TensorFlow)
+- CPU (no engines bundled)
+- PyTorch-GPU
+- Aarch64 (Graviton support)
+
+### Build Process
+
+```bash
+# Prepare build
+cd djl-serving
+rm -rf serving/docker/distributions
+./gradlew clean && ./gradlew --refresh-dependencies :serving:dockerDeb -Psnapshot
+
+# Get versions
+cd serving/docker
+export DJL_VERSION=$(awk -F '=' '/djl / {gsub(/ ?"/, "", $2); print $2}' ../../gradle/libs.versions.toml)
+export SERVING_VERSION=$(awk -F '=' '/serving / {gsub(/ ?"/, "", $2); print $2}' ../../gradle/libs.versions.toml)
+
+# Build specific container
+docker compose build --build-arg djl_version=${DJL_VERSION} --build-arg djl_serving_version=${SERVING_VERSION} lmi
+docker compose build --build-arg djl_version=${DJL_VERSION} --build-arg djl_serving_version=${SERVING_VERSION} tensorrt-llm
+docker compose build --build-arg djl_version=${DJL_VERSION} --build-arg djl_serving_version=${SERVING_VERSION} pytorch-inf2
+```
+
+See `serving/docker/docker-compose.yml` for all available targets.
+
+## Testing
+
+### Local Integration Tests
+
+```bash
+cd tests/integration
+OVERRIDE_TEST_CONTAINER=<image_name> python -m pytest tests.py::<TestClass>::<test_name>
+
+# Example
+OVERRIDE_TEST_CONTAINER=deepjavalibrary/djl-serving:lmi python -m pytest tests.py::TestVllm1_g6::test_gemma_2b
+```
+
+Full test suite: `tests/integration/tests.py`
+
+## Key Development Areas (Priority Order)
+
+### DJL-Serving
+1. **Python Engine** - `engines/python/setup/djl_python/` (vLLM, TensorRT-LLM, rolling batch, chat completions)
+2. **Python Engine Java** - `engines/python/src/main/java/ai/djl/python/engine/`
+3. **WLM** - `wlm/` (backend ML/DL engine integration)
+4. **Serving** - `serving/` (frontend web server)
+
+### DJL (Less Frequent)
+PyTorch, HuggingFace Tokenizer, OnnxRuntime, Rust/Candle engines
+
+## CI/CD Workflows
+
+### DJL Repository
+- `continuous.yml` - PR checks
+- `native_jni_s3_pytorch.yml` - Publish native code to S3
+- `nightly_publish.yml` - SNAPSHOT to Maven
+- `serving-publish.yml` - DJL-Serving SNAPSHOT to S3
+
+### DJL-Serving Repository
+- `nightly.yml` - Build containers → Run tests → Publish to staging
+- `docker-nightly-publish.yml` - Build/publish to dev repo (ad-hoc)
+- `integration.yml` - Run all tests with custom image (ad-hoc)
+- `docker_publish.yml` - Sync dev to staging
+- `integration_execute.yml` - Single test on specific instance
+
+## Versioning
+- **DJL** → Maven (stable + SNAPSHOT)
+- **DJL-Serving** → S3 (stable + SNAPSHOT)
+- **Source** → `gradle/libs.versions.toml`
+- **Nightly** → SNAPSHOT, **Release** → Stable
@@ -0,0 +1,133 @@
+# Model Partitioning and Optimization
+
+The partition system (`serving/docker/partition/`) provides tools for model preparation, including tensor parallelism sharding, quantization, and multi-node setup.
+
+## Core Scripts
+
+### partition.py - Main Entry Point
+Handles S3 download, requirements install, partitioning, quantization (AWQ/FP8), S3 upload.
+
+**Features:** HF downloads, `OPTION_*` env vars, MPI mode, auto-cleanup
+
+```bash
+python partition.py \
+  --model-id <hf_model_id_or_s3_uri> \
+  --tensor-parallel-degree 4 \
+  --quantization awq \
+  --save-mp-checkpoint-path /tmp/output
+```
+
+### run_partition.py - Custom Handlers
+Invokes user-provided partition handlers via `partition_handler` property.
+
+### run_multi_node_setup.py - Cluster Coordination
+Multi-node setup: queries leader for model info, downloads to workers, exchanges SSH keys, reports readiness.
+
+**Env Vars:** `DJL_LEADER_ADDR`, `LWS_LEADER_ADDR`, `DJL_CACHE_DIR`
+
+### trt_llm_partition.py - TensorRT-LLM Compilation
+Builds TensorRT engines with BuildConfig (batch/seq limits), QuantConfig (AWQ/FP8/SmoothQuant), CalibConfig (calibration data).
+
+### SageMaker Neo Integration
+
+Partition scripts power **SageMaker Neo's CreateOptimizationJob API** - managed service for compilation, quantization, and sharding.
+
+**API:** https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateOptimizationJob.html
+
+**Optimization Types:**
+1. Compilation (TensorRT-LLM engines)
+2. Quantization (AWQ, FP8)
+3. Sharding (Fast Model Loader TP)
+
+**Neo Environment Variables:**
+- `SM_NEO_INPUT_MODEL_DIR`, `SM_NEO_OUTPUT_MODEL_DIR`
+- `SM_NEO_COMPILATION_PARAMS` (JSON config)
+- `SERVING_FEATURES` (vllm, trtllm)
+
+**Neo Scripts:**
+- `sm_neo_dispatcher.py` - Routes jobs: vllm→Quantize/Shard, trtllm→Compile
+- `sm_neo_trt_llm_partition.py` - TensorRT-LLM compilation
+- `sm_neo_quantize.py` - Quantization workflows
+- `sm_neo_utils.py` - Env var helpers
+
+**Workflow:**
+CreateOptimizationJob(source S3, config, output S3, container) → Neo launches container → Dispatcher routes → Handler optimizes → Artifacts to output S3 → Deploy to SageMaker
+
+## Quantization
+
+### AWQ (4-bit, AutoAWQ library)
+```properties
+option.quantize=awq
+option.awq_zero_point=true
+option.awq_block_size=128
+option.awq_weight_bit_width=4
+option.awq_mm_version=GEMM
+option.awq_ignore_layers=lm_head
+```
+
+### FP8 (llm-compressor, CNN/DailyMail calibration)
+```properties
+option.quantize=fp8
+option.fp8_scheme=FP8_DYNAMIC
+option.fp8_ignore=lm_head
+option.calib_size=512
+option.max_model_len=2048
+```
+
+## Multi-Node
+
+### MPI Mode (engine=MPI or TP > 1)
+```bash
+mpirun -N <tp_degree> --allow-run-as-root \
+  --mca btl_vader_single_copy_mechanism none \
+  -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1 \
+  python run_partition.py --properties '{...}'
+```
+
+### Cluster Setup (LeaderWorkerSet/K8s)
+1. Leader generates SSH keys
+2. Workers query `/cluster/models` for model info
+3. Workers download model, exchange SSH keys via `/cluster/sshpublickey`
+4. Workers report to `/cluster/status?message=OK`
+5. Leader loads model
+
+## Configuration
+
+### properties_manager.py
+Loads `serving.properties`, merges `OPTION_*` env vars, validates, generates output.
+
+**Key Properties:**
+- `option.model_id` - HF model ID or S3 URI
+- `option.tensor_parallel_degree`, `option.pipeline_parallel_degree`
+- `option.save_mp_checkpoint_path` - Output dir
+- `option.quantize` - awq, fp8, static_int8
+- `engine` - Python, MPI
+
+### utils.py Helpers
+`get_partition_cmd()`, `extract_python_jar()`, `load_properties()`, `update_kwargs_with_env_vars()`, `remove_option_from_properties()`, `load_hf_config_and_tokenizer()`
+
+## Container Integration
+Scripts at `/opt/djl/partition/` invoked via:
+1. Neo compilation (`sm_neo_dispatcher.py`)
+2. Container startup (on-the-fly partitioning)
+3. Management API (dynamic registration)
+
+## Common Workflows
+
+```bash
+# Tensor Parallelism
+python partition.py --model-id meta-llama/Llama-2-70b-hf \
+  --tensor-parallel-degree 8 --save-mp-checkpoint-path /tmp/output
+
+# AWQ Quantization
+python partition.py --model-id meta-llama/Llama-2-7b-hf \
+  --quantization awq --save-mp-checkpoint-path /tmp/output
+
+# TensorRT-LLM Engine
+python trt_llm_partition.py --properties_dir /opt/ml/model \
+  --trt_llm_model_repo /tmp/engine --model_path /tmp/model \
+  --tensor_parallel_degree 4 --pipeline_parallel_degree 1
+```
+
+## Error Handling
+Non-zero exit on failure, real-time stdout/stderr, cleanup on success, S3 upload only after success.
@@ -0,0 +1,33 @@
+# DJL Serving - Product Overview
+
+High-performance universal model serving solution powered by Deep Java Library (DJL). Serves ML models through REST APIs with automatic scaling, dynamic batching, and multi-engine support.
+
+## Architecture
+
+**3-Layer Design:**
+1. **Frontend** - Netty HTTP server (Inference + Management APIs)
+2. **Workflows** - Multi-model execution pipelines
+3. **WorkLoadManager (WLM)** - Worker thread pools with batching/routing
+
+**Python Engine** - Runs Python-based models and custom handlers
+**LMI** - Large Model Inference with vLLM, TensorRT-LLM, HuggingFace Accelerate
+
+## Supported Models
+
+PyTorch TorchScript, SKLearn models, ONNX, Python scripts, XGBoost, Sentencepiece, HuggingFace models
+
+## Primary Use Cases
+
+1. **LLM Serving** - Optimized backends (vLLM, TensorRT-LLM) with LoRA adapters
+2. **Multi-Model Endpoints** - Version management, workflows
+3. **Custom Handlers** - Python preprocessing/postprocessing
+4. **Embeddings & Multimodal** - Text embeddings, vision-language models
+5. **AWS Integration** - SageMaker deployment, Neo optimization (compilation, quantization, sharding)
+
+## Key Features
+
+- Auto-scaling worker threads based on load
+- Dynamic batching for throughput optimization
+- Multi-engine support (serve different frameworks simultaneously)
+- Plugin architecture for extensibility
+- OpenAPI-compatible REST endpoints
-Original file line number
+Diff line change
@@ Expand Up / @@ -41,5 +41,6 @@ dist/ @@
     *.egg-info/
     *.pt
-    .kiro
+    .kiro/*
+    !.kiro/steering/
     venv