Skip to content

Commit f0fb0b7

Browse files
dagil-nvidiaathreeshclaude
authored
docs: Cherry-pick PR #5380 documentation updates to release/0.8.0 (#5440)
Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: athreesh <anish.maddipoti@utexas.edu> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 702412d commit f0fb0b7

File tree

15 files changed

+1832
-43
lines changed

15 files changed

+1832
-43
lines changed

docs/_sections/fault_tolerance.rst

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
Fault Tolerance
2+
===============
3+
4+
.. toctree::
5+
:maxdepth: 1
6+
7+
Overview <../fault_tolerance/README.md>
8+
Request Migration <../fault_tolerance/request_migration.md>
9+
Request Cancellation <../fault_tolerance/request_cancellation.md>
10+
Graceful Shutdown <../fault_tolerance/graceful_shutdown.md>
11+
Request Rejection <../fault_tolerance/request_rejection.md>
12+
Testing <../fault_tolerance/testing.md>

docs/backends/sglang/README.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -163,14 +163,19 @@ docker run \
163163

164164
Below we provide a guide that lets you run all of our common deployment patterns on a single node.
165165

166-
### Start NATS and ETCD in the background
166+
### Start Infrastructure Services (Local Development Only)
167167

168-
Start using [Docker Compose](../../../deploy/docker-compose.yml)
168+
For local/bare-metal development, start etcd and optionally NATS using [Docker Compose](../../../deploy/docker-compose.yml):
169169

170170
```bash
171171
docker compose -f deploy/docker-compose.yml up -d
172172
```
173173

174+
> [!NOTE]
175+
> - **etcd** is optional but is the default local discovery backend. You can also use `--kv_store file` to use file system based discovery.
176+
> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing
177+
> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
178+
174179
> [!TIP]
175180
> Each example corresponds to a simple bash script that runs the OpenAI compatible server, processor, and optional router (written in Rust) and LLM engine (written in Python) in a single terminal. You can easily take each command and run them in separate terminals.
176181
>

docs/backends/trtllm/README.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -70,14 +70,19 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
7070

7171
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
7272

73-
### Start NATS and ETCD in the background
73+
### Start Infrastructure Services (Local Development Only)
7474

75-
Start using [Docker Compose](../../../deploy/docker-compose.yml)
75+
For local/bare-metal development, start etcd and optionally NATS using [Docker Compose](../../../deploy/docker-compose.yml):
7676

7777
```bash
7878
docker compose -f deploy/docker-compose.yml up -d
7979
```
8080

81+
> [!NOTE]
82+
> - **etcd** is optional but is the default local discovery backend. You can also use `--kv_store file` to use file system based discovery.
83+
> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing
84+
> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
85+
8186
### Build container
8287

8388
```bash

docs/backends/vllm/README.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -55,14 +55,19 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
5555

5656
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
5757

58-
### Start NATS and ETCD in the background
58+
### Start Infrastructure Services (Local Development Only)
5959

60-
Start using [Docker Compose](../../../deploy/docker-compose.yml)
60+
For local/bare-metal development, start etcd and optionally NATS using [Docker Compose](../../../deploy/docker-compose.yml):
6161

6262
```bash
6363
docker compose -f deploy/docker-compose.yml up -d
6464
```
6565

66+
> [!NOTE]
67+
> - **etcd** is optional but is the default local discovery backend. You can also use `--kv_store file` to use file system based discovery.
68+
> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing
69+
> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
70+
6671
### Pull or build container
6772

6873
We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd like to build your own container from source:

docs/design_docs/distributed_runtime.md

Lines changed: 48 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -19,56 +19,81 @@ limitations under the License.
1919

2020
## Overview
2121

22-
Dynamo's `DistributedRuntime` is the core infrastructure in the framework that enables distributed communication and coordination between different Dynamo components. It is implemented in rust (`/lib/runtime`) and exposed to other programming languages via bindings (i.e., python bindings can be found in `/lib/bindings/python`). `DistributedRuntime` follows a hierarchical structure:
22+
Dynamo's `DistributedRuntime` is the core infrastructure in the framework that enables distributed communication and coordination between different Dynamo components. It is implemented in Rust (`/lib/runtime`) and exposed to other programming languages via bindings (i.e., Python bindings can be found in `/lib/bindings/python`). The runtime supports multiple discovery backends (Kubernetes-native or etcd) and request planes (TCP, HTTP, or NATS). `DistributedRuntime` follows a hierarchical structure:
2323

24-
- `DistributedRuntime`: This is the highest level object that exposes the distributed runtime interface. It maintains connection to external services (e.g., etcd for service discovery and NATS for messaging) and manages lifecycle with cancellation tokens.
24+
- `DistributedRuntime`: This is the highest level object that exposes the distributed runtime interface. It manages connections to discovery backends (K8s API or etcd) and optional messaging (NATS for KV events), and handles lifecycle with cancellation tokens.
2525
- `Namespace`: A `Namespace` is a logical grouping of components that isolate between different model deployments.
2626
- `Component`: A `Component` is a discoverable object within a `Namespace` that represents a logical unit of workers.
2727
- `Endpoint`: An `Endpoint` is a network-accessible service that provides a specific service or function.
2828

2929
While theoretically each `DistributedRuntime` can have multiple `Namespace`s as long as their names are unique (similar logic also applies to `Component/Namespace` and `Endpoint/Component`), in practice, each dynamo components typically are deployed with its own process and thus has its own `DistributedRuntime` object. However, they share the same namespace to discover each other.
3030

31-
For example, a typical deployment configuration (like `examples/backends/vllm/deploy/agg.yaml` or `examples/backends/sglang/deploy/agg.yaml`) has multiple workers:
31+
For example, a typical deployment configuration (like `examples/backends/vllm/deploy/agg.yaml` or `examples/backends/sglang/deploy/agg.yaml`) has multiple components:
3232

33-
- `Frontend`: Starts an HTTP server and handles incoming requests. The HTTP server routes all requests to the `Processor`.
34-
- `Processor`: When a new request arrives, `Processor` applies the chat template and performs the tokenization.
35-
Then, it routes the request to the `Worker`.
36-
- `Worker` components (e.g., `VllmDecodeWorker`, `SGLangDecodeWorker`, `TrtllmWorker`): Perform the actual computation using their respective engines (vLLM, SGLang, TensorRT-LLM).
33+
- `Frontend`: Starts an HTTP server (OpenAI-compatible API on port 8000), handles incoming requests, applies chat templates, performs tokenization, and routes requests to workers. The `make_engine` function encapsulates this functionality.
34+
- `Worker` components (e.g., `VllmDecodeWorker`, `VllmPrefillWorker`, `SGLangDecodeWorker`, `TRTLLMWorker`): Perform the actual inference computation using their respective engines (vLLM, SGLang, TensorRT-LLM).
3735

38-
Since the workers are deployed in different processes, each of them has its own `DistributedRuntime`. Within their own `DistributedRuntime`, they all share the same `Namespace` (e.g., `vllm-agg`, `sglang-agg`). Then, under their namespace, they have their own `Component`s: `Frontend` uses the `make_engine` function which handles HTTP serving and routing automatically, while worker components create components with names like `worker`, `decode`, or `prefill` and register endpoints like `generate`, `flush_cache`, or `clear_kv_blocks`. The `Frontend` component doesn't explicitly create endpoints - instead, the `make_engine` function handles the HTTP server and worker discovery. Worker components create their endpoints programmatically using the `component.endpoint()` method. Their `DistributedRuntime`s are initialized in their respective main functions, their `Namespace`s are configured in the deployment YAML, their `Component`s are created programmatically (e.g., `runtime.namespace("dynamo").component("worker")`), and their `Endpoint`s are created using the `component.endpoint()` method.
36+
Since these components are deployed in different processes, each has its own `DistributedRuntime`. Within their own `DistributedRuntime`, they all share the same `Namespace` (e.g., `vllm-agg`, `sglang-disagg`). Under their namespace, each has its own `Component`:
37+
38+
- `Frontend` uses the `make_engine` function which handles HTTP serving, request preprocessing, and worker discovery automatically
39+
- Worker components register with names like `backend`, `prefill`, `decode`, or `encoder` depending on their role
40+
- Workers register endpoints like `generate`, `clear_kv_blocks`, or `load_metrics`
41+
42+
Their `DistributedRuntime`s are initialized in their respective main functions, their `Namespace`s are configured in the deployment YAML, their `Component`s are created programmatically (e.g., `runtime.namespace("dynamo").component("backend")`), and their `Endpoint`s are created using the `component.endpoint()` method.
3943

4044
## Initialization
4145

42-
In this section, we explain what happens under the hood when `DistributedRuntime/Namespace/Component/Endpoint` objects are created. There are two modes for `DistributedRuntime` initialization: dynamic and static. In static mode, components and endpoints are defined using known addresses and do not change during runtime. In dynamic modes, components and endpoints are discovered through the network and can change during runtime. We focus on the dynamic mode in the rest of this document. Static mode is basically dynamic mode without registration and discovery and hence does not rely on etcd.
46+
In this section, we explain what happens under the hood when `DistributedRuntime/Namespace/Component/Endpoint` objects are created. There are multiple modes for `DistributedRuntime` initialization based on the deployment environment.
4347

4448
```{caution}
45-
The hierarchy and naming in etcd and NATS may change over time, and this document might not reflect the latest changes. Regardless of such changes, the main concepts would remain the same.
49+
The hierarchy and naming may change over time, and this document might not reflect the latest changes. Regardless of such changes, the main concepts would remain the same.
4650
```
4751

48-
- `DistributedRuntime`: When a `DistributedRuntime` object is created, it establishes connections to the following services:
49-
- etcd (dynamic mode only): for service discovery. In static mode, `DistributedRuntime` can operate without etcd.
50-
- NATS (optional): for KV event messaging and router replica sync. NATS is enabled by default but can be disabled via the `enable_nats` parameter (e.g., using `--no-kv-events` flag). When NATS is disabled, the system operates in approximate mode without KV event persistence. Also legacy nats based request_plane is supported.
52+
### Service Discovery Backends
53+
54+
The `DistributedRuntime` supports two service discovery backends, configured via `DYN_DISCOVERY_BACKEND`:
55+
56+
- **KV Store Discovery** (`DYN_DISCOVERY_BACKEND=kv_store`): Uses etcd for service discovery. **This is the global default** for all deployments unless explicitly overridden.
57+
58+
- **Kubernetes Discovery** (`DYN_DISCOVERY_BACKEND=kubernetes`): Uses native Kubernetes resources (DynamoWorkerMetadata CRD, EndpointSlices) for service discovery. **Must be explicitly set.** The Dynamo operator automatically sets this environment variable for Kubernetes deployments. **No etcd required.**
5159

52-
etcd and NATS are global services (there could be multiple instances for high availability).
60+
> **Note:** There is no automatic detection of the deployment environment. The runtime always defaults to `kv_store`. For Kubernetes deployments, the operator injects `DYN_DISCOVERY_BACKEND=kubernetes` into pod environments.
5361
54-
For etcd, it also creates a primary lease and spin up a background task to keep the lease alive. All objects registered under this `DistributedRuntime` use this lease_id to maintain their life cycle. There is also a cancellation token that is tied to the primary lease. When the cancellation token is triggered or the background task failed, the primary lease is revoked or expired and the kv pairs stored with this lease_id is removed.
55-
- `Namespace`: `Namespace`s are primarily a logical grouping mechanism and is not registered in etcd. It provides the root path for all components under this `Namespace`.
56-
- `Component`: When a `Component` object is created, similar to `Namespace`, it isn't be registered in etcd. When `create_service` is called, it creates a NATS service group using `{namespace_name}.{service_name}` as the service identifier and registers a service in the registry of the `Component`, where the registry is an internal data structure that tracks all services and endpoints within the `DistributedRuntime`.
57-
- `Endpoint`: When an Endpoint object is created and started, it performs two key registrations:
58-
- NATS Registration: The endpoint is registered with the NATS service group created during service creation. The endpoint is assigned a unique subject following the naming: `{namespace_name}.{service_name}.{endpoint_name}-{lease_id_hex}`.
59-
- etcd Registration: The endpoint information is stored in etcd at a path following the naming: `/services/{namespace}/{component}/{endpoint}-{lease_id}`. Note that the endpoints of different workers of the same type (i.e., two `VllmPrefillWorker`s in one deployment) share the same `Namespace`, `Component`, and `Endpoint` name. They are distinguished by their different primary `lease_id` of their `DistributedRuntime`.
62+
When using Kubernetes discovery, the KV store backend automatically switches to in-memory storage since etcd is not needed.
63+
64+
### Runtime Initialization
65+
66+
- `DistributedRuntime`: When a `DistributedRuntime` object is created, it establishes connections based on the discovery backend:
67+
- **Kubernetes mode**: Uses K8s API for service registration via DynamoWorkerMetadata CRD. No external dependencies required.
68+
- **KV Store mode**: Connects to etcd for service discovery. Creates a primary lease with a background keep-alive task. All objects registered under this `DistributedRuntime` use this lease_id to maintain their lifecycle.
69+
- **NATS** (optional): Used for KV event messaging when using KV-aware routing. Can be disabled via `--no-kv-events` flag, which enables prediction-based routing without event persistence.
70+
- **Request Plane**: TCP by default. Can be configured to use HTTP or NATS via `DYN_REQUEST_PLANE` environment variable.
71+
- `Namespace`: `Namespace`s are primarily a logical grouping mechanism. They provide the root path for all components under this `Namespace`.
72+
- `Component`: When a `Component` object is created, it registers a service in the internal registry of the `DistributedRuntime`, which tracks all services and endpoints.
73+
- `Endpoint`: When an Endpoint object is created and started, it performs registration based on the discovery backend:
74+
- **Kubernetes mode**: Endpoint information is stored in DynamoWorkerMetadata CRD resources, which are watched by other components for discovery.
75+
- **KV Store mode**: Endpoint information is stored in etcd at a path following the naming: `/services/{namespace}/{component}/{endpoint}-{lease_id}`. Note that endpoints of different workers of the same type (i.e., two `VllmPrefillWorker`s in one deployment) share the same `Namespace`, `Component`, and `Endpoint` name. They are distinguished by their different primary `lease_id`.
6076

6177
## Calling Endpoints
6278

63-
Dynamo uses `Client` object to call an endpoint. When a `Client` objected is created, it is given the name of the `Namespace`, `Component`, and `Endpoint`. It then sets up an etcd watcher to monitor the prefix `/services/{namespace}/{component}/{endpoint}`. The etcd watcher continuously updates the `Client` with the information, including `lease_id` and NATS subject of the available `Endpoint`s.
79+
Dynamo uses a `Client` object to call an endpoint. When a `Client` is created, it is given the name of the `Namespace`, `Component`, and `Endpoint`. It then watches for endpoint changes:
80+
81+
- **Kubernetes mode**: Watches DynamoWorkerMetadata CRD resources for endpoint updates.
82+
- **KV Store mode**: Sets up an etcd watcher to monitor the prefix `/services/{namespace}/{component}/{endpoint}`.
83+
84+
The watcher continuously updates the `Client` with information about available `Endpoint`s.
6485

6586
The user can decide which load balancing strategy to use when calling the `Endpoint` from the `Client`, which is done in [push_router.rs](../../lib/runtime/src/pipeline/network/egress/push_router.rs). Dynamo supports three load balancing strategies:
6687

6788
- `random`: randomly select an endpoint to hit
6889
- `round_robin`: select endpoints in round-robin order
69-
- `direct`: direct the request to a specific endpoint by specifying the `lease_id` of the endpoint
90+
- `direct`: direct the request to a specific endpoint by specifying the instance ID
91+
92+
After selecting which endpoint to hit, the `Client` sends the request using the configured request plane (TCP by default). The request plane handles the actual transport:
7093

71-
After selecting which endpoint to hit, the `Client` sends the serialized request to the NATS subject of the selected `Endpoint`. The `Endpoint` receives the request and create a TCP response stream using the connection information from the request, which establishes a direct TCP connection to the `Client`. Then, as the worker generates the response, it serializes each response chunk and sends the serialized data over the TCP connection.
94+
- **TCP** (default): Direct TCP connection with connection pooling
95+
- **HTTP**: HTTP/2-based transport
96+
- **NATS**: Message broker-based transport (legacy)
7297

7398
## Examples
7499

0 commit comments

Comments
 (0)