Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,29 @@ The Container Backend with Docker enables you to run distributed TrainJobs in is

The Docker backend uses the adapter pattern to provide a unified interface, making it easy to switch between Docker and Podman without code changes.

## Architecture

The Container Backend with Docker uses a local orchestration layer to manage TrainJobs within Docker containers. This ensures environment parity between your local machine and production Kubernetes clusters.

```mermaid
graph LR
User([User Script]) -->|TrainerClient.train| SDK[Kubeflow SDK]

SDK -->|1. Pull| Image[Docker Image]
SDK -->|2. Net| Net[Bridge Network]
SDK -->|3. Run| Daemon[Docker Daemon]

subgraph DockerEnv [Local Docker Environment]
direction TB
Daemon -->|Spawn| C1[Node 0]
Daemon -->|Spawn| C2[Node 1]
C1 <-->|DDP| C2
end

C1 -->|4. Logs| Logs[Stream Logs]
C1 -.->|5. Clean| Remove[Auto-Remove]
```

## Prerequisites

### Required Software
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -217,6 +217,28 @@ finally:
client.delete_job(job_name)
```

## Architecture

The Local Process Backend operates by orchestrating native OS processes. It bypasses container runtimes like Docker, instead managing the lifecycle of your training script through isolated Python Virtual Environments (venvs).

```mermaid
graph LR
User([User Script]) -->|TrainerClient.train| SDK[Kubeflow SDK]

SDK -->|1. Create| Venv[Python Venv]
Venv -->|2. Install| Deps[Dependencies]
SDK -->|3. Extract| Script[Training Script .py]

subgraph LocalExec [Local Execution]
direction TB
Deps --> Process[Python Process]
Script --> Process
end

Process -->|4. Logs| Logs[Stream Logs]
Process -.->|5. Clean| Cleanup[Delete Venv]
```

## How It Works

Understanding the internal workflow helps with debugging and optimization:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,37 @@ backend_config = ContainerBackendConfig(
)
```

## Architecture

The Container Backend with Docker uses a local orchestration layer to manage TrainJobs within Docker containers. This ensures environment parity between your local machine and production Kubernetes clusters.

```mermaid
graph LR
User([User Script]) -->|TrainerClient.train| SDK[Kubeflow SDK]

SDK -->|1. Prep| PodConfig[Podman Config]
SDK -->|2. Mount| LocalDir[Local Dir Mounts]
SDK -->|3. Exec| Podman[Podman CLI/API]

subgraph PodmanEnv [Podman Container - Rootless]
direction TB
Podman --> Process[Training Process]
Process --> Security[User Namespace Isolation]
end

Process -->|4. Logs| Logs[Stream Logs]
Process -->|5. Clean| Exit[Exit & Cleanup]
```



### Workflow Detail
1. **Image Management:** The SDK identifies the required training image. If `pull_policy` is set, it ensures the latest image is available.
2. **Network Creation:** A dedicated Docker bridge network is created for the job to allow containers (nodes) to communicate via hostnames (e.g., `job-node-0`).
3. **Container Spawning:** The SDK instructs the Docker Daemon to start containers. It injects environment variables like `MASTER_ADDR`, `MASTER_PORT`, `RANK`, and `WORLD_SIZE` to enable distributed frameworks (e.g., PyTorch DDP).
4. **Log Streaming:** Logs are streamed from the containers back to the SDK's `TrainerClient`.
5. **Lifecycle Management:** Once the training process exits, the SDK handles the removal of containers and the temporary network if `auto_remove=True`.

## Multi-Node Distributed Training

The Podman backend automatically sets up networking and environment variables for distributed training:
Expand Down