From 72015bc385f8e6349551c4040b8adbd221f89f21 Mon Sep 17 00:00:00 2001 From: sh4shv4t Date: Wed, 4 Feb 2026 17:24:18 +0530 Subject: [PATCH 1/2] doc(trainer): add architecture diagrams to local execution guides Signed-off-by: sh4shv4t --- .../local-execution-mode/docker.md | 23 ++++++++++++++ .../local-execution-mode/local_process.md | 22 +++++++++++++ .../local-execution-mode/podman.md | 31 +++++++++++++++++++ 3 files changed, 76 insertions(+) diff --git a/content/en/docs/components/trainer/user-guides/local-execution-mode/docker.md b/content/en/docs/components/trainer/user-guides/local-execution-mode/docker.md index 9a51358c40..05b03a784b 100644 --- a/content/en/docs/components/trainer/user-guides/local-execution-mode/docker.md +++ b/content/en/docs/components/trainer/user-guides/local-execution-mode/docker.md @@ -15,6 +15,29 @@ The Container Backend with Docker enables you to run distributed TrainJobs in is The Docker backend uses the adapter pattern to provide a unified interface, making it easy to switch between Docker and Podman without code changes. +## Architecture + +The Container Backend with Docker uses a local orchestration layer to manage TrainJobs within Docker containers. This ensures environment parity between your local machine and production Kubernetes clusters. + +```mermaid +graph LR + User([User Script]) -->|TrainerClient.train| SDK[Kubeflow SDK] + + SDK -->|1. Pull| Image[Docker Image] + SDK -->|2. Net| Net[Bridge Network] + SDK -->|3. Run| Daemon[Docker Daemon] + + subgraph DockerEnv [Local Docker Environment] + direction TB + Daemon -->|Spawn| C1[Node 0] + Daemon -->|Spawn| C2[Node 1] + C1 <-->|DDP| C2 + end + + C1 -->|4. Logs| Logs[Stream Logs] + C1 -.->|5. Clean| Remove[Auto-Remove] +``` + ## Prerequisites ### Required Software diff --git a/content/en/docs/components/trainer/user-guides/local-execution-mode/local_process.md b/content/en/docs/components/trainer/user-guides/local-execution-mode/local_process.md index beedb3cf92..523a5f7673 100644 --- a/content/en/docs/components/trainer/user-guides/local-execution-mode/local_process.md +++ b/content/en/docs/components/trainer/user-guides/local-execution-mode/local_process.md @@ -217,6 +217,28 @@ finally: client.delete_job(job_name) ``` +## Architecture + +The Local Process Backend operates by orchestrating native OS processes. It bypasses container runtimes like Docker, instead managing the lifecycle of your training script through isolated Python Virtual Environments (venvs). + +```mermaid +graph LR + User([User Script]) -->|TrainerClient.train| SDK[Kubeflow SDK] + + SDK -->|1. Create| Venv[Python Venv] + Venv -->|2. Install| Deps[Dependencies] + SDK -->|3. Extract| Script[Training Script .py] + + subgraph LocalExec [Local Execution] + direction TB + Deps --> Process[Python Process] + Script --> Process + end + + Process -->|4. Logs| Logs[Stream Logs] + Process -.->|5. Clean| Cleanup[Delete Venv] +``` + ## How It Works Understanding the internal workflow helps with debugging and optimization: diff --git a/content/en/docs/components/trainer/user-guides/local-execution-mode/podman.md b/content/en/docs/components/trainer/user-guides/local-execution-mode/podman.md index f01d3e663e..8e72a5c9d2 100644 --- a/content/en/docs/components/trainer/user-guides/local-execution-mode/podman.md +++ b/content/en/docs/components/trainer/user-guides/local-execution-mode/podman.md @@ -168,6 +168,37 @@ backend_config = ContainerBackendConfig( ) ``` +## Architecture + +The Container Backend with Docker uses a local orchestration layer to manage TrainJobs within Docker containers. This ensures environment parity between your local machine and production Kubernetes clusters. + +```mermaid +graph LR + User([User Script]) -->|TrainerClient.train| SDK[Kubeflow SDK] + + SDK -->|1. Prep| PodConfig[Podman Config] + SDK -->|2. Mount| LocalDir[Local Dir Mounts] + SDK -->|3. Exec| Podman[Podman CLI/API] + + subgraph PodmanEnv [Podman Container - Rootless] + direction TB + Podman --> Process[Training Process] + Process --> Security[User Namespace Isolation] + end + + Process -->|4. Logs| Logs[Stream Logs] + Process -->|5. Clean| Exit[Exit & Cleanup] +``` + + + +### Workflow Detail +1. **Image Management:** The SDK identifies the required training image. If `pull_policy` is set, it ensures the latest image is available. +2. **Network Creation:** A dedicated Docker bridge network is created for the job to allow containers (nodes) to communicate via hostnames (e.g., `job-node-0`). +3. **Container Spawning:** The SDK instructs the Docker Daemon to start containers. It injects environment variables like `MASTER_ADDR`, `MASTER_PORT`, `RANK`, and `WORLD_SIZE` to enable distributed frameworks (e.g., PyTorch DDP). +4. **Log Streaming:** Logs are streamed from the containers back to the SDK's `TrainerClient`. +5. **Lifecycle Management:** Once the training process exits, the SDK handles the removal of containers and the temporary network if `auto_remove=True`. + ## Multi-Node Distributed Training The Podman backend automatically sets up networking and environment variables for distributed training: From 11700f742f6b1a82d111da73a98bc65b8f51d130 Mon Sep 17 00:00:00 2001 From: sh4shv4t Date: Sat, 7 Feb 2026 04:42:32 +0530 Subject: [PATCH 2/2] docs(trainer): fix architecture diagrams and documentation accuracy MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - docker.md: show logs streaming from all nodes, clarify conditional cleanup - podman.md: correct architecture text (Docker→Podman), align diagram with implementation, remove incorrect workflow details - local_process.md: update diagram to reflect bash script generation and single subprocess execution These changes address reviewer feedback and align documentation with actual SDK implementation. Signed-off-by: sh4shv4t --- .../local-execution-mode/docker.md | 3 ++- .../local-execution-mode/local_process.md | 15 +++++------ .../local-execution-mode/podman.md | 27 ++++++++----------- 3 files changed, 20 insertions(+), 25 deletions(-) diff --git a/content/en/docs/components/trainer/user-guides/local-execution-mode/docker.md b/content/en/docs/components/trainer/user-guides/local-execution-mode/docker.md index 05b03a784b..e64f07294f 100644 --- a/content/en/docs/components/trainer/user-guides/local-execution-mode/docker.md +++ b/content/en/docs/components/trainer/user-guides/local-execution-mode/docker.md @@ -35,7 +35,8 @@ graph LR end C1 -->|4. Logs| Logs[Stream Logs] - C1 -.->|5. Clean| Remove[Auto-Remove] + C2 -->|4. Logs| Logs + SDK -.->|"5. Cleanup (if auto_remove)"| Remove[Delete Containers & Network] ``` ## Prerequisites diff --git a/content/en/docs/components/trainer/user-guides/local-execution-mode/local_process.md b/content/en/docs/components/trainer/user-guides/local-execution-mode/local_process.md index 523a5f7673..e347d21dfb 100644 --- a/content/en/docs/components/trainer/user-guides/local-execution-mode/local_process.md +++ b/content/en/docs/components/trainer/user-guides/local-execution-mode/local_process.md @@ -225,18 +225,17 @@ The Local Process Backend operates by orchestrating native OS processes. It bypa graph LR User([User Script]) -->|TrainerClient.train| SDK[Kubeflow SDK] - SDK -->|1. Create| Venv[Python Venv] - Venv -->|2. Install| Deps[Dependencies] - SDK -->|3. Extract| Script[Training Script .py] + SDK -->|1. Generate| Script[Bash Script] - subgraph LocalExec [Local Execution] + subgraph LocalExec ["Single Subprocess (bash -c)"] direction TB - Deps --> Process[Python Process] - Script --> Process + Script -->|2. Execute| Venv[Create Venv + pip] + Venv --> Deps[Install Dependencies] + Deps --> Train[Run Entrypoint] + Train -.-> Clean["Delete Venv (if cleanup_venv)"] end - Process -->|4. Logs| Logs[Stream Logs] - Process -.->|5. Clean| Cleanup[Delete Venv] + Train -->|3. Logs| Logs[Stream Logs] ``` ## How It Works diff --git a/content/en/docs/components/trainer/user-guides/local-execution-mode/podman.md b/content/en/docs/components/trainer/user-guides/local-execution-mode/podman.md index 8e72a5c9d2..ff60c0a8c1 100644 --- a/content/en/docs/components/trainer/user-guides/local-execution-mode/podman.md +++ b/content/en/docs/components/trainer/user-guides/local-execution-mode/podman.md @@ -170,35 +170,30 @@ backend_config = ContainerBackendConfig( ## Architecture -The Container Backend with Docker uses a local orchestration layer to manage TrainJobs within Docker containers. This ensures environment parity between your local machine and production Kubernetes clusters. +The Container Backend with Podman uses a local orchestration layer to manage TrainJobs within Podman containers. This ensures environment parity between your local machine and production Kubernetes clusters. ```mermaid graph LR User([User Script]) -->|TrainerClient.train| SDK[Kubeflow SDK] - SDK -->|1. Prep| PodConfig[Podman Config] - SDK -->|2. Mount| LocalDir[Local Dir Mounts] - SDK -->|3. Exec| Podman[Podman CLI/API] + SDK -->|1. Pull| Image[Podman Image] + SDK -->|2. Net| Net[DNS-Enabled Bridge Network] + SDK -->|3. Run| Podman[Podman Engine] - subgraph PodmanEnv [Podman Container - Rootless] + subgraph PodmanEnv [Local Podman Environment] direction TB - Podman --> Process[Training Process] - Process --> Security[User Namespace Isolation] + Podman -->|Spawn| C1[Node 0] + Podman -->|Spawn| C2[Node 1] + C1 <-->|DDP| C2 end - Process -->|4. Logs| Logs[Stream Logs] - Process -->|5. Clean| Exit[Exit & Cleanup] + C1 -->|4. Logs| Logs[Stream Logs] + C2 -->|4. Logs| Logs + SDK -.->|"5. Cleanup (if auto_remove)"| Remove[Delete Containers & Network] ``` -### Workflow Detail -1. **Image Management:** The SDK identifies the required training image. If `pull_policy` is set, it ensures the latest image is available. -2. **Network Creation:** A dedicated Docker bridge network is created for the job to allow containers (nodes) to communicate via hostnames (e.g., `job-node-0`). -3. **Container Spawning:** The SDK instructs the Docker Daemon to start containers. It injects environment variables like `MASTER_ADDR`, `MASTER_PORT`, `RANK`, and `WORLD_SIZE` to enable distributed frameworks (e.g., PyTorch DDP). -4. **Log Streaming:** Logs are streamed from the containers back to the SDK's `TrainerClient`. -5. **Lifecycle Management:** Once the training process exits, the SDK handles the removal of containers and the temporary network if `auto_remove=True`. - ## Multi-Node Distributed Training The Podman backend automatically sets up networking and environment variables for distributed training: