diff --git a/content/en/docs/components/trainer/user-guides/local-execution-mode/docker.md b/content/en/docs/components/trainer/user-guides/local-execution-mode/docker.md index 9a51358c40..e64f07294f 100644 --- a/content/en/docs/components/trainer/user-guides/local-execution-mode/docker.md +++ b/content/en/docs/components/trainer/user-guides/local-execution-mode/docker.md @@ -15,6 +15,30 @@ The Container Backend with Docker enables you to run distributed TrainJobs in is The Docker backend uses the adapter pattern to provide a unified interface, making it easy to switch between Docker and Podman without code changes. +## Architecture + +The Container Backend with Docker uses a local orchestration layer to manage TrainJobs within Docker containers. This ensures environment parity between your local machine and production Kubernetes clusters. + +```mermaid +graph LR + User([User Script]) -->|TrainerClient.train| SDK[Kubeflow SDK] + + SDK -->|1. Pull| Image[Docker Image] + SDK -->|2. Net| Net[Bridge Network] + SDK -->|3. Run| Daemon[Docker Daemon] + + subgraph DockerEnv [Local Docker Environment] + direction TB + Daemon -->|Spawn| C1[Node 0] + Daemon -->|Spawn| C2[Node 1] + C1 <-->|DDP| C2 + end + + C1 -->|4. Logs| Logs[Stream Logs] + C2 -->|4. Logs| Logs + SDK -.->|"5. Cleanup (if auto_remove)"| Remove[Delete Containers & Network] +``` + ## Prerequisites ### Required Software diff --git a/content/en/docs/components/trainer/user-guides/local-execution-mode/local_process.md b/content/en/docs/components/trainer/user-guides/local-execution-mode/local_process.md index beedb3cf92..e347d21dfb 100644 --- a/content/en/docs/components/trainer/user-guides/local-execution-mode/local_process.md +++ b/content/en/docs/components/trainer/user-guides/local-execution-mode/local_process.md @@ -217,6 +217,27 @@ finally: client.delete_job(job_name) ``` +## Architecture + +The Local Process Backend operates by orchestrating native OS processes. It bypasses container runtimes like Docker, instead managing the lifecycle of your training script through isolated Python Virtual Environments (venvs). + +```mermaid +graph LR + User([User Script]) -->|TrainerClient.train| SDK[Kubeflow SDK] + + SDK -->|1. Generate| Script[Bash Script] + + subgraph LocalExec ["Single Subprocess (bash -c)"] + direction TB + Script -->|2. Execute| Venv[Create Venv + pip] + Venv --> Deps[Install Dependencies] + Deps --> Train[Run Entrypoint] + Train -.-> Clean["Delete Venv (if cleanup_venv)"] + end + + Train -->|3. Logs| Logs[Stream Logs] +``` + ## How It Works Understanding the internal workflow helps with debugging and optimization: diff --git a/content/en/docs/components/trainer/user-guides/local-execution-mode/podman.md b/content/en/docs/components/trainer/user-guides/local-execution-mode/podman.md index f01d3e663e..ff60c0a8c1 100644 --- a/content/en/docs/components/trainer/user-guides/local-execution-mode/podman.md +++ b/content/en/docs/components/trainer/user-guides/local-execution-mode/podman.md @@ -168,6 +168,32 @@ backend_config = ContainerBackendConfig( ) ``` +## Architecture + +The Container Backend with Podman uses a local orchestration layer to manage TrainJobs within Podman containers. This ensures environment parity between your local machine and production Kubernetes clusters. + +```mermaid +graph LR + User([User Script]) -->|TrainerClient.train| SDK[Kubeflow SDK] + + SDK -->|1. Pull| Image[Podman Image] + SDK -->|2. Net| Net[DNS-Enabled Bridge Network] + SDK -->|3. Run| Podman[Podman Engine] + + subgraph PodmanEnv [Local Podman Environment] + direction TB + Podman -->|Spawn| C1[Node 0] + Podman -->|Spawn| C2[Node 1] + C1 <-->|DDP| C2 + end + + C1 -->|4. Logs| Logs[Stream Logs] + C2 -->|4. Logs| Logs + SDK -.->|"5. Cleanup (if auto_remove)"| Remove[Delete Containers & Network] +``` + + + ## Multi-Node Distributed Training The Podman backend automatically sets up networking and environment variables for distributed training: