Skip to content

Commit f1480f5

Browse files
authored
Add DGXCloudExecutor docs and update execution guide (#192)
* Add DGXCloudExecutor docs and update execution guide Signed-off-by: Hemil Desai <[email protected]> * fix Signed-off-by: Hemil Desai <[email protected]> --------- Signed-off-by: Hemil Desai <[email protected]>
1 parent 1f79d9c commit f1480f5

File tree

1 file changed

+68
-3
lines changed

1 file changed

+68
-3
lines changed

docs/source/guides/execution.md

Lines changed: 68 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -37,9 +37,10 @@ The packager support matrix is described below:
3737
| Executor | Packagers |
3838
|----------|----------|
3939
| LocalExecutor | run.Packager |
40-
| DockerExecutor | run.Packager, run.GitArchivePackager, run.PatternPackager |
41-
| SlurmExecutor | run.Packager, run.GitArchivePackager, run.PatternPackager |
42-
| SkypilotExecutor | run.Packager, run.GitArchivePackager, run.PatternPackager |
40+
| DockerExecutor | run.Packager, run.GitArchivePackager, run.PatternPackager, run.HybridPackager |
41+
| SlurmExecutor | run.Packager, run.GitArchivePackager, run.PatternPackager, run.HybridPackager |
42+
| SkypilotExecutor | run.Packager, run.GitArchivePackager, run.PatternPackager, run.HybridPackager |
43+
| DGXCloudExecutor | run.Packager, run.GitArchivePackager, run.PatternPackager, run.HybridPackager |
4344

4445
`run.Packager` is a passthrough base packager.
4546

@@ -78,6 +79,27 @@ You can use `run.PatternPackager` to package your code by specifying `include_pa
7879
cd {relative_path} && find {relative_include_pattern} -type f
7980
```
8081

82+
`run.HybridPackager` allows combining multiple packagers into a single archive. This is useful when you need to package different parts of your project using different strategies (e.g., a git archive for committed code and a pattern packager for generated artifacts).
83+
84+
Each sub-packager in the `sub_packagers` dictionary is assigned a key, which becomes the directory name under which its contents are placed in the final archive. If `extract_at_root` is set to `True`, all contents are placed directly in the root of the archive, potentially overwriting files if names conflict.
85+
86+
Example:
87+
```python
88+
import nemo_run as run
89+
import os
90+
91+
hybrid_packager = run.HybridPackager(
92+
sub_packagers={
93+
"code": run.GitArchivePackager(subpath="src"),
94+
"configs": run.PatternPackager(include_pattern="configs/*.yaml", relative_path=os.getcwd())
95+
}
96+
)
97+
98+
# Usage with an executor:
99+
# executor.packager = hybrid_packager
100+
```
101+
This would create an archive where the contents of `src` are under a `code/` directory and matched `configs/*.yaml` files are under a `configs/` directory.
102+
81103
### Defining Executors
82104
Next, We'll describe details on setting up each of the executors below.
83105

@@ -199,3 +221,46 @@ executor = your_skypilot_cluster(nodes=8, devices=8, container_image="your-nemo-
199221
```
200222

201223
As demonstrated in the examples, defining executors in Python offers great flexibility. You can easily mix and match things like common environment variables, and the separation of tasks from executors enables you to run the same configured task on any supported executor.
224+
225+
#### DGXCloudExecutor
226+
227+
The `DGXCloudExecutor` integrates with a DGX Cloud cluster's Run:ai API to launch distributed jobs. It uses REST API calls to authenticate, identify the target project and cluster, and submit the job specification.
228+
229+
> **_WARNING:_** Currently, the `DGXCloudExecutor` is only supported when launching experiments *from* a pod running on the DGX Cloud cluster itself. Furthermore, this launching pod must have access to a Persistent Volume Claim (PVC) where the experiment/job directories will be created, and this same PVC must also be configured to be mounted by the job being launched.
230+
231+
Here's an example configuration:
232+
233+
```python
234+
def your_dgx_executor(nodes: int, gpus_per_node: int, container_image: str):
235+
# Ensure these are set correctly for your DGX Cloud environment
236+
# You might fetch these from environment variables or a config file
237+
base_url = "YOUR_DGX_CLOUD_API_ENDPOINT" # e.g., https://<cluster-name>.<domain>/api/v1
238+
app_id = "YOUR_RUNAI_APP_ID"
239+
app_secret = "YOUR_RUNAI_APP_SECRET"
240+
project_name = "YOUR_RUNAI_PROJECT_NAME"
241+
# Define the PVC that will be mounted in the job pods
242+
# Ensure the path specified here contains your NEMORUN_HOME
243+
pvc_name = "your-pvc-k8s-name" # The Kubernetes name of the PVC
244+
pvc_mount_path = "/your_custom_path" # The path where the PVC will be mounted inside the container
245+
246+
executor = run.DGXCloudExecutor(
247+
base_url=base_url,
248+
app_id=app_id,
249+
app_secret=app_secret,
250+
project_name=project_name,
251+
container_image=container_image,
252+
nodes=nodes,
253+
gpus_per_node=gpus_per_node,
254+
pvcs=[{"name": pvc_name, "path": pvc_mount_path}],
255+
# Optional: Add custom environment variables or Slurm specs if needed
256+
env_vars=common_envs(),
257+
# packager=run.GitArchivePackager() # Choose appropriate packager
258+
)
259+
return executor
260+
261+
# Example usage:
262+
# executor = your_dgx_executor(nodes=4, gpus_per_node=8, container_image="your-nemo-image")
263+
264+
```
265+
266+
For a complete end-to-end example using DGX Cloud with NeMo, refer to the [NVIDIA DGX Cloud NeMo End-to-End Workflow Example](https://docs.nvidia.com/dgx-cloud/run-ai/latest/nemo-e2e-example.html).

0 commit comments

Comments
 (0)