Skip to content

Commit 5052391

Browse files
Updated documentation for deploying RDMA enabled shared node pools. (#60)
- BUGFIX: Timing issue related to webhook deployment fixed. - BUGFIX: ADB deprecated OCPUs breaking ADB deployment. Changed to ECPUs
1 parent c80e047 commit 5052391

File tree

14 files changed

+147
-64
lines changed

14 files changed

+147
-64
lines changed

docs/api_documentation/README.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,14 @@
1111
| recipe_id | string | Yes | One of the following: `llm_inference_nvidia`, `lora_finetune_nvidia`, or `mlcommons_lora_finetune_nvidia` |
1212
| deployment_name | string | Yes | Any deployment name to identify the deployment details easily. Must be unique from other recipe deployments. |
1313
| recipe_mode | string | Yes | One of the following: `service`, `job`, `update`, or `shared_node_pool`. Enter `service` for inference recipe deployments, `job` for fine-tuning recipe deployments, `update` for updating existing deployments (currently only supported for MIG), and `shared_node_pool` for creating a shared node pool. |
14-
| recipe_node_labels | object | No | Additional labels to apply to a node pool in the form `{"label": "value"}` |
14+
| recipe_node_labels | object[string][string] | No | Additional labels to apply to a node pool in the form `{"label": "value"}` |
1515
| service_endpoint_domain | string | No | Required for inference recipe deployments. Inference endpoint will point to this domain. |
16-
| recipe_max_pods_per_node | int | No | Allow a node to schedule more pods than default 31 from kubernetes. Required for certain MIG configurations which can slice up to 56 times. |
16+
| recipe_max_pods_per_node | int | No | Allow a node to schedule more pods than default 31 from kubernetes. Required for certain MIG configurations which can slice up to 56 times. |
17+
| recipe_availability_domain | string | No | Required for RDMA enabled shared node pool deployments. Optional for shared node pool (non-RDMA) and recipe deployments. |
18+
| recipe_public_ssh_key | string | No | Optionally adds an ssh key to RDMA enabled node pools for connectivity via ssh |
19+
| recipe_node_image_ocid | string | No | Required for RDMA enabled shared node pool deployments. Optional for shared node pool (non-RDMA) and recipe deployments. |
20+
| recipe_container_memory_size | int | No | Memory in GB that recipe must have to schedule. This is both the amount of memory a node must have available to schedule a recipe, and an upper bound for the container. |
21+
| recipe_container_cpu_count | int | No | Number of CPUs recipe must have to schedule. This is both the amount of CPU cores a node must have available to schedule a recipe, and an upper bound for the container. |
1722
| recipe_container_port | string | No | Required for inference recipe deployments. Inference endpoint will point to this port. |
1823
| recipe_node_shape | string | Yes | Enter the shape of the node that you want to deploy the recipe on to. Example: `BM.GPU.A10.4` |
1924
| recipe_node_pool_size | int | Yes | Number of nodes that you want to allocate for this recipe deployment. Ensure you have sufficient capacity. This feature is under development. Always enter 1. |

docs/custom_blueprints/blueprint_json_schema.json

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,12 @@
5555
"recipe_max_pods_per_node": {
5656
"type": "integer"
5757
},
58+
"recipe_availability_domain": {
59+
"type": "string"
60+
},
61+
"recipe_public_ssh_key": {
62+
"type": "string"
63+
},
5864
"recipe_node_pool_size": {
5965
"type": "integer"
6066
},
@@ -70,6 +76,9 @@
7076
"recipe_node_selector_arch": {
7177
"type": "string"
7278
},
79+
"recipe_node_image_ocid": {
80+
"type": "string"
81+
},
7382
"recipe_flex_shape_ocpu_count": {
7483
"type": "integer"
7584
},
@@ -242,6 +251,9 @@
242251
"recipe_container_memory_size": {
243252
"type": "integer"
244253
},
254+
"recipe_container_cpu_count": {
255+
"type": "integer"
256+
},
245257
"input_object_storage": {
246258
"type": "array",
247259
"items": {

docs/iam_policies/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,4 +96,6 @@ Allow dynamic-group 'IdentityDomainName'/'DynamicGroupName' to read virtual-netw
9696
Allow dynamic-group 'IdentityDomainName'/'DynamicGroupName' to inspect compartments in compartment {compartment_name}
9797
9898
Allow dynamic-group 'IdentityDomainName'/'DynamicGroupName' to manage cluster-node-pools in compartment {compartment_name}
99+
100+
Allow dynamic-group 'IdentityDomainName'/'DynamicGroupName' to {CLUSTER_JOIN} in compartment {compartment_name}
99101
```

docs/multi_node_inference/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ Use multi-node inference whenever you are trying to use a very large model that
3232

3333
## RDMA + Multinode Inference
3434

35-
Want to use RDMA with multinode inference? [See here for details](../using_rdma_enabled_node_pools)
35+
Want to use RDMA with multinode inference? [See here for details](../using_rdma_enabled_node_pools/README.md)
3636

3737
## How to use it?
3838

@@ -72,7 +72,7 @@ The following parameters are required:
7272

7373
- `multinode_num_nodes_to_use_from_shared_pool` -> the total number of nodes (as an integer) you want to use to serve this model. This number must be less than the size of the shared node pool, and will only use schedulable nodes in the pool.
7474

75-
- [OPTIONAL] `"multinode_rdma_enabled_in_shared_pool": "true"` -> If you have deployed an HPC cluster with RDMA enabled for node pools - [see here for details](../deploy_ai_blueprints_onto_hpc_cluster) - enable RDMA communication between nodes (currently only supported for BM.GPU.H100.8). This will fail validation if RDMA is not supported for shape type, or node is missing appropriate labels described in linked doc.
75+
- [OPTIONAL] `"multinode_rdma_enabled_in_shared_pool": true` -> If you have provisioned RDMA enabled shared node pools in your cluster - enable RDMA communication between nodes. This will fail validation if RDMA is not supported for shape type, or node is missing appropriate labels described in [linked doc](../using_rdma_enabled_node_pools/README.md).
7676

7777
- [OPTIONAL] `recipe_readiness_probe_params` -> Readiness probe to ensure that service is ready to serve requests. Parameter details found [here](../startup_liveness_readiness_probes/README.md).
7878

Lines changed: 72 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,82 @@
11
# Using RDMA Enabled Node Pools
22

3-
Currently, AI Blueprints does not have the ability to provision node pools with RDMA configured. However, it can use node pools which were configured previously. Blueprints support for deploying node pools with RDMA configured is coming soon.
3+
Random Direct Memory Access (RDMA) is a protocol that lets one node read from or write to the memory of another node without involving either machine’s CPU or operating system, enabling true zero-copy data transfers and dramatically reducing latency and CPU overhead. In large-scale AI workloads such as multi-node training with AllReduce or disaggregated LLM inference, RDMA can yield tremendous performance gains by signifantly reducing communication and copy overhead between nodes.
44

5-
If you already have a cluster with RDMA enabled node pools, jump to [install AI Blueprints onto an existing cluster](./README.md#install-ai-blueprints-onto-existing-cluster).
5+
Blueprints uses [OCI cluster networks with instance pools](https://docs.oracle.com/en-us/iaas/Content/Compute/Tasks/managingclusternetworks.htm) to provision RDMA-enabled node pools.
66

7-
Additionally, RDMA is currently only supported for H100 shapes, but A100, H200, and B200 shapes are being added in short order.
7+
Note: follow our supported shapes as these have been validated with blueprints. To request additional shape support, open an issue on GitHub.
88

9-
## Optional - Deploy an HPC Cluster with the OCI-OKE-HPC Quickstart
9+
RDMA is currently supported for:
1010

11-
[The oci-hpc-oke quickstart](https://github.com/oracle-quickstart/oci-hpc-oke) provides a straightforward way to deploy RDMA enabled node pools into an OKE cluster. Follow that quickstart with these helpful tips to deploy an OKE cluster with an RDMA enabled node pool.
11+
- BM.GPU.H100.8
12+
- BM.GPU.H200.8
13+
- BM.GPU.B4.8
1214

13-
If you do not use this method, you will need to bring your own cluster with a node pool with RDMA connectivity to use recipes with RDMA enabled.
15+
Additional shape support is coming soon.
1416

15-
Tips to look out for in the oci-hpc-oke stack:
17+
# Provision RDMA-enabled shared nodepools with Blueprints
1618

17-
**Tip 1**: The main GitHub readme for that repository provides PAR links to images required for GPU nodes with RDMA connectivity.
18-
- Right click on the appropriate combination (IE GPU driver 560 & CUDA 12.6) and copy link to get the PAR.
19-
- Go to the tenancy + region in which you'd like to import the image to be used during the quickstart deployment. Follow [this doc](https://docs.oracle.com/en-us/iaas/Content/Compute/Tasks/custom-images-import.htm#listing-custom-images) to import the custom image into that tenancy / region.
20-
- Once the image is done importing, it will be usable during cluster deployment.
19+
The following section will describe the steps to provision RDMA-enabled shared nodepools
20+
21+
If you already have a cluster with RDMA enabled node pools, for example [from this quickstart](https://github.com/oracle-quickstart/oci-hpc-oke) jump to [install AI Blueprints onto an existing cluster](./README.md#install-ai-blueprints-onto-existing-cluster).
22+
23+
If not, proceed below.
24+
25+
## Required policies
26+
27+
The specific policy required for RDMA-enabled shared node pools is:
28+
```
29+
Allow dynamic-group 'IdentityDomainName'/'DynamicGroupName' to {CLUSTER_JOIN} in compartment {compartment_name}
30+
```
31+
32+
The fine-grained policy list for blueprints can be found [here](../iam_policies/README.md).
33+
34+
## Import a custom image
35+
36+
The following [oci-hpc-oke quickstart](https://github.com/oracle-quickstart/oci-hpc-oke?tab=readme-ov-file#images-to-use) provides node images with proper drivers and libraries for RDMA connectivity between nodes for various CUDA driver / toolkit versions. One of these images must be imported into your tenancy in the correct region (and possibly compartment depending on policies) to provision RDMA enabled shared node pools.
37+
38+
- Right click on the appropriate combination (IE GPU driver 560 & CUDA 12.6) and copy link to get the PAR
39+
- Login to the tenancy + region in which you'd like to import the image
40+
- In the console, click the hamburger in the top left -> Compute -> Instances -> Custom Images
41+
- Go to the Compartment in which you'd like to import the image, then click "Import image"
42+
- Set the compartment in "Create in compartment", name the image, then ensure the OS is set to "Ubuntu" as these are Ubuntu images
43+
- Click the circle "Import from an Object Storage URL"
44+
- Paste the PAR URL retrieved above into the object storage url box
45+
- For image type, select "OCI"
46+
- Add any tags you'd like, then click "Import Image" on the bottom
47+
- Once the image is done importing (30 minutes to an hour), it will be usable during cluster deployment
48+
- To use the image in recipes, you will need to retrieve the image OCID
49+
50+
[This doc](https://docs.oracle.com/en-us/iaas/Content/Compute/Tasks/custom-images-import.htm#listing-custom-images) provides complete details for all image importing options.
51+
52+
## Deploying an RDMA-enabled shared node pool
53+
54+
Once the image has been imported, it is now possible to deploy a shared node pool with RDMA connectivity with AI blueprints.
55+
56+
In addition to the parameters described in [the shared node pool doc](../shared_node_pools/README.md#without-selector), the following additional parameters are required:
57+
- `"recipe_availability_domain": "<FULL AD NAME>"` -> full availability domain name where you have capacity for nodes. Examples: `"TrcQ:AP-MELBOURNE-1-AD-1"`, `"TrcQ:EU-FRANKFURT-1-AD-3"`. These can generally be found in the console via Hamburger (top left) -> Governance & Administration -> Tenancy Management -> Limits, Quotas and Usage
58+
59+
- `"recipe_node_image_ocid": "<image ocid>"` -> the OCID of the custom image you imported
60+
61+
- `"multinode_rdma_enabled_in_shared_pool": true` -> boolean telling blueprints to setup RDMA. **Important** - if this is left off, blueprints will provision a shared node pool with the specified shape as a node pool without RDMA connectivity and this cannot be undone except by deleting and recreating the pool.
62+
63+
- `"shared_node_pool_size": >1` -> This must be some number greater than 1, as RDMA is fundamentally **inter-node** connectivity.
64+
65+
This is an [example blueprint](./rdma_shared_node_pool.json).
66+
67+
Populate the example with the correct shape, AD, and image OCID, and paste it into the `/deployment` API endpoint to deploy a 2 node RDMA-enabled pool which can be used for downstream blueprints.
68+
69+
## Using RDMA-enabled nodes in a blueprint
70+
71+
To use RDMA in a blueprint, the following fields must be added to deploy to the nodes configured in the previous step:
72+
- `"recipe_use_shared_node_pool": true` -> RDMA is only supported in shared pool mode
73+
- `"multinode_rdma_enabled_in_shared_pool": true` -> Lets blueprints know that this deployment should use RDMA configurations in the backend
74+
- `"multinode_num_nodes_to_use_from_shared_pool": 2` -> Number of nodes from RDMA enabled pool to use for this deployment
75+
76+
[This blueprint](./rdma_distributed_inference.json) performs a multi-node distributed inference deployment of Llama-3.1-405b-Instruct to 2 H100 nodes communicating with RDMA and serves it over a public endpoint as an example. 405b with fp16 was selected because the weights are too large to load into a single BM.GPU.H100.8, as it takes around 900GB of GPU vRAM to load the weights.
77+
78+
The `recipe_container_env` has been left in so you can see that the nodes are communicating via RDMA in the pod logs, but this can be removed to minimize bloat in the logs.
2179

22-
**Tip 2**: During deployment of the HPC cluster, here is a description of the fields - Pay special attention to **Workers: Operational** and **Workers: GPU+RDMA**:
23-
- Create policies: Create the policies needed by the cluster. If this is unchecked, this must be done manually
24-
- Network: Creates the VCN for the cluster with appropriate security rules
25-
- Bastion & Operator: Gives ssh access to worker nodes and an internal operator to access kubernetes cluster on the operator node
26-
- OKE Cluster: The configuration of the OKE cluster nodes operating the control plane and pods for cluster management - should be CPU nodes
27-
- **Workers: Operational [REQUIRED]**: IMPORTANT - this is a common "pitfall". If enabling an RDMA node pool in the section below, put these in the same availability domain as the **Workers: GPU + RDMA**. You can use CPUs for this, such as the default VM.Standard.E5.Flex, but **the image should be the same as the one you use for the GPU+RDMA**, as these nodes will be used to check RDMA health, so they need the software stack.
28-
- Workers: CPU [OPTIONAL]: Non-RDMA CPU worker nodes to stand up with cluster (leave off as AI Blueprints can provision these if required)
29-
- Workers: GPU [OPTIONAL]: Non-RDMA GPU worker nodes to stand up with cluster (leave off as AI Blueprints can provision these if required)
30-
- **Workers: GPU+RDMA [REQUIRED]**: If you desire RDMA nodes, provision at least 2 of these, which will configure with RDMA connectivity. Put these in the same Availability domain as the **Workers: Operational** and use the same image.
3180

3281
## Install AI Blueprints onto existing cluster
3382

@@ -40,6 +89,7 @@ For nodes which were recently deployed by the oci-hpc-oke stack, or if you finis
4089
- run `kubectl get nodes`
4190
- run `kubectl describe node <nodename>` on each node until you find the node you want to add which is one of the nodes with RDMA connectivity
4291
- The private ip appears under the `Name` field of the output of `kubectl get nodes`.
92+
- Alternatively, find them in the console in "Instances" for your tenancy/region/compartment
4393
2. Go to the stack and click "Application information". Click the API Url.
4494
- If you get a warning about security, sometimes it takes a bit for the certificates to get signed. This will go away once that process completes on the OKE side.
4595
3. Login with the `Admin Username` and `Admin Password` in the Application information tab.
@@ -62,20 +112,6 @@ For nodes which were recently deployed by the oci-hpc-oke stack, or if you finis
62112
```
63113
8. Repeat steps 5-7 for each node you'd like to add, updating `recipe_node_name` and incrementing `deployment_name` fields for each deployment until you've added all RDMA enabled nodes you'd like to add to the cluster.
64114

65-
## Using RDMA enabled nodes in a recipe
66-
67-
Blueprints supported RDMA shapes:
68-
- BM.GPU.H100.8
115+
## Issues
69116

70-
Additional shapes coming soon.
71-
72-
To use RDMA in a blueprint, the following fields must be added to deploy to the nodes configured in the previous step:
73-
- `"recipe_use_shared_node_pool": true` -> RDMA is only supported in shared pool mode
74-
- `"multinode_rdma_enabled_in_shared_pool": true` -> Lets blueprints know that this deployment should use RDMA configurations in the backend
75-
- `"multinode_num_nodes_to_use_from_shared_pool": 2` -> Number of nodes from RDMA enabled pool to use for this deployment
76-
77-
## Example Recipe
78-
79-
[This recipe](./rdma_distributed_inference.json) performs a multi-node distributed inference deployment of Llama-3.1-405b-Instruct to 2 H100 nodes communicating with RDMA and serves it over a public endpoint as an example.
80-
81-
The `recipe_container_env` has been left in so you can see that the nodes are communicating via RDMA in the pod logs, but this can be removed to minimize bloat in the logs.
117+
To report an issue with RDMA deployments, please post an issue on the GitHub.

0 commit comments

Comments
 (0)