oracle-quickstart
diff --git a/‎docs/api_documentation/README.md‎
Lines changed: 7 additions & 2 deletions b/‎docs/api_documentation/README.md‎
Lines changed: 7 additions & 2 deletions
diff --git a/‎docs/custom_blueprints/blueprint_json_schema.json‎
Lines changed: 12 additions & 0 deletions b/‎docs/custom_blueprints/blueprint_json_schema.json‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎docs/iam_policies/README.md‎
Lines changed: 2 additions & 0 deletions b/‎docs/iam_policies/README.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/multi_node_inference/README.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/multi_node_inference/README.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/using_rdma_enabled_node_pools/README.md‎
Lines changed: 72 additions & 36 deletions b/‎docs/using_rdma_enabled_node_pools/README.md‎
Lines changed: 72 additions & 36 deletions
@@ -11,9 +11,14 @@
 | recipe_id                                    | string  | Yes      | One of the following: `llm_inference_nvidia`, `lora_finetune_nvidia`, or `mlcommons_lora_finetune_nvidia`                                                                                                                                                                                                                                                                         |
 | deployment_name                              | string  | Yes      | Any deployment name to identify the deployment details easily. Must be unique from other recipe deployments.                                                                                                                                                                                                                                                                      |
 | recipe_mode                                  | string  | Yes      | One of the following: `service`, `job`, `update`, or `shared_node_pool`. Enter `service` for inference recipe deployments, `job` for fine-tuning recipe deployments, `update` for updating existing deployments (currently only supported for MIG), and `shared_node_pool` for creating a shared node pool. |
-| recipe_node_labels                           | object  | No       | Additional labels to apply to a node pool in the form `{"label": "value"}`                                                                                                                                                                                                                                                                                                        |
+| recipe_node_labels                           | object[string][string]  | No       | Additional labels to apply to a node pool in the form `{"label": "value"}`                                                                                                                                                                                                                                                                                                        |
 | service_endpoint_domain                      | string  | No       | Required for inference recipe deployments. Inference endpoint will point to this domain.                                                                                                                                                                                                                                                                                          |
-| recipe_max_pods_per_node                     | int     | No       | Allow a node to schedule more pods than default 31 from kubernetes. Required for certain MIG configurations which can slice up to 56 times.                                                                                                                                                                                                                                       |
+| recipe_max_pods_per_node                     | int     | No       | Allow a node to schedule more pods than default 31 from kubernetes. Required for certain MIG configurations which can slice up to 56 times.                                                                                                                                                                                                                                      |
+| recipe_availability_domain                   | string  | No       | Required for RDMA enabled shared node pool deployments. Optional for shared node pool (non-RDMA) and recipe deployments.             |
+| recipe_public_ssh_key                        | string  | No       | Optionally adds an ssh key to RDMA enabled node pools for connectivity via ssh                                 |
+| recipe_node_image_ocid                       | string  | No       | Required for RDMA enabled shared node pool deployments. Optional for shared node pool (non-RDMA) and recipe deployments.             |
+| recipe_container_memory_size                 | int     | No       | Memory in GB that recipe must have to schedule. This is both the amount of memory a node must have available to schedule a recipe, and an upper bound for the container. |
+| recipe_container_cpu_count                   | int     | No       | Number of CPUs recipe must have to schedule. This is both the amount of CPU cores a node must have available to schedule a recipe, and an upper bound for the container. |
 | recipe_container_port                        | string  | No       | Required for inference recipe deployments. Inference endpoint will point to this port.                                                                                                                                                                                                                                                                                            |
 | recipe_node_shape                            | string  | Yes      | Enter the shape of the node that you want to deploy the recipe on to. Example: `BM.GPU.A10.4`                                                                                                                                                                                                                                                                                     |
 | recipe_node_pool_size                        | int     | Yes      | Number of nodes that you want to allocate for this recipe deployment. Ensure you have sufficient capacity. This feature is under development. Always enter 1.                                                                                                                                                                                                                     |
 
@@ -55,6 +55,12 @@
           "recipe_max_pods_per_node": {
             "type": "integer"
           },
+          "recipe_availability_domain": {
+            "type": "string"
+          },
+          "recipe_public_ssh_key": {
+            "type": "string"
+          },
           "recipe_node_pool_size": {
             "type": "integer"
           },
@@ -70,6 +76,9 @@
           "recipe_node_selector_arch": {
             "type": "string"
           },
+          "recipe_node_image_ocid": {
+            "type": "string"
+          },
           "recipe_flex_shape_ocpu_count": {
             "type": "integer"
           },
@@ -242,6 +251,9 @@
           "recipe_container_memory_size": {
             "type": "integer"
           },
+          "recipe_container_cpu_count": {
+            "type": "integer"
+          },
           "input_object_storage": {
             "type": "array",
             "items": {
 
@@ -96,4 +96,6 @@ Allow dynamic-group 'IdentityDomainName'/'DynamicGroupName' to read virtual-netw
 Allow dynamic-group 'IdentityDomainName'/'DynamicGroupName' to inspect compartments in compartment {compartment_name}
 
 Allow dynamic-group 'IdentityDomainName'/'DynamicGroupName' to manage cluster-node-pools in compartment {compartment_name}
+
+Allow dynamic-group 'IdentityDomainName'/'DynamicGroupName' to {CLUSTER_JOIN} in compartment {compartment_name}
 ```
@@ -32,7 +32,7 @@ Use multi-node inference whenever you are trying to use a very large model that
 
 ## RDMA + Multinode Inference
 
-Want to use RDMA with multinode inference? [See here for details](../using_rdma_enabled_node_pools)
+Want to use RDMA with multinode inference? [See here for details](../using_rdma_enabled_node_pools/README.md)
 
 ## How to use it?
 
@@ -72,7 +72,7 @@ The following parameters are required:
 
 - `multinode_num_nodes_to_use_from_shared_pool` -> the total number of nodes (as an integer) you want to use to serve this model. This number must be less than the size of the shared node pool, and will only use schedulable nodes in the pool.
 
-- [OPTIONAL] `"multinode_rdma_enabled_in_shared_pool": "true"` -> If you have deployed an HPC cluster with RDMA enabled for node pools - [see here for details](../deploy_ai_blueprints_onto_hpc_cluster) - enable RDMA communication between nodes (currently only supported for BM.GPU.H100.8). This will fail validation if RDMA is not supported for shape type, or node is missing appropriate labels described in linked doc.
+- [OPTIONAL] `"multinode_rdma_enabled_in_shared_pool": true` -> If you have provisioned RDMA enabled shared node pools in your cluster - enable RDMA communication between nodes. This will fail validation if RDMA is not supported for shape type, or node is missing appropriate labels described in [linked doc](../using_rdma_enabled_node_pools/README.md).
 
 - [OPTIONAL] `recipe_readiness_probe_params` -> Readiness probe to ensure that service is ready to serve requests. Parameter details found [here](../startup_liveness_readiness_probes/README.md).
 
 
@@ -1,33 +1,82 @@
 # Using RDMA Enabled Node Pools
 
-Currently, AI Blueprints does not have the ability to provision node pools with RDMA configured. However, it can use node pools which were configured previously. Blueprints support for deploying node pools with RDMA configured is coming soon.
+Random Direct Memory Access (RDMA) is a protocol that lets one node read from or write to the memory of another node without involving either machine’s CPU or operating system, enabling true zero-copy data transfers and dramatically reducing latency and CPU overhead. In large-scale AI workloads such as multi-node training with AllReduce or disaggregated LLM inference, RDMA can yield tremendous performance gains by signifantly reducing communication and copy overhead between nodes.
 
-If you already have a cluster with RDMA enabled node pools, jump to [install AI Blueprints onto an existing cluster](./README.md#install-ai-blueprints-onto-existing-cluster).
+Blueprints uses [OCI cluster networks with instance pools](https://docs.oracle.com/en-us/iaas/Content/Compute/Tasks/managingclusternetworks.htm) to provision RDMA-enabled node pools. 
 
-Additionally, RDMA is currently only supported for H100 shapes, but A100, H200, and B200 shapes are being added in short order.
+Note: follow our supported shapes as these have been validated with blueprints. To request additional shape support, open an issue on GitHub.
 
-## Optional - Deploy an HPC Cluster with the OCI-OKE-HPC Quickstart
+RDMA is currently supported for:
 
-[The oci-hpc-oke quickstart](https://github.com/oracle-quickstart/oci-hpc-oke) provides a straightforward way to deploy RDMA enabled node pools into an OKE cluster. Follow that quickstart with these helpful tips to deploy an OKE cluster with an RDMA enabled node pool.
+ - BM.GPU.H100.8
+ - BM.GPU.H200.8
+ - BM.GPU.B4.8
 
-If you do not use this method, you will need to bring your own cluster with a node pool with RDMA connectivity to use recipes with RDMA enabled.
+Additional shape support is coming soon.
 
-Tips to look out for in the oci-hpc-oke stack:
+# Provision RDMA-enabled shared nodepools with Blueprints
 
-**Tip 1**: The main GitHub readme for that repository provides PAR links to images required for GPU nodes with RDMA connectivity.
-  - Right click on the appropriate combination (IE GPU driver 560 & CUDA 12.6) and copy link to get the PAR.
-  - Go to the tenancy + region in which you'd like to import the image to be used during the quickstart deployment. Follow [this doc](https://docs.oracle.com/en-us/iaas/Content/Compute/Tasks/custom-images-import.htm#listing-custom-images) to import the custom image into that tenancy / region.
-  - Once the image is done importing, it will be usable during cluster deployment.
+The following section will describe the steps to provision RDMA-enabled shared nodepools 
+
+If you already have a cluster with RDMA enabled node pools, for example [from this quickstart](https://github.com/oracle-quickstart/oci-hpc-oke) jump to [install AI Blueprints onto an existing cluster](./README.md#install-ai-blueprints-onto-existing-cluster).
+
+If not, proceed below.
+
+## Required policies
+
+The specific policy required for RDMA-enabled shared node pools is:
+```
+Allow dynamic-group 'IdentityDomainName'/'DynamicGroupName' to {CLUSTER_JOIN} in compartment {compartment_name}
+```
+
+The fine-grained policy list for blueprints can be found [here](../iam_policies/README.md).
+
+## Import a custom image
+
+The following [oci-hpc-oke quickstart](https://github.com/oracle-quickstart/oci-hpc-oke?tab=readme-ov-file#images-to-use) provides node images with proper drivers and libraries for RDMA connectivity between nodes for various CUDA driver / toolkit versions. One of these images must be imported into your tenancy in the correct region (and possibly compartment depending on policies) to provision RDMA enabled shared node pools.
+
+  - Right click on the appropriate combination (IE GPU driver 560 & CUDA 12.6) and copy link to get the PAR
+  - Login to the tenancy + region in which you'd like to import the image
+  - In the console, click the hamburger in the top left -> Compute -> Instances -> Custom Images
+  - Go to the Compartment in which you'd like to import the image, then click "Import image"
+  - Set the compartment in "Create in compartment", name the image, then ensure the OS is set to "Ubuntu" as these are Ubuntu images
+  - Click the circle "Import from an Object Storage URL"
+  - Paste the PAR URL retrieved above into the object storage url box
+  - For image type, select "OCI"
+  - Add any tags you'd like, then click "Import Image" on the bottom
+  - Once the image is done importing (30 minutes to an hour), it will be usable during cluster deployment
+  - To use the image in recipes, you will need to retrieve the image OCID
+
+[This doc](https://docs.oracle.com/en-us/iaas/Content/Compute/Tasks/custom-images-import.htm#listing-custom-images) provides complete details for all image importing options.
+
+## Deploying an RDMA-enabled shared node pool
+
+Once the image has been imported, it is now possible to deploy a shared node pool with RDMA connectivity with AI blueprints.
+
+In addition to the parameters described in [the shared node pool doc](../shared_node_pools/README.md#without-selector), the following additional parameters are required:
+  - `"recipe_availability_domain": "<FULL AD NAME>"` -> full availability domain name where you have capacity for nodes. Examples: `"TrcQ:AP-MELBOURNE-1-AD-1"`, `"TrcQ:EU-FRANKFURT-1-AD-3"`. These can generally be found in the console via Hamburger (top left) -> Governance & Administration -> Tenancy Management -> Limits, Quotas and Usage
+
+  - `"recipe_node_image_ocid": "<image ocid>"` -> the OCID of the custom image you imported
+
+  - `"multinode_rdma_enabled_in_shared_pool": true` -> boolean telling blueprints to setup RDMA. **Important** - if this is left off, blueprints will provision a shared node pool with the specified shape as a node pool without RDMA connectivity and this cannot be undone except by deleting and recreating the pool.
+
+  - `"shared_node_pool_size": >1` -> This must be some number greater than 1, as RDMA is fundamentally **inter-node** connectivity.
+
+This is an [example blueprint](./rdma_shared_node_pool.json).
+
+Populate the example with the correct shape, AD, and image OCID, and paste it into the `/deployment` API endpoint to deploy a 2 node RDMA-enabled pool which can be used for downstream blueprints.
+
+## Using RDMA-enabled nodes in a blueprint
+
+To use RDMA in a blueprint, the following fields must be added to deploy to the nodes configured in the previous step:
+  - `"recipe_use_shared_node_pool": true` -> RDMA is only supported in shared pool mode
+  - `"multinode_rdma_enabled_in_shared_pool": true` -> Lets blueprints know that this deployment should use RDMA configurations in the backend
+  - `"multinode_num_nodes_to_use_from_shared_pool": 2` -> Number of nodes from RDMA enabled pool to use for this deployment
+
+[This blueprint](./rdma_distributed_inference.json) performs a multi-node distributed inference deployment of Llama-3.1-405b-Instruct to 2 H100 nodes communicating with RDMA and serves it over a public endpoint as an example. 405b with fp16 was selected because the weights are too large to load into a single BM.GPU.H100.8, as it takes around 900GB of GPU vRAM to load the weights.
+
+The `recipe_container_env` has been left in so you can see that the nodes are communicating via RDMA in the pod logs, but this can be removed to minimize bloat in the logs.
 
-**Tip 2**: During deployment of the HPC cluster, here is a description of the fields - Pay special attention to **Workers: Operational** and **Workers: GPU+RDMA**:
-  - Create policies: Create the policies needed by the cluster. If this is unchecked, this must be done manually
-  - Network: Creates the VCN for the cluster with appropriate security rules
-  - Bastion & Operator: Gives ssh access to worker nodes and an internal operator to access kubernetes cluster on the operator node
-  - OKE Cluster: The configuration of the OKE cluster nodes operating the control plane and pods for cluster management - should be CPU nodes
-  - **Workers: Operational [REQUIRED]**: IMPORTANT - this is a common "pitfall". If enabling an RDMA node pool in the section below, put these in the same availability domain as the **Workers: GPU + RDMA**. You can use CPUs for this, such as the default VM.Standard.E5.Flex, but **the image should be the same as the one you use for the GPU+RDMA**, as these nodes will be used to check RDMA health, so they need the software stack.
-  - Workers: CPU [OPTIONAL]: Non-RDMA CPU worker nodes to stand up with cluster (leave off as AI Blueprints can provision these if required)
-  - Workers: GPU [OPTIONAL]: Non-RDMA GPU worker nodes to stand up with cluster (leave off as AI Blueprints can provision these if required)
-  - **Workers: GPU+RDMA [REQUIRED]**: If you desire RDMA nodes, provision at least 2 of these, which will configure with RDMA connectivity. Put these in the same Availability domain as the **Workers: Operational** and use the same image.
 
 ## Install AI Blueprints onto existing cluster
 
@@ -40,6 +89,7 @@ For nodes which were recently deployed by the oci-hpc-oke stack, or if you finis
      - run `kubectl get nodes`
      - run `kubectl describe node <nodename>` on each node until you find the node you want to add which is one of the nodes with RDMA connectivity
      - The private ip appears under the `Name` field of the output of `kubectl get nodes`.
+   - Alternatively, find them in the console in "Instances" for your tenancy/region/compartment
 2. Go to the stack and click "Application information". Click the API Url.
    - If you get a warning about security, sometimes it takes a bit for the certificates to get signed. This will go away once that process completes on the OKE side.
 3. Login with the `Admin Username` and `Admin Password` in the Application information tab.
@@ -62,20 +112,6 @@ For nodes which were recently deployed by the oci-hpc-oke stack, or if you finis
 ```
 8. Repeat steps 5-7 for each node you'd like to add, updating `recipe_node_name` and incrementing `deployment_name` fields for each deployment until you've added all RDMA enabled nodes you'd like to add to the cluster.
 
-## Using RDMA enabled nodes in a recipe
-
-Blueprints supported RDMA shapes:
-  - BM.GPU.H100.8
+## Issues
 
-Additional shapes coming soon.
-
-To use RDMA in a blueprint, the following fields must be added to deploy to the nodes configured in the previous step:
-  - `"recipe_use_shared_node_pool": true` -> RDMA is only supported in shared pool mode
-  - `"multinode_rdma_enabled_in_shared_pool": true` -> Lets blueprints know that this deployment should use RDMA configurations in the backend
-  - `"multinode_num_nodes_to_use_from_shared_pool": 2` -> Number of nodes from RDMA enabled pool to use for this deployment
-
-## Example Recipe
-
-[This recipe](./rdma_distributed_inference.json) performs a multi-node distributed inference deployment of Llama-3.1-405b-Instruct to 2 H100 nodes communicating with RDMA and serves it over a public endpoint as an example.
-
-The `recipe_container_env` has been left in so you can see that the nodes are communicating via RDMA in the pod logs, but this can be removed to minimize bloat in the logs.
+To report an issue with RDMA deployments, please post an issue on the GitHub.