You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Adding AMD plugin and metrics tracking for MI300x. (#101)
* Adding AMD plugin and metrics tracking to support MI300x.
- Added example blueprint of MI300x with Llama4 Maverick.
- Added example blueprint of MI300x shared node pool.
- Updated API documentation to include local_filesystem and input_file_system
- Added MI300x specs to RDMA table with link to HPC image
* Renamed local_directory_path to node_directory_path.
* Added AMD metrics exporter version to software versions, and added the bring your own pattern.
* Cleanup.
| input_object_storage | object | Yes | Name of bucket to mount at location “mount_location”. Mount size will be `volume_size_in_gbs`. Will copy all objects in bucket to mount location. Store your LLM model (and in the case of fine-tuning blueprints, your input dataset as well) in this bucket. Example: `[{"bucket_name": "corrino_hf_oss_models", "mount_location": "/models", "volume_size_in_gbs": 500}]`|
33
33
| output_object_storage | object | No | Required for fine-tuning deployments. Name of bucket to mount at location “mount_location”. Mount size will be “volume_size_in_gbs”. Will copy all items written here during program runtime to bucket on program completion. Example: `[{“bucket_name”: “output”,“mount_location”: “/output”,“volume_size_in_gbs”: 500}]`|
34
+
| input_file_system | object | No | Required for shared storage. This is both input and output storage. OCI File System OCID, Mount Target OCID will be used to mount the file system at "mount location". Mount size will be “volume_size_in_gbs”. This works as an NFS, so any data written will persist to file storage. Example: `[{“file_system_ocid”: “ocid..._”,“mount_target_ocid”: “ocid...”,"mount_location": "/models",“volume_size_in_gbs”: 500}]`|
35
+
| local_filesystem | object | No | Local filesystem path to mount to container. This will be read / write path, and is local to the node the container runs on. Any written data will persist to node, and will be subject to available storage on node. Example: `[{"mount_location": "/models","node_directory_path": “/mnt/nvme/models”}]`|
34
36
| recipe_image_uri | string | Yes | Location of the recipe container image. Each recipe points to a specific container image. See the recipe.json examples below. Example: `iduyx1qnmway/corrino-devops-repository:vllmv0901`|
35
37
| recipe_container_command_args | string | No | Container init arguments to pass. Each recipe has specific container arguments that it expects. See the Blueprint Arguments section below for details. Example: `["--model","$(Model_Path)","--tensor-parallel-size","$(tensor_parallel_size)"]`|
36
38
| recipe_container_env | string | No | Values of the recipe container init arguments. See the Blueprint Arguments section below for details. Example: `[{"key": "tensor_parallel_size","value": "2"},{"key": "model_name","value": "NousResearch/Meta-Llama-3.1-8B-Instruct"},{"key": "Model_Path","value": "/models/NousResearch/Meta-Llama-3.1-8B-Instruct"}]`|
Copy file name to clipboardExpand all lines: docs/custom_blueprints/blueprint_json_schema.json
+31Lines changed: 31 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -434,6 +434,37 @@
434
434
}
435
435
}
436
436
},
437
+
"local_filesystem": {
438
+
"type": "array",
439
+
"description": "Local filesystem path to mount to container. This will be read / write path, and is local to the node the container runs on. Any written data will persist to node, and will be subject to available storage on node.",
440
+
"items": {
441
+
"additionalProperties": false,
442
+
"required": [
443
+
"node_directory_path",
444
+
"mount_location"
445
+
],
446
+
"properties": {
447
+
"node_directory_path": {
448
+
"type": "string",
449
+
"description": "The actual directory path on the node to mount to the container.",
450
+
"examples": ["/mnt/nvme/models"]
451
+
},
452
+
"mount_location": {
453
+
"type": "string",
454
+
"description": "The mount location in the container.",
Copy file name to clipboardExpand all lines: docs/sample_blueprints/other/using_rdma_enabled_node_pools/README.md
+7Lines changed: 7 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,6 +32,7 @@ RDMA is currently supported for:
32
32
- BM.GPU.H200.8
33
33
- BM.GPU.B200.8
34
34
- BM.GPU.B4.8
35
+
- BM.GPU.MI300X.8
35
36
36
37
Additional shape support is coming soon.
37
38
@@ -69,6 +70,7 @@ One of the images in the table below must be imported into your tenancy in the c
69
70
- Once the image is done importing (30 minutes to an hour), it will be usable during cluster deployment
70
71
- To use the image in recipes, you will need to retrieve the image OCID
71
72
73
+
**Note**: Clicking any of the links below will download a large image file to your computer (~20GB). It is best to copy the link to paste directly into the console when importing the custom image.
72
74
73
75
**Note**: B200 requires Driver version 570 and CUDA >= 12.8. Ensure correct PAR for compatibility with B200.
74
76
@@ -81,6 +83,11 @@ One of the images in the table below must be imported into your tenancy in the c
[This doc](https://docs.oracle.com/en-us/iaas/Content/Compute/Tasks/custom-images-import.htm#listing-custom-images) provides complete details for all image importing options.
Copy file name to clipboardExpand all lines: docs/sample_blueprints/platform_features/shared_node_pools/README.md
+8Lines changed: 8 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,6 +14,14 @@ Shared node pools are compatible with any blueprint and support all OCI compute
14
14
1. Specifying the Availability Domain of the instance type
15
15
2. Specifying the custom image OCID to use for the node
16
16
17
+
**Note**: Clicking the Link in the table below will download a large image file to your computer (~20GB). It is best to copy the link and paste it in your conole to import the image as described in [This document section](../../other/using_rdma_enabled_node_pools/README.md).
0 commit comments