Skip to content

Commit 714ede9

Browse files
More blueprint category refinement (#98)
* add more blueprint categories and fix links * fix whisper transcription readme and remove unneeded file * update the blueprint json schema to match the new blueprint categories * Offline + Online inference benchmark (#99) * docs for offline inference * removed edit line * online inference readme * better readme with extra pre-filled samples for offline inference * added sample json files * added deployment json files * addressed PR comments * changed file names to indiciate the workload, addressed comments on the PR for offline inference * minor edit - offline readme * Add offline and online inference blueprints and configuration files - Introduced new JSON and YAML files for offline inference benchmarks using SGLang and vLLM backends. - Added README documentation for both offline and online inference blueprints, detailing usage, supported scenarios, and sample configurations. - Removed outdated README files for offline and online inference to streamline documentation. --------- Co-authored-by: ssraghavan-oci <[email protected]> * Update Llama Stack documentation and JSON schema - Updated README files to reflect the new location of the Llama Stack blueprint under partner blueprints. - Added a new JSON configuration file for the Llama Stack deployment, detailing the components and their configurations. - Introduced a README for the Llama Stack under partner blueprints, providing an overview, installation notes, and usage instructions. - Enhanced the blueprint JSON schema to include a new category for partner blueprints. --------- Co-authored-by: ssraghavan-oci <[email protected]>
1 parent 76d5853 commit 714ede9

File tree

92 files changed

+678
-196
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

92 files changed

+678
-196
lines changed

INSTALLING_ONTO_EXISTING_CLUSTER_README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ If you have existing node pools in your original OKE cluster that you'd like Blu
8383
- If you get a warning about security, sometimes it takes a bit for the certificates to get signed. This will go away once that process completes on the OKE side.
8484
3. Login with the `Admin Username` and `Admin Password` in the Application information tab.
8585
4. Click the link next to "deployment" which will take you to a page with "Deployment List", and a content box.
86-
5. Paste in the sample blueprint json found [here](docs/sample_blueprints/platform_feature_blueprints/exisiting_cluster_installation/add_node_to_control_plane.json).
86+
5. Paste in the sample blueprint json found [here](docs/sample_blueprints/other/exisiting_cluster_installation/add_node_to_control_plane.json).
8787
6. Modify the "recipe_node_name" field to the private IP address you found in step 1 above.
8888
7. Click "POST". This is a fast operation.
8989
8. Wait about 20 seconds and refresh the page. It should look like:
@@ -108,10 +108,10 @@ If you have existing node pools in your original OKE cluster that you'd like Blu
108108
- If you get a warning about security, sometimes it takes a bit for the certificates to get signed. This will go away once that process completes on the OKE side.
109109
3. Login with the `Admin Username` and `Admin Password` in the Application information tab.
110110
4. Click the link next to "deployment" which will take you to a page with "Deployment List", and a content box.
111-
5. If you added a node from [Step 4](./INSTALLING_ONTO_EXISTING_CLUSTER_README.md#step-4-add-existing-nodes-to-cluster-optional), use the following shared node pool [blueprint](docs/sample_blueprints/platform_feature_blueprints/shared_node_pools/vllm_inference_sample_shared_pool_blueprint.json).
111+
5. If you added a node from [Step 4](./INSTALLING_ONTO_EXISTING_CLUSTER_README.md#step-4-add-existing-nodes-to-cluster-optional), use the following shared node pool [blueprint](docs/sample_blueprints/platform_features/shared_node_pools/vllm_inference_sample_shared_pool_blueprint.json).
112112
- Depending on the node shape, you will need to change:
113113
`"recipe_node_shape": "BM.GPU.A10.4"` to match your shape.
114-
6. If you did not add a node, or just want to deploy a fresh node, use the following [blueprint](docs/sample_blueprints/workload_blueprints/llm_inference_with_vllm/vllm-open-hf-model.json).
114+
6. If you did not add a node, or just want to deploy a fresh node, use the following [blueprint](docs/sample_blueprints/model_serving/llm_inference_with_vllm/vllm-open-hf-model.json).
115115
7. Paste the blueprint you selected into context box on the deployment page and click "POST"
116116
8. To monitor the deployment, go back to "Api Root" and click "deployment_logs".
117117
- If you are deploying without a shared node pool, it can take 10-30 minutes to bring up a node, depending on shape and whether it is bare-metal or virtual.

README.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -52,16 +52,16 @@ After you install OCI AI Blueprints to an OKE cluster in your tenancy, you can d
5252

5353
| Blueprint | Description |
5454
| --------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
55-
| [**LLM & VLM Inference with vLLM**](docs/sample_blueprints/workload_blueprints/llm_inference_with_vllm/README.md) | Deploy Llama 2/3/3.1 7B/8B models using NVIDIA GPU shapes and the vLLM inference engine with auto-scaling. |
56-
| [**Llama Stack**](docs/sample_blueprints/workload_blueprints/llama-stack/README.md) | Complete GenAI runtime with vLLM, ChromaDB, Postgres, and Jaeger for production deployments with unified API for inference, RAG, and telemetry. |
57-
| [**Fine-Tuning Benchmarking**](docs/sample_blueprints/workload_blueprints/lora-benchmarking/README.md) | Run MLCommons quantized Llama-2 70B LoRA finetuning on A100 for performance benchmarking. |
58-
| [**LoRA Fine-Tuning**](docs/sample_blueprints/workload_blueprints/lora-fine-tuning/README.md) | LoRA fine-tuning of custom or HuggingFace models using any dataset. Includes flexible hyperparameter tuning. |
59-
| [**GPU Performance Benchmarking**](docs/sample_blueprints/workload_blueprints/gpu-health-check/README.md) | Comprehensive evaluation of GPU performance to ensure optimal hardware readiness before initiating any intensive computational workload. |
60-
| [**CPU Inference**](docs/sample_blueprints/workload_blueprints/cpu-inference/README.md) | Leverage Ollama to test CPU-based inference with models like Mistral, Gemma, and more. |
61-
| [**Multi-node Inference with RDMA and vLLM**](docs/sample_blueprints/workload_blueprints/multi-node-inference/README.md) | Deploy Llama-405B sized LLMs across multiple nodes with RDMA using H100 nodes with vLLM and LeaderWorkerSet. |
62-
| [**Autoscaling Inference with vLLM**](docs/sample_blueprints/platform_feature_blueprints/auto_scaling/README.md) | Serve LLMs with auto-scaling using KEDA, which scales to multiple GPUs and nodes using application metrics like inference latency. |
63-
| [**LLM Inference with MIG**](docs/sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/README.md) | Deploy LLMs to a fraction of a GPU with Nvidia’s multi-instance GPUs and serve them with vLLM. |
64-
| [**Job Queuing**](docs/sample_blueprints/platform_feature_blueprints/teams/README.md) | Take advantage of job queuing and enforce resource quotas and fair sharing between teams. |
55+
| [**LLM & VLM Inference with vLLM**](docs/sample_blueprints/model_serving/llm_inference_with_vllm/README.md) | Deploy Llama 2/3/3.1 7B/8B models using NVIDIA GPU shapes and the vLLM inference engine with auto-scaling. |
56+
| [**Llama Stack**](docs/sample_blueprints/partner_blueprints/llama-stack/README.md) | Complete GenAI runtime with vLLM, ChromaDB, Postgres, and Jaeger for production deployments with unified API for inference, RAG, and telemetry. |
57+
| [**Fine-Tuning Benchmarking**](docs/sample_blueprints/gpu_benchmarking/lora-benchmarking/README.md) | Run MLCommons quantized Llama-2 70B LoRA finetuning on A100 for performance benchmarking. |
58+
| [**LoRA Fine-Tuning**](docs/sample_blueprints/model_fine_tuning/lora-fine-tuning/README.md) | LoRA fine-tuning of custom or HuggingFace models using any dataset. Includes flexible hyperparameter tuning. |
59+
| [**GPU Performance Benchmarking**](docs/sample_blueprints/gpu_health_check/gpu-health-check/README.md) | Comprehensive evaluation of GPU performance to ensure optimal hardware readiness before initiating any intensive computational workload. |
60+
| [**CPU Inference**](docs/sample_blueprints/model_serving/cpu-inference/README.md) | Leverage Ollama to test CPU-based inference with models like Mistral, Gemma, and more. |
61+
| [**Multi-node Inference with RDMA and vLLM**](docs/sample_blueprints/model_serving/multi-node-inference/README.md) | Deploy Llama-405B sized LLMs across multiple nodes with RDMA using H100 nodes with vLLM and LeaderWorkerSet. |
62+
| [**Autoscaling Inference with vLLM**](docs/sample_blueprints/model_serving/auto_scaling/README.md) | Serve LLMs with auto-scaling using KEDA, which scales to multiple GPUs and nodes using application metrics like inference latency. |
63+
| [**LLM Inference with MIG**](docs/sample_blueprints/model_serving/mig_multi_instance_gpu/README.md) | Deploy LLMs to a fraction of a GPU with Nvidia’s multi-instance GPUs and serve them with vLLM. |
64+
| [**Job Queuing**](docs/sample_blueprints/platform_features/teams/README.md) | Take advantage of job queuing and enforce resource quotas and fair sharing between teams. |
6565

6666
## Support & Contact
6767

docs/about.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -36,8 +36,8 @@
3636
| ------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------- |
3737
| **Customize Blueprints** | Tailor existing OCI AI Blueprints to suit your exact AI workload needs—everything from hyperparameters to node counts and hardware. | [Read More](custom_blueprints/README.md) |
3838
| **Updating OCI AI Blueprints** | Keep your OCI AI Blueprints environment current with the latest control plane and portal updates. | [Read More](../INSTALLING_ONTO_EXISTING_CLUSTER_README.md) |
39-
| **Shared Node Pool** | Use longer-lived resources (e.g., bare metal nodes) across multiple blueprints or to persist resources after a blueprint is undeployed. | [Read More](sample_blueprints/platform_feature_blueprints/shared_node_pools/README.md) |
40-
| **Auto-Scaling** | Automatically adjust resource usage based on infrastructure or application-level metrics to optimize performance and costs. | [Read More](sample_blueprints/platform_feature_blueprints/auto_scaling/README.md) |
39+
| **Shared Node Pool** | Use longer-lived resources (e.g., bare metal nodes) across multiple blueprints or to persist resources after a blueprint is undeployed. | [Read More](sample_blueprints/platform_features/shared_node_pools/README.md) |
40+
| **Auto-Scaling** | Automatically adjust resource usage based on infrastructure or application-level metrics to optimize performance and costs. | [Read More](sample_blueprints/model_serving/auto_scaling/README.md) |
4141

4242
---
4343

@@ -76,13 +76,13 @@ A:
7676
A: Deploy a vLLM blueprint, then use a tool like LLMPerf to run benchmarking against your inference endpoint. Contact us for more details.
7777

7878
**Q: Where can I see the full list of blueprints?**
79-
A: All available blueprints are listed [here](sample_blueprints/platform_feature_blueprints/exisiting_cluster_installation/README.md). If you need something custom, please let us know.
79+
A: All available blueprints are listed [here](sample_blueprints/other/exisiting_cluster_installation/README.md). If you need something custom, please let us know.
8080

8181
**Q: How do I check logs for troubleshooting?**
8282
A: Use `kubectl` to inspect pod logs in your OKE cluster.
8383

8484
**Q: Does OCI AI Blueprints support auto-scaling?**
85-
A: Yes, we leverage KEDA for application-driven auto-scaling. See [documentation](sample_blueprints/platform_feature_blueprints/auto_scaling/README.md).
85+
A: Yes, we leverage KEDA for application-driven auto-scaling. See [documentation](sample_blueprints/model_serving/auto_scaling/README.md).
8686

8787
**Q: Which GPUs are compatible?**
8888
A: Any NVIDIA GPUs available in your OCI region (A10, A100, H100, etc.).
@@ -91,4 +91,4 @@ A: Any NVIDIA GPUs available in your OCI region (A10, A100, H100, etc.).
9191
A: Yes, though testing on clusters running other workloads is ongoing. We recommend a clean cluster for best stability.
9292

9393
**Q: How do I run multiple blueprints on the same node?**
94-
A: Enable shared node pools. [Read more here](sample_blueprints/platform_feature_blueprints/shared_node_pools/README.md).
94+
A: Enable shared node pools. [Read more here](sample_blueprints/platform_features/shared_node_pools/README.md).

docs/api_documentation.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -36,11 +36,11 @@
3636
| recipe_container_env | string | No | Values of the recipe container init arguments. See the Blueprint Arguments section below for details. Example: `[{"key": "tensor_parallel_size","value": "2"},{"key": "model_name","value": "NousResearch/Meta-Llama-3.1-8B-Instruct"},{"key": "Model_Path","value": "/models/NousResearch/Meta-Llama-3.1-8B-Instruct"}]` |
3737
| skip_capacity_validation | boolean | No | Determines whether validation checks on shape capacity are performed before initiating deployment. If your deployment is failing validation due to capacity errors but you believe this not to be true, you should set `skip_capacity_validation` to be `true` in the recipe JSON to bypass all checks for Shape capacity. |
3838

39-
For autoscaling parameters, visit [autoscaling](sample_blueprints/platform_feature_blueprints/auto_scaling/README.md).
39+
For autoscaling parameters, visit [autoscaling](sample_blueprints/model_serving/auto_scaling/README.md).
4040

41-
For multinode inference parameters, visit [multinode inference](sample_blueprints/workload_blueprints/multi-node-inference/README.md)
41+
For multinode inference parameters, visit [multinode inference](sample_blueprints/model_serving/multi-node-inference/README.md)
4242

43-
For MIG parameters, visit [MIG shared pool configurations](sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/mig_inference_single_replica.json), [update MIG configuration](sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/mig_inference_single_replica.json), and [MIG recipe configuration](sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/mig_inference_single_replica.json).
43+
For MIG parameters, visit [MIG shared pool configurations](sample_blueprints/model_serving/mig_multi_instance_gpu/mig_inference_single_replica.json), [update MIG configuration](sample_blueprints/model_serving/mig_multi_instance_gpu/mig_inference_single_replica.json), and [MIG recipe configuration](sample_blueprints/model_serving/mig_multi_instance_gpu/mig_inference_single_replica.json).
4444

4545
### Blueprint Container Arguments
4646

@@ -94,13 +94,13 @@ This recipe deploys the vLLM container image. Follow the vLLM docs to pass the c
9494
There are 3 blueprints that we are providing out of the box. Following are example recipe.json snippets that you can use to deploy the blueprints quickly for a test run.
9595
|Blueprint|Scenario|Sample JSON|
9696
|----|----|----
97-
|LLM Inference using NVIDIA shapes and vLLM|Deployment with default Llama-3.1-8B model using PAR|View sample JSON here [here](sample_blueprints/workload_blueprints/llm_inference_with_vllm/vllm-open-hf-model.json)
98-
|MLCommons Llama-2 Quantized 70B LORA Fine-Tuning on A100|Default deployment with model and dataset ingested using PAR|View sample JSON here [here](sample_blueprints/workload_blueprints/lora-benchmarking/mlcommons_lora_finetune_nvidia_sample_recipe.json)
99-
|LORA Fine-Tune Blueprint|Open Access Model Open Access Dataset Download from Huggingface (no token required)|View sample JSON [here](sample_blueprints/workload_blueprints/lora-fine-tuning/open_model_open_dataset_hf.backend.json)
100-
|LORA Fine-Tune Blueprint|Closed Access Model Open Access Dataset Download from Huggingface (Valid Auth Token Is Required!!)|View sample JSON [here](sample_blueprints/workload_blueprints/lora-fine-tuning/closed_model_open_dataset_hf.backend.json)
101-
|LORA Fine-Tune Blueprint|Bucket Model Open Access Dataset Download from Huggingface (no token required)|View sample JSON [here](sample_blueprints/workload_blueprints/lora-fine-tuning/bucket_par_open_dataset.backend.json)
102-
|LORA Fine-Tune Blueprint|Get Model from Bucket in Another Region / Tenancy using Pre-Authenticated_Requests (PAR) Open Access Dataset Download from Huggingface (no token required)|View sample JSON [here](sample_blueprints/workload_blueprints/lora-fine-tuning/bucket_model_open_dataset.backend.json)
103-
|LORA Fine-Tune Blueprint|Bucket Model Bucket Checkpoint Open Access Dataset Download from Huggingface (no token required)|View sample JSON [here](sample_blueprints/workload_blueprints/lora-fine-tuning/bucket_par_open_dataset.backend.json)
97+
|LLM Inference using NVIDIA shapes and vLLM|Deployment with default Llama-3.1-8B model using PAR|View sample JSON here [here](sample_blueprints/model_serving/llm_inference_with_vllm/vllm-open-hf-model.json)
98+
|MLCommons Llama-2 Quantized 70B LORA Fine-Tuning on A100|Default deployment with model and dataset ingested using PAR|View sample JSON here [here](sample_blueprints/gpu_benchmarking/lora-benchmarking/mlcommons_lora_finetune_nvidia_sample_recipe.json)
99+
|LORA Fine-Tune Blueprint|Open Access Model Open Access Dataset Download from Huggingface (no token required)|View sample JSON [here](sample_blueprints/model_fine_tuning/lora-fine-tuning/open_model_open_dataset_hf.backend.json)
100+
|LORA Fine-Tune Blueprint|Closed Access Model Open Access Dataset Download from Huggingface (Valid Auth Token Is Required!!)|View sample JSON [here](sample_blueprints/model_fine_tuning/lora-fine-tuning/closed_model_open_dataset_hf.backend.json)
101+
|LORA Fine-Tune Blueprint|Bucket Model Open Access Dataset Download from Huggingface (no token required)|View sample JSON [here](sample_blueprints/model_fine_tuning/lora-fine-tuning/bucket_par_open_dataset.backend.json)
102+
|LORA Fine-Tune Blueprint|Get Model from Bucket in Another Region / Tenancy using Pre-Authenticated_Requests (PAR) Open Access Dataset Download from Huggingface (no token required)|View sample JSON [here](sample_blueprints/model_fine_tuning/lora-fine-tuning/bucket_model_open_dataset.backend.json)
103+
|LORA Fine-Tune Blueprint|Bucket Model Bucket Checkpoint Open Access Dataset Download from Huggingface (no token required)|View sample JSON [here](sample_blueprints/model_fine_tuning/lora-fine-tuning/bucket_par_open_dataset.backend.json)
104104

105105
## Undeploy a Blueprint
106106

docs/common_workflows/deploying_blueprints_onto_specific_nodes/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ If you have existing node pools in your original OKE cluster that you'd like Blu
2121
2. Go to the stack and click "Application information". Click the API Url.
2222
3. Login with the `Admin Username` and `Admin Password` in the Application information tab.
2323
4. Click the link next to "deployment" which will take you to a page with "Deployment List", and a content box.
24-
5. Paste in the sample blueprint json found [here](../../sample_blueprints/platform_feature_blueprints/exisiting_cluster_installation/add_node_to_control_plane.json).
24+
5. Paste in the sample blueprint json found [here](../../sample_blueprints/other/exisiting_cluster_installation/add_node_to_control_plane.json).
2525
6. Modify the "recipe_node_name" field to the private IP address you found in step 1 above.
2626
7. Click "POST". This is a fast operation.
2727
8. Wait about 20 seconds and refresh the page. It should look like:

0 commit comments

Comments
 (0)