diff --git a/INSTALLING_ONTO_EXISTING_CLUSTER_README.md b/INSTALLING_ONTO_EXISTING_CLUSTER_README.md index e89158e..3c26107 100644 --- a/INSTALLING_ONTO_EXISTING_CLUSTER_README.md +++ b/INSTALLING_ONTO_EXISTING_CLUSTER_README.md @@ -83,7 +83,7 @@ If you have existing node pools in your original OKE cluster that you'd like Blu - If you get a warning about security, sometimes it takes a bit for the certificates to get signed. This will go away once that process completes on the OKE side. 3. Login with the `Admin Username` and `Admin Password` in the Application information tab. 4. Click the link next to "deployment" which will take you to a page with "Deployment List", and a content box. -5. Paste in the sample blueprint json found [here](docs/sample_blueprints/platform_feature_blueprints/exisiting_cluster_installation/add_node_to_control_plane.json). +5. Paste in the sample blueprint json found [here](docs/sample_blueprints/other/exisiting_cluster_installation/add_node_to_control_plane.json). 6. Modify the "recipe_node_name" field to the private IP address you found in step 1 above. 7. Click "POST". This is a fast operation. 8. Wait about 20 seconds and refresh the page. It should look like: @@ -108,10 +108,10 @@ If you have existing node pools in your original OKE cluster that you'd like Blu - If you get a warning about security, sometimes it takes a bit for the certificates to get signed. This will go away once that process completes on the OKE side. 3. Login with the `Admin Username` and `Admin Password` in the Application information tab. 4. Click the link next to "deployment" which will take you to a page with "Deployment List", and a content box. -5. If you added a node from [Step 4](./INSTALLING_ONTO_EXISTING_CLUSTER_README.md#step-4-add-existing-nodes-to-cluster-optional), use the following shared node pool [blueprint](docs/sample_blueprints/platform_feature_blueprints/shared_node_pools/vllm_inference_sample_shared_pool_blueprint.json). +5. If you added a node from [Step 4](./INSTALLING_ONTO_EXISTING_CLUSTER_README.md#step-4-add-existing-nodes-to-cluster-optional), use the following shared node pool [blueprint](docs/sample_blueprints/platform_features/shared_node_pools/vllm_inference_sample_shared_pool_blueprint.json). - Depending on the node shape, you will need to change: `"recipe_node_shape": "BM.GPU.A10.4"` to match your shape. -6. If you did not add a node, or just want to deploy a fresh node, use the following [blueprint](docs/sample_blueprints/workload_blueprints/llm_inference_with_vllm/vllm-open-hf-model.json). +6. If you did not add a node, or just want to deploy a fresh node, use the following [blueprint](docs/sample_blueprints/model_serving/llm_inference_with_vllm/vllm-open-hf-model.json). 7. Paste the blueprint you selected into context box on the deployment page and click "POST" 8. To monitor the deployment, go back to "Api Root" and click "deployment_logs". - If you are deploying without a shared node pool, it can take 10-30 minutes to bring up a node, depending on shape and whether it is bare-metal or virtual. diff --git a/README.md b/README.md index 10a5bd9..f8409c7 100644 --- a/README.md +++ b/README.md @@ -52,16 +52,16 @@ After you install OCI AI Blueprints to an OKE cluster in your tenancy, you can d | Blueprint | Description | | --------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- | -| [**LLM & VLM Inference with vLLM**](docs/sample_blueprints/workload_blueprints/llm_inference_with_vllm/README.md) | Deploy Llama 2/3/3.1 7B/8B models using NVIDIA GPU shapes and the vLLM inference engine with auto-scaling. | -| [**Llama Stack**](docs/sample_blueprints/workload_blueprints/llama-stack/README.md) | Complete GenAI runtime with vLLM, ChromaDB, Postgres, and Jaeger for production deployments with unified API for inference, RAG, and telemetry. | -| [**Fine-Tuning Benchmarking**](docs/sample_blueprints/workload_blueprints/lora-benchmarking/README.md) | Run MLCommons quantized Llama-2 70B LoRA finetuning on A100 for performance benchmarking. | -| [**LoRA Fine-Tuning**](docs/sample_blueprints/workload_blueprints/lora-fine-tuning/README.md) | LoRA fine-tuning of custom or HuggingFace models using any dataset. Includes flexible hyperparameter tuning. | -| [**GPU Performance Benchmarking**](docs/sample_blueprints/workload_blueprints/gpu-health-check/README.md) | Comprehensive evaluation of GPU performance to ensure optimal hardware readiness before initiating any intensive computational workload. | -| [**CPU Inference**](docs/sample_blueprints/workload_blueprints/cpu-inference/README.md) | Leverage Ollama to test CPU-based inference with models like Mistral, Gemma, and more. | -| [**Multi-node Inference with RDMA and vLLM**](docs/sample_blueprints/workload_blueprints/multi-node-inference/README.md) | Deploy Llama-405B sized LLMs across multiple nodes with RDMA using H100 nodes with vLLM and LeaderWorkerSet. | -| [**Autoscaling Inference with vLLM**](docs/sample_blueprints/platform_feature_blueprints/auto_scaling/README.md) | Serve LLMs with auto-scaling using KEDA, which scales to multiple GPUs and nodes using application metrics like inference latency. | -| [**LLM Inference with MIG**](docs/sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/README.md) | Deploy LLMs to a fraction of a GPU with Nvidia’s multi-instance GPUs and serve them with vLLM. | -| [**Job Queuing**](docs/sample_blueprints/platform_feature_blueprints/teams/README.md) | Take advantage of job queuing and enforce resource quotas and fair sharing between teams. | +| [**LLM & VLM Inference with vLLM**](docs/sample_blueprints/model_serving/llm_inference_with_vllm/README.md) | Deploy Llama 2/3/3.1 7B/8B models using NVIDIA GPU shapes and the vLLM inference engine with auto-scaling. | +| [**Llama Stack**](docs/sample_blueprints/other/llama-stack/README.md) | Complete GenAI runtime with vLLM, ChromaDB, Postgres, and Jaeger for production deployments with unified API for inference, RAG, and telemetry. | +| [**Fine-Tuning Benchmarking**](docs/sample_blueprints/gpu_benchmarking/lora-benchmarking/README.md) | Run MLCommons quantized Llama-2 70B LoRA finetuning on A100 for performance benchmarking. | +| [**LoRA Fine-Tuning**](docs/sample_blueprints/model_fine_tuning/lora-fine-tuning/README.md) | LoRA fine-tuning of custom or HuggingFace models using any dataset. Includes flexible hyperparameter tuning. | +| [**GPU Performance Benchmarking**](docs/sample_blueprints/gpu_health_check/gpu-health-check/README.md) | Comprehensive evaluation of GPU performance to ensure optimal hardware readiness before initiating any intensive computational workload. | +| [**CPU Inference**](docs/sample_blueprints/model_serving/cpu-inference/README.md) | Leverage Ollama to test CPU-based inference with models like Mistral, Gemma, and more. | +| [**Multi-node Inference with RDMA and vLLM**](docs/sample_blueprints/model_serving/multi-node-inference/README.md) | Deploy Llama-405B sized LLMs across multiple nodes with RDMA using H100 nodes with vLLM and LeaderWorkerSet. | +| [**Autoscaling Inference with vLLM**](docs/sample_blueprints/model_serving/auto_scaling/README.md) | Serve LLMs with auto-scaling using KEDA, which scales to multiple GPUs and nodes using application metrics like inference latency. | +| [**LLM Inference with MIG**](docs/sample_blueprints/model_serving/mig_multi_instance_gpu/README.md) | Deploy LLMs to a fraction of a GPU with Nvidia’s multi-instance GPUs and serve them with vLLM. | +| [**Job Queuing**](docs/sample_blueprints/platform_features/teams/README.md) | Take advantage of job queuing and enforce resource quotas and fair sharing between teams. | ## Support & Contact diff --git a/docs/about.md b/docs/about.md index 2dd3985..aed9498 100644 --- a/docs/about.md +++ b/docs/about.md @@ -36,8 +36,8 @@ | ------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------- | | **Customize Blueprints** | Tailor existing OCI AI Blueprints to suit your exact AI workload needs—everything from hyperparameters to node counts and hardware. | [Read More](custom_blueprints/README.md) | | **Updating OCI AI Blueprints** | Keep your OCI AI Blueprints environment current with the latest control plane and portal updates. | [Read More](../INSTALLING_ONTO_EXISTING_CLUSTER_README.md) | -| **Shared Node Pool** | Use longer-lived resources (e.g., bare metal nodes) across multiple blueprints or to persist resources after a blueprint is undeployed. | [Read More](sample_blueprints/platform_feature_blueprints/shared_node_pools/README.md) | -| **Auto-Scaling** | Automatically adjust resource usage based on infrastructure or application-level metrics to optimize performance and costs. | [Read More](sample_blueprints/platform_feature_blueprints/auto_scaling/README.md) | +| **Shared Node Pool** | Use longer-lived resources (e.g., bare metal nodes) across multiple blueprints or to persist resources after a blueprint is undeployed. | [Read More](sample_blueprints/platform_features/shared_node_pools/README.md) | +| **Auto-Scaling** | Automatically adjust resource usage based on infrastructure or application-level metrics to optimize performance and costs. | [Read More](sample_blueprints/model_serving/auto_scaling/README.md) | --- @@ -76,13 +76,13 @@ A: A: Deploy a vLLM blueprint, then use a tool like LLMPerf to run benchmarking against your inference endpoint. Contact us for more details. **Q: Where can I see the full list of blueprints?** -A: All available blueprints are listed [here](sample_blueprints/platform_feature_blueprints/exisiting_cluster_installation/README.md). If you need something custom, please let us know. +A: All available blueprints are listed [here](sample_blueprints/other/exisiting_cluster_installation/README.md). If you need something custom, please let us know. **Q: How do I check logs for troubleshooting?** A: Use `kubectl` to inspect pod logs in your OKE cluster. **Q: Does OCI AI Blueprints support auto-scaling?** -A: Yes, we leverage KEDA for application-driven auto-scaling. See [documentation](sample_blueprints/platform_feature_blueprints/auto_scaling/README.md). +A: Yes, we leverage KEDA for application-driven auto-scaling. See [documentation](sample_blueprints/model_serving/auto_scaling/README.md). **Q: Which GPUs are compatible?** A: Any NVIDIA GPUs available in your OCI region (A10, A100, H100, etc.). @@ -91,4 +91,4 @@ A: Any NVIDIA GPUs available in your OCI region (A10, A100, H100, etc.). A: Yes, though testing on clusters running other workloads is ongoing. We recommend a clean cluster for best stability. **Q: How do I run multiple blueprints on the same node?** -A: Enable shared node pools. [Read more here](sample_blueprints/platform_feature_blueprints/shared_node_pools/README.md). +A: Enable shared node pools. [Read more here](sample_blueprints/platform_features/shared_node_pools/README.md). diff --git a/docs/api_documentation.md b/docs/api_documentation.md index e78125d..08efdf3 100644 --- a/docs/api_documentation.md +++ b/docs/api_documentation.md @@ -36,11 +36,11 @@ | recipe_container_env | string | No | Values of the recipe container init arguments. See the Blueprint Arguments section below for details. Example: `[{"key": "tensor_parallel_size","value": "2"},{"key": "model_name","value": "NousResearch/Meta-Llama-3.1-8B-Instruct"},{"key": "Model_Path","value": "/models/NousResearch/Meta-Llama-3.1-8B-Instruct"}]` | | skip_capacity_validation | boolean | No | Determines whether validation checks on shape capacity are performed before initiating deployment. If your deployment is failing validation due to capacity errors but you believe this not to be true, you should set `skip_capacity_validation` to be `true` in the recipe JSON to bypass all checks for Shape capacity. | -For autoscaling parameters, visit [autoscaling](sample_blueprints/platform_feature_blueprints/auto_scaling/README.md). +For autoscaling parameters, visit [autoscaling](sample_blueprints/model_serving/auto_scaling/README.md). -For multinode inference parameters, visit [multinode inference](sample_blueprints/workload_blueprints/multi-node-inference/README.md) +For multinode inference parameters, visit [multinode inference](sample_blueprints/model_serving/multi-node-inference/README.md) -For MIG parameters, visit [MIG shared pool configurations](sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/mig_inference_single_replica.json), [update MIG configuration](sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/mig_inference_single_replica.json), and [MIG recipe configuration](sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/mig_inference_single_replica.json). +For MIG parameters, visit [MIG shared pool configurations](sample_blueprints/model_serving/mig_multi_instance_gpu/mig_inference_single_replica.json), [update MIG configuration](sample_blueprints/model_serving/mig_multi_instance_gpu/mig_inference_single_replica.json), and [MIG recipe configuration](sample_blueprints/model_serving/mig_multi_instance_gpu/mig_inference_single_replica.json). ### Blueprint Container Arguments @@ -94,13 +94,13 @@ This recipe deploys the vLLM container image. Follow the vLLM docs to pass the c There are 3 blueprints that we are providing out of the box. Following are example recipe.json snippets that you can use to deploy the blueprints quickly for a test run. |Blueprint|Scenario|Sample JSON| |----|----|---- -|LLM Inference using NVIDIA shapes and vLLM|Deployment with default Llama-3.1-8B model using PAR|View sample JSON here [here](sample_blueprints/workload_blueprints/llm_inference_with_vllm/vllm-open-hf-model.json) -|MLCommons Llama-2 Quantized 70B LORA Fine-Tuning on A100|Default deployment with model and dataset ingested using PAR|View sample JSON here [here](sample_blueprints/workload_blueprints/lora-benchmarking/mlcommons_lora_finetune_nvidia_sample_recipe.json) -|LORA Fine-Tune Blueprint|Open Access Model Open Access Dataset Download from Huggingface (no token required)|View sample JSON [here](sample_blueprints/workload_blueprints/lora-fine-tuning/open_model_open_dataset_hf.backend.json) -|LORA Fine-Tune Blueprint|Closed Access Model Open Access Dataset Download from Huggingface (Valid Auth Token Is Required!!)|View sample JSON [here](sample_blueprints/workload_blueprints/lora-fine-tuning/closed_model_open_dataset_hf.backend.json) -|LORA Fine-Tune Blueprint|Bucket Model Open Access Dataset Download from Huggingface (no token required)|View sample JSON [here](sample_blueprints/workload_blueprints/lora-fine-tuning/bucket_par_open_dataset.backend.json) -|LORA Fine-Tune Blueprint|Get Model from Bucket in Another Region / Tenancy using Pre-Authenticated_Requests (PAR) Open Access Dataset Download from Huggingface (no token required)|View sample JSON [here](sample_blueprints/workload_blueprints/lora-fine-tuning/bucket_model_open_dataset.backend.json) -|LORA Fine-Tune Blueprint|Bucket Model Bucket Checkpoint Open Access Dataset Download from Huggingface (no token required)|View sample JSON [here](sample_blueprints/workload_blueprints/lora-fine-tuning/bucket_par_open_dataset.backend.json) +|LLM Inference using NVIDIA shapes and vLLM|Deployment with default Llama-3.1-8B model using PAR|View sample JSON here [here](sample_blueprints/model_serving/llm_inference_with_vllm/vllm-open-hf-model.json) +|MLCommons Llama-2 Quantized 70B LORA Fine-Tuning on A100|Default deployment with model and dataset ingested using PAR|View sample JSON here [here](sample_blueprints/gpu_benchmarking/lora-benchmarking/mlcommons_lora_finetune_nvidia_sample_recipe.json) +|LORA Fine-Tune Blueprint|Open Access Model Open Access Dataset Download from Huggingface (no token required)|View sample JSON [here](sample_blueprints/model_fine_tuning/lora-fine-tuning/open_model_open_dataset_hf.backend.json) +|LORA Fine-Tune Blueprint|Closed Access Model Open Access Dataset Download from Huggingface (Valid Auth Token Is Required!!)|View sample JSON [here](sample_blueprints/model_fine_tuning/lora-fine-tuning/closed_model_open_dataset_hf.backend.json) +|LORA Fine-Tune Blueprint|Bucket Model Open Access Dataset Download from Huggingface (no token required)|View sample JSON [here](sample_blueprints/model_fine_tuning/lora-fine-tuning/bucket_par_open_dataset.backend.json) +|LORA Fine-Tune Blueprint|Get Model from Bucket in Another Region / Tenancy using Pre-Authenticated_Requests (PAR) Open Access Dataset Download from Huggingface (no token required)|View sample JSON [here](sample_blueprints/model_fine_tuning/lora-fine-tuning/bucket_model_open_dataset.backend.json) +|LORA Fine-Tune Blueprint|Bucket Model Bucket Checkpoint Open Access Dataset Download from Huggingface (no token required)|View sample JSON [here](sample_blueprints/model_fine_tuning/lora-fine-tuning/bucket_par_open_dataset.backend.json) ## Undeploy a Blueprint diff --git a/docs/common_workflows/deploying_blueprints_onto_specific_nodes/README.md b/docs/common_workflows/deploying_blueprints_onto_specific_nodes/README.md index a0047c6..6c4f4bd 100644 --- a/docs/common_workflows/deploying_blueprints_onto_specific_nodes/README.md +++ b/docs/common_workflows/deploying_blueprints_onto_specific_nodes/README.md @@ -21,7 +21,7 @@ If you have existing node pools in your original OKE cluster that you'd like Blu 2. Go to the stack and click "Application information". Click the API Url. 3. Login with the `Admin Username` and `Admin Password` in the Application information tab. 4. Click the link next to "deployment" which will take you to a page with "Deployment List", and a content box. -5. Paste in the sample blueprint json found [here](../../sample_blueprints/platform_feature_blueprints/exisiting_cluster_installation/add_node_to_control_plane.json). +5. Paste in the sample blueprint json found [here](../../sample_blueprints/other/exisiting_cluster_installation/add_node_to_control_plane.json). 6. Modify the "recipe_node_name" field to the private IP address you found in step 1 above. 7. Click "POST". This is a fast operation. 8. Wait about 20 seconds and refresh the page. It should look like: diff --git a/docs/common_workflows/working_with_large_models/README.md b/docs/common_workflows/working_with_large_models/README.md index f5da6a8..5d08ae5 100644 --- a/docs/common_workflows/working_with_large_models/README.md +++ b/docs/common_workflows/working_with_large_models/README.md @@ -40,7 +40,7 @@ Steps: 1. Create a bucket in object storage in the same region as the shared node pool (decrease copy times). In our example, we will call this something similar to the name of the model we plan to use: `llama3290Bvisioninstruct` -2. Once the bucket is finished creating, deploy [this blueprint](../../sample_blueprints/platform_feature_blueprints/model_storage/download_closed_hf_model_to_object_storage.json) to copy `meta-llama/Llama-3.2-90B-Vision-Instruct` to the bucket you created. +2. Once the bucket is finished creating, deploy [this blueprint](../../sample_blueprints/other/model_storage/download_closed_hf_model_to_object_storage.json) to copy `meta-llama/Llama-3.2-90B-Vision-Instruct` to the bucket you created. - **Note**: The blueprint assumes you created the bucket using the name `llama3290Bvisioninstruct`. If you changed the name, you will also need to modify it in the example blueprint. diff --git a/docs/custom_blueprints/blueprint_json_schema.json b/docs/custom_blueprints/blueprint_json_schema.json index b888a8c..9a38519 100644 --- a/docs/custom_blueprints/blueprint_json_schema.json +++ b/docs/custom_blueprints/blueprint_json_schema.json @@ -599,14 +599,34 @@ "description": "Classifies the blueprint by intent.", "oneOf": [ { - "const": "workload_blueprint", - "title": "Workload blueprint", - "description": "End‑to‑end workloads such as inference, fine‑tuning, benchmarking, or health‑checking that deliver a runnable solution." + "const": "gpu_benchmarking", + "title": "GPU Benchmarking", + "description": "Benchmarks for measuring GPU performance, compute throughput, memory bandwidth, and hardware utilization across different workloads and configurations." }, { - "const": "platform_feature_blueprint", - "title": "Platform‑feature blueprint", - "description": "Demonstrates how to use a specific OCI AI Blueprints capability (autoscaling, shared pools, MIG, etc.) that users can copy into other blueprints." + "const": "gpu_health_check", + "title": "GPU Health Check", + "description": "Diagnostic tools and health monitoring solutions for validating GPU functionality, detecting hardware issues, and ensuring optimal GPU cluster operations." + }, + { + "const": "model_fine_tuning", + "title": "Model Fine-tuning", + "description": "End-to-end solutions for fine-tuning pre-trained machine learning models on custom datasets, including parameter-efficient methods like LoRA and full fine-tuning approaches." + }, + { + "const": "model_serving", + "title": "Model Serving", + "description": "Inference and model serving solutions for deploying trained models as scalable services, including real-time inference, batch processing, and multi-model serving scenarios." + }, + { + "const": "platform_features", + "title": "Platform Features", + "description": "Demonstrations of specific OCI AI Blueprints platform capabilities such as autoscaling, shared node pools, MIG configurations, storage integrations, and networking features." + }, + { + "const": "other", + "title": "Other", + "description": "General-purpose blueprints and specialized use cases that don't fit into the standard categories, including experimental workflows and custom integrations." } ] }, diff --git a/docs/sample_blueprints/README.md b/docs/sample_blueprints/README.md index 7f50b74..af4e4aa 100644 --- a/docs/sample_blueprints/README.md +++ b/docs/sample_blueprints/README.md @@ -12,18 +12,18 @@ You may use any blueprint JSON from these categories as the payload in the `/dep | Feature Category | Type | Documentation | Description | | ---------------------------------------------------------------- | -------------- | ---------------------------------------------------- | ------------------------------------------------------------------------------------------- | -| [Autoscaling](platform_feature_blueprints/auto_scaling/README.md) | Inference | [Guide](platform_feature_blueprints/auto_scaling/README.md) | Scale inference workloads based on traffic load with automatic pod and node scaling | -| [CPU Inference](workload_blueprints/cpu-inference/README.md) | Inference | [Guide](workload_blueprints/cpu-inference/README.md) | Deploy CPU-based inference with Ollama for cost-effective and GPU-free model serving | -| [Existing Cluster Installation](platform_feature_blueprints/exisiting_cluster_installation/README.md) | Infrastructure | [Guide](platform_feature_blueprints/exisiting_cluster_installation/README.md) | Deploy OCI AI Blueprints on your existing OKE cluster without creating new infrastructure | -| [GPU Health Check](workload_blueprints/gpu-health-check/README.md) | Diagnostics | [Guide](workload_blueprints/gpu-health-check/README.md) | Comprehensive GPU health validation and diagnostics for production readiness | -| [vLLM Inference](workload_blueprints/llm_inference_with_vllm/README.md) | Inference | [Guide](workload_blueprints/llm_inference_with_vllm/README.md) | Deploy large language models using vLLM for high-performance inference | -| [Llama Stack](workload_blueprints/llama-stack/README.md) | Application | [Guide](workload_blueprints/llama-stack/README.md) | Complete GenAI runtime with vLLM, ChromaDB, Postgres, and Jaeger for production deployments | -| [LoRA Benchmarking](workload_blueprints/lora-benchmarking/README.md) | Training | [Guide](workload_blueprints/lora-benchmarking/README.md) | Benchmark fine-tuning performance using MLCommons methodology | -| [LoRA Fine-Tuning](workload_blueprints/lora-fine-tuning/README.md) | Training | [Guide](workload_blueprints/lora-fine-tuning/README.md) | Efficiently fine-tune large language models using Low-Rank Adaptation | -| [Multi-Instance GPU](platform_feature_blueprints/mig_multi_instance_gpu/README.md) | Infrastructure | [Guide](platform_feature_blueprints/mig_multi_instance_gpu/README.md) | Partition H100 GPUs into multiple isolated instances for efficient resource sharing | -| [Model Storage](platform_feature_blueprints/model_storage/README.md) | Storage | [Guide](platform_feature_blueprints/model_storage/README.md) | Download and store models from HuggingFace to OCI Object Storage | -| [Multi-Node Inference](workload_blueprints/multi-node-inference/README.md) | Inference | [Guide](workload_blueprints/multi-node-inference/README.md) | Scale large language model inference across multiple GPU nodes | -| [Shared Node Pools](platform_feature_blueprints/shared_node_pools/README.md) | Infrastructure | [Guide](platform_feature_blueprints/shared_node_pools/README.md) | Create persistent node pools for efficient blueprint deployment | -| [Teams](platform_feature_blueprints/teams/README.md) | Management | [Guide](platform_feature_blueprints/teams/README.md) | Enforce resource quotas and fair sharing between teams using Kueue | -| [RDMA Node Pools](platform_feature_blueprints/using_rdma_enabled_node_pools/README.md) | Infrastructure | [Guide](platform_feature_blueprints/using_rdma_enabled_node_pools/README.md) | Enable high-performance inter-node communication using Remote Direct Memory Access | -| [Startup & Health Probes](platform_feature_blueprints/startup_liveness_readiness_probes/README.md) | Configuration | [Guide](platform_feature_blueprints/startup_liveness_readiness_probes/README.md) | Configure application health monitoring and startup validation | +| [Autoscaling](model_serving/auto_scaling/README.md) | Inference | [Guide](model_serving/auto_scaling/README.md) | Scale inference workloads based on traffic load with automatic pod and node scaling | +| [CPU Inference](model_serving/cpu-inference/README.md) | Inference | [Guide](model_serving/cpu-inference/README.md) | Deploy CPU-based inference with Ollama for cost-effective and GPU-free model serving | +| [Existing Cluster Installation](other/exisiting_cluster_installation/README.md) | Infrastructure | [Guide](other/exisiting_cluster_installation/README.md) | Deploy OCI AI Blueprints on your existing OKE cluster without creating new infrastructure | +| [GPU Health Check](gpu_health_check/gpu-health-check/README.md) | Diagnostics | [Guide](gpu_health_check/gpu-health-check/README.md) | Comprehensive GPU health validation and diagnostics for production readiness | +| [vLLM Inference](model_serving/llm_inference_with_vllm/README.md) | Inference | [Guide](model_serving/llm_inference_with_vllm/README.md) | Deploy large language models using vLLM for high-performance inference | +| [Llama Stack](other/llama-stack/README.md) | Application | [Guide](other/llama-stack/README.md) | Complete GenAI runtime with vLLM, ChromaDB, Postgres, and Jaeger for production deployments | +| [LoRA Benchmarking](gpu_benchmarking/lora-benchmarking/README.md) | Training | [Guide](gpu_benchmarking/lora-benchmarking/README.md) | Benchmark fine-tuning performance using MLCommons methodology | +| [LoRA Fine-Tuning](model_fine_tuning/lora-fine-tuning/README.md) | Training | [Guide](model_fine_tuning/lora-fine-tuning/README.md) | Efficiently fine-tune large language models using Low-Rank Adaptation | +| [Multi-Instance GPU](model_serving/mig_multi_instance_gpu/README.md) | Infrastructure | [Guide](model_serving/mig_multi_instance_gpu/README.md) | Partition H100 GPUs into multiple isolated instances for efficient resource sharing | +| [Model Storage](other/model_storage/README.md) | Storage | [Guide](other/model_storage/README.md) | Download and store models from HuggingFace to OCI Object Storage | +| [Multi-Node Inference](model_serving/multi-node-inference/README.md) | Inference | [Guide](model_serving/multi-node-inference/README.md) | Scale large language model inference across multiple GPU nodes | +| [Shared Node Pools](platform_features/shared_node_pools/README.md) | Infrastructure | [Guide](platform_features/shared_node_pools/README.md) | Create persistent node pools for efficient blueprint deployment | +| [Teams](platform_features/teams/README.md) | Management | [Guide](platform_features/teams/README.md) | Enforce resource quotas and fair sharing between teams using Kueue | +| [RDMA Node Pools](other/using_rdma_enabled_node_pools/README.md) | Infrastructure | [Guide](other/using_rdma_enabled_node_pools/README.md) | Enable high-performance inter-node communication using Remote Direct Memory Access | +| [Startup & Health Probes](platform_features/startup_liveness_readiness_probes/README.md) | Configuration | [Guide](platform_features/startup_liveness_readiness_probes/README.md) | Configure application health monitoring and startup validation | diff --git a/docs/sample_blueprints/workload_blueprints/lora-benchmarking/README.md b/docs/sample_blueprints/gpu_benchmarking/lora-benchmarking/README.md similarity index 100% rename from docs/sample_blueprints/workload_blueprints/lora-benchmarking/README.md rename to docs/sample_blueprints/gpu_benchmarking/lora-benchmarking/README.md diff --git a/docs/sample_blueprints/workload_blueprints/lora-benchmarking/mlcommons_lora_finetune_nvidia_sample_recipe.json b/docs/sample_blueprints/gpu_benchmarking/lora-benchmarking/mlcommons_lora_finetune_nvidia_sample_recipe.json similarity index 100% rename from docs/sample_blueprints/workload_blueprints/lora-benchmarking/mlcommons_lora_finetune_nvidia_sample_recipe.json rename to docs/sample_blueprints/gpu_benchmarking/lora-benchmarking/mlcommons_lora_finetune_nvidia_sample_recipe.json diff --git a/docs/sample_blueprints/workload_blueprints/gpu-health-check/README.md b/docs/sample_blueprints/gpu_health_check/gpu-health-check/README.md similarity index 99% rename from docs/sample_blueprints/workload_blueprints/gpu-health-check/README.md rename to docs/sample_blueprints/gpu_health_check/gpu-health-check/README.md index 43dc2d9..18881c5 100644 --- a/docs/sample_blueprints/workload_blueprints/gpu-health-check/README.md +++ b/docs/sample_blueprints/gpu_health_check/gpu-health-check/README.md @@ -1,4 +1,4 @@ -# Health Check +# GPU Health Check #### Comprehensive GPU health validation and diagnostics for production readiness diff --git a/docs/sample_blueprints/workload_blueprints/gpu-health-check/healthcheck_fp16_a10.json b/docs/sample_blueprints/gpu_health_check/gpu-health-check/healthcheck_fp16_a10.json similarity index 100% rename from docs/sample_blueprints/workload_blueprints/gpu-health-check/healthcheck_fp16_a10.json rename to docs/sample_blueprints/gpu_health_check/gpu-health-check/healthcheck_fp16_a10.json diff --git a/docs/sample_blueprints/workload_blueprints/gpu-health-check/healthcheck_fp16_h100.json b/docs/sample_blueprints/gpu_health_check/gpu-health-check/healthcheck_fp16_h100.json similarity index 100% rename from docs/sample_blueprints/workload_blueprints/gpu-health-check/healthcheck_fp16_h100.json rename to docs/sample_blueprints/gpu_health_check/gpu-health-check/healthcheck_fp16_h100.json diff --git a/docs/sample_blueprints/workload_blueprints/gpu-health-check/healthcheck_fp32_a10.json b/docs/sample_blueprints/gpu_health_check/gpu-health-check/healthcheck_fp32_a10.json similarity index 100% rename from docs/sample_blueprints/workload_blueprints/gpu-health-check/healthcheck_fp32_a10.json rename to docs/sample_blueprints/gpu_health_check/gpu-health-check/healthcheck_fp32_a10.json diff --git a/docs/sample_blueprints/workload_blueprints/lora-fine-tuning/README.md b/docs/sample_blueprints/model_fine_tuning/lora-fine-tuning/README.md similarity index 100% rename from docs/sample_blueprints/workload_blueprints/lora-fine-tuning/README.md rename to docs/sample_blueprints/model_fine_tuning/lora-fine-tuning/README.md diff --git a/docs/sample_blueprints/workload_blueprints/lora-fine-tuning/bucket_checkpoint_bucket_model_open_dataset.backend.json b/docs/sample_blueprints/model_fine_tuning/lora-fine-tuning/bucket_checkpoint_bucket_model_open_dataset.backend.json similarity index 100% rename from docs/sample_blueprints/workload_blueprints/lora-fine-tuning/bucket_checkpoint_bucket_model_open_dataset.backend.json rename to docs/sample_blueprints/model_fine_tuning/lora-fine-tuning/bucket_checkpoint_bucket_model_open_dataset.backend.json diff --git a/docs/sample_blueprints/workload_blueprints/lora-fine-tuning/bucket_model_open_dataset.backend.json b/docs/sample_blueprints/model_fine_tuning/lora-fine-tuning/bucket_model_open_dataset.backend.json similarity index 100% rename from docs/sample_blueprints/workload_blueprints/lora-fine-tuning/bucket_model_open_dataset.backend.json rename to docs/sample_blueprints/model_fine_tuning/lora-fine-tuning/bucket_model_open_dataset.backend.json diff --git a/docs/sample_blueprints/workload_blueprints/lora-fine-tuning/bucket_par_open_dataset.backend.json b/docs/sample_blueprints/model_fine_tuning/lora-fine-tuning/bucket_par_open_dataset.backend.json similarity index 100% rename from docs/sample_blueprints/workload_blueprints/lora-fine-tuning/bucket_par_open_dataset.backend.json rename to docs/sample_blueprints/model_fine_tuning/lora-fine-tuning/bucket_par_open_dataset.backend.json diff --git a/docs/sample_blueprints/workload_blueprints/lora-fine-tuning/closed_model_open_dataset_hf.backend.json b/docs/sample_blueprints/model_fine_tuning/lora-fine-tuning/closed_model_open_dataset_hf.backend.json similarity index 100% rename from docs/sample_blueprints/workload_blueprints/lora-fine-tuning/closed_model_open_dataset_hf.backend.json rename to docs/sample_blueprints/model_fine_tuning/lora-fine-tuning/closed_model_open_dataset_hf.backend.json diff --git a/docs/sample_blueprints/workload_blueprints/lora-fine-tuning/open_model_open_dataset_hf.backend.json b/docs/sample_blueprints/model_fine_tuning/lora-fine-tuning/open_model_open_dataset_hf.backend.json similarity index 100% rename from docs/sample_blueprints/workload_blueprints/lora-fine-tuning/open_model_open_dataset_hf.backend.json rename to docs/sample_blueprints/model_fine_tuning/lora-fine-tuning/open_model_open_dataset_hf.backend.json diff --git a/docs/sample_blueprints/platform_feature_blueprints/auto_scaling/README.md b/docs/sample_blueprints/model_serving/auto_scaling/README.md similarity index 99% rename from docs/sample_blueprints/platform_feature_blueprints/auto_scaling/README.md rename to docs/sample_blueprints/model_serving/auto_scaling/README.md index 23a089e..bb5c468 100644 --- a/docs/sample_blueprints/platform_feature_blueprints/auto_scaling/README.md +++ b/docs/sample_blueprints/model_serving/auto_scaling/README.md @@ -192,7 +192,7 @@ Pod auto-scaling allows a blueprint to scale within a single node, up to the num #### Additional Considerations: -Pod autoscaling can be paired with startup and liveness probes to verify that a blueprint is both ready to receive requests and continuing to function properly. For more information, visit [our startup and liveness probe doc](../startup_liveness_readiness_probes/README.md). +Pod autoscaling can be paired with startup and liveness probes to verify that a blueprint is both ready to receive requests and continuing to function properly. For more information, visit [our startup and liveness probe doc](../../platform_features/startup_liveness_readiness_probes/README.md). ## Node + Pod Auto-Scaling (Scaling Beyond a Single Node) diff --git a/docs/sample_blueprints/platform_feature_blueprints/auto_scaling/autoscaling_blueprint.json b/docs/sample_blueprints/model_serving/auto_scaling/autoscaling_blueprint.json similarity index 100% rename from docs/sample_blueprints/platform_feature_blueprints/auto_scaling/autoscaling_blueprint.json rename to docs/sample_blueprints/model_serving/auto_scaling/autoscaling_blueprint.json diff --git a/docs/sample_blueprints/workload_blueprints/cpu-inference/README.md b/docs/sample_blueprints/model_serving/cpu-inference/README.md similarity index 100% rename from docs/sample_blueprints/workload_blueprints/cpu-inference/README.md rename to docs/sample_blueprints/model_serving/cpu-inference/README.md diff --git a/docs/sample_blueprints/workload_blueprints/cpu-inference/cpu-inference-gemma.json b/docs/sample_blueprints/model_serving/cpu-inference/cpu-inference-gemma.json similarity index 100% rename from docs/sample_blueprints/workload_blueprints/cpu-inference/cpu-inference-gemma.json rename to docs/sample_blueprints/model_serving/cpu-inference/cpu-inference-gemma.json diff --git a/docs/sample_blueprints/workload_blueprints/cpu-inference/cpu-inference-mistral-bm.json b/docs/sample_blueprints/model_serving/cpu-inference/cpu-inference-mistral-bm.json similarity index 100% rename from docs/sample_blueprints/workload_blueprints/cpu-inference/cpu-inference-mistral-bm.json rename to docs/sample_blueprints/model_serving/cpu-inference/cpu-inference-mistral-bm.json diff --git a/docs/sample_blueprints/workload_blueprints/cpu-inference/cpu-inference-mistral-vm.json b/docs/sample_blueprints/model_serving/cpu-inference/cpu-inference-mistral-vm.json similarity index 100% rename from docs/sample_blueprints/workload_blueprints/cpu-inference/cpu-inference-mistral-vm.json rename to docs/sample_blueprints/model_serving/cpu-inference/cpu-inference-mistral-vm.json diff --git a/docs/sample_blueprints/workload_blueprints/llm_inference_with_vllm/README.md b/docs/sample_blueprints/model_serving/llm_inference_with_vllm/README.md similarity index 100% rename from docs/sample_blueprints/workload_blueprints/llm_inference_with_vllm/README.md rename to docs/sample_blueprints/model_serving/llm_inference_with_vllm/README.md diff --git a/docs/sample_blueprints/workload_blueprints/llm_inference_with_vllm/vllm-closed-hf-model.json b/docs/sample_blueprints/model_serving/llm_inference_with_vllm/vllm-closed-hf-model.json similarity index 100% rename from docs/sample_blueprints/workload_blueprints/llm_inference_with_vllm/vllm-closed-hf-model.json rename to docs/sample_blueprints/model_serving/llm_inference_with_vllm/vllm-closed-hf-model.json diff --git a/docs/sample_blueprints/workload_blueprints/llm_inference_with_vllm/vllm-model-from-obj-storage.json b/docs/sample_blueprints/model_serving/llm_inference_with_vllm/vllm-model-from-obj-storage.json similarity index 100% rename from docs/sample_blueprints/workload_blueprints/llm_inference_with_vllm/vllm-model-from-obj-storage.json rename to docs/sample_blueprints/model_serving/llm_inference_with_vllm/vllm-model-from-obj-storage.json diff --git a/docs/sample_blueprints/workload_blueprints/llm_inference_with_vllm/vllm-open-hf-model-api-key-functionality.json b/docs/sample_blueprints/model_serving/llm_inference_with_vllm/vllm-open-hf-model-api-key-functionality.json similarity index 100% rename from docs/sample_blueprints/workload_blueprints/llm_inference_with_vllm/vllm-open-hf-model-api-key-functionality.json rename to docs/sample_blueprints/model_serving/llm_inference_with_vllm/vllm-open-hf-model-api-key-functionality.json diff --git a/docs/sample_blueprints/workload_blueprints/llm_inference_with_vllm/vllm-open-hf-model.json b/docs/sample_blueprints/model_serving/llm_inference_with_vllm/vllm-open-hf-model.json similarity index 100% rename from docs/sample_blueprints/workload_blueprints/llm_inference_with_vllm/vllm-open-hf-model.json rename to docs/sample_blueprints/model_serving/llm_inference_with_vllm/vllm-open-hf-model.json diff --git a/docs/sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/README.md b/docs/sample_blueprints/model_serving/mig_multi_instance_gpu/README.md similarity index 99% rename from docs/sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/README.md rename to docs/sample_blueprints/model_serving/mig_multi_instance_gpu/README.md index 3f417c4..11218d2 100644 --- a/docs/sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/README.md +++ b/docs/sample_blueprints/model_serving/mig_multi_instance_gpu/README.md @@ -128,7 +128,7 @@ There are two ways to apply a mig configuration to a node pool. #### shared_node_pool: -Apart from the existing requirements for a shared node pool found [here](../shared_node_pools/README.md), the following are additional requirements / options for MIG: +Apart from the existing requirements for a shared node pool found [here](../../platform_features/shared_node_pools/README.md), the following are additional requirements / options for MIG: - `"shared_node_pool_mig_config"` - the mig congfiguration to apply to each node in the node pool. Possible values are in the [Mig Configurations](#mig-configurations). This will apply the node to each node in the pool, but if you want to update a specific node that can be done via the `update` mode described in the next section. - `"recipe_max_pods_per_node"`: [OPTIONAL: DEFAULT = 90] - by default, since MIG can slice up to 56 times for a full BM.GPU.H100.8, the default 31 pods by OKE is insufficient. As part of shared_node_pool deployment for MIG, this value is increased to 90 to fit all slice configurations + some buffer room. The maximum value is proportedly 110. It is not recommended to change this value, as it can not be modified after deployment of a pool. In order to change it, a node must be removed from the pool and re-added with the new value. diff --git a/docs/sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/mig_enabled_shared_node_pool.json b/docs/sample_blueprints/model_serving/mig_multi_instance_gpu/mig_enabled_shared_node_pool.json similarity index 100% rename from docs/sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/mig_enabled_shared_node_pool.json rename to docs/sample_blueprints/model_serving/mig_multi_instance_gpu/mig_enabled_shared_node_pool.json diff --git a/docs/sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/mig_inference_multiple_replicas.json b/docs/sample_blueprints/model_serving/mig_multi_instance_gpu/mig_inference_multiple_replicas.json similarity index 100% rename from docs/sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/mig_inference_multiple_replicas.json rename to docs/sample_blueprints/model_serving/mig_multi_instance_gpu/mig_inference_multiple_replicas.json diff --git a/docs/sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/mig_inference_single_replica.json b/docs/sample_blueprints/model_serving/mig_multi_instance_gpu/mig_inference_single_replica.json similarity index 100% rename from docs/sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/mig_inference_single_replica.json rename to docs/sample_blueprints/model_serving/mig_multi_instance_gpu/mig_inference_single_replica.json diff --git a/docs/sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/mig_inference_single_replica_10gb.json b/docs/sample_blueprints/model_serving/mig_multi_instance_gpu/mig_inference_single_replica_10gb.json similarity index 100% rename from docs/sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/mig_inference_single_replica_10gb.json rename to docs/sample_blueprints/model_serving/mig_multi_instance_gpu/mig_inference_single_replica_10gb.json diff --git a/docs/sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/mig_slices.png b/docs/sample_blueprints/model_serving/mig_multi_instance_gpu/mig_slices.png similarity index 100% rename from docs/sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/mig_slices.png rename to docs/sample_blueprints/model_serving/mig_multi_instance_gpu/mig_slices.png diff --git a/docs/sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/mig_update_node_with_node_name.json b/docs/sample_blueprints/model_serving/mig_multi_instance_gpu/mig_update_node_with_node_name.json similarity index 100% rename from docs/sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/mig_update_node_with_node_name.json rename to docs/sample_blueprints/model_serving/mig_multi_instance_gpu/mig_update_node_with_node_name.json diff --git a/docs/sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/mig_update_shared_pool_with_node_pool_name.json b/docs/sample_blueprints/model_serving/mig_multi_instance_gpu/mig_update_shared_pool_with_node_pool_name.json similarity index 100% rename from docs/sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/mig_update_shared_pool_with_node_pool_name.json rename to docs/sample_blueprints/model_serving/mig_multi_instance_gpu/mig_update_shared_pool_with_node_pool_name.json diff --git a/docs/sample_blueprints/workload_blueprints/multi-node-inference/README.md b/docs/sample_blueprints/model_serving/multi-node-inference/README.md similarity index 95% rename from docs/sample_blueprints/workload_blueprints/multi-node-inference/README.md rename to docs/sample_blueprints/model_serving/multi-node-inference/README.md index 4dad738..dcd36cb 100644 --- a/docs/sample_blueprints/workload_blueprints/multi-node-inference/README.md +++ b/docs/sample_blueprints/model_serving/multi-node-inference/README.md @@ -53,13 +53,13 @@ Use multi-node inference whenever you are trying to use a very large model that ## RDMA + Multinode Inference -Want to use RDMA with multinode inference? [See here for details](../../platform_feature_blueprints/using_rdma_enabled_node_pools/README.md) +Want to use RDMA with multinode inference? [See here for details](../../other/using_rdma_enabled_node_pools/README.md) ## How to use it? We are using [vLLM](https://docs.vllm.ai/en/latest/serving/distributed_serving.html) and [Ray](https://github.com/ray-project/ray) using the [LeaderWorkerSet (LWS)](https://github.com/kubernetes-sigs/lws) to manage state between multiple nodes. -In order to use multi-node inference in an OCI Blueprint, first deploy a shared node pool with blueprints using [this recipe](../../platform_feature_blueprints/shared_node_pools/shared_node_pool_A10_VM.json). +In order to use multi-node inference in an OCI Blueprint, first deploy a shared node pool with blueprints using [this recipe](../../platform_features/shared_node_pools/shared_node_pool_A10_VM.json). Then, use the following blueprint to deploy serving software: [LINK](multinode_inference_VM_A10.json) @@ -93,9 +93,9 @@ The following parameters are required: - `multinode_num_nodes_to_use_from_shared_pool` -> the total number of nodes (as an integer) you want to use to serve this model. This number must be less than the size of the shared node pool, and will only use schedulable nodes in the pool. -- [OPTIONAL] `"multinode_rdma_enabled_in_shared_pool": true` -> If you have provisioned RDMA enabled shared node pools in your cluster - enable RDMA communication between nodes. This will fail validation if RDMA is not supported for shape type, or node is missing appropriate labels described in [linked doc](../../platform_feature_blueprints/using_rdma_enabled_node_pools/README.md). +- [OPTIONAL] `"multinode_rdma_enabled_in_shared_pool": true` -> If you have provisioned RDMA enabled shared node pools in your cluster - enable RDMA communication between nodes. This will fail validation if RDMA is not supported for shape type, or node is missing appropriate labels described in [linked doc](../../other/using_rdma_enabled_node_pools/README.md). -- [OPTIONAL] `recipe_readiness_probe_params` -> Readiness probe to ensure that service is ready to serve requests. Parameter details found [here](../../platform_feature_blueprints/startup_liveness_readiness_probes/README.md). +- [OPTIONAL] `recipe_readiness_probe_params` -> Readiness probe to ensure that service is ready to serve requests. Parameter details found [here](../../platform_features/startup_liveness_readiness_probes/README.md). ## Requirements @@ -113,7 +113,7 @@ Follow these 6 simple steps to deploy your multi-node inference using OCI AI Blu 1. **Deploy your shared node pool** - Deploy a shared node pool containing at least 2 nodes for inference. Note: Existing shared node pools can be used! - - as a template, follow [this BM.A10](../../platform_feature_blueprints/shared_node_pools/shared_node_pool_A10_BM.json) or [this VM.A10](../../platform_feature_blueprints/shared_node_pools/shared_node_pool_A10_VM.json). + - as a template, follow [this BM.A10](../../platform_features/shared_node_pools/shared_node_pool_A10_BM.json) or [this VM.A10](../../platform_features/shared_node_pools/shared_node_pool_A10_VM.json). 2. **Create Your Deployment Blueprint** - Create a JSON configuration (blueprint) that defines your RayCluster. Key parameters include: - `"recipe_mode": "service"` diff --git a/docs/sample_blueprints/workload_blueprints/multi-node-inference/multinode_inference_BM_A10.json b/docs/sample_blueprints/model_serving/multi-node-inference/multinode_inference_BM_A10.json similarity index 100% rename from docs/sample_blueprints/workload_blueprints/multi-node-inference/multinode_inference_BM_A10.json rename to docs/sample_blueprints/model_serving/multi-node-inference/multinode_inference_BM_A10.json diff --git a/docs/sample_blueprints/workload_blueprints/multi-node-inference/multinode_inference_VM_A10.json b/docs/sample_blueprints/model_serving/multi-node-inference/multinode_inference_VM_A10.json similarity index 100% rename from docs/sample_blueprints/workload_blueprints/multi-node-inference/multinode_inference_VM_A10.json rename to docs/sample_blueprints/model_serving/multi-node-inference/multinode_inference_VM_A10.json diff --git a/docs/sample_blueprints/model_serving/offline-inference-infra/README.md b/docs/sample_blueprints/model_serving/offline-inference-infra/README.md new file mode 100644 index 0000000..4bc98ac --- /dev/null +++ b/docs/sample_blueprints/model_serving/offline-inference-infra/README.md @@ -0,0 +1,171 @@ +# Offline Inference Blueprint - Infra (SGLang + vLLM) + +#### Run offline LLM inference benchmarks using SGLang or vLLM backends with automated performance tracking and MLflow logging. + +This blueprint provides a configurable framework to run **offline LLM inference benchmarks** using either the SGLang or vLLM backends. It is designed for cloud GPU environments and supports automated performance benchmarking with MLflow logging. + +This blueprint enables you to: + +- Run inference locally on GPU nodes using pre-loaded models +- Benchmark token throughput, latency, and request performance +- Push results to MLflow for comparison and analysis + +--- + +## Pre-Filled Samples + +| Feature Showcase | Title | Description | Blueprint File | +| ---------------------------------------------------------------------------------------------------------- | ------------------------------------ | ---------------------------------------------------------------------------- | ---------------------------------------------------------------- | +| Benchmark LLM performance using SGLang backend with offline inference for accurate performance measurement | Offline inference with LLaMA 3 | Benchmarks Meta-Llama-3.1-8B model using SGLang on VM.GPU.A10.2 with 2 GPUs. | [offline_deployment_sglang.json](offline_deployment_sglang.json) | +| Benchmark LLM performance using vLLM backend with offline inference for token throughput analysis | Offline inference with LLAMA 3- vLLM | Benchmarks Meta-Llama-3.1-8B model using vLLM on VM.GPU.A10.2 with 2 GPUs. | [offline_deployment_vllm.json](offline_deployment_vllm.json) | + +You can access these pre-filled samples from the OCI AI Blueprint portal. + +--- + +## When to use Offline inference + +Offline inference is ideal for: + +- Accurate performance benchmarking (no API or network bottlenecks) +- Comparing GPU hardware performance (A10, A100, H100, MI300X) +- Evaluating backend frameworks like vLLM and SGLang + +--- + +## Supported Backends + +| Backend | Description | +| ------- | ------------------------------------------------------------------- | +| sglang | Fast multi-modal LLM backend with optimized throughput | +| vllm | Token streaming inference engine for LLMs with speculative decoding | + +--- + +## Running the Benchmark + +- Things need to run the benchmark + - Model checkpoints pre-downloaded and stored in an object storage. + - Make sure to get a PAR for the object storage where the models are saved. With listing, write and read perimissions + - A Bucket to save the outputs. This does not take a PAR, so should be a bucket in the same tenancy as to where you have your OCI blueprints stack + - Config `.yaml` file that has all the parameters required to run the benhcmark. This includes input_len, output_len, gpu_utilization value etc. + - Deployment `.json` to deploy your blueprint. + - Sample deployment and config files are provided below along with links. + +This blueprint supports benchmark execution via a job-mode recipe using a YAML config file. The recipe mounts a model and config file from Object Storage, runs offline inference, and logs metrics. + +### Notes : Make sure your output object storage is in the same tenancy as your stack. + +## Sample Blueprints + +[Sample Blueprint (Job Mode for Offline SGLang Inference)](offline_deployment_sglang.json) +[Sample Blueprint (Job Mode for Offline vLLM Inference)](offline_deployment_vllm.json) +[Sample Config File SGlang ](offline_sglang_example.yaml) +[Sample Config File - vLLM ](offline_vllm_example.yaml) + +--- + +## Metrics Logged + +- `requests_per_second` +- `input_tokens_per_second` +- `output_tokens_per_second` +- `total_tokens_per_second` +- `elapsed_time` +- `total_input_tokens` +- `total_output_tokens` + +If a dataset is provided: + +- `accuracy` + +### Top-level Deployment Keys + +| Key | Description | +| ------------------- | ---------------------------------------------------------------------------- | +| `recipe_id` | Identifier of the recipe to run; here, it's an offline SGLang benchmark job. | +| `recipe_mode` | Specifies this is a `job`, meaning it runs to completion and exits. | +| `deployment_name` | Human-readable name for the job. | +| `recipe_image_uri` | Docker image containing the benchmark code and dependencies. | +| `recipe_node_shape` | Shape of the VM or GPU node to run the job (e.g., VM.GPU.A10.2). | + +### Input Object Storage + +| Key | Description | +| ---------------------- | ---------------------------------------------------------------------------- | +| `input_object_storage` | List of inputs to mount from Object Storage. | +| `par` | Pre-Authenticated Request (PAR) link to a bucket/folder. | +| `mount_location` | Files are mounted to this path inside the container. | +| `volume_size_in_gbs` | Size of the mount volume. | +| `include` | Only these files/folders from the bucket are mounted (e.g., model + config). | + +### Output Object Storage + +| Key | Description | +| ----------------------- | ------------------------------------------------------- | +| `output_object_storage` | Where to store outputs like benchmark logs or results. | +| `bucket_name` | Name of the output bucket in OCI Object Storage. | +| `mount_location` | Mount point inside container where outputs are written. | +| `volume_size_in_gbs` | Size of this volume in GBs. | + +### Runtime & Infra Settings + +| Key | Description | +| ---------------------------------------------- | ------------------------------------------------------------- | +| `recipe_container_command_args` | Path to the YAML config that defines benchmark parameters. | +| `recipe_replica_count` | Number of job replicas to run (usually 1 for inference). | +| `recipe_container_port` | Port (optional for offline mode; required if API is exposed). | +| `recipe_nvidia_gpu_count` | Number of GPUs allocated to this job. | +| `recipe_node_pool_size` | Number of nodes in the pool (1 means 1 VM). | +| `recipe_node_boot_volume_size_in_gbs` | Disk size for OS + dependencies. | +| `recipe_ephemeral_storage_size` | Local scratch space in GBs. | +| `recipe_shared_memory_volume_size_limit_in_mb` | Shared memory (used by some inference engines). | + +--- + +## **Sample Config File (`example_sglang.yaml`)** + +This file is consumed by the container during execution to configure the benchmark run. + +### Inference Setup + +| Key | Description | +| ------------------- | ----------------------------------------------------------------- | +| `benchmark_type` | Set to `offline` to indicate local execution with no HTTP server. | +| `offline_backend` | Backend engine to use (`sglang` or `vllm`). | +| `model_path` | Path to the model directory (already mounted via Object Storage). | +| `tokenizer_path` | Path to the tokenizer (usually same as model path). | +| `trust_remote_code` | Enables loading models that require custom code (Hugging Face). | +| `conv_template` | Prompt formatting template to use (e.g., `llama-2`). | + +### Benchmark Parameters + +| Key | Description | +| ---------------- | ---------------------------------------------------------------------- | +| `input_len` | Number of tokens in the input prompt. | +| `output_len` | Number of tokens to generate. | +| `num_prompts` | Number of total prompts to run (e.g., 64 prompts x 128 output tokens). | +| `max_seq_len` | Max sequence length supported by the model (e.g., 4096). | +| `max_batch_size` | Max batch size per inference run (depends on GPU memory). | +| `dtype` | Precision (e.g., float16, bfloat16, auto). | + +### Sampling Settings + +| Key | Description | +| ------------- | --------------------------------------------------------------- | +| `temperature` | Controls randomness in generation (lower = more deterministic). | +| `top_p` | Top-p sampling for diversity (0.9 keeps most probable tokens). | + +### MLflow Logging + +| Key | Description | +| ----------------- | -------------------------------------------- | +| `mlflow_uri` | MLflow server to log performance metrics. | +| `experiment_name` | Experiment name to group runs in MLflow UI. | +| `run_name` | Custom name to identify this particular run. | + +### Output + +| Key | Description | +| ------------------- | -------------------------------------------------------------- | +| `save_metrics_path` | Path inside the container where metrics will be saved as JSON. | diff --git a/docs/sample_blueprints/model_serving/offline-inference-infra/offline_deployment_sglang.json b/docs/sample_blueprints/model_serving/offline-inference-infra/offline_deployment_sglang.json new file mode 100644 index 0000000..e3b988a --- /dev/null +++ b/docs/sample_blueprints/model_serving/offline-inference-infra/offline_deployment_sglang.json @@ -0,0 +1,36 @@ +{ + "recipe_id": "offline_inference_sglang", + "recipe_mode": "job", + "deployment_name": "Offline Inference Benchmark", + "recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:llm-benchmark-0409-v4", + "recipe_node_shape": "VM.GPU.A10.2", + "input_object_storage": [ + { + "par": "https://objectstorage.ap-melbourne-1.oraclecloud.com/p/0T99iRADcM08aVpumM6smqMIcnIJTFtV2D8ZIIWidUP9eL8GSRyDMxOb9Va9rmRc/n/iduyx1qnmway/b/mymodels/o/", + "mount_location": "/models", + "volume_size_in_gbs": 500, + "include": [ + "new_example_sglang.yaml", + "NousResearch/Meta-Llama-3.1-8B" + ] + } + ], + "output_object_storage": [ + { + "bucket_name": "inference_output", + "mount_location": "/mlcommons_output", + "volume_size_in_gbs": 200 + } + ], + "recipe_container_command_args": [ + "/models/new_example_sglang.yaml" + ], + "recipe_replica_count": 1, + "recipe_container_port": "8000", + "recipe_nvidia_gpu_count": 2, + "recipe_node_pool_size": 1, + "recipe_node_boot_volume_size_in_gbs": 200, + "recipe_ephemeral_storage_size": 100, + "recipe_shared_memory_volume_size_limit_in_mb": 200 + } + \ No newline at end of file diff --git a/docs/sample_blueprints/model_serving/offline-inference-infra/offline_deployment_vllm.json b/docs/sample_blueprints/model_serving/offline-inference-infra/offline_deployment_vllm.json new file mode 100644 index 0000000..e920f38 --- /dev/null +++ b/docs/sample_blueprints/model_serving/offline-inference-infra/offline_deployment_vllm.json @@ -0,0 +1,36 @@ +{ + "recipe_id": "offline_inference_vllm", + "recipe_mode": "job", + "deployment_name": "Offline Inference Benchmark vllm", + "recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:llm-benchmark-0409-v4", + "recipe_node_shape": "VM.GPU.A10.2", + "input_object_storage": [ + { + "par": "https://objectstorage.ap-melbourne-1.oraclecloud.com/p/0T99iRADcM08aVpumM6smqMIcnIJTFtV2D8ZIIWidUP9eL8GSRyDMxOb9Va9rmRc/n/iduyx1qnmway/b/mymodels/o/", + "mount_location": "/models", + "volume_size_in_gbs": 500, + "include": [ + "new_example_sglang.yaml", + "NousResearch/Meta-Llama-3.1-8B" + ] + } + ], + "output_object_storage": [ + { + "bucket_name": "inference_output", + "mount_location": "/mlcommons_output", + "volume_size_in_gbs": 200 + } + ], + "recipe_container_command_args": [ + "/models/offline_vllm_example.yaml" + ], + "recipe_replica_count": 1, + "recipe_container_port": "8000", + "recipe_nvidia_gpu_count": 2, + "recipe_node_pool_size": 1, + "recipe_node_boot_volume_size_in_gbs": 200, + "recipe_ephemeral_storage_size": 100, + "recipe_shared_memory_volume_size_limit_in_mb": 200 + } + \ No newline at end of file diff --git a/docs/sample_blueprints/model_serving/offline-inference-infra/offline_sglang_example.yaml b/docs/sample_blueprints/model_serving/offline-inference-infra/offline_sglang_example.yaml new file mode 100644 index 0000000..a1ccf27 --- /dev/null +++ b/docs/sample_blueprints/model_serving/offline-inference-infra/offline_sglang_example.yaml @@ -0,0 +1,24 @@ +benchmark_type: offline +offline_backend: sglang + +model_path: /models/NousResearch/Meta-Llama-3.1-8B +tokenizer_path: /models/NousResearch/Meta-Llama-3.1-8B +trust_remote_code: true +conv_template: llama-2 + +input_len: 128 +output_len: 128 +num_prompts: 64 +max_seq_len: 4096 +max_batch_size: 8 +dtype: auto +temperature: 0.7 +top_p: 0.9 + +mlflow_uri: http://mlflow-benchmarking.corrino-oci.com:5000 +experiment_name: "sglang-bench-doc-test-new" +run_name: "llama3-8b-sglang-test" + + +save_metrics_path: /benchmarking_output/benchmark_output_llama3_sglang.json + diff --git a/docs/sample_blueprints/model_serving/offline-inference-infra/offline_vllm_example.yaml b/docs/sample_blueprints/model_serving/offline-inference-infra/offline_vllm_example.yaml new file mode 100644 index 0000000..7734c14 --- /dev/null +++ b/docs/sample_blueprints/model_serving/offline-inference-infra/offline_vllm_example.yaml @@ -0,0 +1,29 @@ +benchmark_type: offline +model: /models/NousResearch/Meta-Llama-3.1-8B +tokenizer: /models/NousResearch/Meta-Llama-3.1-8B + +input_len: 12 +output_len: 12 +num_prompts: 2 +seed: 42 +tensor_parallel_size: 8 + +# vLLM-specific +#quantization: awq +dtype: half +gpu_memory_utilization: 0.99 +num_scheduler_steps: 10 +device: cuda +enforce_eager: true +kv_cache_dtype: auto +enable_prefix_caching: true +distributed_executor_backend: mp + +# Output +#output_json: ./128_128.json + +# MLflow +mlflow_uri: http://mlflow-benchmarking.corrino-oci.com:5000 +experiment_name: test-bm-suite-doc +run_name: llama3-vllm-test +save_metrics_path: /mlcommons_output/benchmark_output_llama3_vllm.json diff --git a/docs/sample_blueprints/model_serving/online-inference-infra/README.md b/docs/sample_blueprints/model_serving/online-inference-infra/README.md new file mode 100644 index 0000000..84be75d --- /dev/null +++ b/docs/sample_blueprints/model_serving/online-inference-infra/README.md @@ -0,0 +1,58 @@ +# Online Inference Blueprint (LLMPerf) + +#### Benchmark online inference performance of large language models using LLMPerf standardized benchmarking tool. + +This blueprint benchmarks **online inference performance** of large language models using **LLMPerf**, a standardized benchmarking tool. It is designed to evaluate LLM APIs served via platforms such as OpenAI-compatible interfaces, including self-hosted LLM inference endpoints. + +This blueprint helps: + +- Simulate real-time request load on a running model server +- Measure end-to-end latency, throughput, and completion performance +- Push results to MLflow for visibility and tracking + +--- + +## Pre-Filled Samples + +| Feature Showcase | Title | Description | Blueprint File | +| --------------------------------------------------------------------------------------------------- | ----------------------------------------- | --------------------------------------------------------------------------- | ------------------------------------------------ | +| Benchmark live LLM API endpoints using LLMPerf to measure real-time performance and latency metrics | Online inference on LLaMA 3 using LLMPerf | Benchmark of meta/llama3-8b-instruct via a local OpenAI-compatible endpoint | [online_deployment.json](online_deployment.json) | + +These can be accessed directly from the OCI AI Blueprint portal. + +--- + +## Prerequisites + +Before running this blueprint: + +- You **must have an inference server already running**, compatible with the OpenAI API format. +- Ensure the endpoint and model name match what’s defined in the config. + +--- + +## Supported Scenarios + +| Use Case | Description | +| --------------------- | ------------------------------------------------------- | +| Local LLM APIs | Benchmark your own self-hosted models (e.g., vLLM) | +| Remote OpenAI API | Benchmark OpenAI deployments for throughput analysis | +| Multi-model endpoints | Test latency/throughput across different configurations | + +--- + +## Sample Blueprints + +[Sample Blueprint (Job Mode for Online Benchmarking)](online_inference_job.json) +[Sample Config File ](example_online.yaml) + +--- + +## Metrics Logged + +- `output_tokens_per_second` +- `requests_per_minute` +- `overall_output_throughput` +- All raw metrics from the `_summary.json` output of LLMPerf + +--- diff --git a/docs/sample_blueprints/model_serving/online-inference-infra/example_online.yaml b/docs/sample_blueprints/model_serving/online-inference-infra/example_online.yaml new file mode 100644 index 0000000..ea06d10 --- /dev/null +++ b/docs/sample_blueprints/model_serving/online-inference-infra/example_online.yaml @@ -0,0 +1,18 @@ +benchmark_type: online + +model: meta/llama3-8b-instruct +input_len: 64 +output_len: 32 +max_requests: 5 +timeout: 300 +num_concurrent: 1 +results_dir: /workspace/results_on +llm_api: openai +llm_api_key: dummy-key +llm_api_base: http://localhost:8001/v1 + +experiment_name: local-bench +run_name: llama3-test +mlflow_uri: http://mlflow-benchmarking.corrino-oci.com:5000 +llmperf_path: /opt/llmperf-src +metadata: test=localhost \ No newline at end of file diff --git a/docs/sample_blueprints/model_serving/online-inference-infra/llama3_public_online.yaml b/docs/sample_blueprints/model_serving/online-inference-infra/llama3_public_online.yaml new file mode 100644 index 0000000..967b5c8 --- /dev/null +++ b/docs/sample_blueprints/model_serving/online-inference-infra/llama3_public_online.yaml @@ -0,0 +1,17 @@ +benchmark_type: online +model: /models/NousResearch/Meta-Llama-3.1-8B-Instruct # Updated model path +input_len: 64 +output_len: 32 +max_requests: 5 +timeout: 300 +num_concurrent: 1 +results_dir: /online_output +llm_api: openai +llm_api_key: dummy-key +llm_api_base: https://llama8bobjvllm.129-80-16-111.nip.io/v1 # Updated to HTTPS +experiment_name: local-bench +run_name: llama3-test +mlflow_uri: http://mlflow-benchmarking.corrino-oci.com:5000 +llmperf_path: /opt/llmperf-src +metadata: test=public-endpoint +save_metrics_path: /online_output/benchmark_output_llama3_online_public.json \ No newline at end of file diff --git a/docs/sample_blueprints/model_serving/online-inference-infra/online_deployment.json b/docs/sample_blueprints/model_serving/online-inference-infra/online_deployment.json new file mode 100644 index 0000000..daeca81 --- /dev/null +++ b/docs/sample_blueprints/model_serving/online-inference-infra/online_deployment.json @@ -0,0 +1,35 @@ +{ + "recipe_id": "online_infernece_llmperf", + "recipe_mode": "job", + "deployment_name": "a1", + "recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:llm-benchmark-0409-v4", + "recipe_node_shape": "VM.Standard.E4.Flex", + "recipe_node_pool_size": 1, + "recipe_flex_shape_ocpu_count": 32, + "recipe_flex_shape_memory_size_in_gbs": 256, + "recipe_node_boot_volume_size_in_gbs": 200, + "recipe_ephemeral_storage_size": 150, + "input_object_storage": [ + { + "par": "https://objectstorage.ap-melbourne-1.oraclecloud.com/p/0T99iRADcM08aVpumM6smqMIcnIJTFtV2D8ZIIWidUP9eL8GSRyDMxOb9Va9rmRc/n/iduyx1qnmway/b/mymodels/o/", + "mount_location": "/models", + "volume_size_in_gbs": 500, + "include": [ + "llama3_public_online.yaml" + ] + } + ], + "output_object_storage": [ + { + "bucket_name": "inference_output", + "mount_location": "/online_output", + "volume_size_in_gbs": 200 + } + ], + "recipe_container_command_args": [ + "/models/llama3_public_online.yaml" + ], + "recipe_replica_count": 1, + "recipe_container_port": "5678" + } + \ No newline at end of file diff --git a/docs/sample_blueprints/model_serving/online-inference-infra/online_inference_job.json b/docs/sample_blueprints/model_serving/online-inference-infra/online_inference_job.json new file mode 100644 index 0000000..8522fb7 --- /dev/null +++ b/docs/sample_blueprints/model_serving/online-inference-infra/online_inference_job.json @@ -0,0 +1,21 @@ +{ + "recipe_id": "online_inference_benchmark", + "recipe_mode": "job", + "deployment_name": "Online Inference Benchmark", + "recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:llm-benchmark-0409-v2", + "recipe_node_shape": "VM.GPU.A10.2", + "input_object_storage": [ + { + "par": "https://objectstorage.ap-melbourne-1.oraclecloud.com/p/Z2q73uuLCAxCbGXJ99CIeTxnCTNipsE-1xHE9HYfCz0RBYPTcCbqi9KHViUEH-Wq/n/iduyx1qnmway/b/mymodels/o/", + "mount_location": "/models", + "volume_size_in_gbs": 100, + "include": ["example_online.yaml"] + } + ], + "recipe_container_command_args": ["/models/example_online.yaml"], + "recipe_replica_count": 1, + "recipe_container_port": "8000", + "recipe_node_pool_size": 1, + "recipe_node_boot_volume_size_in_gbs": 200, + "recipe_ephemeral_storage_size": 100 +} diff --git a/docs/sample_blueprints/platform_feature_blueprints/exisiting_cluster_installation/README.md b/docs/sample_blueprints/other/exisiting_cluster_installation/README.md similarity index 100% rename from docs/sample_blueprints/platform_feature_blueprints/exisiting_cluster_installation/README.md rename to docs/sample_blueprints/other/exisiting_cluster_installation/README.md diff --git a/docs/sample_blueprints/platform_feature_blueprints/exisiting_cluster_installation/add_node_to_control_plane.json b/docs/sample_blueprints/other/exisiting_cluster_installation/add_node_to_control_plane.json similarity index 100% rename from docs/sample_blueprints/platform_feature_blueprints/exisiting_cluster_installation/add_node_to_control_plane.json rename to docs/sample_blueprints/other/exisiting_cluster_installation/add_node_to_control_plane.json diff --git a/docs/sample_blueprints/workload_blueprints/llama-stack/README.md b/docs/sample_blueprints/other/llama-stack/README.md similarity index 98% rename from docs/sample_blueprints/workload_blueprints/llama-stack/README.md rename to docs/sample_blueprints/other/llama-stack/README.md index d2d6460..ab6336b 100644 --- a/docs/sample_blueprints/workload_blueprints/llama-stack/README.md +++ b/docs/sample_blueprints/other/llama-stack/README.md @@ -74,7 +74,7 @@ Llama Stack has many different use cases and are thoroughly detailed here, in th 1. How can I configure the vLLM pre-filled sample (e.g. I want to deploy a different model with vLLM; a custom model)? -- Any vLLM inference server and model that is compatible with vLLM will work with the Llama Stack implementation. Follow our [llm_inference_with_vllm blueprint](../llm_inference_with_vllm/README.md) for more details on setting up vLLM. +- Any vLLM inference server and model that is compatible with vLLM will work with the Llama Stack implementation. Follow our [llm_inference_with_vllm blueprint](../../model_serving/llm_inference_with_vllm/README.md) for more details on setting up vLLM. 2. Can I use a different inference engine than vLLM? diff --git a/docs/sample_blueprints/platform_feature_blueprints/deployment_groups/llama_stack_basic.json b/docs/sample_blueprints/other/llama-stack/llama_stack_basic.json similarity index 100% rename from docs/sample_blueprints/platform_feature_blueprints/deployment_groups/llama_stack_basic.json rename to docs/sample_blueprints/other/llama-stack/llama_stack_basic.json diff --git a/docs/sample_blueprints/platform_feature_blueprints/model_storage/README.md b/docs/sample_blueprints/other/model_storage/README.md similarity index 100% rename from docs/sample_blueprints/platform_feature_blueprints/model_storage/README.md rename to docs/sample_blueprints/other/model_storage/README.md diff --git a/docs/sample_blueprints/platform_feature_blueprints/model_storage/download_closed_hf_model_to_object_storage.json b/docs/sample_blueprints/other/model_storage/download_closed_hf_model_to_object_storage.json similarity index 100% rename from docs/sample_blueprints/platform_feature_blueprints/model_storage/download_closed_hf_model_to_object_storage.json rename to docs/sample_blueprints/other/model_storage/download_closed_hf_model_to_object_storage.json diff --git a/docs/sample_blueprints/platform_feature_blueprints/model_storage/download_open_hf_model_to_object_storage.json b/docs/sample_blueprints/other/model_storage/download_open_hf_model_to_object_storage.json similarity index 100% rename from docs/sample_blueprints/platform_feature_blueprints/model_storage/download_open_hf_model_to_object_storage.json rename to docs/sample_blueprints/other/model_storage/download_open_hf_model_to_object_storage.json diff --git a/docs/sample_blueprints/platform_feature_blueprints/using_rdma_enabled_node_pools/README.md b/docs/sample_blueprints/other/using_rdma_enabled_node_pools/README.md similarity index 99% rename from docs/sample_blueprints/platform_feature_blueprints/using_rdma_enabled_node_pools/README.md rename to docs/sample_blueprints/other/using_rdma_enabled_node_pools/README.md index 6dd6da5..5cd3a78 100644 --- a/docs/sample_blueprints/platform_feature_blueprints/using_rdma_enabled_node_pools/README.md +++ b/docs/sample_blueprints/other/using_rdma_enabled_node_pools/README.md @@ -88,7 +88,7 @@ One of the images in the table below must be imported into your tenancy in the c Once the image has been imported, it is now possible to deploy a shared node pool with RDMA connectivity with AI blueprints. -In addition to the parameters described in [the shared node pool doc](../shared_node_pools/README.md#without-selector), the following additional parameters are required: +In addition to the parameters described in [the shared node pool doc](../../platform_features/shared_node_pools/README.md#without-selector), the following additional parameters are required: - `"recipe_availability_domain": ""` -> full availability domain name where you have capacity for nodes. Examples: `"TrcQ:AP-MELBOURNE-1-AD-1"`, `"TrcQ:EU-FRANKFURT-1-AD-3"`. These can generally be found in the console via Hamburger (top left) -> Governance & Administration -> Tenancy Management -> Limits, Quotas and Usage diff --git a/docs/sample_blueprints/platform_feature_blueprints/using_rdma_enabled_node_pools/rdma_distributed_inference.json b/docs/sample_blueprints/other/using_rdma_enabled_node_pools/rdma_distributed_inference.json similarity index 100% rename from docs/sample_blueprints/platform_feature_blueprints/using_rdma_enabled_node_pools/rdma_distributed_inference.json rename to docs/sample_blueprints/other/using_rdma_enabled_node_pools/rdma_distributed_inference.json diff --git a/docs/sample_blueprints/platform_feature_blueprints/using_rdma_enabled_node_pools/rdma_shared_node_pool.json b/docs/sample_blueprints/other/using_rdma_enabled_node_pools/rdma_shared_node_pool.json similarity index 100% rename from docs/sample_blueprints/platform_feature_blueprints/using_rdma_enabled_node_pools/rdma_shared_node_pool.json rename to docs/sample_blueprints/other/using_rdma_enabled_node_pools/rdma_shared_node_pool.json diff --git a/docs/sample_blueprints/platform_feature_blueprints/using_rdma_enabled_node_pools/rdma_update_nodes.json b/docs/sample_blueprints/other/using_rdma_enabled_node_pools/rdma_update_nodes.json similarity index 100% rename from docs/sample_blueprints/platform_feature_blueprints/using_rdma_enabled_node_pools/rdma_update_nodes.json rename to docs/sample_blueprints/other/using_rdma_enabled_node_pools/rdma_update_nodes.json diff --git a/docs/sample_blueprints/other/whisper_transcription/README.md b/docs/sample_blueprints/other/whisper_transcription/README.md new file mode 100644 index 0000000..4ae11d7 --- /dev/null +++ b/docs/sample_blueprints/other/whisper_transcription/README.md @@ -0,0 +1,146 @@ +# Whisper Transcription API + +#### Transcription + Summarization + Diarization Pipeline (FastAPI-powered) + +This blueprint provides a complete solution for running **audio/video transcription**, **speaker diarization**, and **summarization** via a RESTful API. It integrates [Faster-Whisper](https://github.com/guillaumekln/faster-whisper) for efficient transcription, [pyannote.audio](https://github.com/pyannote/pyannote-audio) for diarization, and Hugging Face instruction-tuned LLMs (e.g., Mistral-7B) for summarization. It supports multi-GPU acceleration, real-time streaming logs, and JSON/text output formats. + +--- + +## Pre-Filled Samples + +Below are pre-configured blueprints for deploying Whisper transcription using different GPU configurations on Oracle Cloud Infrastructure. + +## Pre-Filled Samples + +| Feature Showcase | Title | Description | Blueprint File | +| -------------------------------------------------------------------- | ------------------ | -------------------------------------------------------------- | ------------------------------------------------------------------ | +| Deploy Whisper transcription on A10 GPU for real-time speech-to-text | A10 Transcription | Real-time audio transcription with Whisper on BM.GPU.A10.8 | [whisper-transcription-A10.json](whisper-transcription-A10.json) | +| Deploy Whisper transcription on A100 GPU for high-speed processing | A100 Transcription | High-performance Whisper transcription using BM.GPU.A100.8 | [whisper-transcription-A100.json](whisper-transcription-A100.json) | +| Deploy Whisper transcription on H100 GPU for next-gen AI workloads | H100 Transcription | Ultra-fast Whisper transcription with Whisper on BM.GPU.H100.8 | [whisper-transcription-H100.json](whisper-transcription-H100.json) | + +--- + +# In-Depth Feature Overview + +| Capability | Description | +| -------------------- | ----------------------------------------------------------------------------------------------- | +| Transcription | Fast, multi-GPU inference with Faster-Whisper | +| Summarization | Uses Mistral-7B (or other HF models) to create summaries of long transcripts | +| Speaker Diarization | Global speaker labeling via pyannote.audio | +| Denoising | Hybrid removal of background noise using Demucs and noisereduce | +| Real-Time Streaming | Logs stream live via HTTP if enabled | +| Format Compatibility | Supports `.mp3`, `.wav`, `.flac`, `.aac`, `.m4a`, `.mp4`, `.webm`, `.mov`, `.mkv`, `.avi`, etc. | + +--- + +## Deployment on OCI Blueprint + +### Sample Recipe (Service Mode) + +please look at this json file as an example [whisper-transcription-A10.json](whisper-transcription-A10.json) + +### Endpoint + +``` +POST https://.nip.io/transcribe +``` + +**Example:** +`https://whisper-transcription-a10-6666.130-162-199-33.nip.io/transcribe` + +--- + +## API Parameters + +| Name | Type | Description | +| ------------------ | ------ | ---------------------------------------------------------------------------------------------- | +| `audio_url` | string | URL to audio file in OCI Object Storage (requires PAR) | +| `model` | string | Whisper model to use: `base`, `medium`, `large`, `turbo`, etc. | +| `summary` | bool | Whether to generate a summary (default: false). Requires `hf_token` if model path not provided | +| `speaker` | bool | Whether to run diarization (default: false). Requires `hf_token` | +| `max_speakers` | int | (Optional) Maximum number of speakers expected for diarization | +| `denoise` | bool | Whether to apply noise reduction | +| `streaming` | bool | Enables real-time logs via /stream_log endpoint | +| `hf_token` | string | Hugging Face access token (required for diarization or HF-hosted summarizers) | +| `prop_decrease` | float | (Optional) Controls level of noise suppression. Range: 0.0–1.0 (default: 0.7) | +| `summarized_model` | string | (Optional) Path or HF model ID for summarizer. Default: `mistralai/Mistral-7B-Instruct-v0.1` | +| `ground_truth` | string | (Optional) Path to reference transcript file to compute WER | + +--- + +## Example cURL Command + +```bash +curl -k -N -L -X POST https://.nip.io/transcribe \ + -F "audio_url=" \ + -F "model=turbo" \ + -F "summary=true" \ + -F "speaker=true" \ + -F "streaming=true" \ + -F "denoise=false" \ + -F "hf_token=hf_xxxxxxx" \ + -F "max_speakers=2" +``` + +--- + +## Output Files + +Each processed audio generates the following: + +- `*.txt` – Human-readable transcript with speaker turns and timestamps +- `*.json` – Full structured metadata: transcript, summary, diarization +- `*.log` – Detailed processing log (useful for debugging or auditing) + +--- + +## Streaming Logs + +If `streaming=true`, the response will contain a log filename: + +```json +{ + "meta": "logfile_name", + "logfile": "transcription_log_remote_audio_.log" +} +``` + +To stream logs in real-time: + +```bash +curl -N https://.nip.io/stream_log/ +``` + +--- + +## Hugging Face Access + +To enable diarization, accept model terms at: +https://huggingface.co/pyannote/segmentation + +Generate token at: +https://huggingface.co/settings/tokens + +--- + +## Dependencies + +| Package | Purpose | +| -------------------- | ------------------------------ | +| `faster-whisper` | Core transcription engine | +| `transformers` | Summarization via Hugging Face | +| `pyannote.audio` | Speaker diarization | +| `pydub`, `librosa` | Audio chunking and processing | +| `demucs` | Vocal separation / denoising | +| `fastapi`, `uvicorn` | REST API server | +| `jiwer` | WER evaluation | + +--- + +## Final Notes + +- Whisper model is GPU-cached per thread for performance. +- For more information about this project please review the [docs](docs) +- Please check out the [examples](examples) folder for more tests. +- Diarization runs globally, not chunk-by-chunk. +- Denoising is optional but improves quality on noisy files. diff --git a/docs/whisper_transcription/docs/Whisper_Architecture.pdf b/docs/sample_blueprints/other/whisper_transcription/docs/Whisper_Architecture.pdf similarity index 100% rename from docs/whisper_transcription/docs/Whisper_Architecture.pdf rename to docs/sample_blueprints/other/whisper_transcription/docs/Whisper_Architecture.pdf diff --git a/docs/whisper_transcription/examples/test1/test.wav b/docs/sample_blueprints/other/whisper_transcription/examples/test1/test.wav similarity index 100% rename from docs/whisper_transcription/examples/test1/test.wav rename to docs/sample_blueprints/other/whisper_transcription/examples/test1/test.wav diff --git a/docs/whisper_transcription/examples/test1/test_all_transcripts_20250601_201349.txt b/docs/sample_blueprints/other/whisper_transcription/examples/test1/test_all_transcripts_20250601_201349.txt similarity index 100% rename from docs/whisper_transcription/examples/test1/test_all_transcripts_20250601_201349.txt rename to docs/sample_blueprints/other/whisper_transcription/examples/test1/test_all_transcripts_20250601_201349.txt diff --git a/docs/whisper_transcription/examples/test1/transcription_log_20250601_201340.log b/docs/sample_blueprints/other/whisper_transcription/examples/test1/transcription_log_20250601_201340.log similarity index 100% rename from docs/whisper_transcription/examples/test1/transcription_log_20250601_201340.log rename to docs/sample_blueprints/other/whisper_transcription/examples/test1/transcription_log_20250601_201340.log diff --git a/docs/whisper_transcription/examples/test2/transcription_log_20250601_203611.log b/docs/sample_blueprints/other/whisper_transcription/examples/test2/transcription_log_20250601_203611.log similarity index 100% rename from docs/whisper_transcription/examples/test2/transcription_log_20250601_203611.log rename to docs/sample_blueprints/other/whisper_transcription/examples/test2/transcription_log_20250601_203611.log diff --git a/docs/whisper_transcription/examples/test2/video1591686795.mp4 b/docs/sample_blueprints/other/whisper_transcription/examples/test2/video1591686795.mp4 similarity index 100% rename from docs/whisper_transcription/examples/test2/video1591686795.mp4 rename to docs/sample_blueprints/other/whisper_transcription/examples/test2/video1591686795.mp4 diff --git a/docs/whisper_transcription/examples/test2/video1591686795_all_transcripts_20250601_203730.json b/docs/sample_blueprints/other/whisper_transcription/examples/test2/video1591686795_all_transcripts_20250601_203730.json similarity index 100% rename from docs/whisper_transcription/examples/test2/video1591686795_all_transcripts_20250601_203730.json rename to docs/sample_blueprints/other/whisper_transcription/examples/test2/video1591686795_all_transcripts_20250601_203730.json diff --git a/docs/whisper_transcription/examples/test2/video1591686795_all_transcripts_20250601_203730.txt b/docs/sample_blueprints/other/whisper_transcription/examples/test2/video1591686795_all_transcripts_20250601_203730.txt similarity index 100% rename from docs/whisper_transcription/examples/test2/video1591686795_all_transcripts_20250601_203730.txt rename to docs/sample_blueprints/other/whisper_transcription/examples/test2/video1591686795_all_transcripts_20250601_203730.txt diff --git a/docs/whisper_transcription/examples/test3/audio1788670787.m4a b/docs/sample_blueprints/other/whisper_transcription/examples/test3/audio1788670787.m4a similarity index 100% rename from docs/whisper_transcription/examples/test3/audio1788670787.m4a rename to docs/sample_blueprints/other/whisper_transcription/examples/test3/audio1788670787.m4a diff --git a/docs/whisper_transcription/examples/test3/audio1788670787_all_transcripts_20250601_191710.json b/docs/sample_blueprints/other/whisper_transcription/examples/test3/audio1788670787_all_transcripts_20250601_191710.json similarity index 100% rename from docs/whisper_transcription/examples/test3/audio1788670787_all_transcripts_20250601_191710.json rename to docs/sample_blueprints/other/whisper_transcription/examples/test3/audio1788670787_all_transcripts_20250601_191710.json diff --git a/docs/whisper_transcription/examples/test3/audio1788670787_all_transcripts_20250601_191710.txt b/docs/sample_blueprints/other/whisper_transcription/examples/test3/audio1788670787_all_transcripts_20250601_191710.txt similarity index 100% rename from docs/whisper_transcription/examples/test3/audio1788670787_all_transcripts_20250601_191710.txt rename to docs/sample_blueprints/other/whisper_transcription/examples/test3/audio1788670787_all_transcripts_20250601_191710.txt diff --git a/docs/whisper_transcription/examples/test3/transcription_log_20250601_191325.log b/docs/sample_blueprints/other/whisper_transcription/examples/test3/transcription_log_20250601_191325.log similarity index 100% rename from docs/whisper_transcription/examples/test3/transcription_log_20250601_191325.log rename to docs/sample_blueprints/other/whisper_transcription/examples/test3/transcription_log_20250601_191325.log diff --git a/docs/whisper_transcription/whisper-transcription-A10.json b/docs/sample_blueprints/other/whisper_transcription/whisper-transcription-A10.json similarity index 100% rename from docs/whisper_transcription/whisper-transcription-A10.json rename to docs/sample_blueprints/other/whisper_transcription/whisper-transcription-A10.json diff --git a/docs/whisper_transcription/whisper-transcription-A100.json b/docs/sample_blueprints/other/whisper_transcription/whisper-transcription-A100.json similarity index 100% rename from docs/whisper_transcription/whisper-transcription-A100.json rename to docs/sample_blueprints/other/whisper_transcription/whisper-transcription-A100.json diff --git a/docs/whisper_transcription/whisper-transcription-H100.json b/docs/sample_blueprints/other/whisper_transcription/whisper-transcription-H100.json similarity index 100% rename from docs/whisper_transcription/whisper-transcription-H100.json rename to docs/sample_blueprints/other/whisper_transcription/whisper-transcription-H100.json diff --git a/docs/sample_blueprints/platform_feature_blueprints/deployment_groups/README.md b/docs/sample_blueprints/platform_features/deployment_groups/README.md similarity index 100% rename from docs/sample_blueprints/platform_feature_blueprints/deployment_groups/README.md rename to docs/sample_blueprints/platform_features/deployment_groups/README.md diff --git a/docs/sample_blueprints/workload_blueprints/llama-stack/llama_stack_basic.json b/docs/sample_blueprints/platform_features/deployment_groups/llama_stack_basic.json similarity index 100% rename from docs/sample_blueprints/workload_blueprints/llama-stack/llama_stack_basic.json rename to docs/sample_blueprints/platform_features/deployment_groups/llama_stack_basic.json diff --git a/docs/sample_blueprints/platform_feature_blueprints/shared_node_pools/README.md b/docs/sample_blueprints/platform_features/shared_node_pools/README.md similarity index 97% rename from docs/sample_blueprints/platform_feature_blueprints/shared_node_pools/README.md rename to docs/sample_blueprints/platform_features/shared_node_pools/README.md index 4630cda..1a59758 100644 --- a/docs/sample_blueprints/platform_feature_blueprints/shared_node_pools/README.md +++ b/docs/sample_blueprints/platform_features/shared_node_pools/README.md @@ -23,7 +23,7 @@ Additional required fields: See [this recipe](./shared_node_pool_B200_BM.json) as an example for these parameters. -[This document section](../using_rdma_enabled_node_pools/README.md#import-a-custom-image) describes now to import a custom image and provides links to import custom images for various shapes. +[This document section](../../other/using_rdma_enabled_node_pools/README.md) describes now to import a custom image and provides links to import custom images for various shapes. ## Pre-Filled Samples diff --git a/docs/sample_blueprints/platform_feature_blueprints/shared_node_pools/shared_node_pool_A10_BM.json b/docs/sample_blueprints/platform_features/shared_node_pools/shared_node_pool_A10_BM.json similarity index 100% rename from docs/sample_blueprints/platform_feature_blueprints/shared_node_pools/shared_node_pool_A10_BM.json rename to docs/sample_blueprints/platform_features/shared_node_pools/shared_node_pool_A10_BM.json diff --git a/docs/sample_blueprints/platform_feature_blueprints/shared_node_pools/shared_node_pool_A10_VM.json b/docs/sample_blueprints/platform_features/shared_node_pools/shared_node_pool_A10_VM.json similarity index 100% rename from docs/sample_blueprints/platform_feature_blueprints/shared_node_pools/shared_node_pool_A10_VM.json rename to docs/sample_blueprints/platform_features/shared_node_pools/shared_node_pool_A10_VM.json diff --git a/docs/sample_blueprints/platform_feature_blueprints/shared_node_pools/shared_node_pool_B200_BM.json b/docs/sample_blueprints/platform_features/shared_node_pools/shared_node_pool_B200_BM.json similarity index 100% rename from docs/sample_blueprints/platform_feature_blueprints/shared_node_pools/shared_node_pool_B200_BM.json rename to docs/sample_blueprints/platform_features/shared_node_pools/shared_node_pool_B200_BM.json diff --git a/docs/sample_blueprints/platform_feature_blueprints/shared_node_pools/vllm_inference_sample_shared_pool_blueprint.json b/docs/sample_blueprints/platform_features/shared_node_pools/vllm_inference_sample_shared_pool_blueprint.json similarity index 100% rename from docs/sample_blueprints/platform_feature_blueprints/shared_node_pools/vllm_inference_sample_shared_pool_blueprint.json rename to docs/sample_blueprints/platform_features/shared_node_pools/vllm_inference_sample_shared_pool_blueprint.json diff --git a/docs/sample_blueprints/platform_feature_blueprints/startup_liveness_readiness_probes/README.md b/docs/sample_blueprints/platform_features/startup_liveness_readiness_probes/README.md similarity index 100% rename from docs/sample_blueprints/platform_feature_blueprints/startup_liveness_readiness_probes/README.md rename to docs/sample_blueprints/platform_features/startup_liveness_readiness_probes/README.md diff --git a/docs/sample_blueprints/platform_feature_blueprints/startup_liveness_readiness_probes/autoscale_with_fss.json b/docs/sample_blueprints/platform_features/startup_liveness_readiness_probes/autoscale_with_fss.json similarity index 100% rename from docs/sample_blueprints/platform_feature_blueprints/startup_liveness_readiness_probes/autoscale_with_fss.json rename to docs/sample_blueprints/platform_features/startup_liveness_readiness_probes/autoscale_with_fss.json diff --git a/docs/sample_blueprints/platform_feature_blueprints/teams/README.md b/docs/sample_blueprints/platform_features/teams/README.md similarity index 100% rename from docs/sample_blueprints/platform_feature_blueprints/teams/README.md rename to docs/sample_blueprints/platform_features/teams/README.md diff --git a/docs/sample_blueprints/platform_feature_blueprints/teams/create_job_with_team.json b/docs/sample_blueprints/platform_features/teams/create_job_with_team.json similarity index 100% rename from docs/sample_blueprints/platform_feature_blueprints/teams/create_job_with_team.json rename to docs/sample_blueprints/platform_features/teams/create_job_with_team.json diff --git a/docs/sample_blueprints/platform_feature_blueprints/teams/create_team.json b/docs/sample_blueprints/platform_features/teams/create_team.json similarity index 100% rename from docs/sample_blueprints/platform_feature_blueprints/teams/create_team.json rename to docs/sample_blueprints/platform_features/teams/create_team.json diff --git a/docs/whisper_transcription/README.md b/docs/whisper_transcription/README.md deleted file mode 100644 index 99fa310..0000000 --- a/docs/whisper_transcription/README.md +++ /dev/null @@ -1,134 +0,0 @@ -# Whisper Transcription API - -### Transcription + Summarization + Diarization Pipeline (FastAPI-powered) - -This blueprint provides a complete solution for running **audio/video transcription**, **speaker diarization**, and **summarization** via a RESTful API. It integrates [Faster-Whisper](https://github.com/guillaumekln/faster-whisper) for efficient transcription, [pyannote.audio](https://github.com/pyannote/pyannote-audio) for diarization, and Hugging Face instruction-tuned LLMs (e.g., Mistral-7B) for summarization. It supports multi-GPU acceleration, real-time streaming logs, and JSON/text output formats. - ---- -## Pre-Filled Samples - -Below are pre-configured blueprints for deploying Whisper transcription using different GPU configurations on Oracle Cloud Infrastructure. - -| Feature Showcase Title | Description | Blueprint File | -|----------------------------------------------------------------------|-----------------------------------------------------------------------|-----------------------------------| -| Deploy Whisper transcription on A10 GPU for real-time speech-to-text | Real-time audio transcription with Whisper on BM.GPU.A10.8 | [whisper-transcription-A10.json](whisper-transcription-A10.json) | -| Deploy Whisper transcription on A100 GPU for high-speed processing | High-performance Whisper transcription using BM.GPU.A100.8 | [whisper-transcription-A100.json](whisper-transcription-A100.json) | -| Deploy Whisper transcription on H100 GPU for next-gen AI workloads | Ultra-fast Whisper transcription with Whisper on BM.GPU.H100.8 | [whisper-transcription-H100.json](whisper-transcription-H100.json) | - -## Key Features - -| Capability | Description | -|------------------------|-----------------------------------------------------------------------------------------------| -| Transcription | Fast, multi-GPU inference with Faster-Whisper | -| Summarization | Uses Mistral-7B (or other HF models) to create summaries of long transcripts | -| Speaker Diarization | Global speaker labeling via pyannote.audio | -| Denoising | Hybrid removal of background noise using Demucs and noisereduce | -| Real-Time Streaming | Logs stream live via HTTP if enabled | -| Format Compatibility | Supports `.mp3`, `.wav`, `.flac`, `.aac`, `.m4a`, `.mp4`, `.webm`, `.mov`, `.mkv`, `.avi`, etc. | - ---- - -## Deployment on OCI Blueprint - -### Sample Recipe (Service Mode) -please look at this json file as an example [whisper-transcription-A10.json](whisper-transcription-A10.json) - -### Endpoint -``` -POST https://.nip.io/transcribe -``` -**Example:** -`https://whisper-transcription-a10-6666.130-162-199-33.nip.io/transcribe` - ---- - -## API Parameters - -| Name | Type | Description | -|-------------------|-----------|-----------------------------------------------------------------------------------------------------------------------| -| `audio_url` | string | URL to audio file in OCI Object Storage (requires PAR) | -| `model` | string | Whisper model to use: `base`, `medium`, `large`, `turbo`, etc. | -| `summary` | bool | Whether to generate a summary (default: false). Requires `hf_token` if model path not provided | -| `speaker` | bool | Whether to run diarization (default: false). Requires `hf_token` | -| `max_speakers` | int | (Optional) Maximum number of speakers expected for diarization | -| `denoise` | bool | Whether to apply noise reduction | -| `streaming` | bool | Enables real-time logs via /stream_log endpoint | -| `hf_token` | string | Hugging Face access token (required for diarization or HF-hosted summarizers) | -| `prop_decrease` | float | (Optional) Controls level of noise suppression. Range: 0.0–1.0 (default: 0.7) | -| `summarized_model`| string | (Optional) Path or HF model ID for summarizer. Default: `mistralai/Mistral-7B-Instruct-v0.1` | -| `ground_truth` | string | (Optional) Path to reference transcript file to compute WER | - ---- - -## Example cURL Command -```bash -curl -k -N -L -X POST https://.nip.io/transcribe \ - -F "audio_url=" \ - -F "model=turbo" \ - -F "summary=true" \ - -F "speaker=true" \ - -F "streaming=true" \ - -F "denoise=false" \ - -F "hf_token=hf_xxxxxxx" \ - -F "max_speakers=2" -``` - ---- - -## Output Files - -Each processed audio generates the following: - -- `*.txt` – Human-readable transcript with speaker turns and timestamps -- `*.json` – Full structured metadata: transcript, summary, diarization -- `*.log` – Detailed processing log (useful for debugging or auditing) - ---- - -## Streaming Logs - -If `streaming=true`, the response will contain a log filename: -```json -{ - "meta": "logfile_name", - "logfile": "transcription_log_remote_audio_.log" -} -``` -To stream logs in real-time: -```bash -curl -N https://.nip.io/stream_log/ -``` - ---- - -## Hugging Face Access - -To enable diarization, accept model terms at: -https://huggingface.co/pyannote/segmentation - -Generate token at: -https://huggingface.co/settings/tokens - ---- - -## Dependencies - -| Package | Purpose | -|---------------------|----------------------------------| -| `faster-whisper` | Core transcription engine | -| `transformers` | Summarization via Hugging Face | -| `pyannote.audio` | Speaker diarization | -| `pydub`, `librosa` | Audio chunking and processing | -| `demucs` | Vocal separation / denoising | -| `fastapi`, `uvicorn`| REST API server | -| `jiwer` | WER evaluation | - ---- - -## Final Notes - -- Whisper model is GPU-cached per thread for performance. -- For more information about this project please review the [docs](docs) -- Please check out the [examples](examples) folder for more tests. -- Diarization runs globally, not chunk-by-chunk. -- Denoising is optional but improves quality on noisy files.