diff --git a/README.md b/README.md index ebe41ced..de2594d4 100644 --- a/README.md +++ b/README.md @@ -168,6 +168,7 @@ Practical deployment and model usage guides for Nemotron models. |-------|----------|--------------|-----------| | [**Nemotron 3 Super 120B A12B**](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16) | Production deployments needing strong reasoning | 1M context, in NVFP4 single B200, RAG & tool calling | [Cookbooks](./usage-cookbook/Nemotron-3-Super) | | [**Nemotron 3 Nano 30B A3B**](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) | Resource-constrained environments | 1M context, sparse MoE hybrid Mamba-2, controllable reasoning | [Cookbooks](./usage-cookbook/Nemotron-3-Nano) | +| [**Llama-3.1-Nemotron-Nano-8B-v1**](https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1) | Small-footprint OCI deployments | Validated on private OKE in Phoenix with `vLLM`, OCI Bastion service, tool calling, and OpenAI-compatible `/v1` inference; provides a reproducible OCI path comparable to common AWS GPU/Kubernetes deployment patterns | [Cookbooks](./usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1) | | [**NVIDIA-Nemotron-Nano-12B-v2-VL**](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL) | Document intelligence and video understanding | 12B VLM, video reasoning, Efficient Video Sampling | [Cookbooks](./usage-cookbook/Nemotron-Nano2-VL/) | | [**Llama-3.1-Nemotron-Safety-Guard-8B-v3**](https://huggingface.co/nvidia/Llama-3.1-Nemotron-Safety-Guard-8B-v3) | Multilingual content moderation | 9 languages, 23 safety categories | [Cookbooks](./usage-cookbook/Llama-3.1-Nemotron-Safety-Guard-V3/) | | **Nemotron-Parse** | Document parsing for RAG and AI agents | Table extraction, semantic segmentation | [Cookbooks](./usage-cookbook/Nemotron-Parse-v1.1/) | diff --git a/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/README.md b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/README.md new file mode 100644 index 00000000..866a1b3c --- /dev/null +++ b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/README.md @@ -0,0 +1,288 @@ +# Llama-3.1-Nemotron-Nano-8B-v1 on OCI OKE (Private Phoenix Deployment) + +This cookbook documents a validated private deployment of +`nvidia/Llama-3.1-Nemotron-Nano-8B-v1` on **Oracle Cloud Infrastructure (OCI)** using: + +- `us-phoenix-1` +- a **private** Oracle Kubernetes Engine (OKE) cluster +- a single `VM.GPU.A10.1` worker +- `vLLM` with an OpenAI-compatible `/v1` endpoint + +This guide is intentionally **private-only**: + +- no public Kubernetes API endpoint +- no public worker-node IPs +- no public inference endpoint + +Access is handled through **OCI Bastion** and local port forwarding. + +Note: the Terraform sample in this cookbook provisions the **OCI Bastion +service** for reproducible private access. It does **not** create a public +bastion host VM. + +This gives Nemotron users a reproducible Oracle Cloud deployment path that +leans into OCI's strengths for enterprise workloads: private OKE control +planes, managed Bastion access, and a clean separation between infrastructure +provisioning and model serving. + +## Why this configuration + +This setup gives Nemotron users a reproducible OCI deployment path with a small +single-GPU footprint while preserving tool calling, structured output, and +streaming support. + +For teams evaluating cloud options for Nemotron, this sample shows that OCI can +offer a practical and well-contained production shape: private networking, +managed access, and a validated GPU-backed serving path in Phoenix. + +Validated capabilities on this deployment: + +- chat completion +- structured output +- tool calling +- streaming +- async/concurrent requests +- OpenAI-compatible model discovery via `/v1/models` + +## Tested environment + +- Region: `us-phoenix-1` +- Kubernetes: OKE private cluster +- GPU shape: `VM.GPU.A10.1` +- Model: `nvidia/Llama-3.1-Nemotron-Nano-8B-v1` +- Serving stack: `vLLM` +- Inference API: OpenAI-compatible `/v1` + +## Architecture + +1. Create a **private** OKE cluster in Phoenix. +2. Create a CPU node pool and a GPU node pool. +3. Use **OCI Bastion** to reach the cluster API locally. +4. Deploy Nemotron with the checked-in `vLLM` values file. +5. Keep the inference service internal and validate it through a local + port-forward. + +## Prerequisites + +- OCI tenancy with Phoenix capacity for `VM.GPU.A10.1` +- OKE permissions +- OCI Bastion permissions +- `kubectl` +- `helm` +- access to pull the Nemotron model from Hugging Face or an equivalent model + artifact source accepted by your environment + +## Deployment notes + +This cookbook assumes a private OKE cluster. Keep these constraints: + +- disable the public Kubernetes control-plane endpoint +- do not attach public IPs to worker nodes +- do not expose the model through a public load balancer + +The known-good serving values are in +[`vllm_oke_phoenix_private_values.yaml`](./vllm_oke_phoenix_private_values.yaml). + +Terraform for the private Phoenix OKE infrastructure is available in +[`terraform/`](./terraform/). + +That Terraform path was validated end to end in Phoenix through: + +- VCN and private subnets +- private OKE control plane +- OCI Bastion service +- CPU node pool +- GPU node pool on `VM.GPU.A10.1` + +Important settings for this single-A10 deployment: + +- `maxModelLen: 4096` +- `gpuMemoryUtilization: 0.95` +- `enableTool: true` +- `toolCallParser: llama3_json` +- `chatTemplate: /vllm-workspace/examples/tool_chat_template_llama3.1_json.jinja` + +These settings were required to make the model stable on a single A10 while +preserving tool-calling behavior. + +## Example install flow + +Deploy the serving stack with the `vLLM Production Stack` Helm chart using the +checked-in values file: + +```bash +helm upgrade --install vllm \ + -n default \ + -f usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/vllm_oke_phoenix_private_values.yaml +``` + +Use Bastion plus the private cluster endpoint for cluster access. Then +port-forward the router service locally: + +```bash +kubectl -n default port-forward svc/vllm-router-service 8080:80 +``` + +At that point, the local validation endpoint is: + +```text +http://127.0.0.1:8080/v1 +``` + +## Validation + +Health check: + +```bash +curl -s http://127.0.0.1:8080/health +``` + +Model discovery: + +```bash +curl -s http://127.0.0.1:8080/v1/models +``` + +Chat completion: + +```bash +curl -s http://127.0.0.1:8080/v1/chat/completions \ + -H 'Content-Type: application/json' \ + -d '{ + "model": "nvidia/Llama-3.1-Nemotron-Nano-8B-v1", + "messages": [{"role": "user", "content": "Reply with NEMOTRON_OK"}] + }' +``` + +## Tool-calling smoke test + +```bash +curl -s http://127.0.0.1:8080/v1/chat/completions \ + -H 'Content-Type: application/json' \ + -d '{ + "model": "nvidia/Llama-3.1-Nemotron-Nano-8B-v1", + "messages": [{"role": "user", "content": "What time is it in UTC?"}], + "tools": [{ + "type": "function", + "function": { + "name": "get_utc_time", + "description": "Return the current UTC time", + "parameters": { + "type": "object", + "properties": {}, + "required": [] + } + } + }] + }' +``` + +Expected behavior: the model returns a tool call with `finish_reason` set to +`tool_calls`. + +## Query via OCI Bastion + +For this private deployment, query the cluster and model through the **OCI +Bastion service** plus local forwarding. + +Export the Terraform outputs: + +```bash +export BASTION_ID="" +export PRIVATE_API_HOST="" +export REGION="us-phoenix-1" +export OCI_CLI_PROFILE="API_KEY_AUTH" +``` + +Create a Bastion port-forwarding session to the private OKE API: + +```bash +oci bastion session create-port-forwarding \ + --bastion-id "$BASTION_ID" \ + --ssh-public-key-file ~/.ssh/id_ed25519.pub \ + --key-type PUB \ + --target-port 6443 \ + --target-private-ip "$PRIVATE_API_HOST" \ + --display-name nemotron-oke-api \ + --session-ttl 10800 \ + --region "$REGION" \ + --profile "$OCI_CLI_PROFILE" +``` + +Inspect the created session and copy the SSH command OCI returns: + +```bash +oci bastion session get \ + --session-id "" \ + --region "$REGION" \ + --profile "$OCI_CLI_PROFILE" +``` + +Run the returned SSH command so that the private Kubernetes API is reachable on +local port `6443`, then query the cluster: + +```bash +kubectl get nodes +kubectl -n default get pods +``` + +Port-forward the Nemotron router service: + +```bash +kubectl -n default port-forward svc/vllm-router-service 8080:80 +``` + +At that point, the private model is queryable locally without exposing a public +inference endpoint: + +```bash +curl -s http://127.0.0.1:8080/v1/models +``` + +```bash +curl -s http://127.0.0.1:8080/v1/chat/completions \ + -H 'Content-Type: application/json' \ + -d '{ + "model": "nvidia/Llama-3.1-Nemotron-Nano-8B-v1", + "messages": [{"role": "user", "content": "Reply with NEMOTRON_OK"}] + }' +``` + +## Operational notes + +- Phoenix provided a workable path for this deployment when Chicago GPU capacity + was not available. +- A single A10 is enough for the validated setup, but it requires conservative + context sizing. +- Private access plus local forwarding keeps the control plane and inference + path off the public internet. + +## Troubleshooting + +### The model pod starts but never becomes ready + +Reduce context pressure and ensure the `vLLM` values include: + +- `maxModelLen: 4096` +- `gpuMemoryUtilization: 0.95` + +### Tool calling does not work + +Make sure all of these are set: + +- `enableTool: true` +- `toolCallParser: llama3_json` +- `chatTemplate: /vllm-workspace/examples/tool_chat_template_llama3.1_json.jinja` + +### `kubectl` cannot reach the cluster + +This guide assumes a **private** OKE cluster. Re-establish the Bastion tunnel +before using `kubectl`. + +### The endpoint is reachable but `/v1/models` is empty or wrong + +Confirm the deployment is serving: + +- `nvidia/Llama-3.1-Nemotron-Nano-8B-v1` + +and that the router service is forwarding to the Nemotron backend pods. diff --git a/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/.gitignore b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/.gitignore new file mode 100644 index 00000000..1a22d40b --- /dev/null +++ b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/.gitignore @@ -0,0 +1,5 @@ +.terraform/ +terraform.tfvars +terraform.tfstate +terraform.tfstate.* +tfplan diff --git a/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/.terraform.lock.hcl b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/.terraform.lock.hcl new file mode 100644 index 00000000..a539a586 --- /dev/null +++ b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/.terraform.lock.hcl @@ -0,0 +1,145 @@ +# This file is maintained automatically by "terraform init". +# Manual edits may be lost in future updates. + +provider "registry.terraform.io/hashicorp/cloudinit" { + version = "2.3.7" + constraints = ">= 2.2.0" + hashes = [ + "h1:M9TpQxKAE/hyOwytdX9MUNZw30HoD/OXqYIug5fkqH8=", + "zh:06f1c54e919425c3139f8aeb8fcf9bceca7e560d48c9f0c1e3bb0a8ad9d9da1e", + "zh:0e1e4cf6fd98b019e764c28586a386dc136129fef50af8c7165a067e7e4a31d5", + "zh:1871f4337c7c57287d4d67396f633d224b8938708b772abfc664d1f80bd67edd", + "zh:2b9269d91b742a71b2248439d5e9824f0447e6d261bfb86a8a88528609b136d1", + "zh:3d8ae039af21426072c66d6a59a467d51f2d9189b8198616888c1b7fc42addc7", + "zh:3ef4e2db5bcf3e2d915921adced43929214e0946a6fb11793085d9a48995ae01", + "zh:42ae54381147437c83cbb8790cc68935d71b6357728a154109d3220b1beb4dc9", + "zh:4496b362605ae4cbc9ef7995d102351e2fe311897586ffc7a4a262ccca0c782a", + "zh:652a2401257a12706d32842f66dac05a735693abcb3e6517d6b5e2573729ba13", + "zh:7406c30806f5979eaed5f50c548eced2ea18ea121e01801d2f0d4d87a04f6a14", + "zh:7848429fd5a5bcf35f6fee8487df0fb64b09ec071330f3ff240c0343fe2a5224", + "zh:78d5eefdd9e494defcb3c68d282b8f96630502cac21d1ea161f53cfe9bb483b3", + ] +} + +provider "registry.terraform.io/hashicorp/helm" { + version = "3.1.1" + constraints = ">= 3.0.1" + hashes = [ + "h1:47CqNwkxctJtL/N/JuEj+8QMg8mRNI/NWeKO5/ydfZU=", + "zh:1a6d5ce931708aec29d1f3d9e360c2a0c35ba5a54d03eeaff0ce3ca597cd0275", + "zh:3411919ba2a5941801e677f0fea08bdd0ae22ba3c9ce3309f55554699e06524a", + "zh:81b36138b8f2320dc7f877b50f9e38f4bc614affe68de885d322629dd0d16a29", + "zh:95a2a0a497a6082ee06f95b38bd0f0d6924a65722892a856cfd914c0d117f104", + "zh:9d3e78c2d1bb46508b972210ad706dd8c8b106f8b206ecf096cd211c54f46990", + "zh:a79139abf687387a6efdbbb04289a0a8e7eaca2bd91cdc0ce68ea4f3286c2c34", + "zh:aaa8784be125fbd50c48d84d6e171d3fb6ef84a221dbc5165c067ce05faab4c8", + "zh:afecd301f469975c9d8f350cc482fe656e082b6ab0f677d1a816c3c615837cc1", + "zh:c54c22b18d48ff9053d899d178d9ffef7d9d19785d9bf310a07d648b7aac075b", + "zh:db2eefd55aea48e73384a555c72bac3f7d428e24147bedb64e1a039398e5b903", + "zh:ee61666a233533fd2be971091cecc01650561f1585783c381b6f6e8a390198a4", + "zh:f569b65999264a9416862bca5cd2a6177d94ccb0424f3a4ef424428912b9cb3c", + ] +} + +provider "registry.terraform.io/hashicorp/http" { + version = "3.5.0" + constraints = ">= 3.2.1" + hashes = [ + "h1:dl73+8wzQR++HFGoJgDqY3mj3pm14HUuH/CekVyOj5s=", + "zh:047c5b4920751b13425efe0d011b3a23a3be97d02d9c0e3c60985521c9c456b7", + "zh:157866f700470207561f6d032d344916b82268ecd0cf8174fb11c0674c8d0736", + "zh:1973eb9383b0d83dd4fd5e662f0f16de837d072b64a6b7cd703410d730499476", + "zh:212f833a4e6d020840672f6f88273d62a564f44acb0c857b5961cdb3bbc14c90", + "zh:2c8034bc039fffaa1d4965ca02a8c6d57301e5fa9fff4773e684b46e3f78e76a", + "zh:5df353fc5b2dd31577def9cc1a4ebf0c9a9c2699d223c6b02087a3089c74a1c6", + "zh:672083810d4185076c81b16ad13d1224b9e6ea7f4850951d2ab8d30fa6e41f08", + "zh:78d5eefdd9e494defcb3c68d282b8f96630502cac21d1ea161f53cfe9bb483b3", + "zh:7b4200f18abdbe39904b03537e1a78f21ebafe60f1c861a44387d314fda69da6", + "zh:843feacacd86baed820f81a6c9f7bd32cf302db3d7a0f39e87976ebc7a7cc2ee", + "zh:a9ea5096ab91aab260b22e4251c05f08dad2ed77e43e5e4fadcdfd87f2c78926", + "zh:d02b288922811739059e90184c7f76d45d07d3a77cc48d0b15fd3db14e928623", + ] +} + +provider "registry.terraform.io/hashicorp/null" { + version = "3.2.4" + constraints = ">= 3.2.1" + hashes = [ + "h1:L5V05xwp/Gto1leRryuesxjMfgZwjb7oool4WS1UEFQ=", + "zh:59f6b52ab4ff35739647f9509ee6d93d7c032985d9f8c6237d1f8a59471bbbe2", + "zh:78d5eefdd9e494defcb3c68d282b8f96630502cac21d1ea161f53cfe9bb483b3", + "zh:795c897119ff082133150121d39ff26cb5f89a730a2c8c26f3a9c1abf81a9c43", + "zh:7b9c7b16f118fbc2b05a983817b8ce2f86df125857966ad356353baf4bff5c0a", + "zh:85e33ab43e0e1726e5f97a874b8e24820b6565ff8076523cc2922ba671492991", + "zh:9d32ac3619cfc93eb3c4f423492a8e0f79db05fec58e449dee9b2d5873d5f69f", + "zh:9e15c3c9dd8e0d1e3731841d44c34571b6c97f5b95e8296a45318b94e5287a6e", + "zh:b4c2ab35d1b7696c30b64bf2c0f3a62329107bd1a9121ce70683dec58af19615", + "zh:c43723e8cc65bcdf5e0c92581dcbbdcbdcf18b8d2037406a5f2033b1e22de442", + "zh:ceb5495d9c31bfb299d246ab333f08c7fb0d67a4f82681fbf47f2a21c3e11ab5", + "zh:e171026b3659305c558d9804062762d168f50ba02b88b231d20ec99578a6233f", + "zh:ed0fe2acdb61330b01841fa790be00ec6beaac91d41f311fb8254f74eb6a711f", + ] +} + +provider "registry.terraform.io/hashicorp/random" { + version = "3.8.1" + constraints = ">= 3.4.3" + hashes = [ + "h1:u8AKlWVDTH5r9YLSeswoVEjiY72Rt4/ch7U+61ZDkiQ=", + "zh:08dd03b918c7b55713026037c5400c48af5b9f468f483463321bd18e17b907b4", + "zh:0eee654a5542dc1d41920bbf2419032d6f0d5625b03bd81339e5b33394a3e0ae", + "zh:229665ddf060aa0ed315597908483eee5b818a17d09b6417a0f52fd9405c4f57", + "zh:2469d2e48f28076254a2a3fc327f184914566d9e40c5780b8d96ebf7205f8bc0", + "zh:37d7eb334d9561f335e748280f5535a384a88675af9a9eac439d4cfd663bcb66", + "zh:741101426a2f2c52dee37122f0f4a2f2d6af6d852cb1db634480a86398fa3511", + "zh:78d5eefdd9e494defcb3c68d282b8f96630502cac21d1ea161f53cfe9bb483b3", + "zh:a902473f08ef8df62cfe6116bd6c157070a93f66622384300de235a533e9d4a9", + "zh:b85c511a23e57a2147355932b3b6dce2a11e856b941165793a0c3d7578d94d05", + "zh:c5172226d18eaac95b1daac80172287b69d4ce32750c82ad77fa0768be4ea4b8", + "zh:dab4434dba34aad569b0bc243c2d3f3ff86dd7740def373f2a49816bd2ff819b", + "zh:f49fd62aa8c5525a5c17abd51e27ca5e213881d58882fd42fec4a545b53c9699", + ] +} + +provider "registry.terraform.io/hashicorp/time" { + version = "0.13.1" + constraints = ">= 0.9.1" + hashes = [ + "h1:ZT5ppCNIModqk3iOkVt5my8b8yBHmDpl663JtXAIRqM=", + "zh:02cb9aab1002f0f2a94a4f85acec8893297dc75915f7404c165983f720a54b74", + "zh:04429b2b31a492d19e5ecf999b116d396dac0b24bba0d0fb19ecaefe193fdb8f", + "zh:26f8e51bb7c275c404ba6028c1b530312066009194db721a8427a7bc5cdbc83a", + "zh:772ff8dbdbef968651ab3ae76d04afd355c32f8a868d03244db3f8496e462690", + "zh:78d5eefdd9e494defcb3c68d282b8f96630502cac21d1ea161f53cfe9bb483b3", + "zh:898db5d2b6bd6ca5457dccb52eedbc7c5b1a71e4a4658381bcbb38cedbbda328", + "zh:8de913bf09a3fa7bedc29fec18c47c571d0c7a3d0644322c46f3aa648cf30cd8", + "zh:9402102c86a87bdfe7e501ffbb9c685c32bbcefcfcf897fd7d53df414c36877b", + "zh:b18b9bb1726bb8cfbefc0a29cf3657c82578001f514bcf4c079839b6776c47f0", + "zh:b9d31fdc4faecb909d7c5ce41d2479dd0536862a963df434be4b16e8e4edc94d", + "zh:c951e9f39cca3446c060bd63933ebb89cedde9523904813973fbc3d11863ba75", + "zh:e5b773c0d07e962291be0e9b413c7a22c044b8c7b58c76e8aa91d1659990dfb5", + ] +} + +provider "registry.terraform.io/oracle/oci" { + version = "8.5.0" + constraints = ">= 4.67.3, >= 7.30.0" + hashes = [ + "h1:YGSTTLRk0vpD4P0dJFt2lZ2XphT2skF9AxBGCkM04z4=", + "zh:0289ba575d3749068fc12fdbfa3f44b9780b21a23315eb2ca5bcf73065cc4fe7", + "zh:1152fd8451c2b74d87594fda1aa69e6a3f772189b902a592e91fcc57dfe3c48f", + "zh:3e4b1a2e345263e48d6be4d6d01fd5976b09af585e4a9314d318ab216304b8f1", + "zh:6b88ebb0ed7de80e324124511251561072c8a5f1ae222aa588063a1652ff72e8", + "zh:8ef61c735f19e1be9abeeb79debbeacd91e5996b4be5719d61323244e19ebe3d", + "zh:8fcdc6701173b59d78f076f8ce4ce01ef127bf5bf65323340e23c0b14da02f9d", + "zh:9b12af85486a96aedd8d7984b0ff811a4b42e3d88dad1a3fb4c0b580d04fa425", + "zh:a03e6f788876b7408d811eb21056986e15c46876983637e7e5e645fff28d0587", + "zh:b1149065247943c0937359e0f2ed5fdce9c2a588e32e90b9c13be64f709f8121", + "zh:b375612ef300e7f53797552521d3ec10f3d9465ccbe6d96519314e32d6611c93", + "zh:daf49947168641d170f59907b2592f020ab17f5443e8f5a96174219112d51fe2", + "zh:e9649887105493b311cbaf180ba635186e1a4c3b5fe7e26ea9bfd06a52aa76f3", + "zh:f593bb15d46c5c998401fea9cc3fdf7950b81a53632ecb1bea8d2cc41971ccca", + "zh:f7f1f4d0c5922bd0403b989ebed168577164dbfc45181b2e19dcb888e1fc9df7", + "zh:fafce2b47e3227dc8068db4f2bf223c4a4b8fefe39f50aeced467eed1bd901e3", + ] +} diff --git a/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/README.md b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/README.md new file mode 100644 index 00000000..9d0fe219 --- /dev/null +++ b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/README.md @@ -0,0 +1,97 @@ +# Terraform: Private OCI OKE for Llama-3.1-Nemotron-Nano-8B-v1 + +This Terraform example provisions the **private-only** OCI infrastructure for +the validated Phoenix deployment described in the parent cookbook. + +It is intended to give Nemotron users a reproducible OCI path for NVIDIA model +serving that highlights Oracle Cloud's operational strengths: private OKE, +managed Bastion access, and a clean infrastructure-as-code path for GPU-backed +Nemotron deployments. + +It creates: + +- a VCN +- a **private** OKE cluster +- a private CPU node pool +- a private GPU node pool targeting `VM.GPU.A10.1` +- an **OCI Bastion service** resource for private access + +It does **not** create: + +- a public Kubernetes API endpoint +- public worker-node IPs +- a public bastion host +- a public inference endpoint + +## Bastion note + +This sample provisions the **OCI Bastion service** so that private-cluster +access is reproducible from Terraform. + +That is intentionally different from creating a public bastion VM: + +- no public bastion compute instance is created +- no worker node receives a public IP +- the Kubernetes API remains private + +If your environment already manages private-cluster access through a separate +operator workflow, you can remove the `oci_bastion_bastion` resource and keep +the rest of the sample unchanged. + +## Module choice + +This wrapper intentionally uses Oracle's official OKE Terraform module: + +- `oracle-terraform-modules/oke/oci` + +The Nemotron-specific layer in this directory adds: + +- the Phoenix defaults +- the no-public-IP constraints +- the A10-focused worker pool defaults +- the OCI Bastion service resource required for private access + +## Files + +- [`main.tf`](./main.tf) - private OKE cluster, worker pools, OCI Bastion +- [`variables.tf`](./variables.tf) - deployment inputs +- [`outputs.tf`](./outputs.tf) - useful IDs and private endpoint information +- [`terraform.tfvars.example`](./terraform.tfvars.example) - starting point + +## Usage + +```bash +cp terraform.tfvars.example terraform.tfvars +terraform init +terraform plan +terraform apply +``` + +The validated live run completed successfully in `us-phoenix-1`, including: + +- private OKE cluster creation +- OCI Bastion service creation +- CPU node pool creation +- GPU node pool creation on `VM.GPU.A10.1` in `PHX-AD-2` + +After the infrastructure is ready: + +1. create an OCI Bastion session to reach the private cluster +2. deploy the model with: + - [`../vllm_oke_phoenix_private_values.yaml`](../vllm_oke_phoenix_private_values.yaml) +3. validate: + - `/health` + - `/v1/models` + - chat completion + - tool calling + - streaming + +## Notes + +- The validated live deployment used `us-phoenix-1`. +- The validated GPU pool used Phoenix `AD-2`, exposed as `gpu_placement_ads`. +- The Bastion resource here is the OCI managed Bastion service, not a public + bastion VM. +- `ssh_public_key_path` must point to an actual OpenSSH public key file; the + wrapper reads the file contents with Terraform's `file()` function before + passing it to OKE. diff --git a/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/main.tf b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/main.tf new file mode 100644 index 00000000..e9b07078 --- /dev/null +++ b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/main.tf @@ -0,0 +1,112 @@ +provider "oci" { + config_file_profile = var.config_file_profile + tenancy_ocid = var.tenancy_ocid + region = var.region +} + +locals { + common_tags = merge(var.freeform_tags, { + model = "nvidia/Llama-3.1-Nemotron-Nano-8B-v1" + deployment = "private-oke" + region = var.region + }) +} + +module "oke" { + source = "oracle-terraform-modules/oke/oci" + version = "5.4.1" + + providers = { + oci.home = oci + } + + tenancy_id = var.tenancy_ocid + compartment_id = var.compartment_ocid + region = var.region + + cluster_name = var.cluster_name + kubernetes_version = var.kubernetes_version + cluster_type = "enhanced" + cni_type = "flannel" + pods_cidr = var.pods_cidr + services_cidr = var.services_cidr + vcn_cidrs = var.vcn_cidrs + ssh_public_key = file(var.ssh_public_key_path) + output_detail = true + create_vcn = true + create_bastion = false + create_operator = false + control_plane_is_public = false + assign_public_ip_to_control_plane = false + worker_is_public = false + allow_worker_internet_access = true + allow_pod_internet_access = true + allow_worker_ssh_access = false + preferred_load_balancer = "internal" + load_balancers = "internal" + freeform_tags = { all = local.common_tags } + + subnets = { + cp = { + create = "always" + newbits = 13 + netnum = 2 + } + workers = { + create = "always" + newbits = 2 + netnum = 1 + } + pods = { + create = "always" + newbits = 2 + netnum = 2 + } + int_lb = { + create = "always" + newbits = 11 + netnum = 16 + } + pub_lb = { + create = "never" + } + bastion = { + create = "never" + } + operator = { + create = "never" + } + } + + worker_pool_mode = "node-pool" + worker_pool_size = 1 + worker_pools = { + cpu = { + size = var.cpu_pool_size + shape = var.cpu_shape + ocpus = var.cpu_ocpus + memory = var.cpu_memory_gbs + boot_volume_size = 100 + assign_public_ip = false + create = true + } + gpu = { + size = var.gpu_pool_size + shape = var.gpu_shape + boot_volume_size = var.gpu_boot_volume_size + assign_public_ip = false + create = true + placement_ads = var.gpu_placement_ads + } + } +} + +resource "oci_bastion_bastion" "oci_bastion" { + compartment_id = var.compartment_ocid + bastion_type = "STANDARD" + target_subnet_id = module.oke.worker_subnet_id + client_cidr_block_allow_list = var.bastion_client_cidrs + max_session_ttl_in_seconds = 10800 + name = "${var.cluster_name}-bastion" + freeform_tags = local.common_tags +} diff --git a/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/outputs.tf b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/outputs.tf new file mode 100644 index 00000000..c39a82ee --- /dev/null +++ b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/outputs.tf @@ -0,0 +1,34 @@ +output "cluster_id" { + description = "OKE cluster OCID." + value = module.oke.cluster_id +} + +output "cluster_endpoints" { + description = "Cluster endpoints; private endpoint should be used." + value = module.oke.cluster_endpoints +} + +output "apiserver_private_host" { + description = "Private control-plane host." + value = module.oke.apiserver_private_host +} + +output "vcn_id" { + description = "VCN used by the Nemotron deployment." + value = module.oke.vcn_id +} + +output "control_plane_subnet_id" { + description = "Private control-plane subnet." + value = module.oke.control_plane_subnet_id +} + +output "worker_subnet_id" { + description = "Private worker subnet." + value = module.oke.worker_subnet_id +} + +output "oci_bastion_id" { + description = "OCI Bastion service OCID for creating private sessions." + value = oci_bastion_bastion.oci_bastion.id +} diff --git a/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/terraform.tfvars.example b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/terraform.tfvars.example new file mode 100644 index 00000000..9a2bab0c --- /dev/null +++ b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/terraform.tfvars.example @@ -0,0 +1,12 @@ +tenancy_ocid = "ocid1.tenancy.oc1..exampleuniqueID" +compartment_ocid = "ocid1.compartment.oc1..exampleuniqueID" +config_file_profile = "API_KEY_AUTH" +region = "us-phoenix-1" +cluster_name = "nemotron-phx-private" +ssh_public_key_path = "~/.ssh/id_ed25519.pub" + +# Restrict Bastion session creation to your current client egress CIDR. +bastion_client_cidrs = ["203.0.113.10/32"] + +# The validated deployment used Phoenix AD-2 for the A10 node pool. +gpu_placement_ads = [2] diff --git a/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/variables.tf b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/variables.tf new file mode 100644 index 00000000..165cabf5 --- /dev/null +++ b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/variables.tf @@ -0,0 +1,115 @@ +variable "tenancy_ocid" { + description = "OCI tenancy OCID." + type = string +} + +variable "compartment_ocid" { + description = "Compartment where the OKE cluster and Bastion service will be created." + type = string +} + +variable "region" { + description = "OCI region for the deployment." + type = string + default = "us-phoenix-1" +} + +variable "config_file_profile" { + description = "OCI CLI config profile name." + type = string + default = "DEFAULT" +} + +variable "cluster_name" { + description = "Name prefix for the private Nemotron OKE deployment." + type = string + default = "nemotron-oci-phx" +} + +variable "ssh_public_key_path" { + description = "Path to the OpenSSH public key file used for private worker access." + type = string +} + +variable "vcn_cidrs" { + description = "VCN CIDR blocks for the deployment." + type = list(string) + default = ["10.0.0.0/16"] +} + +variable "pods_cidr" { + description = "Kubernetes pods CIDR." + type = string + default = "10.244.0.0/16" +} + +variable "services_cidr" { + description = "Kubernetes services CIDR." + type = string + default = "10.96.0.0/16" +} + +variable "kubernetes_version" { + description = "OKE Kubernetes version." + type = string + default = "v1.33.1" +} + +variable "cpu_pool_size" { + description = "Number of CPU worker nodes." + type = number + default = 1 +} + +variable "cpu_shape" { + description = "Shape for the CPU worker pool." + type = string + default = "VM.Standard.E5.Flex" +} + +variable "cpu_ocpus" { + description = "OCPUs for each CPU worker if using a flex shape." + type = number + default = 2 +} + +variable "cpu_memory_gbs" { + description = "Memory in GB for each CPU worker if using a flex shape." + type = number + default = 16 +} + +variable "gpu_pool_size" { + description = "Number of GPU worker nodes." + type = number + default = 1 +} + +variable "gpu_shape" { + description = "Shape for the GPU worker pool." + type = string + default = "VM.GPU.A10.1" +} + +variable "gpu_boot_volume_size" { + description = "Boot volume size for GPU workers." + type = number + default = 200 +} + +variable "gpu_placement_ads" { + description = "Availability domains to target for the GPU node pool. Phoenix AD-2 is `[2]`." + type = list(number) + default = [2] +} + +variable "bastion_client_cidrs" { + description = "CIDR blocks allowed to create OCI Bastion sessions." + type = list(string) +} + +variable "freeform_tags" { + description = "Optional freeform tags." + type = map(string) + default = {} +} diff --git a/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/versions.tf b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/versions.tf new file mode 100644 index 00000000..1c9c0264 --- /dev/null +++ b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/versions.tf @@ -0,0 +1,10 @@ +terraform { + required_version = ">= 1.5.0" + + required_providers { + oci = { + source = "oracle/oci" + version = ">= 7.30.0" + } + } +} diff --git a/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/vllm_oke_phoenix_private_values.yaml b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/vllm_oke_phoenix_private_values.yaml new file mode 100644 index 00000000..076bcb83 --- /dev/null +++ b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/vllm_oke_phoenix_private_values.yaml @@ -0,0 +1,30 @@ +# Validated private OCI OKE deployment values for +# nvidia/Llama-3.1-Nemotron-Nano-8B-v1 on a single VM.GPU.A10.1 node. + +servingEngineSpec: + runtimeClassName: "" + modelSpec: + - name: "llama31-nemotron-nano-8b" + repository: "vllm/vllm-openai" + tag: "latest" + modelURL: "nvidia/Llama-3.1-Nemotron-Nano-8B-v1" + enableTool: true + toolCallParser: "llama3_json" + chatTemplate: "/vllm-workspace/examples/tool_chat_template_llama3.1_json.jinja" + replicaCount: 1 + requestCPU: 4 + requestMemory: "24Gi" + requestGPU: 1 + pvcStorage: "120Gi" + pvcAccessMode: + - ReadWriteOnce + storageClass: "oci-block-storage-enc" + nodeSelector: + app: gpu + tolerations: + - key: "nvidia.com/gpu" + operator: "Exists" + effect: "NoSchedule" + vllmConfig: + maxModelLen: 4096 + gpuMemoryUtilization: 0.95 diff --git a/usage-cookbook/README.md b/usage-cookbook/README.md index f7d79b5c..001121f6 100644 --- a/usage-cookbook/README.md +++ b/usage-cookbook/README.md @@ -13,5 +13,4 @@ This directory contains cookbook-style guides showing how to deploy and use the - **SGLang Deployment** - Tutorials on serving and interacting with Nemotron via SGLang - **NIM Microservice** - Guide to deploying Nemotron as scalable, production-ready endpoints using NVIDIA Inference Microservices (NIM). - **Hugging Face Transformers** - Direct loading and inference of Nemotron models with Hugging Face Transformers - - +- **OCI OKE Private Deployment** - A Phoenix-only private deployment guide for `nvidia/Llama-3.1-Nemotron-Nano-8B-v1` using OKE, OCI Bastion service, and `vLLM`, providing a reproducible OCI path comparable to common AWS GPU/Kubernetes deployment patterns.