Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,7 @@ Practical deployment and model usage guides for Nemotron models.
|-------|----------|--------------|-----------|
| [**Nemotron 3 Super 120B A12B**](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16) | Production deployments needing strong reasoning | 1M context, in NVFP4 single B200, RAG & tool calling | [Cookbooks](./usage-cookbook/Nemotron-3-Super) |
| [**Nemotron 3 Nano 30B A3B**](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) | Resource-constrained environments | 1M context, sparse MoE hybrid Mamba-2, controllable reasoning | [Cookbooks](./usage-cookbook/Nemotron-3-Nano) |
| [**Llama-3.1-Nemotron-Nano-8B-v1**](https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1) | Small-footprint OCI deployments | Validated on private OKE in Phoenix with `vLLM`, OCI Bastion service, tool calling, and OpenAI-compatible `/v1` inference; provides a reproducible OCI path comparable to common AWS GPU/Kubernetes deployment patterns | [Cookbooks](./usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1) |
| [**NVIDIA-Nemotron-Nano-12B-v2-VL**](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL) | Document intelligence and video understanding | 12B VLM, video reasoning, Efficient Video Sampling | [Cookbooks](./usage-cookbook/Nemotron-Nano2-VL/) |
| [**Llama-3.1-Nemotron-Safety-Guard-8B-v3**](https://huggingface.co/nvidia/Llama-3.1-Nemotron-Safety-Guard-8B-v3) | Multilingual content moderation | 9 languages, 23 safety categories | [Cookbooks](./usage-cookbook/Llama-3.1-Nemotron-Safety-Guard-V3/) |
| **Nemotron-Parse** | Document parsing for RAG and AI agents | Table extraction, semantic segmentation | [Cookbooks](./usage-cookbook/Nemotron-Parse-v1.1/) |
Expand Down
288 changes: 288 additions & 0 deletions usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,288 @@
# Llama-3.1-Nemotron-Nano-8B-v1 on OCI OKE (Private Phoenix Deployment)

This cookbook documents a validated private deployment of
`nvidia/Llama-3.1-Nemotron-Nano-8B-v1` on **Oracle Cloud Infrastructure (OCI)** using:

- `us-phoenix-1`
- a **private** Oracle Kubernetes Engine (OKE) cluster
- a single `VM.GPU.A10.1` worker
- `vLLM` with an OpenAI-compatible `/v1` endpoint

This guide is intentionally **private-only**:

- no public Kubernetes API endpoint
- no public worker-node IPs
- no public inference endpoint

Access is handled through **OCI Bastion** and local port forwarding.

Note: the Terraform sample in this cookbook provisions the **OCI Bastion
service** for reproducible private access. It does **not** create a public
bastion host VM.

This gives Nemotron users a reproducible Oracle Cloud deployment path that
leans into OCI's strengths for enterprise workloads: private OKE control
planes, managed Bastion access, and a clean separation between infrastructure
provisioning and model serving.

## Why this configuration

This setup gives Nemotron users a reproducible OCI deployment path with a small
single-GPU footprint while preserving tool calling, structured output, and
streaming support.

For teams evaluating cloud options for Nemotron, this sample shows that OCI can
offer a practical and well-contained production shape: private networking,
managed access, and a validated GPU-backed serving path in Phoenix.

Validated capabilities on this deployment:

- chat completion
- structured output
- tool calling
- streaming
- async/concurrent requests
- OpenAI-compatible model discovery via `/v1/models`

## Tested environment

- Region: `us-phoenix-1`
- Kubernetes: OKE private cluster
- GPU shape: `VM.GPU.A10.1`
- Model: `nvidia/Llama-3.1-Nemotron-Nano-8B-v1`
- Serving stack: `vLLM`
- Inference API: OpenAI-compatible `/v1`

## Architecture

1. Create a **private** OKE cluster in Phoenix.
2. Create a CPU node pool and a GPU node pool.
3. Use **OCI Bastion** to reach the cluster API locally.
4. Deploy Nemotron with the checked-in `vLLM` values file.
5. Keep the inference service internal and validate it through a local
port-forward.

## Prerequisites

- OCI tenancy with Phoenix capacity for `VM.GPU.A10.1`
- OKE permissions
- OCI Bastion permissions
- `kubectl`
- `helm`
- access to pull the Nemotron model from Hugging Face or an equivalent model
artifact source accepted by your environment

## Deployment notes

This cookbook assumes a private OKE cluster. Keep these constraints:

- disable the public Kubernetes control-plane endpoint
- do not attach public IPs to worker nodes
- do not expose the model through a public load balancer

The known-good serving values are in
[`vllm_oke_phoenix_private_values.yaml`](./vllm_oke_phoenix_private_values.yaml).

Terraform for the private Phoenix OKE infrastructure is available in
[`terraform/`](./terraform/).

That Terraform path was validated end to end in Phoenix through:

- VCN and private subnets
- private OKE control plane
- OCI Bastion service
- CPU node pool
- GPU node pool on `VM.GPU.A10.1`

Important settings for this single-A10 deployment:

- `maxModelLen: 4096`
- `gpuMemoryUtilization: 0.95`
- `enableTool: true`
- `toolCallParser: llama3_json`
- `chatTemplate: /vllm-workspace/examples/tool_chat_template_llama3.1_json.jinja`

These settings were required to make the model stable on a single A10 while
preserving tool-calling behavior.

## Example install flow

Deploy the serving stack with the `vLLM Production Stack` Helm chart using the
checked-in values file:

```bash
helm upgrade --install vllm <path-to-vllm-production-stack-helm-chart> \
-n default \
-f usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/vllm_oke_phoenix_private_values.yaml
```

Use Bastion plus the private cluster endpoint for cluster access. Then
port-forward the router service locally:

```bash
kubectl -n default port-forward svc/vllm-router-service 8080:80
```

At that point, the local validation endpoint is:

```text
http://127.0.0.1:8080/v1
```

## Validation

Health check:

```bash
curl -s http://127.0.0.1:8080/health
```

Model discovery:

```bash
curl -s http://127.0.0.1:8080/v1/models
```

Chat completion:

```bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/Llama-3.1-Nemotron-Nano-8B-v1",
"messages": [{"role": "user", "content": "Reply with NEMOTRON_OK"}]
}'
```

## Tool-calling smoke test

```bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/Llama-3.1-Nemotron-Nano-8B-v1",
"messages": [{"role": "user", "content": "What time is it in UTC?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_utc_time",
"description": "Return the current UTC time",
"parameters": {
"type": "object",
"properties": {},
"required": []
}
}
}]
}'
```

Expected behavior: the model returns a tool call with `finish_reason` set to
`tool_calls`.

## Query via OCI Bastion

For this private deployment, query the cluster and model through the **OCI
Bastion service** plus local forwarding.

Export the Terraform outputs:

```bash
export BASTION_ID="<terraform output oci_bastion_id>"
export PRIVATE_API_HOST="<terraform output apiserver_private_host>"
export REGION="us-phoenix-1"
export OCI_CLI_PROFILE="API_KEY_AUTH"
```

Create a Bastion port-forwarding session to the private OKE API:

```bash
oci bastion session create-port-forwarding \
--bastion-id "$BASTION_ID" \
--ssh-public-key-file ~/.ssh/id_ed25519.pub \
--key-type PUB \
--target-port 6443 \
--target-private-ip "$PRIVATE_API_HOST" \
--display-name nemotron-oke-api \
--session-ttl 10800 \
--region "$REGION" \
--profile "$OCI_CLI_PROFILE"
```

Inspect the created session and copy the SSH command OCI returns:

```bash
oci bastion session get \
--session-id "<session_ocid>" \
--region "$REGION" \
--profile "$OCI_CLI_PROFILE"
```

Run the returned SSH command so that the private Kubernetes API is reachable on
local port `6443`, then query the cluster:

```bash
kubectl get nodes
kubectl -n default get pods
```

Port-forward the Nemotron router service:

```bash
kubectl -n default port-forward svc/vllm-router-service 8080:80
```

At that point, the private model is queryable locally without exposing a public
inference endpoint:

```bash
curl -s http://127.0.0.1:8080/v1/models
```

```bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/Llama-3.1-Nemotron-Nano-8B-v1",
"messages": [{"role": "user", "content": "Reply with NEMOTRON_OK"}]
}'
```

## Operational notes

- Phoenix provided a workable path for this deployment when Chicago GPU capacity
was not available.
- A single A10 is enough for the validated setup, but it requires conservative
context sizing.
- Private access plus local forwarding keeps the control plane and inference
path off the public internet.

## Troubleshooting

### The model pod starts but never becomes ready

Reduce context pressure and ensure the `vLLM` values include:

- `maxModelLen: 4096`
- `gpuMemoryUtilization: 0.95`

### Tool calling does not work

Make sure all of these are set:

- `enableTool: true`
- `toolCallParser: llama3_json`
- `chatTemplate: /vllm-workspace/examples/tool_chat_template_llama3.1_json.jinja`

### `kubectl` cannot reach the cluster

This guide assumes a **private** OKE cluster. Re-establish the Bastion tunnel
before using `kubectl`.

### The endpoint is reachable but `/v1/models` is empty or wrong

Confirm the deployment is serving:

- `nvidia/Llama-3.1-Nemotron-Nano-8B-v1`

and that the router service is forwarding to the Nemotron backend pods.
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
.terraform/
terraform.tfvars
terraform.tfstate
terraform.tfstate.*
tfplan
Loading