Name	Name	Last commit message	Last commit date
parent directory ..
Deploying Custom model.md	Deploying Custom model.md
Llamastack_NVIDIA_E2E_Flow_RHOAI.ipynb	Llamastack_NVIDIA_E2E_Flow_RHOAI.ipynb
README.md	README.md
config.py	config.py

LlamaStack NVIDIA E2E Demo

This demo showcases an end-to-end workflow for fine-tuning, inference, and evaluation using NVIDIA NeMo Microservices and LlamaStack. It demonstrates how to integrate various AI capabilities through the LlamaStack API.

Overview

The demo is based on the official LlamaStack NVIDIA E2E Flow notebook and includes:

Model Fine-tuning: Customizing pre-trained models with domain-specific data
Inference: Running inference on base and customized models
Evaluation: Comparing model performance metrics before and after fine-tuning
Safety Checks: Implementing guardrails for content safety
Dataset Management: Uploading and managing training datasets

Prerequisites

Before deploying LlamaStack, ensure you have:

NeMo Microservices Platform running with the following components:
- NeMo Data Store (NDS)
- NeMo Entity Store
- NeMo Customizer
- NeMo Evaluator
- NeMo Guardrails
- NIM (NVIDIA Inference Microservice)
Hugging Face Token with access to required model repositories
NVIDIA NGC API Key for accessing NVIDIA services

Deployment

LlamaStack is deployed using the Helm chart as part of the nemo-instances chart. This ensures consistency with the deployment approach used for all other NeMo microservices.

Building the Image

The LlamaStack container image used in the deployment can be built using the following commands:

git clone https://github.com/meta-llama/llama-stack.git
cd llama-stack

podman build --platform=linux/amd64 \
  -f containers/Containerfile \
  --build-arg DISTRO_NAME=nvidia \
  --build-arg INSTALL_MODE=editable \
  --tag quay.io/ecosystem-appeng/llamastack-server-distribution:latest .

podman push quay.io/ecosystem-appeng/llamastack-server-distribution:latest

The image configuration in the Helm values (deploy/nemo-instances/values.yaml) references this image:

llamastack:
  image:
    repository: quay.io/ecosystem-appeng/llamastack-server-distribution
    tag: "latest"

Helm Deployment

LlamaStack is included in the nemo-instances Helm chart. To deploy or upgrade:

cd deploy/nemo-instances

# Deploy or upgrade with llamastack enabled
helm upgrade nemo-instances . \
  -n <namespace> \
  --set namespace.name=<namespace> \
  --set llamastack.enabled=true

The Helm chart will create:

ConfigMap: Contains the LlamaStack configuration defining providers for various APIs (inference, safety, eval, post_training, etc.)
Deployment: Deploys the LlamaStack container with environment variables pointing to your NeMo microservices
Service: Creates a ClusterIP service to expose LlamaStack internally on port 8321

Configuration

The deployment is configured via Helm values in deploy/nemo-instances/values.yaml. Key configuration includes:

NVIDIA_API_KEY: Your NGC API key (from secret)
NVIDIA_BASE_URL: NIM inference endpoint URL (automatically configured based on namespace)
NVIDIA_ENTITY_STORE_URL: NeMo Entity Store URL (automatically configured)
NVIDIA_DATASETS_URL: NeMo Data Store URL (automatically configured)
NVIDIA_CUSTOMIZER_URL: NeMo Customizer URL (automatically configured)
GUARDRAILS_SERVICE_URL: NeMo Guardrails URL (automatically configured)
NVIDIA_EVALUATOR_URL: NeMo Evaluator URL (automatically configured)

All service URLs are automatically configured based on the namespace setting, ensuring proper connectivity to NeMo microservices.

End-to-End Test

The Llama_Stack_NVIDIA_E2E_Flow.ipynb notebook provides a comprehensive end-to-end test of the LlamaStack API integration with NeMo microservices.

Test Workflow

Setup and Configuration:
- Configure URLs for all NeMo microservices in config.py
- Set Hugging Face token for dataset access
- Initialize LlamaStack client
Dataset Preparation:
- Upload sample SQuAD dataset for fine-tuning
- Prepare training, validation, and testing data splits
Model Registration:
- Register base model (meta/llama-3.2-1b-instruct) in Entity Store
- Configure model metadata and artifacts
Inference Testing:
- Test inference on the base model
- Verify model responses and performance
Model Customization:
- Create fine-tuning job using NeMo Customizer
- Monitor job progress and wait for completion
- Register customized model in NIM
Evaluation:
- Run evaluation on base model using sample datasets
- Evaluate customized model performance
- Compare metrics between base and fine-tuned models

Sample Data

The demo includes sample datasets:

sample_squad_data/: Stanford Question Answering Dataset (SQuAD) for fine-tuning
sample_content_safety_test_data/: Content safety test cases for guardrails evaluation
sample_squad_messages/: Message format datasets for chat-based evaluation

Key API Endpoints Tested

The notebook exercises these LlamaStack APIs:

Inference API: Model completions and chat
Post Training API: Model customization jobs
Eval API: Model evaluation and benchmarking
Safety API: Content safety and guardrails
Dataset IO API: Dataset upload and management

Expected Results

Base model BLEU score: ~3
Customized model BLEU score: ~5-15

Usage

After deployment, LlamaStack will be available at:

Internal: http://llamastack.<namespace>.svc.cluster.local:8321
From within the same namespace: http://llamastack:8321

You can interact with LlamaStack using:

from llama_stack_client import LlamaStackClient

# Update the URL to match your deployment
client = LlamaStackClient(base_url="http://llamastack.<namespace>.svc.cluster.local:8321")
# Run inference, evaluations, safety checks, etc.

Important: RHOAI InferenceService Limitation

E2E Workflow Compatibility

NeMo Microservices enables a complete workflow:

Evaluate the base model ✅
Fine-tune the model (using Customizer) ✅
Evaluate the fine-tuned model ❌ (Issue when using InferenceService)
Apply guardrails ✅

The Problem with RHOAI-Deployed Models

Step #3 (Evaluating Fine-Tuned Models) fails when using RHOAI InferenceService.

Why?

RHOAI deploys models as InferenceService (KServe/ModelMesh)
NeMo NIM deployed via NIMPipeline supports dynamic LoRA adapter loading
InferenceService cannot dynamically load LoRA adapters from Entity Store

What This Means:

✅ Fine-tuning works - Customizer trains LoRA adapters successfully with any base model
❌ Serving fine-tuned model fails - RHOAI InferenceService cannot load the trained adapter

Solution

For the complete E2E workflow, deploy models using NIMPipeline (not RHOAI InferenceService):

See Deploying Custom model.md for instructions
NIMPipeline supports dynamic LoRA loading via NIM_PEFT_SOURCE environment variable

Troubleshooting

Deployment Issues: Check pod logs with oc logs -f deployment/llamastack
Service Connectivity: Verify NeMo microservices are running and accessible
Model Loading: Ensure base models are available in NIM
API Keys: Confirm NGC and Hugging Face tokens are valid
Fine-Tuned Model Evaluation Fails: Verify model is deployed via NIMPipeline, not RHOAI InferenceService (see above)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

LlamaStack NVIDIA E2E Demo

Overview

Prerequisites

Deployment

Building the Image

Helm Deployment

Configuration

End-to-End Test

Test Workflow

Sample Data

Key API Endpoints Tested

Expected Results

Usage

Important: RHOAI InferenceService Limitation

E2E Workflow Compatibility

The Problem with RHOAI-Deployed Models

Solution

Troubleshooting

FilesExpand file tree

llamastack

Directory actions

More options

Directory actions

More options

Latest commit

History

llamastack

Folders and files

parent directory

README.md

LlamaStack NVIDIA E2E Demo

Overview

Prerequisites

Deployment

Building the Image

Helm Deployment

Configuration

End-to-End Test

Test Workflow

Sample Data

Key API Endpoints Tested

Expected Results

Usage

Important: RHOAI InferenceService Limitation

E2E Workflow Compatibility

The Problem with RHOAI-Deployed Models

Solution

Troubleshooting