Skip to content

Commit eab4d67

Browse files
authored
Add vGPU Sizing Advisor to community examples (#376)
Signed-off-by: chloecrozier <[email protected]>
1 parent be9acc9 commit eab4d67

File tree

154 files changed

+38698
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

154 files changed

+38698
-0
lines changed
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# Ignore git objects
2+
.git/
3+
.gitignore
4+
.gitlab-ci.yml
5+
.gitmodules
6+
# Ignore temperory volumes
7+
deploy/compose/volumes
8+
9+
# creating a docker image
10+
.dockerignore
11+
12+
# Ignore any virtual environment configuration files
13+
.env*
14+
.venv/
15+
env/
16+
# Ignore python bytecode files
17+
*.pyc
18+
__pycache__/
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
data/dataset.zip filter=lfs diff=lfs merge=lfs -text
2+
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# Python
2+
.venv/
3+
__pycache__/
4+
*.pyc
5+
*.pyo
6+
*.pyd
7+
8+
# Docker & Compose
9+
deploy/compose/volumes/
10+
uploaded_files/
11+
12+
# Node.js
13+
**/node_modules/
14+
15+
# Next.js
16+
frontend/.next/
17+
frontend/out/
18+
19+
# Build outputs
20+
**/build/
21+
**/dist/
22+
23+
# Helm
24+
**/helm-charts/*.tgz
25+
26+
# Logs
27+
*.log
28+
logs/
29+
30+
# Environment files
31+
.env
32+
.env.local
33+
.env*.local
34+
.env.development.local
35+
.env.test.local
36+
.env.production.local
37+
38+
# IDE
39+
.vscode/
40+
.idea/
41+
*.iml
42+
43+
# OS files
44+
.DS_Store
45+
Thumbs.db
46+
47+
# Temporary files
48+
temp.txt
49+
*.tmp
50+
*.temp
51+
*.swp
52+
*.bak
53+
*~
54+
55+
# Test outputs
56+
test-results/
57+
coverage/
58+
59+
# Generated configuration results
60+
src/vgpu_configuration_results.json
Lines changed: 209 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,209 @@
1+
## Changelog
2+
All notable changes to this project will be documented in this file.
3+
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
4+
5+
6+
## [2.3.0] - 2025-10-20
7+
8+
This release focuses on local deployment improvements, enhanced workload differentiation, and improved user experience with advanced configuration options.
9+
10+
### Added
11+
- **Advanced Configuration Tabs**
12+
- Enhanced UI with additional configuration options
13+
- Info buttons and hover tooltips for parameter explanations
14+
- Contextual guidance to help users understand parameter meanings
15+
16+
- **Workload Safety Validations**
17+
- Token validation to prevent misconfigured deployments
18+
- GPU compatibility checks for local deployments
19+
- Protection against running jobs with incorrect configurations
20+
21+
- **Document Citation References**
22+
- Fixed ingestion document citation tracking
23+
- Improved reference accuracy in RAG responses
24+
25+
- **Enhanced Docker Cleanup**
26+
- Automatic cleanup of stopped containers
27+
- Prunes unused volumes and networks
28+
- Optional Docker image and build cache cleanup
29+
- Improved disk space management
30+
31+
### Changed
32+
- **Local Deployment Architecture**
33+
- Migrated to vLLM container-based deployment
34+
- Streamlined local inference setup
35+
36+
- **Calculator Intelligence**
37+
- GPU passthrough recommendations for workloads exceeding vGPU profile limits
38+
- Improved sizing suggestions for large-scale deployments
39+
40+
- **Workload Differentiation**
41+
- Enhanced RAG vs inference workload calculations
42+
- Embedding vector storage considerations
43+
- Database overhead factoring for RAG workloads
44+
45+
- **SSH Removal**
46+
- Completely removed SSH dependency
47+
- Simplified deployment workflow
48+
49+
### Improved
50+
- **User Interface**
51+
- Modernized UI components
52+
- Better visual feedback and status indicators
53+
- Improved configuration wizard flow
54+
55+
## [2.2.0] - 2025-10-13
56+
57+
This release focuses on the AI vWS Sizing Advisor with enhanced deployment capabilities, improved user experience, and zero external dependencies for SSH operations.
58+
59+
### Added
60+
- **Dynamic HuggingFace Model Integration**
61+
- Dynamically populated model list from HuggingFace API
62+
- Support for any HuggingFace model in vLLM deployment
63+
- Real-time model validation and availability checking
64+
65+
- **Adjustable Workload Calculation Parameters**
66+
- Configurable overhead parameters for workload calculations
67+
- Dynamic GPU utilization settings based on vGPU profile
68+
- Customizable memory overhead and KV cache calculations
69+
- User-controllable performance vs resource trade-offs
70+
71+
- **Backend Management Scripts**
72+
- New `restart_backend.sh` script for container management
73+
- Automated health checking and verification
74+
- Clean restart workflow with status reporting
75+
76+
- **Enhanced Debugging Output**
77+
- Clear, structured deployment logs
78+
- Real-time progress updates during vLLM deployment
79+
- SSH key generation path logging
80+
- Detailed error messages with automatic cleanup
81+
- Separate debug and deployment result views in UI
82+
83+
- **Comprehensive GPU Performance Metrics**
84+
- GPU memory utilization reporting
85+
- Actual vs estimated memory usage comparison
86+
- Real-time GPU saturation monitoring
87+
- Time-to-first-token (TTFT) measurements
88+
- Throughput and latency metrics
89+
- Inference test results with sample outputs
90+
91+
### Changed
92+
- **SSH Implementation (Zero External Dependencies)**
93+
- Removed `paramiko` library (LGPL) dependency
94+
- Removed `sshpass` (GPL) dependency
95+
- Implemented pure Python solution using built-in `subprocess`, `tempfile`, and `os` modules
96+
- Auto-generates SSH keys (`vgpu_sizing_advisor`) on first use
97+
- Automatic SSH key copying to remote VMs using bash with `SSH_ASKPASS`
98+
- 100% Apache-compatible implementation
99+
100+
- **HuggingFace Token Management**
101+
- Clear cached tokens before authentication
102+
- Explicit `huggingface-cli logout` before login
103+
- Automatic token file cleanup (`~/.huggingface/token`, `~/.cache/huggingface/token`)
104+
- Immediate deployment failure on invalid tokens
105+
- Clean error messages without SSH warnings or tracebacks
106+
107+
- **UI/UX Improvements**
108+
- Updated configuration wizard with better flow
109+
- Dynamic status indicators (success/failure)
110+
- Prominent error display with red alert boxes
111+
- Hover tooltips for SSH key configuration
112+
- Separate tabs for deployment logs and debug output
113+
- Copy buttons for log export
114+
- Cleaner deployment result formatting
115+
116+
### Improved
117+
- **Error Handling**
118+
- Structured error messages with context
119+
- Automatic error message cleanup (removes SSH warnings, tracebacks)
120+
- Better error propagation from backend to frontend
121+
- Explicit failure states in UI
122+
123+
- **Deployment Process**
124+
- Automatic SSH key setup on first connection
125+
- Faster subsequent deployments (key-based auth)
126+
- More reliable vLLM server startup detection
127+
- Better cleanup on deployment failure
128+
129+
### Technical Improvements
130+
- Pure Python SSH implementation (no GPL dependencies)
131+
- Apache 2.0 license compliance verified
132+
- Cleaner repository structure
133+
- Comprehensive .gitignore for production readiness
134+
- Removed unnecessary notebooks and demo files
135+
136+
### Security
137+
- SSH key-based authentication (more secure than passwords)
138+
- Automatic key generation with proper permissions (700/600)
139+
140+
## [2.1.0] - 2025-05-13
141+
142+
143+
This release reduces overall GPU requirement for the deployment of the blueprint. It also improves the performance and stability for both docker and helm based deployments.
144+
145+
### Added
146+
- Added non-blocking async support to upload documents API
147+
- Added a new field `blocking: bool` to control this behaviour from client side. Default is set to `true`
148+
- Added a new API `/status` to monitor state or completion status of uploaded docs
149+
- Helm chart is published on NGC Public registry.
150+
- Helm chart customization guide is now available for all optional features under [documentation](./README.md#available-customizations).
151+
- Issues with very large file upload has been fixed.
152+
- Security enhancements and stability improvements.
153+
154+
### Changed
155+
- Overall GPU requirement reduced to 2xH100/3xA100.
156+
- Changed default LLM model to [llama-3_3-nemotron-super-49b-v1](https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1). This reduces overall GPU needed to deploy LLM model to 1xH100/2xA100
157+
- Changed default GPU needed for all other NIMs (ingestion and reranker NIMs) to 1xH100/1xA100
158+
- Changed default chunk size to 512 in order to reduce LLM context size and in turn reduce RAG server response latency.
159+
- Exposed config to split PDFs post chunking. Controlled using `APP_NVINGEST_ENABLEPDFSPLITTER` environment variable in ingestor-server. Default value is set to `True`.
160+
- Added batch-based ingestion which can help manage memory usage of `ingestor-server` more effectively. Controlled using `ENABLE_NV_INGEST_BATCH_MODE` and `NV_INGEST_FILES_PER_BATCH` variables. Default value is `True` and `100` respectively.
161+
- Removed `extract_options` from API level of `ingestor-server`.
162+
- Resolved an issue during bulk ingestion, where ingestion job failed if ingestion of a single file fails.
163+
164+
### Known Issues
165+
- The `rag-playground` container needs to be rebuild if the `APP_LLM_MODELNAME`, `APP_EMBEDDINGS_MODELNAME` or `APP_RANKING_MODELNAME` environment variable values are changed.
166+
- While trying to upload multiple files at the same time, there may be a timeout error `Error uploading documents: [Error: aborted] { code: 'ECONNRESET' }`. Developers are encouraged to use API's directly for bulk uploading, instead of using the sample rag-playground. The default timeout is set to 1 hour from UI side, while uploading.
167+
- In case of failure while uploading files, error messages may not be shown in the user interface of rag-playground. Developers are encouraged to check the `ingestor-server` logs for details.
168+
169+
A detailed guide is available [here](./docs/migration_guide.md) for easing developers experience, while migrating from older versions.
170+
171+
## [2.0.0] - 2025-03-18
172+
173+
This release adds support for multimodal documents using [Nvidia Ingest](https://github.com/NVIDIA/nv-ingest) including support for parsing PDFs, Word and PowerPoint documents. It also significantly improves accuracy and perf considerations by refactoring the APIs, architecture as well as adds a new developer friendly UI.
174+
175+
### Added
176+
- Integration with Nvingest for ingestion pipeline, the unstructured.io based pipeline is now deprecated.
177+
- OTEL compatible [observability and telemetry support](./docs/observability.md).
178+
- API refactoring. Updated schemas [here](./docs/api_reference/).
179+
- Support runtime configuration of all common parameters. 
180+
- Multimodal citation support.
181+
- New dedicated endpoints for deleting collection, creating collections and reingestion of documents
182+
- [New react + nodeJS based UI](./frontend/) showcasing runtime configurations
183+
- Added optional features to improve accuracy and reliability of the pipeline, turned off by default. Best practices [here](./docs/accuracy_perf.md)
184+
- [Self reflection support](./docs/self-reflection.md)
185+
- [NeMo Guardrails support](./docs/nemo-guardrails.md)
186+
- [Hybrid search support using Milvus](./docs/hybrid_search.md)
187+
- [Brev dev](https://developer.nvidia.com/brev) compatible [notebook](./notebooks/launchable.ipynb)
188+
- Security enhancements and stability improvements
189+
190+
### Changed
191+
- - In **RAG v1.0.0**, a single server managed both **ingestion** and **retrieval/generation** APIs. In **RAG v2.0.0**, the architecture has evolved to utilize **two separate microservices**.
192+
- [Helm charts](./deploy/helm/) are now modularized, seperate helm charts are provided for each distinct microservice.
193+
- Default settings configured to achieve a balance between accuracy and perf.
194+
- [Default flow uses on-prem models](./docs/quickstart.md#deploy-with-docker-compose) with option to switch to API catalog endpoints for docker based flow.
195+
- [Query rewriting](./docs/query_rewriter.md) uses a smaller llama3.1-8b-instruct and is turned off by default.
196+
- Support to use conversation history during retrieval for low-latency  multiturn support.
197+
198+
### Known Issues
199+
- The `rag-playground` container needs to be rebuild if the `APP_LLM_MODELNAME`, `APP_EMBEDDINGS_MODELNAME` or `APP_RANKING_MODELNAME` environment variable values are changed.
200+
- Optional features reflection, nemoguardrails and image captioning are not available in helm based deployment.
201+
- Uploading large files with .txt extension may fail during ingestion, we recommend splitting such files into smaller parts, to avoid this issue.
202+
203+
A detailed guide is available [here](./docs/migration_guide.md) for easing developers experience, while migrating from older versions.
204+
205+
## [1.0.0] - 2025-01-15
206+
207+
### Added
208+
209+
- First release.

0 commit comments

Comments
 (0)