|
| 1 | +## Changelog |
| 2 | +All notable changes to this project will be documented in this file. |
| 3 | +The format is based on Keep a Changelog, and this project adheres to Semantic Versioning. |
| 4 | + |
| 5 | + |
| 6 | +## [2.3.0] - 2025-10-20 |
| 7 | + |
| 8 | +This release focuses on local deployment improvements, enhanced workload differentiation, and improved user experience with advanced configuration options. |
| 9 | + |
| 10 | +### Added |
| 11 | +- **Advanced Configuration Tabs** |
| 12 | + - Enhanced UI with additional configuration options |
| 13 | + - Info buttons and hover tooltips for parameter explanations |
| 14 | + - Contextual guidance to help users understand parameter meanings |
| 15 | + |
| 16 | +- **Workload Safety Validations** |
| 17 | + - Token validation to prevent misconfigured deployments |
| 18 | + - GPU compatibility checks for local deployments |
| 19 | + - Protection against running jobs with incorrect configurations |
| 20 | + |
| 21 | +- **Document Citation References** |
| 22 | + - Fixed ingestion document citation tracking |
| 23 | + - Improved reference accuracy in RAG responses |
| 24 | + |
| 25 | +- **Enhanced Docker Cleanup** |
| 26 | + - Automatic cleanup of stopped containers |
| 27 | + - Prunes unused volumes and networks |
| 28 | + - Optional Docker image and build cache cleanup |
| 29 | + - Improved disk space management |
| 30 | + |
| 31 | +### Changed |
| 32 | +- **Local Deployment Architecture** |
| 33 | + - Migrated to vLLM container-based deployment |
| 34 | + - Streamlined local inference setup |
| 35 | + |
| 36 | +- **Calculator Intelligence** |
| 37 | + - GPU passthrough recommendations for workloads exceeding vGPU profile limits |
| 38 | + - Improved sizing suggestions for large-scale deployments |
| 39 | + |
| 40 | +- **Workload Differentiation** |
| 41 | + - Enhanced RAG vs inference workload calculations |
| 42 | + - Embedding vector storage considerations |
| 43 | + - Database overhead factoring for RAG workloads |
| 44 | + |
| 45 | +- **SSH Removal** |
| 46 | + - Completely removed SSH dependency |
| 47 | + - Simplified deployment workflow |
| 48 | + |
| 49 | +### Improved |
| 50 | +- **User Interface** |
| 51 | + - Modernized UI components |
| 52 | + - Better visual feedback and status indicators |
| 53 | + - Improved configuration wizard flow |
| 54 | + |
| 55 | +## [2.2.0] - 2025-10-13 |
| 56 | + |
| 57 | +This release focuses on the AI vWS Sizing Advisor with enhanced deployment capabilities, improved user experience, and zero external dependencies for SSH operations. |
| 58 | + |
| 59 | +### Added |
| 60 | +- **Dynamic HuggingFace Model Integration** |
| 61 | + - Dynamically populated model list from HuggingFace API |
| 62 | + - Support for any HuggingFace model in vLLM deployment |
| 63 | + - Real-time model validation and availability checking |
| 64 | + |
| 65 | +- **Adjustable Workload Calculation Parameters** |
| 66 | + - Configurable overhead parameters for workload calculations |
| 67 | + - Dynamic GPU utilization settings based on vGPU profile |
| 68 | + - Customizable memory overhead and KV cache calculations |
| 69 | + - User-controllable performance vs resource trade-offs |
| 70 | + |
| 71 | +- **Backend Management Scripts** |
| 72 | + - New `restart_backend.sh` script for container management |
| 73 | + - Automated health checking and verification |
| 74 | + - Clean restart workflow with status reporting |
| 75 | + |
| 76 | +- **Enhanced Debugging Output** |
| 77 | + - Clear, structured deployment logs |
| 78 | + - Real-time progress updates during vLLM deployment |
| 79 | + - SSH key generation path logging |
| 80 | + - Detailed error messages with automatic cleanup |
| 81 | + - Separate debug and deployment result views in UI |
| 82 | + |
| 83 | +- **Comprehensive GPU Performance Metrics** |
| 84 | + - GPU memory utilization reporting |
| 85 | + - Actual vs estimated memory usage comparison |
| 86 | + - Real-time GPU saturation monitoring |
| 87 | + - Time-to-first-token (TTFT) measurements |
| 88 | + - Throughput and latency metrics |
| 89 | + - Inference test results with sample outputs |
| 90 | + |
| 91 | +### Changed |
| 92 | +- **SSH Implementation (Zero External Dependencies)** |
| 93 | + - Removed `paramiko` library (LGPL) dependency |
| 94 | + - Removed `sshpass` (GPL) dependency |
| 95 | + - Implemented pure Python solution using built-in `subprocess`, `tempfile`, and `os` modules |
| 96 | + - Auto-generates SSH keys (`vgpu_sizing_advisor`) on first use |
| 97 | + - Automatic SSH key copying to remote VMs using bash with `SSH_ASKPASS` |
| 98 | + - 100% Apache-compatible implementation |
| 99 | + |
| 100 | +- **HuggingFace Token Management** |
| 101 | + - Clear cached tokens before authentication |
| 102 | + - Explicit `huggingface-cli logout` before login |
| 103 | + - Automatic token file cleanup (`~/.huggingface/token`, `~/.cache/huggingface/token`) |
| 104 | + - Immediate deployment failure on invalid tokens |
| 105 | + - Clean error messages without SSH warnings or tracebacks |
| 106 | + |
| 107 | +- **UI/UX Improvements** |
| 108 | + - Updated configuration wizard with better flow |
| 109 | + - Dynamic status indicators (success/failure) |
| 110 | + - Prominent error display with red alert boxes |
| 111 | + - Hover tooltips for SSH key configuration |
| 112 | + - Separate tabs for deployment logs and debug output |
| 113 | + - Copy buttons for log export |
| 114 | + - Cleaner deployment result formatting |
| 115 | + |
| 116 | +### Improved |
| 117 | +- **Error Handling** |
| 118 | + - Structured error messages with context |
| 119 | + - Automatic error message cleanup (removes SSH warnings, tracebacks) |
| 120 | + - Better error propagation from backend to frontend |
| 121 | + - Explicit failure states in UI |
| 122 | + |
| 123 | +- **Deployment Process** |
| 124 | + - Automatic SSH key setup on first connection |
| 125 | + - Faster subsequent deployments (key-based auth) |
| 126 | + - More reliable vLLM server startup detection |
| 127 | + - Better cleanup on deployment failure |
| 128 | + |
| 129 | +### Technical Improvements |
| 130 | +- Pure Python SSH implementation (no GPL dependencies) |
| 131 | +- Apache 2.0 license compliance verified |
| 132 | +- Cleaner repository structure |
| 133 | +- Comprehensive .gitignore for production readiness |
| 134 | +- Removed unnecessary notebooks and demo files |
| 135 | + |
| 136 | +### Security |
| 137 | +- SSH key-based authentication (more secure than passwords) |
| 138 | +- Automatic key generation with proper permissions (700/600) |
| 139 | + |
| 140 | +## [2.1.0] - 2025-05-13 |
| 141 | + |
| 142 | + |
| 143 | +This release reduces overall GPU requirement for the deployment of the blueprint. It also improves the performance and stability for both docker and helm based deployments. |
| 144 | + |
| 145 | +### Added |
| 146 | +- Added non-blocking async support to upload documents API |
| 147 | + - Added a new field `blocking: bool` to control this behaviour from client side. Default is set to `true` |
| 148 | + - Added a new API `/status` to monitor state or completion status of uploaded docs |
| 149 | +- Helm chart is published on NGC Public registry. |
| 150 | +- Helm chart customization guide is now available for all optional features under [documentation](./README.md#available-customizations). |
| 151 | +- Issues with very large file upload has been fixed. |
| 152 | +- Security enhancements and stability improvements. |
| 153 | + |
| 154 | +### Changed |
| 155 | +- Overall GPU requirement reduced to 2xH100/3xA100. |
| 156 | + - Changed default LLM model to [llama-3_3-nemotron-super-49b-v1](https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1). This reduces overall GPU needed to deploy LLM model to 1xH100/2xA100 |
| 157 | + - Changed default GPU needed for all other NIMs (ingestion and reranker NIMs) to 1xH100/1xA100 |
| 158 | +- Changed default chunk size to 512 in order to reduce LLM context size and in turn reduce RAG server response latency. |
| 159 | +- Exposed config to split PDFs post chunking. Controlled using `APP_NVINGEST_ENABLEPDFSPLITTER` environment variable in ingestor-server. Default value is set to `True`. |
| 160 | +- Added batch-based ingestion which can help manage memory usage of `ingestor-server` more effectively. Controlled using `ENABLE_NV_INGEST_BATCH_MODE` and `NV_INGEST_FILES_PER_BATCH` variables. Default value is `True` and `100` respectively. |
| 161 | +- Removed `extract_options` from API level of `ingestor-server`. |
| 162 | +- Resolved an issue during bulk ingestion, where ingestion job failed if ingestion of a single file fails. |
| 163 | + |
| 164 | +### Known Issues |
| 165 | +- The `rag-playground` container needs to be rebuild if the `APP_LLM_MODELNAME`, `APP_EMBEDDINGS_MODELNAME` or `APP_RANKING_MODELNAME` environment variable values are changed. |
| 166 | +- While trying to upload multiple files at the same time, there may be a timeout error `Error uploading documents: [Error: aborted] { code: 'ECONNRESET' }`. Developers are encouraged to use API's directly for bulk uploading, instead of using the sample rag-playground. The default timeout is set to 1 hour from UI side, while uploading. |
| 167 | +- In case of failure while uploading files, error messages may not be shown in the user interface of rag-playground. Developers are encouraged to check the `ingestor-server` logs for details. |
| 168 | + |
| 169 | +A detailed guide is available [here](./docs/migration_guide.md) for easing developers experience, while migrating from older versions. |
| 170 | + |
| 171 | +## [2.0.0] - 2025-03-18 |
| 172 | + |
| 173 | +This release adds support for multimodal documents using [Nvidia Ingest](https://github.com/NVIDIA/nv-ingest) including support for parsing PDFs, Word and PowerPoint documents. It also significantly improves accuracy and perf considerations by refactoring the APIs, architecture as well as adds a new developer friendly UI. |
| 174 | + |
| 175 | +### Added |
| 176 | +- Integration with Nvingest for ingestion pipeline, the unstructured.io based pipeline is now deprecated. |
| 177 | +- OTEL compatible [observability and telemetry support](./docs/observability.md). |
| 178 | +- API refactoring. Updated schemas [here](./docs/api_reference/). |
| 179 | + - Support runtime configuration of all common parameters. |
| 180 | + - Multimodal citation support. |
| 181 | + - New dedicated endpoints for deleting collection, creating collections and reingestion of documents |
| 182 | +- [New react + nodeJS based UI](./frontend/) showcasing runtime configurations |
| 183 | +- Added optional features to improve accuracy and reliability of the pipeline, turned off by default. Best practices [here](./docs/accuracy_perf.md) |
| 184 | + - [Self reflection support](./docs/self-reflection.md) |
| 185 | + - [NeMo Guardrails support](./docs/nemo-guardrails.md) |
| 186 | + - [Hybrid search support using Milvus](./docs/hybrid_search.md) |
| 187 | +- [Brev dev](https://developer.nvidia.com/brev) compatible [notebook](./notebooks/launchable.ipynb) |
| 188 | +- Security enhancements and stability improvements |
| 189 | + |
| 190 | +### Changed |
| 191 | +- - In **RAG v1.0.0**, a single server managed both **ingestion** and **retrieval/generation** APIs. In **RAG v2.0.0**, the architecture has evolved to utilize **two separate microservices**. |
| 192 | +- [Helm charts](./deploy/helm/) are now modularized, seperate helm charts are provided for each distinct microservice. |
| 193 | +- Default settings configured to achieve a balance between accuracy and perf. |
| 194 | + - [Default flow uses on-prem models](./docs/quickstart.md#deploy-with-docker-compose) with option to switch to API catalog endpoints for docker based flow. |
| 195 | + - [Query rewriting](./docs/query_rewriter.md) uses a smaller llama3.1-8b-instruct and is turned off by default. |
| 196 | + - Support to use conversation history during retrieval for low-latency multiturn support. |
| 197 | + |
| 198 | +### Known Issues |
| 199 | +- The `rag-playground` container needs to be rebuild if the `APP_LLM_MODELNAME`, `APP_EMBEDDINGS_MODELNAME` or `APP_RANKING_MODELNAME` environment variable values are changed. |
| 200 | +- Optional features reflection, nemoguardrails and image captioning are not available in helm based deployment. |
| 201 | +- Uploading large files with .txt extension may fail during ingestion, we recommend splitting such files into smaller parts, to avoid this issue. |
| 202 | + |
| 203 | +A detailed guide is available [here](./docs/migration_guide.md) for easing developers experience, while migrating from older versions. |
| 204 | + |
| 205 | +## [1.0.0] - 2025-01-15 |
| 206 | + |
| 207 | +### Added |
| 208 | + |
| 209 | +- First release. |
0 commit comments