diff --git a/docs/features/rag/index.md b/docs/features/rag/index.md index c389da7f6..d001af251 100644 --- a/docs/features/rag/index.md +++ b/docs/features/rag/index.md @@ -146,6 +146,23 @@ The RAG feature allows users to easily track the context of documents fed to LLM The togglable hybrid search sub-feature for our RAG embedding feature enhances RAG functionality via `BM25`, with re-ranking powered by `CrossEncoder`, and configurable relevance score thresholds. This provides a more precise and tailored RAG experience for your specific use case. +## KV Cache Optimization (Performance Tip) 🚀 + +For professional and high-performance use cases—especially when dealing with long documents or frequent follow-up questions—you can significantly improve response times by enabling **KV Cache Optimization**. + +### The Problem: Cache Invalidation +By default, Open WebUI injects retrieved RAG context into the **user message**. As the conversation progresses, follow-up messages shift the position of this context in the chat history. For many LLM engines—including local engines (like Ollama, llama.cpp, and vLLM) and cloud providers / Model-as-a-Service providers (like OpenAI and Vertex AI)—this shifting position invalidates the **KV (Key-Value) prefix cache** or **Prompt Cache**, forcing the model to re-process the entire context for every single response. This leads to increased latency and potentially higher costs as the conversation grows. + +### The Solution: `RAG_SYSTEM_CONTEXT` +You can fix this behavior by enabling the `RAG_SYSTEM_CONTEXT` environment variable. + +- **How it works**: When `RAG_SYSTEM_CONTEXT=True`, Open WebUI injects the RAG context into the **system message** instead of the user message. +- **The Result**: Since the system message stays at the absolute beginning of the prompt and its position never changes, the provider can effectively cache the processed context. Follow-up questions then benefit from **instant responses** and **cost savings** because the "heavy lifting" (processing the large RAG context) is only done once. + +:::tip recommended configuration +If you are using **Ollama**, **llama.cpp**, **OpenAI**, or **Vertex AI** and frequently "chat with your documents," set `RAG_SYSTEM_CONTEXT=True` in your environment to experience drastically faster follow-up responses! +::: + ## YouTube RAG Pipeline The dedicated RAG pipeline for summarizing YouTube videos via video URLs enables smooth interaction with video transcriptions directly. This innovative feature allows you to incorporate video content into your chats, further enriching your conversation experience. diff --git a/docs/getting-started/env-configuration.mdx b/docs/getting-started/env-configuration.mdx index 3649245ca..2391ee029 100644 --- a/docs/getting-started/env-configuration.mdx +++ b/docs/getting-started/env-configuration.mdx @@ -2896,6 +2896,12 @@ Strictly return in JSON format: - Description: Specifies whether to use the full context for RAG. - Persistence: This environment variable is a `PersistentConfig` variable. +#### `RAG_SYSTEM_CONTEXT` + +- Type: `bool` +- Default: `False` +- Description: When enabled, injects RAG context into the **system message** instead of the user message. This is highly recommended for optimizing performance when using models that support **KV prefix caching** or **Prompt Caching**. This includes local engines (like Ollama, llama.cpp, or vLLM) and cloud providers / Model-as-a-Service providers (like OpenAI and Vertex AI). By placing the context in the system message, it remains at a stable position at the start of the conversation, allowing the cache to persist across multiple turns. When disabled (default), context is injected into the user message, which shifts position each turn and invalidates the cache. + #### `ENABLE_RAG_LOCAL_WEB_FETCH` - Type: `bool` diff --git a/docs/troubleshooting/connection-error.mdx b/docs/troubleshooting/connection-error.mdx index 3f05066df..3d7928fbc 100644 --- a/docs/troubleshooting/connection-error.mdx +++ b/docs/troubleshooting/connection-error.mdx @@ -77,7 +77,10 @@ WebSocket support is required for Open WebUI v0.5.0 and later. If WebSockets are 1. **Check your reverse proxy configuration** - Ensure `Upgrade` and `Connection` headers are properly set 2. **Verify CORS settings** - WebSocket connections respect CORS policies 3. **Check browser console** - Look for WebSocket connection errors -4. **Test direct connection** - Try connecting directly to Open WebUI without the proxy to isolate the issue +4. **Test direct connection** - Try connecting directly to Open WebUI without the proxy to isolate the issue. +5. **Check for HTTP/2 WebSocket Issues** - Some proxies (like HAProxy 3.x) enable HTTP/2 by default. If your proxy handles client connections via HTTP/2 but the backend/application doesn't support RFC 8441 (WebSockets over H2) properly, the instance may "freeze" or stop responding. + - **Fix for HAProxy**: Add `option h2-workaround-bogus-websocket-clients` to your configuration or force the backend connection to use HTTP/1.1. + - **Fix for Nginx**: Ensure you are using `proxy_http_version 1.1;` in your location block (which is the default in many Open WebUI examples). For multi-instance deployments, configure Redis for WebSocket management: ```bash diff --git a/docs/troubleshooting/rag.mdx b/docs/troubleshooting/rag.mdx index 1a98d0530..105d2541f 100644 --- a/docs/troubleshooting/rag.mdx +++ b/docs/troubleshooting/rag.mdx @@ -167,6 +167,18 @@ When using the **Markdown Header Splitter**, documents can sometimes be split in --- +### 8. Slow Follow-up Responses (KV Cache Invalidation) 🐌 + +If your initial response is fast but follow-up questions become increasingly slow, you are likely experiencing **KV Cache invalidation**. + +**The Problem**: By default, Open WebUI injects RAG context into the **user message**. As the chat progresses, new messages shift the position of this context, forcing models (like Ollama, llama.cpp, or vLLM) and cloud providers (like OpenAI or Vertex AI) to re-process the entire context for every turn. + +✅ Solution: +- Set the environment variable `RAG_SYSTEM_CONTEXT=True`. +- This injects the RAG context into the **system message**, which stays at a fixed position at the start of the conversation. +- This allows providers to effectively use **KV prefix caching** or **Prompt Caching**, resulting in nearly instant follow-up responses even with large documents. + +--- | Problem | Fix | |--------|------| @@ -176,6 +188,7 @@ When using the **Markdown Header Splitter**, documents can sometimes be split in | 📉 Inaccurate retrieval | Switch to a better embedding model, then reindex | | ❌ Upload limits bypass | Use Folder uploads (with `FOLDER_MAX_FILE_COUNT`) but note that Knowledge Base limits are separate | | 🧩 Fragmented/Tiny Chunks | Increase **Chunk Min Size Target** to merge small sections | +| 🐌 Slow follow-up responses | Enable `RAG_SYSTEM_CONTEXT=True` to fix KV cache invalidation | | Still confused? | Test with GPT-4o and compare outputs | --- diff --git a/docs/tutorials/https/haproxy.md b/docs/tutorials/https/haproxy.md index 8463374d5..7446965a3 100644 --- a/docs/tutorials/https/haproxy.md +++ b/docs/tutorials/https/haproxy.md @@ -117,6 +117,20 @@ backend owui_chat http-request add-header X-CLIENT-IP %[src] http-request set-header X-Forwarded-Proto https if { ssl_fc } server chat :3000 + +## WebSocket and HTTP/2 Compatibility + +Starting with recent versions (including HAProxy 3.x), HAProxy may enable HTTP/2 by default. While HTTP/2 supports WebSockets (RFC 8441), some clients or backend configurations may experience "freezes" or unresponsiveness when icons or data start loading via WebSockets over an H2 tunnel. + +If you experience these issues: +1. **Force HTTP/1.1 for WebSockets**: Add `option h2-workaround-bogus-websocket-clients` to your `frontend` or `defaults` section. This prevents HAProxy from advertising RFC 8441 support to the client, forcing a fallback to the more stable HTTP/1.1 Upgrade mechanism. +2. **Backend Version**: Ensure your backend connection is using HTTP/1.1 (the default for `mode http`). + +Example addition to your `defaults` or `frontend`: +```shell +defaults + # ... other settings + option h2-workaround-bogus-websocket-clients ``` You will see that we have ACL records (routers) for both Open WebUI and Let's Encrypt. To use WebSocket with OWUI, you need to have an SSL configured, and the easiest way to do that is to use Let's Encrypt. diff --git a/docs/tutorials/https/nginx.md b/docs/tutorials/https/nginx.md index 3de6d8eea..8989ea506 100644 --- a/docs/tutorials/https/nginx.md +++ b/docs/tutorials/https/nginx.md @@ -25,6 +25,17 @@ A very common and difficult-to-debug issue with WebSocket connections is a misco Failure to do so will cause WebSocket connections to fail, even if you have enabled "Websockets support" in Nginx Proxy Manager. +### HTTP/2 and WebSockets + +If you enable **HTTP/2** on your Nginx server, ensure that your proxy configuration still uses **HTTP/1.1** for the connection to the Open WebUI backend. This is crucial as most WebUI features (like streaming and real-time updates) rely on WebSockets, which are more stable when handled via HTTP/1.1 `Upgrade` than over the newer RFC 8441 (WebSockets over H2) in many proxy environments. + +In your Nginx location block, always include: +```nginx +proxy_http_version 1.1; +proxy_set_header Upgrade $http_upgrade; +proxy_set_header Connection "upgrade"; +``` + ::: Choose the method that best fits your deployment needs. diff --git a/docs/tutorials/tips/performance.md b/docs/tutorials/tips/performance.md index fba1b8e95..007e11219 100644 --- a/docs/tutorials/tips/performance.md +++ b/docs/tutorials/tips/performance.md @@ -62,9 +62,17 @@ If you are using **OpenRouter** or any provider with hundreds/thousands of model Reuses the LLM-generated Web-Search search queries for RAG search within the same chat turn. This prevents redundant LLM calls when multiple retrieval features act on the same user prompt. - **Env Var**: `ENABLE_QUERIES_CACHE=True` + * *Note*: If enabled, the same search query will be reused for RAG instead of generating new queries for RAG, saving on inference cost and API calls, thus improving performance. I.e. the LLM generates "US News 2025" as a Web Search query, if this setting is enabled, the same search query will be reused for RAG instead of generating new queries for RAG, saving on inference cost and API calls, thus improving performance. +#### KV Cache Optimization (RAG Performance) +Drastically improves the speed of follow-up questions when chatting with large documents or knowledge bases. + +- **Env Var**: `RAG_SYSTEM_CONTEXT=True` +- **Effect**: Injects RAG context into the **system message** instead of the user message. +- **Why**: Many LLM engines (like Ollama, llama.cpp, vLLM) and cloud providers (OpenAI, Vertex AI) support **KV prefix caching** or **Prompt Caching**. System messages stay at the start of the conversation, while user messages shift position each turn. Moving RAG context to the system message ensures the cache remains valid, leading to **near-instant follow-up responses** instead of re-processing large contexts every turn. + --- ## 📦 Database Optimization @@ -305,3 +313,4 @@ For detailed information on all available variables, see the [Environment Config | `AUDIO_STT_ENGINE` | [STT Engine](/getting-started/env-configuration#audio_stt_engine) | | `ENABLE_IMAGE_GENERATION` | [Image Generation](/getting-started/env-configuration#enable_image_generation) | | `ENABLE_AUTOCOMPLETE_GENERATION` | [Autocomplete](/getting-started/env-configuration#enable_autocomplete_generation) | +| `RAG_SYSTEM_CONTEXT` | [RAG System Context](/getting-started/env-configuration#rag_system_context) |