Add LiteLLM proxy config for Venice API in Cursor

Jon-edge · Jon-edge · commit dcfc9b92b153 · 2026-02-19T12:26:52.000-08:00
Cursor over-allocates max_tokens for models with 1M context windows
(e.g. claude-opus-4-6), causing Venice to reject requests. This adds
a LiteLLM proxy config that clamps output tokens to safe limits.
diff --git a/scripts/venice-litellm/README.md b/scripts/venice-litellm/README.md
@@ -0,0 +1,56 @@
+# Venice API + Cursor IDE via LiteLLM Proxy
+
+Venice models with 1M token context windows (e.g. `claude-opus-4-6`, `claude-sonnet-4-6`) fail in Cursor because Cursor derives `max_tokens` from the context window and sends values that exceed Venice's output token limits.
+
+Models with ≤200k context (e.g. `claude-opus-45`) work without a proxy.
+
+LiteLLM sits between Cursor and Venice, clamping `max_tokens` to safe values.
+
+## Prerequisites
+
+- `VENICE_API_KEY` exported in your shell (e.g. in `~/.zshrc`)
+- Python 3.9+
+
+## Setup
+
+```bash
+pip install 'litellm[proxy]'
+```
+
+## Start the proxy
+
+```bash
+litellm --config /path/to/litellm-config.yaml --port 8765
+```
+
+## Configure Cursor
+
+1. Open **Settings > Models**
+2. Add custom models by name: `claude-opus-4-6`, `openai-gpt-52`, etc.
+3. Under **OpenAI API Key**, enter your Venice API key
+4. Set **Override OpenAI Base URL** to `http://localhost:8765`
+
+## Available models
+
+| Model | Max Output Tokens | Notes |
+|-------|-------------------|-------|
+| `claude-opus-4-6` | 8192 | Anthropic's most capable reasoning model |
+| `claude-opus-45` | 8192 | Works without proxy (198k context) |
+| `claude-sonnet-4-6` | 8192 | Best speed/intelligence balance |
+| `claude-sonnet-45` | 8192 | Works without proxy (198k context) |
+| `openai-gpt-52` | 16384 | GPT-5.2 frontier model |
+| `openai-gpt-52-codex` | 16384 | GPT-5.2 optimized for code |
+
+## Adding models
+
+Edit `litellm-config.yaml` following the existing pattern. Use Venice model IDs from their [models endpoint](https://docs.venice.ai/api-reference/endpoint/models/list).
+
+## Why not use Venice directly?
+
+Venice advertises `availableContextTokens: 1000000` for newer Claude/Gemini models. Cursor uses this to budget `max_tokens`, often requesting 200k+ output tokens. Venice rejects these with:
+
+```
+max_tokens: 232001 > 128000, which is the maximum allowed number of output tokens for claude-opus-4-6
+```
+
+The proxy intercepts this by setting `model_info.max_tokens` per model, which LiteLLM uses to constrain requests.
diff --git a/scripts/venice-litellm/litellm-config.yaml b/scripts/venice-litellm/litellm-config.yaml
@@ -0,0 +1,57 @@
+model_list:
+  # Claude Opus 4.6 - clamped to safe limits
+  - model_name: claude-opus-4-6
+    litellm_params:
+      model: openai/claude-opus-4-6
+      api_base: https://api.venice.ai/api/v1
+      api_key: os.environ/VENICE_API_KEY
+    model_info:
+      max_tokens: 8192
+
+  # Claude Opus 4.5
+  - model_name: claude-opus-45
+    litellm_params:
+      model: openai/claude-opus-45
+      api_base: https://api.venice.ai/api/v1
+      api_key: os.environ/VENICE_API_KEY
+    model_info:
+      max_tokens: 8192
+
+  # Claude Sonnet 4.6
+  - model_name: claude-sonnet-4-6
+    litellm_params:
+      model: openai/claude-sonnet-4-6
+      api_base: https://api.venice.ai/api/v1
+      api_key: os.environ/VENICE_API_KEY
+    model_info:
+      max_tokens: 8192
+
+  # Claude Sonnet 4.5
+  - model_name: claude-sonnet-45
+    litellm_params:
+      model: openai/claude-sonnet-45
+      api_base: https://api.venice.ai/api/v1
+      api_key: os.environ/VENICE_API_KEY
+    model_info:
+      max_tokens: 8192
+
+  # GPT-5.2
+  - model_name: openai-gpt-52
+    litellm_params:
+      model: openai/openai-gpt-52
+      api_base: https://api.venice.ai/api/v1
+      api_key: os.environ/VENICE_API_KEY
+    model_info:
+      max_tokens: 16384
+
+  # GPT-5.2 Codex
+  - model_name: openai-gpt-52-codex
+    litellm_params:
+      model: openai/openai-gpt-52-codex
+      api_base: https://api.venice.ai/api/v1
+      api_key: os.environ/VENICE_API_KEY
+    model_info:
+      max_tokens: 16384
+
+router_settings:
+  enable_pre_call_checks: true  # Check context window before call