Inline code autocomplete for VS Code, powered by your choice of LLM backend. Get ghost-text suggestions as you type — accept with Tab, dismiss with Escape.
Works with Ollama, Anthropic (Claude), vLLM, LM Studio, LiteLLM Gateway, or any server that speaks the OpenAI chat completions protocol.
Best for: privacy-conscious users, offline work, or trying out the extension for free.
-
Install Ollama and pull a code model:
ollama pull codellama:7b
-
Open VS Code Settings (
Cmd+,/Ctrl+,) and search fortypeAhead:Setting Value Backend OpenAI CompatibleModel codellama:7bApi Base Url http://localhost:11434/v1 -
Start typing in any file. Ghost text should appear after a brief pause.
Tip: Other good Ollama models for code completion:
deepseek-coder:6.7b,starcoder2:3b(fast),codellama:13b(better quality).
Best for: highest quality completions using Claude models, when you have an Anthropic API key.
-
Get an API key from console.anthropic.com
-
Settings:
Setting Value Backend AnthropicModel claude-haiku-4-5(fast) orclaude-sonnet-4-6(smarter)Api Key Your Anthropic API key (starts with sk-ant-)Leave Api Base Url empty — it defaults to
https://api.anthropic.com.
Tip: If you leave Model empty, the extension uses the
ANTHROPIC_SMALL_FAST_MODELenvironment variable (if set), otherwise defaults toclaude-haiku-4-5.
Best for: running larger models on powerful hardware, or using models not available in Ollama.
Both vLLM and LM Studio expose an OpenAI-compatible API.
vLLM:
vllm serve deepseek-ai/deepseek-coder-6.7b-instruct --port 8000| Setting | Value |
|---|---|
| Backend | OpenAI Compatible |
| Model | deepseek-ai/deepseek-coder-6.7b-instruct |
| Api Base Url | http://localhost:8000/v1 |
LM Studio:
- Download a model in LM Studio and start the local server
- Use the model name shown in LM Studio's server tab
| Setting | Value |
|---|---|
| Backend | OpenAI Compatible |
| Model | (model name from LM Studio) |
| Api Base Url | http://localhost:1234/v1 |
Best for: organizations that route requests through a centralized LLM proxy, or when you need to use models from multiple providers through one endpoint.
LiteLLM is a proxy server that translates OpenAI-format requests to 100+ LLM providers.
| Setting | Value |
|---|---|
| Backend | LiteLLM Gateway |
| Model | Model name as configured in your LiteLLM proxy (e.g., gpt-4o-mini, claude-haiku) |
| Api Base Url | Your LiteLLM server URL (e.g., http://litellm.internal:4000/v1) |
| Api Key | Your LiteLLM API key (if required) |
The OpenAI Compatible backend works with any server that implements the /chat/completions endpoint in the OpenAI format. This includes:
- Ollama
- vLLM
- LM Studio
- LocalAI
- text-generation-webui (with OpenAI extension)
- llama.cpp server
- OpenAI API itself
- Azure OpenAI
- Any custom gateway
| Scenario | Backend | Why |
|---|---|---|
| Free, local, private | OpenAI Compatible + Ollama |
No API key needed. Data stays on your machine. |
| Best quality completions | Anthropic |
Claude models produce high-quality code completions. |
| Large models on GPU server | OpenAI Compatible + vLLM |
vLLM is optimized for GPU inference throughput. |
| Quick local experimentation | OpenAI Compatible + LM Studio |
GUI for downloading and running models. |
| Corporate/team setup | LiteLLM Gateway |
Centralized proxy with auth, logging, rate limiting. |
| Custom internal LLM gateway | OpenAI Compatible |
Works with any OpenAI-compatible endpoint. |
Open VS Code Settings (Cmd+, / Ctrl+,) and search for typeAhead.
| Setting | Type | Default | Description |
|---|---|---|---|
typeAhead.enabled |
boolean | true |
Enable or disable the extension |
typeAhead.backend |
enum | openai |
Backend: OpenAI Compatible, Anthropic, or LiteLLM Gateway |
typeAhead.model |
string | "" |
Model name (required for OpenAI/LiteLLM, optional for Anthropic) |
typeAhead.apiBaseUrl |
string | "" |
API base URL (required for OpenAI/LiteLLM, defaults to https://api.anthropic.com for Anthropic) |
typeAhead.apiKey |
string | "" |
Static API key. Leave empty for servers that need no auth (like local Ollama) |
typeAhead.apiKeyHelper |
string | "" |
Shell command that outputs an API key (overrides apiKey — see below) |
typeAhead.debounceMs |
number | 300 |
Milliseconds to wait after you stop typing before requesting a completion |
typeAhead.contextLines |
number | 100 |
Lines of code before and after the cursor to send as context |
typeAhead.excludePatterns |
string[] | [] |
Glob patterns for files/folders where autocomplete is disabled (see below) |
typeAhead.customInstructions |
string | "" |
Custom instructions appended to the system prompt (see below) |
typeAhead.cacheSize |
number | 50 |
Number of completions to cache. Set to 0 to disable caching |
You can also set these in your settings.json:
{
"typeAhead.backend": "openai",
"typeAhead.model": "codellama:7b",
"typeAhead.apiBaseUrl": "http://localhost:11434/v1"
}Use excludePatterns to disable autocomplete in certain files, file types, or directories:
{
"typeAhead.excludePatterns": [
"**/*.md",
"**/*.json",
"**/node_modules/**",
"**/dist/**",
".env",
"*.lock"
]
}Supported patterns:
| Pattern | What it matches |
|---|---|
*.md |
Any file ending in .md |
**/*.json |
Any .json file in any directory |
**/node_modules/** |
Any file inside a node_modules folder |
.env |
A file named exactly .env |
src/secret.ts |
A specific file path |
**/Dockerfile |
A file named Dockerfile in any directory |
Add your own instructions that the LLM should follow when generating completions:
{
"typeAhead.customInstructions": "Always use TypeScript strict types. Prefer async/await over .then() chains. Use descriptive variable names."
}In the VS Code settings UI, the customInstructions field supports multi-line text so you can write longer instructions.
These instructions are appended to the system prompt sent to the model with every completion request. Use them to enforce coding standards, style preferences, or project-specific conventions.
For environments where API keys are short-lived (corporate SSO, rotating tokens, etc.), you can configure a shell command that generates a fresh key. The extension runs this command:
- Once when VS Code opens (session start)
- Again automatically if the server returns a 401 or 403 error
Example: Using a custom CLI tool:
{
"typeAhead.apiKeyHelper": "my-company-cli get-api-token --service llm"
}Example: Using environment-specific scripts:
{
"typeAhead.apiKeyHelper": "/path/to/get-llm-key.sh"
}How it works:
- The extension runs your command and reads the API key from stdout
- The key is cached in memory for the session (not written to disk)
- If the server returns 401/403, the command is re-run to get a fresh key, and the request is retried
Priority: apiKeyHelper > apiKey. If both are set, the helper command wins.
Open the Command Palette (Cmd+Shift+P / Ctrl+Shift+P):
| Command | Description |
|---|---|
Type Ahead: Toggle On/Off |
Quickly enable or disable the extension |
The extension shows its status in the bottom-right of VS Code:
| Icon | Meaning |
|---|---|
$(sparkle) Type Ahead |
Ready — waiting for you to type |
$(loading~spin) Type Ahead |
Generating a completion |
$(warning) Type Ahead |
Error — click to toggle, check Output panel for details |
$(circle-slash) Type Ahead |
Disabled |
Click the status bar item to toggle the extension on/off.
- Check the status bar — is it showing "Type Ahead" or is it hidden?
- Open the Output panel (
Cmd+Shift+U) and select Extension Host from the dropdown - Look for log lines starting with
Type Ahead:— they show the full request/response flow:Type Ahead: [auth] warming up API key at session start... Type Ahead: [auth] API key ready Type Ahead: [llm] POST http://localhost:11434/v1/chat/completions (model: codellama:7b) Type Ahead: [llm] auth: Bearer token set Type Ahead: [llm] response 200 in 342ms Type Ahead: [llm] completion: 28 chars
- Increase debounce: Set
debounceMsto 500-1000ms for slow servers. This reduces unnecessary requests while you're still typing. - Use a faster model: Smaller models respond faster. Try
starcoder2:3borcodellama:7binstead of 13B+ models. - Reduce context: Lower
contextLinesfrom 100 to 30-50. Less context = faster inference. - The first completion is always slower because there's no cache. Subsequent completions at the same position are instant (cache hit).
- Your API key is invalid or expired
- If using
apiKeyHelper, check that the command works: run it in your terminal and verify it outputs a key - For Anthropic, make sure the key starts with
sk-ant-
- The model name doesn't match what the server knows. Check:
- Ollama:
ollama listto see installed models - vLLM: check the model name you used in
vllm serve - Anthropic: use
claude-haiku-4-5,claude-sonnet-4-6, etc.
- Ollama:
- You selected
OpenAI CompatibleorLiteLLM Gatewaybut didn't set a URL - Set
apiBaseUrlto your server's URL (e.g.,http://localhost:11434/v1for Ollama)
- The server is not running or not reachable at the configured URL
- Check that your server is running:
curl http://localhost:11434/v1/models
| Tip | Setting | Effect |
|---|---|---|
| Faster suggestions | debounceMs: 150 |
Triggers sooner after you stop typing (more API calls) |
| Less API usage | debounceMs: 500 |
Waits longer, fewer requests, saves tokens |
| Faster inference | contextLines: 30 |
Sends less code to the model |
| Better completions | contextLines: 200 |
More context = more accurate completions (slower) |
| Disable caching | cacheSize: 0 |
Every request goes to the server (useful for testing) |
The extension works with all programming languages supported by VS Code. The model receives the file name and language identifier along with the surrounding code, so it can adapt its completions to the language you're working in.
- Local models (Ollama, vLLM, LM Studio): Your code never leaves your machine.
- Anthropic / LiteLLM / remote servers: Code context (up to
contextLineslines around your cursor) is sent to the configured API endpoint. No data is stored by the extension itself. - API keys: Stored in VS Code settings (on disk). For sensitive environments, use
apiKeyHelperto generate keys dynamically — they are only held in memory. - No telemetry: The extension does not collect or send any usage data.