Created: February 2, 2026 Author: Claude Opus 4.5 + Jeremy Status: Ready for Deployment Estimated Time: 30-45 minutes
This guide deploys a drop-in replacement containing:
- Razer AIKit - vLLM-powered local inference with 280K+ models
- Visionary Tool Server - 312+ MCP tools with new AIKit bridge (22 tools)
- ngrok - Remote access tunnel (same domain as before)
Your existing setup stays on ice as backup. Same port 8082, same ngrok domain.
| Before | After |
|---|---|
| 13 LLM API providers | 1 unified AIKit endpoint |
| $80-130/mo potential API costs | ~$25-30/mo (Claude API only) |
| 67 local models (Ollama + LMStudio) | 280,000+ HuggingFace models |
| No fine-tuning capability | Full LoRA/QLoRA/DPO fine-tuning |
| Standard inference | vLLM optimized (2-3x faster) |
- All 312 existing MCP tools work identically
- Your workflow (Claude/ChatGPT/Gemini → Tool Server) unchanged
- Same port 8082 - drop-in replacement
- Same ngrok domain - no client config changes
- ngrok tunnel for mobile/remote access
- Docker Desktop "click to run" simplicity
Before starting, verify these are installed/configured:
- Docker Desktop running with WSL2 backend
- NVIDIA Container Toolkit in WSL2 (for GPU passthrough)
- ngrok account with custom domain capability
- HuggingFace account (free, for model access)
# In PowerShell
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smiShould show your RTX 4090. If not, see Troubleshooting section.
wsl --list --verbose
# Should show at least one distro with VERSION 2After deployment, your project will look like:
D:\DEV_PROJECTS\GitHub\Claude_Opus_ChatGPT_App_Project\
├── main.py # Entry point (unchanged)
├── docker-compose.yml # Original Sandbox A
├── docker-compose.aikit.yml # NEW: Sandbox B with AIKit
├── Dockerfile # Original
├── Dockerfile.aikit # NEW: AIKit-optimized
├── .env.master # API keys (unchanged)
├── RAZER_AIKIT_DEPLOYMENT.md # This guide
├── app/
│ ├── server.py
│ ├── config.py
│ ├── utils.py
│ └── tools/
│ ├── github.py # Existing (35 tools)
│ ├── discord.py # Existing
│ ├── razer_aikit.py # NEW: 22 AIKit tools
│ └── ... (60+ more modules)
└── D:\Visionary_Models\
├── aikit/ # NEW: AIKit model storage
└── aikit-cache/ # NEW: HuggingFace cache
# Create AIKit storage directories
New-Item -ItemType Directory -Force -Path "D:\Visionary_Models\aikit"
New-Item -ItemType Directory -Force -Path "D:\Visionary_Models\aikit-cache"Copy these files to your project root (D:\DEV_PROJECTS\GitHub\Claude_Opus_ChatGPT_App_Project\):
docker-compose.aikit.yml→ Project rootDockerfile.aikit→ Project rootapp/tools/razer_aikit.py→app/tools/directory
Edit app/server.py to import the new module. Add this line with the other tool imports:
# In app/server.py, add with other imports:
from app.tools import razer_aikitOr if using dynamic import pattern, add to the tools list:
# If you have a TOOL_MODULES list:
TOOL_MODULES = [
# ... existing modules ...
"razer_aikit", # ADD THIS LINE
]If you want a separate ngrok domain for the new sandbox, update docker-compose.aikit.yml:
# In ngrok service, change domain:
command: >
http tool-server:8083
--domain=visionary-aikit-sandbox.ngrok.io # Your custom domainOr use your existing domain by changing port mapping.
Add to your .env.master:
# Add this line (get token from https://huggingface.co/settings/tokens)
HUGGINGFACE_API_KEY=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxx# Navigate to project
cd D:\DEV_PROJECTS\GitHub\Claude_Opus_ChatGPT_App_Project
# Build the containers (first time takes ~5-10 minutes)
docker compose -f docker-compose.aikit.yml build
# Start the stack
docker compose -f docker-compose.aikit.yml up -d
# Check status
docker compose -f docker-compose.aikit.yml ps
# View logs
docker compose -f docker-compose.aikit.yml logs -f# Check AIKit health
curl http://localhost:8000/health
# Check Tool Server health
curl http://localhost:8082/health
# Should return: {"status":"healthy","tools":334}
# (312 original + 22 new AIKit tools)Via Tool Server MCP or direct API call:
# Test chat completion
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/deepseek-coder-1.3b-instruct",
"messages": [{"role": "user", "content": "Write a Python hello world"}],
"max_tokens": 100
}'For mobile/remote access:
# Start with tunnel profile
docker compose -f docker-compose.aikit.yml --profile tunnel up -d ngrok
# Check tunnel
curl http://localhost:4040/api/tunnelsOption A: Docker Desktop UI
- Open Docker Desktop
- Find "visionary-aikit-stack"
- Click
▶️ Run
Option B: Command Line
docker compose -f docker-compose.aikit.yml up -dOption A: Docker Desktop UI
- Find "visionary-aikit-stack"
- Click ⏹️ Stop
Option B: Command Line
docker compose -f docker-compose.aikit.yml down# All services
docker compose -f docker-compose.aikit.yml logs -f
# Specific service
docker compose -f docker-compose.aikit.yml logs -f aikit
docker compose -f docker-compose.aikit.yml logs -f tool-serverThe default model is deepseek-ai/deepseek-coder-1.3b-instruct (fast, small).
To use a different model, either:
- Per-request: Specify
modelparameter in API calls - Default change: Edit
docker-compose.aikit.ymlcommand section
Popular models to try:
Qwen/Qwen2.5-7B-Instruct- Great all-roundermicrosoft/phi-4- Strong reasoningQwen/Qwen2.5-Coder-32B-Instruct- Best coding qualitydeepseek-ai/DeepSeek-R1-Distill-Qwen-7B- Chain-of-thought
NOTHING CHANGES - same endpoints as before!
Use MCP connector URL:
http://localhost:8082/sse # Local
https://visionary-tool-server.ngrok.io/sse # Remote
Same MCP endpoint:
https://visionary-tool-server.ngrok.io/sse
Settings → Agent Skills → Add MCP Server:
Name: Visionary Tool Server
URL: http://localhost:8082/sse
Already configured? It just works. No changes needed.
import httpx
# Chat with local LLM
response = httpx.post(
"http://localhost:8000/v1/chat/completions",
json={
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}],
}
)
print(response.json())| Tool | Description |
|---|---|
aikit_chat |
Chat completion (OpenAI-compatible) |
aikit_complete |
Text completion |
aikit_embed |
Generate embeddings |
aikit_code_assist |
Specialized code assistance |
| Tool | Description |
|---|---|
aikit_list_models |
List available models |
aikit_pull_model |
Download from HuggingFace |
aikit_model_info |
Get model metadata |
aikit_load_model |
Load model into memory |
aikit_unload_model |
Remove model from memory |
| Tool | Description |
|---|---|
aikit_finetune_start |
Start LoRA/QLoRA training |
aikit_finetune_status |
Check training progress |
aikit_finetune_stop |
Cancel training |
aikit_finetune_list |
List all jobs |
aikit_merge_adapter |
Merge adapter with base |
| Tool | Description |
|---|---|
aikit_health |
Server health check |
aikit_gpu_status |
GPU metrics |
aikit_cluster_status |
Ray cluster info |
aikit_benchmark |
Performance testing |
| Tool | Description |
|---|---|
aikit_quick_chat |
Simple one-turn chat |
aikit_recommend_model |
Get model suggestions |
Total: 22 new tools
# Reinstall NVIDIA Container Toolkit in WSL2
wsl -d Ubuntu
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker# Check logs
docker compose -f docker-compose.aikit.yml logs aikit
# Common issues:
# - Out of GPU memory: Use smaller model
# - Model download failed: Check HuggingFace token
# - Port conflict: Change ports in docker-compose# Find what's using the port
netstat -ano | findstr :8000
netstat -ano | findstr :8082
# Kill the process
taskkill /PID <pid> /F
# Or change ports in docker-compose.aikit.ymlYour RTX 4090 has 24GB VRAM. Max model sizes:
- 7B models: ~14GB VRAM (comfortable)
- 13B models: ~22GB VRAM (tight)
- 32B+ models: Requires quantization
Use quantized versions:
# In docker-compose, change model to quantized version:
aikit run Qwen/Qwen2.5-32B-Instruct-AWQ --quantization awq# Check if AIKit is running
docker compose -f docker-compose.aikit.yml ps
# Check network
docker network inspect aikit-network
# Verify internal DNS
docker compose -f docker-compose.aikit.yml exec tool-server curl http://aikit:8000/health| Model Size | Tokens/sec | First Token Latency |
|---|---|---|
| 1-3B | 150-200 | <100ms |
| 7B | 80-120 | 200-500ms |
| 13B | 40-60 | 500-1000ms |
| 32B (quantized) | 20-40 | 1-2s |
| Component | RAM | VRAM |
|---|---|---|
| AIKit (7B model) | ~4GB | ~14GB |
| Tool Server | ~1GB | ~0 |
| Docker overhead | ~2GB | ~0 |
| Total | ~7GB | ~14GB |
If anything goes wrong, your original setup is untouched:
# Stop new stack
docker compose -f docker-compose.aikit.yml down
# Use original stack
docker compose up -d
# Or run directly
python main.pyYour Sandbox A on port 8082 works exactly as before.
After deployment, verify these work:
-
curl http://localhost:8000/healthreturns healthy -
curl http://localhost:8083/healthshows 334 tools - AIKit chat completion works with test prompt
- Tool Server can call
aikit_chattool - Existing GitHub tools still work
- ngrok tunnel accessible (if enabled)
- Claude.ai can connect via MCP
- GPU visible in container (
nvidia-smi)
- Razer AIKit GitHub
- Razer AIKit Docs
- vLLM Documentation
- LlamaFactory (Fine-tuning)
- HuggingFace Models
- Ray Dashboard (when running)
Once verified, you have:
- 334 MCP tools (312 original + 22 AIKit)
- 280,000+ local LLM models via HuggingFace
- vLLM optimized inference (2-3x faster)
- Fine-tuning capability (LoRA, QLoRA, DPO)
- $50-100/month savings on API costs
- Complete data privacy - nothing leaves your machine
Your AI orchestration now runs through:
You → Claude/ChatGPT/Gemini → Visionary Tool Server → Razer AIKit → RTX 4090
Welcome to the future of local AI infrastructure! 🚀