Skip to content

Latest commit

 

History

History
518 lines (378 loc) · 13 KB

File metadata and controls

518 lines (378 loc) · 13 KB

🚀 Razer AIKit + Visionary Tool Server Deployment Guide

Complete Integration for Local LLM Infrastructure

Created: February 2, 2026 Author: Claude Opus 4.5 + Jeremy Status: Ready for Deployment Estimated Time: 30-45 minutes


📋 Executive Summary

This guide deploys a drop-in replacement containing:

  • Razer AIKit - vLLM-powered local inference with 280K+ models
  • Visionary Tool Server - 312+ MCP tools with new AIKit bridge (22 tools)
  • ngrok - Remote access tunnel (same domain as before)

Your existing setup stays on ice as backup. Same port 8082, same ngrok domain.

What Changes

Before After
13 LLM API providers 1 unified AIKit endpoint
$80-130/mo potential API costs ~$25-30/mo (Claude API only)
67 local models (Ollama + LMStudio) 280,000+ HuggingFace models
No fine-tuning capability Full LoRA/QLoRA/DPO fine-tuning
Standard inference vLLM optimized (2-3x faster)

What Stays the Same

  • All 312 existing MCP tools work identically
  • Your workflow (Claude/ChatGPT/Gemini → Tool Server) unchanged
  • Same port 8082 - drop-in replacement
  • Same ngrok domain - no client config changes
  • ngrok tunnel for mobile/remote access
  • Docker Desktop "click to run" simplicity

🔧 Prerequisites Checklist

Before starting, verify these are installed/configured:

Required

  • Docker Desktop running with WSL2 backend
  • NVIDIA Container Toolkit in WSL2 (for GPU passthrough)
  • ngrok account with custom domain capability
  • HuggingFace account (free, for model access)

Verify Docker GPU Support

# In PowerShell
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

Should show your RTX 4090. If not, see Troubleshooting section.

Verify WSL2

wsl --list --verbose
# Should show at least one distro with VERSION 2

📁 File Structure

After deployment, your project will look like:

D:\DEV_PROJECTS\GitHub\Claude_Opus_ChatGPT_App_Project\
├── main.py                          # Entry point (unchanged)
├── docker-compose.yml               # Original Sandbox A
├── docker-compose.aikit.yml         # NEW: Sandbox B with AIKit
├── Dockerfile                        # Original
├── Dockerfile.aikit                  # NEW: AIKit-optimized
├── .env.master                       # API keys (unchanged)
├── RAZER_AIKIT_DEPLOYMENT.md        # This guide
├── app/
│   ├── server.py
│   ├── config.py
│   ├── utils.py
│   └── tools/
│       ├── github.py                # Existing (35 tools)
│       ├── discord.py               # Existing
│       ├── razer_aikit.py           # NEW: 22 AIKit tools
│       └── ... (60+ more modules)
└── D:\Visionary_Models\
    ├── aikit/                       # NEW: AIKit model storage
    └── aikit-cache/                 # NEW: HuggingFace cache

🚀 Deployment Steps

Step 1: Create Required Directories

# Create AIKit storage directories
New-Item -ItemType Directory -Force -Path "D:\Visionary_Models\aikit"
New-Item -ItemType Directory -Force -Path "D:\Visionary_Models\aikit-cache"

Step 2: Copy Deployment Files

Copy these files to your project root (D:\DEV_PROJECTS\GitHub\Claude_Opus_ChatGPT_App_Project\):

  1. docker-compose.aikit.yml → Project root
  2. Dockerfile.aikit → Project root
  3. app/tools/razer_aikit.pyapp/tools/ directory

Step 3: Register AIKit Module in Server

Edit app/server.py to import the new module. Add this line with the other tool imports:

# In app/server.py, add with other imports:
from app.tools import razer_aikit

Or if using dynamic import pattern, add to the tools list:

# If you have a TOOL_MODULES list:
TOOL_MODULES = [
    # ... existing modules ...
    "razer_aikit",  # ADD THIS LINE
]

Step 4: Configure ngrok Domain (Optional)

If you want a separate ngrok domain for the new sandbox, update docker-compose.aikit.yml:

# In ngrok service, change domain:
command: >
  http tool-server:8083
  --domain=visionary-aikit-sandbox.ngrok.io  # Your custom domain

Or use your existing domain by changing port mapping.

Step 5: Add HuggingFace Token to Environment

Add to your .env.master:

# Add this line (get token from https://huggingface.co/settings/tokens)
HUGGINGFACE_API_KEY=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxx

Step 6: Build and Start the Stack

# Navigate to project
cd D:\DEV_PROJECTS\GitHub\Claude_Opus_ChatGPT_App_Project

# Build the containers (first time takes ~5-10 minutes)
docker compose -f docker-compose.aikit.yml build

# Start the stack
docker compose -f docker-compose.aikit.yml up -d

# Check status
docker compose -f docker-compose.aikit.yml ps

# View logs
docker compose -f docker-compose.aikit.yml logs -f

Step 7: Verify Deployment

# Check AIKit health
curl http://localhost:8000/health

# Check Tool Server health  
curl http://localhost:8082/health

# Should return: {"status":"healthy","tools":334}
# (312 original + 22 new AIKit tools)

Step 8: Test AIKit Tools

Via Tool Server MCP or direct API call:

# Test chat completion
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/deepseek-coder-1.3b-instruct",
    "messages": [{"role": "user", "content": "Write a Python hello world"}],
    "max_tokens": 100
  }'

Step 9: Start ngrok Tunnel (Optional)

For mobile/remote access:

# Start with tunnel profile
docker compose -f docker-compose.aikit.yml --profile tunnel up -d ngrok

# Check tunnel
curl http://localhost:4040/api/tunnels

🎮 Daily Usage

Starting the Stack

Option A: Docker Desktop UI

  1. Open Docker Desktop
  2. Find "visionary-aikit-stack"
  3. Click ▶️ Run

Option B: Command Line

docker compose -f docker-compose.aikit.yml up -d

Stopping the Stack

Option A: Docker Desktop UI

  1. Find "visionary-aikit-stack"
  2. Click ⏹️ Stop

Option B: Command Line

docker compose -f docker-compose.aikit.yml down

Viewing Logs

# All services
docker compose -f docker-compose.aikit.yml logs -f

# Specific service
docker compose -f docker-compose.aikit.yml logs -f aikit
docker compose -f docker-compose.aikit.yml logs -f tool-server

Switching Models

The default model is deepseek-ai/deepseek-coder-1.3b-instruct (fast, small).

To use a different model, either:

  1. Per-request: Specify model parameter in API calls
  2. Default change: Edit docker-compose.aikit.yml command section

Popular models to try:

  • Qwen/Qwen2.5-7B-Instruct - Great all-rounder
  • microsoft/phi-4 - Strong reasoning
  • Qwen/Qwen2.5-Coder-32B-Instruct - Best coding quality
  • deepseek-ai/DeepSeek-R1-Distill-Qwen-7B - Chain-of-thought

🔌 Connecting Clients

NOTHING CHANGES - same endpoints as before!

Claude.ai / Claude Mobile

Use MCP connector URL:

http://localhost:8082/sse        # Local
https://visionary-tool-server.ngrok.io/sse  # Remote

ChatGPT Desktop / Mobile

Same MCP endpoint:

https://visionary-tool-server.ngrok.io/sse

AnythingLLM Desktop & Mobile

Settings → Agent Skills → Add MCP Server:

Name: Visionary Tool Server
URL: http://localhost:8082/sse

Already configured? It just works. No changes needed.

Direct API Access

import httpx

# Chat with local LLM
response = httpx.post(
    "http://localhost:8000/v1/chat/completions",
    json={
        "model": "Qwen/Qwen2.5-7B-Instruct",
        "messages": [{"role": "user", "content": "Hello!"}],
    }
)
print(response.json())

🛠️ New AIKit Tools Reference

Core Inference (4 tools)

Tool Description
aikit_chat Chat completion (OpenAI-compatible)
aikit_complete Text completion
aikit_embed Generate embeddings
aikit_code_assist Specialized code assistance

Model Management (5 tools)

Tool Description
aikit_list_models List available models
aikit_pull_model Download from HuggingFace
aikit_model_info Get model metadata
aikit_load_model Load model into memory
aikit_unload_model Remove model from memory

Fine-Tuning (5 tools)

Tool Description
aikit_finetune_start Start LoRA/QLoRA training
aikit_finetune_status Check training progress
aikit_finetune_stop Cancel training
aikit_finetune_list List all jobs
aikit_merge_adapter Merge adapter with base

System & Monitoring (4 tools)

Tool Description
aikit_health Server health check
aikit_gpu_status GPU metrics
aikit_cluster_status Ray cluster info
aikit_benchmark Performance testing

Convenience (2 tools)

Tool Description
aikit_quick_chat Simple one-turn chat
aikit_recommend_model Get model suggestions

Total: 22 new tools


🔧 Troubleshooting

Docker GPU Not Working

# Reinstall NVIDIA Container Toolkit in WSL2
wsl -d Ubuntu
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

AIKit Container Won't Start

# Check logs
docker compose -f docker-compose.aikit.yml logs aikit

# Common issues:
# - Out of GPU memory: Use smaller model
# - Model download failed: Check HuggingFace token
# - Port conflict: Change ports in docker-compose

Port 8000/8082 Already in Use

# Find what's using the port
netstat -ano | findstr :8000
netstat -ano | findstr :8082

# Kill the process
taskkill /PID <pid> /F

# Or change ports in docker-compose.aikit.yml

Model Too Large for VRAM

Your RTX 4090 has 24GB VRAM. Max model sizes:

  • 7B models: ~14GB VRAM (comfortable)
  • 13B models: ~22GB VRAM (tight)
  • 32B+ models: Requires quantization

Use quantized versions:

# In docker-compose, change model to quantized version:
aikit run Qwen/Qwen2.5-32B-Instruct-AWQ --quantization awq

Tool Server Can't Connect to AIKit

# Check if AIKit is running
docker compose -f docker-compose.aikit.yml ps

# Check network
docker network inspect aikit-network

# Verify internal DNS
docker compose -f docker-compose.aikit.yml exec tool-server curl http://aikit:8000/health

📊 Performance Expectations

Inference Speed (RTX 4090)

Model Size Tokens/sec First Token Latency
1-3B 150-200 <100ms
7B 80-120 200-500ms
13B 40-60 500-1000ms
32B (quantized) 20-40 1-2s

Memory Usage

Component RAM VRAM
AIKit (7B model) ~4GB ~14GB
Tool Server ~1GB ~0
Docker overhead ~2GB ~0
Total ~7GB ~14GB

🔄 Rollback Plan

If anything goes wrong, your original setup is untouched:

# Stop new stack
docker compose -f docker-compose.aikit.yml down

# Use original stack
docker compose up -d

# Or run directly
python main.py

Your Sandbox A on port 8082 works exactly as before.


✅ Verification Checklist

After deployment, verify these work:

  • curl http://localhost:8000/health returns healthy
  • curl http://localhost:8083/health shows 334 tools
  • AIKit chat completion works with test prompt
  • Tool Server can call aikit_chat tool
  • Existing GitHub tools still work
  • ngrok tunnel accessible (if enabled)
  • Claude.ai can connect via MCP
  • GPU visible in container (nvidia-smi)

📚 Resources


🎉 Success!

Once verified, you have:

  • 334 MCP tools (312 original + 22 AIKit)
  • 280,000+ local LLM models via HuggingFace
  • vLLM optimized inference (2-3x faster)
  • Fine-tuning capability (LoRA, QLoRA, DPO)
  • $50-100/month savings on API costs
  • Complete data privacy - nothing leaves your machine

Your AI orchestration now runs through:

You → Claude/ChatGPT/Gemini → Visionary Tool Server → Razer AIKit → RTX 4090

Welcome to the future of local AI infrastructure! 🚀