This prototype demonstrates speculative decoding across heterogeneous hardware:
- DGX/Server (verifier): Higher-precision model (Q8_0) for verification
- Mac M3 Ultra (drafter): Lower-precision model (Q4_K_M) for fast draft generation
Alternative approach (prefill/decode split): Some implementations split inference by having DGX do prefill and Mac do decode, streaming the KV cache between them. This requires:
- ❌ 10GbE network (expensive)
- ❌ Streaming 500MB-2GB KV cache per request
- ❌ Complex KV serialization/deserialization
- ❌ DGX sits idle during decode phase
- ✅ 4x mentioned speedup (However, this will be limited to the Mac's bottleneck)
Here is a diagram of this "split-pipe" architecture...
Our approach (speculative decoding): Both machines work continuously with minimal coordination:
- ✅ Works on standard network (1GbE/WiFi)
- ✅ Only 20-50 bytes transferred per iteration (token IDs only)
- ✅ No KV cache streaming needed
- ✅ Both machines fully utilized (90%+ utilization)
- ✅ 2x measured speedup with simple implementation
Does this approach require more VRAM on the DGX? No. Hybrid speculative decoding does not add any VRAM overhead on the DGX. The DGX holds the same full-precision model and single KV-cache it would use for normal inference. All draft generation and its KV-cache live on the Mac, so the DGX memory footprint remains unchanged. This is a major difference from split-pipeline approaches, which often require additional KV-cache storage and can increase VRAM usage by 500MB–2GB or more.
This new architecture solves the bottleneck...
Each machine maintains its own KV cache independently. Only small token IDs (~4 integers) are exchanged for verification. This eliminates the network bottleneck entirely.
Speculative decoding accelerates LLM inference by:
- Drafting multiple tokens quickly with a smaller/quantized model (Mac M3)
- Verifying drafts in parallel with a larger/precise model (DGX/Server)
- Accepting matching tokens and continuing generation
Benefits:
- 1.5-3x speedup with high acceptance rates (typically 40-80%)
- Minimal network overhead (~50 bytes per iteration vs 500MB+ for KV streaming)
- Full hardware utilization (both machines busy throughout)
- Python: 3.11+
- Hardware:
- Mac with Apple Silicon (M1/M2/M3/M4) for Metal acceleration
- Any machine with GPU for the server (or CPU if no GPU available)
Both client and server need the TinyLlama GGUF models:
# Create models directory
mkdir -p models
cd models
# Download 4-bit model (for Mac drafter - ~637 MB)
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
# Download 8-bit model (for server verifier - ~1.1 GB)
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q8_0.gguf
cd ..Note: Place the models in a models/ directory in both DGX_Spark/ and Mac_Studio/ folders (or use symlinks).
cd DGX_Spark
# Create virtual environment
python3 -m venv venv_dgx
source venv_dgx/bin/activate
# Install dependencies
pip install llama-cpp-python fastapi uvicorn
# For GPU support (optional, if you have CUDA):
# CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
# Start the server
uvicorn main:app --host 127.0.0.1 --port 8000Expected output:
INFO: Started server process
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:8000
Open a new terminal:
cd Mac_Studio
# Create virtual environment
python3 -m venv venv_mac
source venv_mac/bin/activate
# Install dependencies with Metal support
CMAKE_ARGS="-DLLAMA_METAL=on" pip install -r requirements.txt
# Run the drafter client
python draft_token_generator.py \
--dgx http://127.0.0.1:8000 \
--prompt "best sport in the world is" \
--draft_n 4 \
--max_tokens 80python draft_token_generator.py \
--dgx http://127.0.0.1:8000 \
--prompt "Your prompt here" \
--draft_n 4 \
--max_tokens 80--dgx: Server URL (default:http://127.0.0.1:8000)--model: Path to draft model GGUF file (default:tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf)--prompt: Input prompt text--draft_n: Number of tokens to draft per iteration (default: 4)--max_tokens: Maximum tokens to generate (default: 80)
Loading 4-bit model: tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
✓ Model loaded
Prompt: best sport in the world is
✓ Prefill done (7 tokens)
Iter 1: 7/80 tokens
Current context preview: ...'best sport in the world is'
Generated text: ' cricket.\n'
Draft tokens: [259, 699, 3522, 29889, 13]
Draft text: ' cricket.\n'
✅ 5/5 accepted: ' cricket.\n'
Iter 2: 12/80 tokens
Current context preview: ...'best sport in the world is cricket.\n'
Generated text: '\n2. Cr'
Draft tokens: [29871, 13, 29906, 29889, 6781]
Draft text: ' \n2. Cr'
✅ 5/5 accepted: ' \n2. Cr'
Iter 3:...
============================================================
SPECULATIVE DECODING RESULTS
============================================================
📊 Generation Stats:
Prompt tokens: 7
Generated tokens: 75
Total tokens: 82
Iterations: 20
⚡ Performance:
Total time: 2.37s
Throughput: 31.6 tok/s
🎯 Speculative Decoding Efficiency:
Tokens drafted: 97
Tokens accepted: 75
Acceptance rate: 77.3%
Avg tokens/iteration: 3.75
🚀 Speedup Analysis:
Speculative: 31.6 tok/s
Est. Sequential: 15.8 tok/s
Speedup: 2.00x faster
============================================================
┌─────────────────┐ ┌──────────────────┐
│ Mac Ultra M3 | | |
| Client │ │ DGX/Server │
│ (Drafter) │ │ (Verifier) │
│ │ │ │
│ Q4_K_M Model │◄──────────────────►│ Q8_0 Model │
│ Fast Draft │ HTTP/JSON API │ High Precision │
│ Generation │ (~20-50 bytes) │ Verification │
└─────────────────┘ └──────────────────┘
│ │
│ 1. Generate 4 draft tokens │
│─────────────────────────────────────►│
│ draft_tokens: [259, 699, ...] │
│ │
│ 2. Verify drafts, return accepted │
│◄─────────────────────────────────────│
│ {accepted: 3, preds: [...]} │
│ │
│ 3. Continue from accepted tokens │
│─────────────────────────────────────►│
🔑 No KV Cache Streaming
- Each machine maintains its own KV cache independently
- KV cache never leaves its host machine
- Both sides regenerate KV from token IDs (current_tokens list)
- Network transfer: Only token IDs (~4 integers = 20 bytes per iteration)
⚡ Resource Utilization
Traditional Approach (Prefill/Decode Split):
DGX: ████░░░░░░░░░░░░░░ (20% - idle after prefill)
Mac: ░░░░████████████░░ (80% - doing all decode)
Our Approach (Speculative Decoding):
DGX: ████████████████░░ (90% - continuous verification)
Mac: ████████████████░░ (90% - continuous drafting)
📊 Network Efficiency
- Per iteration: ~20-50 bytes (token IDs only)
- Alternative approach: 500MB-2GB (full KV cache)
- ~10,000,000x less data transferred
Initialize the prompt and KV cache.
Request:
{
"prompt": "best sport in the world is"
}Response:
{
"prompt_len": 7,
"prompt_ids": [1, 1900, 7980, 297, 278, 3186, 338]
}Verify draft tokens against server predictions.
Request:
{
"draft_tokens": [259, 699, 3522, 29889, 13]
}Response:
{
"accepted_prefix_len": 5,
"preds": [259, 699, 3522, 29889, 13],
"total_tokens": 12
}-
Increase
--draft_n: More tokens per draft = higher potential speedup (but lower acceptance rate)- Optimal range: 4-8 tokens
- Too high: acceptance rate drops significantly
-
Model Selection:
- Use smaller quantization for drafter (Q4_K_M, Q4_0)
- Use higher precision for verifier (Q8_0, FP16)
- Similar model families = higher acceptance rates
-
Hardware Optimization:
- Mac: Ensure Metal acceleration (
CMAKE_ARGS="-DLLAMA_METAL=on") - Server: Use GPU acceleration for best performance
- Mac: Ensure Metal acceleration (
Solution: Download models (see Setup step 1) or update model path
Solution: Models may be too different. Try:
- Using same model family with different quantizations
- Reducing
--draft_nto 2-3 tokens
Solution: Verify Metal acceleration:
pip uninstall llama-cpp-python
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python --no-cache-dirSolution: Ensure server model is initialized with logits_all=True (already in provided code)
Larger Models (Mistral, Llama 2) - for better spec decoding (you can then remove greedy generation as in current implementation)
# Download Mistral 7B (Q4_K_M for drafter, Q8_0 for server)
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q8_0.gguf
# Update model paths in code
python draft_token_generator.py --model mistral-7b-instruct-v0.2.Q4_K_M.gguf ...-
Network Agnostic: Works on standard 1GbE, WiFi, or even internet
- No need for expensive 10GbE infrastructure ($750-2600 savings)
- No special cabling or switches required
-
Simple Implementation:
- Only token IDs (integers) exchanged
- No complex KV cache serialization
- Standard HTTP REST API
-
Full Hardware Utilization:
- Both machines continuously processing
- DGX verifies while Mac drafts next batch
- No idle time waiting for large transfers
-
Flexible Model Selection:
- Can use completely different model architectures
- Mac: TinyLlama 4-bit, DGX: Mistral 8-bit (works!)
- KV cache format incompatibility is not an issue
-
Server-Side KV Caching: Keep KV cache on server between requests to avoid recomputation
# Instead of regenerating from scratch each time app.state.kv_cache = None # Reuse and extend existing cache
-
Batching: Process multiple concurrent requests with batch verification
-
Binary Protocol: Replace JSON with gRPC for lower serialization overhead
-
Connection Pooling: Reuse HTTP connections to reduce handshake overhead
-
Adaptive Draft Size: Dynamically adjust
draft_nbased on acceptance rates
MIT

