All notable changes to this project will be documented in this file.
Status: ✅ Production-ready, verified stable on Google Cloud Run
After extensive testing across multiple vLLM versions (v0.9.0 through v0.13.0), we selected v0.8.4 as the stable production version for Cloud Run deployment.
Version History and Findings:
- v0.8.4 (V0 Engine): ✅ SELECTED - Single monolithic process, proven stable
- v0.8.5: Caused CUDA OOM errors during sampler warmup
- v0.9.0 - v0.13.0 (V1 Engine): ❌ Silent failures in Cloud Run containers
- V1 engine uses multiprocess architecture (driver + EngineCore subprocess)
- When EngineCore subprocess fails, no error logs are produced in Cloud Run
- This makes debugging impossible and deployment unreliable
Key Technical Insight: Cloud Run containers run as a single process. vLLM v0.9.0+ introduced a V1 engine architecture with a separate subprocess (EngineCore), which fails silently when the subprocess crashes. The V0 engine (single process, v0.8.4) provides clear error logging and reliable startup behavior.
-
GPU Memory Utilization: 80% (16.8GB of 24GB L4 GPU)
- Reduced from 0.90 to 0.80 to prevent CUDA OOM during sampler warmup
- Provides safety margin for context loading and activations
-
Max Concurrent Sequences: 64
- Optimized for stable batching without memory spikes
-
Max Model Length: 8192 tokens
- Balanced between context window and memory constraints
-
Cold Start Time: ~4-5 minutes
- Model download from GCS: ~2-3 minutes
- vLLM server initialization: ~2 minutes
- This is a realistic expectation for v0.8.4 on L4 GPU
| Metric | Value |
|---|---|
| TTFT (Time to First Token) | ~80ms (warm) |
| TPS (Tokens Per Second) | ~43 tokens/sec |
| Cost/1K Tokens | ~$0.005 (L4 GPU at $0.90/hour) |
| GPU Memory | 80% utilization (16.8GB) |
| Cold Start | ~4-5 minutes |
| Warm Response | <100ms latency |
- Model: Qwen3-8B-AWQ
- Quantization: 4-bit (75% size reduction)
- License: Apache 2.0
- Features: Thinking/reasoning mode enabled
- Capabilities: Advanced reasoning and multilingual support
Standard (Recommended for cost-efficiency):
- Model downloaded from GCS on first startup
- Cost: $0 when idle (min_instances=0), ~$0.90/hour when active
- Cold start: ~4-5 minutes
Baked Image (Alternative):
- Model pre-included in Docker image (~10GB)
- Same cold start time as standard (model loading is the bottleneck, not download)
- Only beneficial if you need multiple instances or very predictable startup
- Cost: Same as standard
- Dockerfile: Updated to vLLM v0.8.4 (was v0.13.0 in earlier attempts)
- docker/startup.sh: Added GPU memory optimization (0.80 utilization)
- docker/startup.sh: Added
--max-num-seqs 64for stability - docker/startup-baked.sh: Updated with same optimizations as standard deployment
- docker/Dockerfile.baked: Updated to v0.8.4
- scripts/deploy.sh: Extended startup probe timeout (240s initial delay) for 4-5 minute cold start
- README.md: Updated performance claims to realistic values (4-5 min cold start, not 30s)
- README.md: Removed Run:ai Model Streamer references (not used in final deployment)
- docs/DEPLOYMENT.md: Added comprehensive deployment notes explaining v0.8.4 choice
- docs/ARCHITECTURE.md: Updated component descriptions to reflect V0 engine
- Makefile: Updated deployment help text and startup probe settings
- Run:ai Model Streamer integration (not compatible with v0.8.4, marginal benefit)
- Earlier attempts at vLLM v0.10.0, v0.11.0, v0.12.0, v0.13.0 configurations
- ✅ Container builds successfully with v0.8.4
- ✅ API server starts and becomes healthy within startup probe timeout
- ✅ OpenAI-compatible endpoints respond correctly
- ✅ Model inference works with Qwen3 thinking/reasoning
- ✅ Stable under load with 64 concurrent sequences
Status: ❌ Failed - Silent container crashes
Issue: V1 engine multiprocess architecture incompatible with Cloud Run
Root Cause: Container startup probe timeout with no error logs
Learning: Latest vLLM version not suitable for Cloud Run single-process model
Local Machine Google Cloud Platform
+-------------+ +---------------------------+
| Application | --HTTPS---> | Cloud Run (asia-southeast1)|
| OpenAI SDK | | +---------------------+ |
+-------------+ | | vLLM Server | |
| | /v1/chat/completions| |
| +---------------------+ |
| GPU: NVIDIA L4 |
| | |
| v |
| +---------------------+ |
| | GCS: Model weights | |
| +---------------------+ |
+---------------------------+
Components:
- vLLM v0.8.4 (V0 Engine): Single-process, proven stable
- Cloud Run: Serverless GPU with scale-to-zero (min_instances=0)
- NVIDIA L4 GPU: 24GB VRAM, optimized for inference
- Qwen3-8B-AWQ: 4-bit quantized model with reasoning capabilities
- GCS: Model weight storage for on-demand downloading
| Configuration | Monthly Cost (Idle) | Monthly Cost (24/7) | Best For |
|---|---|---|---|
| min_instances=0 (Standard) | $0 | ~$648 | Development, light usage |
| min_instances=1 (Always Warm) | ~$648 | ~$648 | Production with SLA |
Cost Comparison (inference cost, not including model download):
- Self-hosted L4: ~$0.005/1K tokens (when running), $0 when idle
- OpenAI GPT-5 Nano: ~$0.0002-0.0004/1K tokens
As vLLM V1 engine matures and Cloud Run containerization improves:
- Monitor v0.14.0+ releases for V1 engine maturity
- Evaluate if multiprocess debugging tools become available
- Consider vSphere or container runtimes with better subprocess logging
- Continue to validate against v0.8.4 baseline
Until then, v0.8.4 remains the recommended stable version for Cloud Run deployment.
Last Updated: 2025-12-21 Deployment Status: ✅ Production-ready