Changelog

All notable changes to this project will be documented in this file.

[Stable] - 2025-12-21

Final Deployment: vLLM v0.8.4 with Qwen3-8B-AWQ

Status: ✅ Production-ready, verified stable on Google Cloud Run

Why v0.8.4 (Not Latest)

After extensive testing across multiple vLLM versions (v0.9.0 through v0.13.0), we selected v0.8.4 as the stable production version for Cloud Run deployment.

Version History and Findings:

v0.8.4 (V0 Engine): ✅ SELECTED - Single monolithic process, proven stable
v0.8.5: Caused CUDA OOM errors during sampler warmup
v0.9.0 - v0.13.0 (V1 Engine): ❌ Silent failures in Cloud Run containers
- V1 engine uses multiprocess architecture (driver + EngineCore subprocess)
- When EngineCore subprocess fails, no error logs are produced in Cloud Run
- This makes debugging impossible and deployment unreliable

Key Technical Insight: Cloud Run containers run as a single process. vLLM v0.9.0+ introduced a V1 engine architecture with a separate subprocess (EngineCore), which fails silently when the subprocess crashes. The V0 engine (single process, v0.8.4) provides clear error logging and reliable startup behavior.

Deployment Optimizations

GPU Memory Utilization: 80% (16.8GB of 24GB L4 GPU)
- Reduced from 0.90 to 0.80 to prevent CUDA OOM during sampler warmup
- Provides safety margin for context loading and activations
Max Concurrent Sequences: 64
- Optimized for stable batching without memory spikes
Max Model Length: 8192 tokens
- Balanced between context window and memory constraints
Cold Start Time: ~4-5 minutes
- Model download from GCS: ~2-3 minutes
- vLLM server initialization: ~2 minutes
- This is a realistic expectation for v0.8.4 on L4 GPU

Performance Metrics

Metric	Value
TTFT (Time to First Token)	~80ms (warm)
TPS (Tokens Per Second)	~43 tokens/sec
Cost/1K Tokens	~$0.005 (L4 GPU at $0.90/hour)
GPU Memory	80% utilization (16.8GB)
Cold Start	~4-5 minutes
Warm Response	<100ms latency

Model

Model: Qwen3-8B-AWQ
- Quantization: 4-bit (75% size reduction)
- License: Apache 2.0
- Features: Thinking/reasoning mode enabled
- Capabilities: Advanced reasoning and multilingual support

Deployment Options

Standard (Recommended for cost-efficiency):

Model downloaded from GCS on first startup
Cost: $0 when idle (min_instances=0), ~$0.90/hour when active
Cold start: ~4-5 minutes

Baked Image (Alternative):

Model pre-included in Docker image (~10GB)
Same cold start time as standard (model loading is the bottleneck, not download)
Only beneficial if you need multiple instances or very predictable startup
Cost: Same as standard

Changed

Dockerfile: Updated to vLLM v0.8.4 (was v0.13.0 in earlier attempts)
docker/startup.sh: Added GPU memory optimization (0.80 utilization)
docker/startup.sh: Added --max-num-seqs 64 for stability
docker/startup-baked.sh: Updated with same optimizations as standard deployment
docker/Dockerfile.baked: Updated to v0.8.4
scripts/deploy.sh: Extended startup probe timeout (240s initial delay) for 4-5 minute cold start
README.md: Updated performance claims to realistic values (4-5 min cold start, not 30s)
README.md: Removed Run:ai Model Streamer references (not used in final deployment)
docs/DEPLOYMENT.md: Added comprehensive deployment notes explaining v0.8.4 choice
docs/ARCHITECTURE.md: Updated component descriptions to reflect V0 engine
Makefile: Updated deployment help text and startup probe settings

Removed

Run:ai Model Streamer integration (not compatible with v0.8.4, marginal benefit)
Earlier attempts at vLLM v0.10.0, v0.11.0, v0.12.0, v0.13.0 configurations

Testing

✅ Container builds successfully with v0.8.4
✅ API server starts and becomes healthy within startup probe timeout
✅ OpenAI-compatible endpoints respond correctly
✅ Model inference works with Qwen3 thinking/reasoning
✅ Stable under load with 64 concurrent sequences

Previous Attempts (Not Production-Ready)

Attempted: vLLM v0.13.0 with Qwen3-8B-AWQ

Status: ❌ Failed - Silent container crashes

Issue: V1 engine multiprocess architecture incompatible with Cloud Run

Root Cause: Container startup probe timeout with no error logs

Learning: Latest vLLM version not suitable for Cloud Run single-process model

Architecture Overview

Local Machine                    Google Cloud Platform
+-------------+                 +---------------------------+
| Application | --HTTPS--->     | Cloud Run (asia-southeast1)|
| OpenAI SDK  |                 | +---------------------+   |
+-------------+                 | | vLLM Server         |   |
                                | | /v1/chat/completions|   |
                                | +---------------------+   |
                                |    GPU: NVIDIA L4         |
                                |           |               |
                                |           v               |
                                | +---------------------+   |
                                | | GCS: Model weights  |   |
                                | +---------------------+   |
                                +---------------------------+

Components:

vLLM v0.8.4 (V0 Engine): Single-process, proven stable
Cloud Run: Serverless GPU with scale-to-zero (min_instances=0)
NVIDIA L4 GPU: 24GB VRAM, optimized for inference
Qwen3-8B-AWQ: 4-bit quantized model with reasoning capabilities
GCS: Model weight storage for on-demand downloading

Cost Analysis

Configuration	Monthly Cost (Idle)	Monthly Cost (24/7)	Best For
min_instances=0 (Standard)	$0	~$648	Development, light usage
min_instances=1 (Always Warm)	~$648	~$648	Production with SLA

Cost Comparison (inference cost, not including model download):

Self-hosted L4: ~$0.005/1K tokens (when running), $0 when idle
OpenAI GPT-5 Nano: ~$0.0002-0.0004/1K tokens

Future Improvements (When vLLM V1 Engine Matures)

As vLLM V1 engine matures and Cloud Run containerization improves:

Monitor v0.14.0+ releases for V1 engine maturity
Evaluate if multiprocess debugging tools become available
Consider vSphere or container runtimes with better subprocess logging
Continue to validate against v0.8.4 baseline

Until then, v0.8.4 remains the recommended stable version for Cloud Run deployment.

Last Updated: 2025-12-21 Deployment Status: ✅ Production-ready

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changelog

[Stable] - 2025-12-21

Final Deployment: vLLM v0.8.4 with Qwen3-8B-AWQ

Why v0.8.4 (Not Latest)

Deployment Optimizations

Performance Metrics

Model

Deployment Options

Changed

Removed

Testing

Previous Attempts (Not Production-Ready)

Attempted: vLLM v0.13.0 with Qwen3-8B-AWQ

Architecture Overview

Cost Analysis

Future Improvements (When vLLM V1 Engine Matures)

FilesExpand file tree

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

[Stable] - 2025-12-21

Final Deployment: vLLM v0.8.4 with Qwen3-8B-AWQ

Why v0.8.4 (Not Latest)

Deployment Optimizations

Performance Metrics

Model

Deployment Options

Changed

Removed

Testing

Previous Attempts (Not Production-Ready)

Attempted: vLLM v0.13.0 with Qwen3-8B-AWQ

Architecture Overview

Cost Analysis

Future Improvements (When vLLM V1 Engine Matures)