vLLM with Tensorized Models

vLLM can load tensorized weights without conversion.

Five‑Minute Quickstart

bash examples/vllm/run_vllm_tensorized.sh s3://my-bucket/models/tiny-gpt2.tensors

The script launches a server and performs a smoke test query.

vllm serve --tensorizer reads weights from disk, HTTP, or S3.
Environment variables like VLLM_WORKER_GPU_MEMORY_UTILIZATION tune throughput vs. memory usage.
Prometheus metrics at /metrics expose time‑to‑first‑token and tokens/sec.
Scale out with KServe or plain Deployments using the Helm chart in helm/tensorizer-vllm.

Refer to the vLLM documentation for advanced options.