candle-vllm can be used as a library in your own Rust projects to perform high-performance LLM inference without running a standalone server.
Add candle-vllm to your Cargo.toml. You must specify the Git repository (or path if local) and enable necessary features.
[dependencies]
candle-vllm = { git = "https://github.com/EricLBuehler/candle-vllm.git", features = ["cuda"] }
# Or for local development:
# candle-vllm = { path = "../candle-vllm", features = ["cuda"] }You must enable at least one backend feature unless you are running on CPU (which is slow for LLMs). common features include:
cuda: Helper to enable CUDA support (requires wrapping project to also configure CUDA).metal: For macOS Metal support.flashattn: Enables the flash attention backend.flashinfer: Enables the FlashInfer backend on CUDA.
Here is a complete example of how to initialize the engine and generate text.
use candle_vllm::api::{EngineBuilder, ModelRepo};
use candle_vllm::openai::requests::{ChatCompletionRequest, Messages};
#[tokio::main]
async fn main() -> anyhow::Result<()> {
// 1. Configure the engine
// You can specify a model from HF Hub, a local path, or a GGUF file.
let builder = EngineBuilder::new(ModelRepo::ModelID(("Qwen/Qwen3-0.6B", None)))
.with_kvcache_mem_gpu(1096) // 1GB KV Cache
.with_temperature(0.7)
.with_top_p(0.95);
// 2. Build the engine
// This initializes the model, weights, and background processing threads.
let engine = builder.build_async().await?;
// 3. Create a request
let request = ChatCompletionRequest {
model: "default".to_string(), // Model name is internal reference
messages: Messages::Map(vec![
std::collections::HashMap::from([
("role".to_string(), "user".to_string()),
("content".to_string(), "Tell me a joke about Rust programming.".to_string()),
])
]),
max_tokens: Some(100),
..Default::default()
};
// 4. Generate response
let response = engine.generate_request(request).await?;
println!("Response: {:?}", response);
// 5. Clean shutdown
engine.shutdown();
Ok(())
}The EngineBuilder provides a fluent API to configure:
- Model Source:
ModelRepo::ModelID(HF Hub),ModelRepo::ModelPath(Local),ModelRepo::ModelFile(GGUF). - Compute Resources:
.with_device_ids(),.with_kvcache_mem_gpu(),.with_kvcache_mem_cpu(). - Generation Defaults:
.with_temperature(),.with_top_p(), etc. - Advanced:
.with_dtype(),.with_isq()(Quantization),.without_flash_attn().
If you are building a standalone binary that uses candle-vllm, ensure your Cargo.toml configures the necessary conflicting features correctly.
For example, if you use cuda, your Cargo.toml might look like:
[package]
name = "my-inference-app"
version = "0.1.0"
edition = "2021"
[dependencies]
candle-vllm = { git = "https://github.com/EricLBuehler/candle-vllm.git", features = ["cuda"] }
tokio = { version = "1.32", features = ["full"] }
anyhow = "1.0"Then run with:
cargo run --features candle-vllm/cudaOr just defaults if you enabled it in dependencies.
Run built-in example:
cd candle-vllm
cargo run --release --features cuda --example simple_gen