Name	Name	Last commit message	Last commit date
parent directory ..
src	src
Cargo.toml	Cargo.toml
README.md	README.md

RuvLLM CLI

Command-line interface for RuvLLM inference, optimized for Apple Silicon.

Installation

# From crates.io
cargo install ruvllm-cli

# From source (with Metal acceleration)
cargo install --path . --features metal

Commands

Download Models

Download models from HuggingFace Hub:

# Download Qwen with Q4K quantization (default)
ruvllm download qwen

# Download with specific quantization
ruvllm download qwen --quantization q8
ruvllm download mistral --quantization f16

# Force re-download
ruvllm download phi --force

# Download specific revision
ruvllm download llama --revision main

Model Aliases

Alias	Model ID
`qwen`	`Qwen/Qwen2.5-7B-Instruct`
`mistral`	`mistralai/Mistral-7B-Instruct-v0.3`
`phi`	`microsoft/Phi-3-medium-4k-instruct`
`llama`	`meta-llama/Meta-Llama-3.1-8B-Instruct`

Quantization Options

Option	Description	Memory Savings
`q4k`	4-bit quantization (default)	~75%
`q8`	8-bit quantization	~50%
`f16`	Half precision	~50%
`none`	Full precision	0%

List Models

# List all available models
ruvllm list

# List only downloaded models
ruvllm list --downloaded

# Detailed listing with sizes
ruvllm list --long

Model Information

# Show model details
ruvllm info qwen

# Output includes:
# - Model architecture
# - Parameter count
# - Download status
# - Disk usage
# - Supported features

Interactive Chat

# Start chat with default settings
ruvllm chat qwen

# With custom system prompt
ruvllm chat qwen --system "You are a helpful coding assistant."

# Adjust generation parameters
ruvllm chat qwen --temperature 0.5 --max-tokens 1024

# Use specific quantization
ruvllm chat qwen --quantization q8

Chat Commands

During chat, use these commands:

Command	Description
`/help`	Show available commands
`/clear`	Clear conversation history
`/system <prompt>`	Change system prompt
`/temp <value>`	Change temperature
`/quit` or `/exit`	Exit chat

Start Server

OpenAI-compatible inference server:

# Start with defaults
ruvllm serve qwen

# Custom host and port
ruvllm serve qwen --host 0.0.0.0 --port 8080

# Configure concurrency
ruvllm serve qwen --max-concurrent 8 --max-context 8192

API Endpoints

Endpoint	Method	Description
`/v1/chat/completions`	POST	Chat completions
`/v1/completions`	POST	Text completions
`/v1/models`	GET	List models
`/health`	GET	Health check

Example Request

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "max_tokens": 256
  }'

Run Benchmarks

# Basic benchmark
ruvllm benchmark qwen

# Configure benchmark
ruvllm benchmark qwen \
  --warmup 5 \
  --iterations 20 \
  --prompt-length 256 \
  --gen-length 128

# Output formats
ruvllm benchmark qwen --format json
ruvllm benchmark qwen --format csv

Benchmark Metrics

Prefill Latency: Time to process input prompt
Decode Throughput: Tokens per second during generation
Time to First Token (TTFT): Latency before first output token
Memory Usage: Peak GPU/RAM consumption

Global Options

# Enable verbose logging
ruvllm --verbose <command>

# Disable colored output
ruvllm --no-color <command>

# Custom cache directory
ruvllm --cache-dir /path/to/cache <command>

# Or via environment variable
export RUVLLM_CACHE_DIR=/path/to/cache

Configuration

Cache Directory

Models are cached in:

macOS: ~/Library/Caches/ruvllm
Linux: ~/.cache/ruvllm
Windows: %LOCALAPPDATA%\ruvllm

Override with --cache-dir or RUVLLM_CACHE_DIR.

Logging

Set log level with RUST_LOG:

RUST_LOG=debug ruvllm chat qwen
RUST_LOG=ruvllm=trace ruvllm serve qwen

Examples

Basic Workflow

# 1. Download a model
ruvllm download qwen

# 2. Verify it's downloaded
ruvllm list --downloaded

# 3. Start chatting
ruvllm chat qwen

Server Deployment

# Download model first
ruvllm download qwen --quantization q4k

# Start server with production settings
ruvllm serve qwen \
  --host 0.0.0.0 \
  --port 8080 \
  --max-concurrent 16 \
  --max-context 4096 \
  --quantization q4k

Performance Testing

# Run comprehensive benchmarks
ruvllm benchmark qwen \
  --warmup 10 \
  --iterations 50 \
  --prompt-length 512 \
  --gen-length 256 \
  --format json > benchmark_results.json

Troubleshooting

Out of Memory

# Use smaller quantization
ruvllm chat qwen --quantization q4k

# Or reduce context length
ruvllm serve qwen --max-context 2048

Slow Download

# Resume interrupted download
ruvllm download qwen

# Force fresh download
ruvllm download qwen --force

Metal Issues (macOS)

Ensure Metal is available:

# Check Metal device
system_profiler SPDisplaysDataType | grep Metal

# Try with CPU fallback
RUVLLM_NO_METAL=1 ruvllm chat qwen

Feature Flags

Build with specific features:

# Metal acceleration (macOS)
cargo install ruvllm-cli --features metal

# CUDA acceleration (NVIDIA)
cargo install ruvllm-cli --features cuda

# Both (if available)
cargo install ruvllm-cli --features "metal,cuda"

License

Apache-2.0 / MIT dual license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

RuvLLM CLI

Installation

Commands

Download Models

Model Aliases

Quantization Options

List Models

Model Information

Interactive Chat

Chat Commands

Start Server

API Endpoints

Example Request

Run Benchmarks

Benchmark Metrics

Global Options

Configuration

Cache Directory

Logging

Examples

Basic Workflow

Server Deployment

Performance Testing

Troubleshooting

Out of Memory

Slow Download

Metal Issues (macOS)

Feature Flags

License

FilesExpand file tree

ruvllm-cli

Directory actions

More options

Directory actions

More options

Latest commit

History

ruvllm-cli

Folders and files

parent directory

README.md

RuvLLM CLI

Installation

Commands

Download Models

Model Aliases

Quantization Options

List Models

Model Information

Interactive Chat

Chat Commands

Start Server

API Endpoints

Example Request

Run Benchmarks

Benchmark Metrics

Global Options

Configuration

Cache Directory

Logging

Examples

Basic Workflow

Server Deployment

Performance Testing

Troubleshooting

Out of Memory

Slow Download

Metal Issues (macOS)

Feature Flags

License