Skip to content

richard-gyiko/which-llm

Repository files navigation

which-llm

Stop guessing which LLM to use. Get data-driven model recommendations based on your task requirements, budget, and performance needs.

With 100+ LLMs available—each with different strengths, pricing, and capabilities—choosing the right one is overwhelming. which-llm queries real benchmark data and gives you actionable recommendations.

Note: This tool provides best-effort suggestions based on benchmark scores and capability metadata. It does not substitute proper evaluation on your specific use case. Benchmarks have known limitations and may not reflect real-world performance for your domain.

Quick Start

The easiest way to use which-llm is through the agent skill—your AI coding assistant (Cursor, Claude Code, Copilot, etc.) learns how to recommend models for you automatically.

1. Install the CLI

# macOS / Linux
brew tap richard-gyiko/tap
brew install which-llm

# Windows
scoop bucket add richard-gyiko https://github.com/richard-gyiko/scoop-bucket
scoop install which-llm
Other installation methods

Manual download from GitHub Releases:

# macOS (Apple Silicon)
curl -LO https://github.com/richard-gyiko/which-llm/releases/latest/download/which-llm-aarch64-apple-darwin.tar.gz
tar -xzf which-llm-aarch64-apple-darwin.tar.gz
sudo mv which-llm /usr/local/bin/

# macOS (Intel)
curl -LO https://github.com/richard-gyiko/which-llm/releases/latest/download/which-llm-x86_64-apple-darwin.tar.gz
tar -xzf which-llm-x86_64-apple-darwin.tar.gz
sudo mv which-llm /usr/local/bin/

# Linux
curl -LO https://github.com/richard-gyiko/which-llm/releases/latest/download/which-llm-x86_64-unknown-linux-gnu.tar.gz
tar -xzf which-llm-x86_64-unknown-linux-gnu.tar.gz
sudo mv which-llm /usr/local/bin/

From source (requires Rust):

cargo install --path .

2. Start Using It

No API key required! The CLI fetches pre-built benchmark data from GitHub Releases, updated daily.

# Refresh data (run once to populate cache)
which-llm refresh

# Query models using SQL
which-llm query "SELECT name, intelligence, coding, price FROM benchmarks LIMIT 10"

# List available tables
which-llm tables

# Check data source info
which-llm info
Optional: Configure API access for real-time data

For the freshest data (instead of daily snapshots), you can configure direct API access to Artificial Analysis:

  1. Create an account at artificialanalysis.ai/login
  2. Generate an API key
  3. Configure the CLI:
which-llm profile create default --api-key YOUR_API_KEY

Or set the ARTIFICIAL_ANALYSIS_API_KEY environment variable.

Then use the --use-api flag to fetch directly from the API:

which-llm refresh --use-api

3. Install the Skill

# Pick your AI coding tool
which-llm skill install cursor      # Cursor
which-llm skill install claude      # Claude Code
which-llm skill install opencode    # OpenCode
which-llm skill install codex       # Codex CLI
which-llm skill install windsurf    # Windsurf
which-llm skill install copilot     # GitHub Copilot
which-llm skill install antigravity # Antigravity

# Or install globally (available in all projects)
which-llm skill install cursor --global

Now just ask your AI assistant: "Which LLM should I use for [your task]?"

Tip: Most AI assistants will automatically load the skill when you mention "which llm" in your question—no need to explicitly tell it to load the skill.

Experimental: The skill is under active development. Recommendations and output format may change as we refine the task classification and benchmark interpretation.

The skill follows the Agent Skills open standard. See the full skill documentation for details on how it classifies tasks and selects models.

How It Works

The skill teaches your AI assistant to:

  1. Classify your task — Is it transformational (summarize, extract), analytical (compare, justify), tool-using (API calls), or agentic (planning, orchestration)?
  2. Derive requirements — Map task complexity to minimum benchmark thresholds
  3. Query real data — Filter models by capability (tool calling, context window, structured output)
  4. Recommend Primary + Fallback — Suggest a cost-optimized cascade (cheap model first, escalate when needed)

Data sources:

  • Artificial Analysis — Benchmark scores (intelligence, coding, math) and performance metrics (price, latency, throughput)
  • models.dev — Capability metadata (context window, tool calling, structured output, reasoning)

Example

You: "I'm building a feature that summarizes long Slack threads so users can catch up quickly. Threads can be pretty long, like 200+ messages. Which LLM should I use?"

Assistant (with skill loaded):

Task Classification

  • Skill Type: Transformational (summarization)
  • Key Constraints: Large context window (200+ messages)
  • Priority: Balanced (quality + cost + speed matter equally)

Recommendations

Primary: gpt-oss-20B ($0.10/M)

  • Intelligence: 24.7 | Context: 131k tokens | Throughput: 305 tps
  • Why: Excellent cost-capability ratio. 131k context easily handles 200+ messages.

Fallback: MiMo-V2-Flash ($0.15/M)

  • Intelligence: 30-39 | Context: 256k tokens
  • Use if: Primary struggles with nuanced summaries or threads exceed 300+ messages

Cost Estimate

  • Cascade (70/30 split): $0.115/M tokens
  • Savings vs always using fallback: 23%

Validation step: Before deploying, test both models on 5-10 representative Slack threads from your workspace.

View full transcript — shows the complete flow including CLI queries and scoring.

CLI Reference

For power users, scripting, or debugging, you can query the data directly.

SQL Queries (Primary Interface)

Use full SQL expressiveness on the cached benchmark data:

# Best coding models under $5/M (benchmarks table)
which-llm query "SELECT name, creator, coding, output_price FROM benchmarks WHERE coding > 40 AND output_price < 5 ORDER BY coding DESC"

# Models with tool calling and large context (models table)
which-llm query "SELECT model_name, provider_name, context_window, tool_call FROM models WHERE tool_call = true AND context_window > 100000"

# List available tables
which-llm tables

# Show schema for a specific table
which-llm tables benchmarks
Available tables and columns

Tables

Table Description Source
benchmarks LLM benchmark scores and pricing Artificial Analysis
models Capability metadata and provider info models.dev
text_to_image Text-to-image models Artificial Analysis
image_editing Image editing models Artificial Analysis
text_to_speech Text-to-speech models Artificial Analysis
text_to_video Text-to-video models Artificial Analysis
image_to_video Image-to-video models Artificial Analysis

Benchmarks Table (Artificial Analysis)

Column Type Description
name VARCHAR Model name
creator VARCHAR Creator (OpenAI, Anthropic, etc.)
intelligence DOUBLE Intelligence index
coding DOUBLE Coding index
math DOUBLE Math index
input_price DOUBLE Price per 1M input tokens
output_price DOUBLE Price per 1M output tokens
tps DOUBLE Tokens per second
latency DOUBLE Time to first token (seconds)

Models Table (models.dev)

Column Type Description
model_name VARCHAR Model name
provider_name VARCHAR Provider (OpenAI, Anthropic, etc.)
context_window BIGINT Maximum context window
tool_call BOOLEAN Supports function calling
structured_output BOOLEAN Supports JSON mode
reasoning BOOLEAN Chain-of-thought model
open_weights BOOLEAN Weights publicly available

Note: The benchmarks and models tables are independent. Use SQL to join or correlate data between them based on model/provider names.

Compare Models

Compare models side-by-side with highlighted winners:

# Compare two or more models
which-llm compare "gpt-5 (high)" "claude 4.5 sonnet" "gemini 2.5 pro"

# Show additional fields
which-llm compare "gpt-5" "claude-4.5" --verbose

# Output formats: --json, --csv, --table, --plain
which-llm compare "gpt-5" "claude-4.5" --json

The compare command uses fuzzy matching on model names and displays a transposed table with models as columns and metrics as rows. Winners for each metric are marked with *.

Calculate Token Costs

Estimate token costs with projections:

# Single model cost calculation
which-llm cost "gpt-5 (high)" --input 10k --output 5k

# Compare costs across models
which-llm cost "gpt-5" "claude 4.5" --input 1M --output 500k

# Daily/monthly projections with request volume
which-llm cost "gpt-5 (high)" --input 2k --output 1k --requests 1000 --period daily

# Supports token units: k (thousands), M (millions), B (billions)
which-llm cost "claude-4.5" --input 1.5M --output 750k

Other Commands

# Refresh data from sources
which-llm refresh

# View data source and attribution info
which-llm info

# Manage cache
which-llm cache status
which-llm cache clear

# Manage profiles (for API access)
which-llm profile list
which-llm profile create work --api-key KEY
which-llm profile default work

# Skill management
which-llm skill list
which-llm skill uninstall cursor

Attribution

This tool uses data from the Artificial Analysis API. Per the API terms, attribution is required for all use of the data.

Data Freshness

The CLI uses pre-built benchmark data hosted on GitHub Releases, updated daily via automated workflows. This means:

  • No API key required for basic usage
  • Data is typically less than 24 hours old
  • Use which-llm info to see when data was last updated
  • Use which-llm refresh to fetch fresh data from sources
  • Use which-llm refresh --use-api with an API key for real-time data

License

MIT

About

CLI + agent skill for selecting the right LLM based on benchmarks, capabilities, and cost. Works with Cursor, Claude Code, Copilot, and more.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages