The Open Salesforce AI Benchmark for Evaluating AI Coding Agents on Salesforce Development
Objective measurement. Real execution. Verified results. The Salesforce AI benchmark for developers.
Choose your path:
| π€ I'm... | π― I Want To... | β‘οΈ Go To... |
|---|---|---|
| New to SF-Bench | Understand what this is | What is SF-Bench? |
| New to Salesforce | Learn about Salesforce | What is Salesforce? |
| Company/Enterprise | Evaluate AI tools for my team | For Companies |
| Salesforce Developer | Test AI models on Salesforce | Quick Start |
| Researcher | Benchmark AI models | Evaluation Guide |
| SWE-bench User | Compare with SWE-bench | Comparison |
| Open Source Enthusiast | Contribute to SF-Bench | Contributing |
Before running SF-Bench, ensure you have:
| Requirement | Details | Where to Get |
|---|---|---|
| Python 3.10+ | Required runtime | python.org |
| Salesforce CLI | sf command-line tool |
Salesforce CLI |
| DevHub Org | Salesforce org with scratch org allocation | Create DevHub |
| API Key | Provider-specific key (see below) | Provider dashboard |
| Provider | Environment Variable | Example Models | Get Key |
|---|---|---|---|
| RouteLLM | ROUTELLM_API_KEY |
Grok 4.1, GPT-5, Claude Opus 4 | RouteLLM |
| OpenRouter | OPENROUTER_API_KEY |
Claude Sonnet, GPT-4, Llama | OpenRouter |
| Google Gemini | GOOGLE_API_KEY |
Gemini 2.5 Flash, Gemini Pro | Google AI Studio |
| Anthropic | ANTHROPIC_API_KEY |
Claude 3.5 Sonnet, Claude Opus | Anthropic |
| OpenAI | OPENAI_API_KEY |
GPT-4, GPT-3.5 | OpenAI |
For Full Evaluation (12 tasks):
- Scratch Orgs: 1-5 orgs (depends on
--max-workers)- Minimum: 1 org (sequential,
--max-workers 1) - Recommended: 2-3 orgs (
--max-workers 2-3) - Maximum: 5 orgs (
--max-workers 5)
- Minimum: 1 org (sequential,
- Token Usage: ~100,000 tokens (~0.1M tokens)
- Per task: ~8,000 tokens (input + output + context)
- Full run: ~96,000 tokens
- Time: 1-2 hours (with functional validation)
- Cost: $0.10-$2 per evaluation (varies by model)
For Lite Evaluation (5 tasks):
- Scratch Orgs: 1-3 orgs
- Token Usage: ~40,000 tokens
- Time: ~10-15 minutes
# 1. Install
git clone https://github.com/yasarshaikh/SF-bench.git
cd SF-bench
pip install -e .
# 2. Set API key (example: RouteLLM for Grok 4)
export ROUTELLM_API_KEY="your-key"
# 3. Run evaluation
python scripts/evaluate.py --model grok-4.1-fast --tasks data/tasks/verified.json --functionalπ Full Quick Start Guide β
Results as of December 2025
| Rank | Model | Overall | Functional Score | LWC | Deploy | Apex | Flow |
|---|---|---|---|---|---|---|---|
| π₯ | Claude Sonnet 4.5 | 41.67% | 6.0% | 100% | 100% | 100% | 0%* |
| π₯ | Gemini 2.5 Flash | 25.0% | - | 100% | 100% | 0%* | 0%* |
* Flow tasks failed due to scratch org creation issues (being fixed)
π View Full Leaderboard β
SF-Bench is an open, objective benchmark for measuring how well AI coding agents perform on Salesforce development tasks.
Generic benchmarks (HumanEval, SWE-bench) miss Salesforce-specific challenges:
| Challenge | What We Test |
|---|---|
| Multi-modal development | Apex, LWC (JavaScript), Flows (XML), Metadata |
| Platform execution | Real scratch orgs, not just syntax checks |
| Governor limits | CPU time, SOQL queries, heap size |
| Declarative tools | Flows, Lightning Pages, Permission Sets |
| Enterprise patterns | Triggers, batch jobs, integrations |
- β Measure actual performance
- β Report objective results
- β Verify functional outcomes
- β Don't predict what models "should" score
- β Don't claim expected success rates
1. LOAD TASK β Read task from data/tasks/*.json
2. CLONE REPO β Clone specified GitHub repo
3. APPLY SOLUTION β Apply AI-generated patch
4. DEPLOY β Deploy to Salesforce scratch org
5. RUN TESTS β Execute unit tests
6. VERIFY OUTCOME β Check functional requirements
7. REPORT RESULT β PASS / FAIL / ERROR
CRITICAL: Binary Pass/Fail Methodology
- A task is PASSED only if the functional requirement is met
- Score breakdown (0-100) is diagnostic metadata only - helps identify where failures occurred
- This follows SWE-bench methodology: if functional requirement isn't met, task fails regardless of other checks
| Level | Weight | What We Check | Pass Criteria |
|---|---|---|---|
| Deployment | 10% | Solution deploys without errors | Required but not sufficient |
| Unit Tests | 20% | All tests pass, coverage β₯80% | Required but not sufficient |
| Functional | 50% | Business outcome achieved | REQUIRED - Gatekeeper |
| Bulk Operations | 10% | Handles 200+ records | Diagnostic only |
| No Manual Tweaks | 10% | Works in one shot | Diagnostic only |
| Provider | Environment Variable | Example Model |
|---|---|---|
| OpenRouter | OPENROUTER_API_KEY |
anthropic/claude-3.5-sonnet |
| RouteLLM | ROUTELLM_API_KEY |
gemini-3-flash-preview |
| OpenAI | OPENAI_API_KEY |
gpt-4-turbo |
| Anthropic | ANTHROPIC_API_KEY |
claude-3-5-sonnet-20241022 |
| Google Gemini | GOOGLE_API_KEY |
gemini-2.5-flash |
| Ollama (local) | None needed | codellama |
# With OpenRouter (recommended - access to 100+ models)
export OPENROUTER_API_KEY="your-key"
python scripts/evaluate.py --model anthropic/claude-3.5-sonnet --tasks data/tasks/verified.json
# With Gemini
export GOOGLE_API_KEY="your-key"
python scripts/evaluate.py --model gemini-2.5-flash --tasks data/tasks/verified.json
# With local Ollama
python scripts/evaluate.py --model codellama --provider ollama --tasks data/tasks/verified.jsonSF-Bench includes 12 verified tasks across Salesforce development domains:
| Category | Tasks | Description | Lite Dataset |
|---|---|---|---|
| Apex | 2 | Triggers, Classes, Integrations | β |
| LWC | 2 | Lightning Components | β |
| Flow | 2 | Record-Triggered Flows, Invocable Actions | β |
| Lightning Pages | 1 | Dynamic Forms | β |
| Experience Cloud | 1 | Guest Access | β |
| Architecture | 4 | Full-stack Design | β |
- Lite (5 tasks): Quick validation in ~10 minutes -
data/tasks/lite.json - Verified (12 tasks): Full evaluation in ~1 hour -
data/tasks/verified.json - Realistic: Challenging scenarios -
data/tasks/realistic.json
- π Quick Start Guide - Get running in 5 minutes
- π What is SF-Bench? - Complete overview
- π’ What is Salesforce? - For beginners
- β FAQ - Common questions
- πΌ For Companies - Business case & ROI
- π¨βπ» For Salesforce Developers - Evaluation guide
- π¬ For Researchers - Methodology details
- π SWE-bench Comparison - Benchmark comparison
- π Validation Methodology - How we validate
- π Benchmark Details - Technical specifications
- π Full Leaderboard - Complete model rankings
- π Result Schema - Result format
- β Contributing Guide - How to contribute
- π― Task Guidelines - Creating new tasks
- π Submitting Results
sf-bench/
βββ sfbench/ # Core framework
β βββ engine.py # Orchestration
β βββ runners/ # Task runners (Apex, LWC, Flow, etc.)
β βββ validators/ # Functional validation
β βββ utils/
β βββ ai_agent.py # AI provider integrations
βββ data/
β βββ tasks/ # Task definitions
β β βββ verified.json # Main benchmark (12 tasks)
β β βββ lite.json # Quick validation (5 tasks)
β β βββ realistic.json # Challenging scenarios
β βββ test-scripts/ # Apex test scripts
βββ scripts/
β βββ evaluate.py # Main evaluation script
β βββ leaderboard.py # Generate leaderboard
βββ docs/ # Documentation
βββ getting-started/ # Beginner guides
βββ personas/ # Persona-specific content
βββ evaluation/ # Evaluation guides
βββ reference/ # Technical reference
| Action | Link |
|---|---|
| β Star the repo | GitHub |
| π Submit results | Submit Results |
| π Report bugs | Issues |
| β Add tasks | Contributing |
| π¬ Discuss | Issues |
- π Documentation: yasarshaikh.github.io/SF-bench
- π GitHub: github.com/yasarshaikh/SF-bench
- π Leaderboard: View Results
- β Issues: Report Bugs
MIT License - see LICENSE for details.
- Inspired by SWE-bench methodology
- Built with Salesforce CLI
- Uses official Salesforce sample apps
β Star us if you find SF-Bench useful!
Help us build the best Salesforce AI benchmark