SF-Bench: The Salesforce AI Benchmark

The Open Salesforce AI Benchmark for Evaluating AI Coding Agents on Salesforce Development

Objective measurement. Real execution. Verified results. The Salesforce AI benchmark for developers.

🎯 I Am A...

Choose your path:

👤 I'm...	🎯 I Want To...	➡️ Go To...
New to SF-Bench	Understand what this is	What is SF-Bench?
New to Salesforce	Learn about Salesforce	What is Salesforce?
Company/Enterprise	Evaluate AI tools for my team	For Companies
Salesforce Developer	Test AI models on Salesforce	Quick Start
Researcher	Benchmark AI models	Evaluation Guide
SWE-bench User	Compare with SWE-bench	Comparison
Open Source Enthusiast	Contribute to SF-Bench	Contributing

📋 Prerequisites

Before running SF-Bench, ensure you have:

Requirement	Details	Where to Get
Python 3.10+	Required runtime	python.org
Salesforce CLI	`sf` command-line tool	Salesforce CLI
DevHub Org	Salesforce org with scratch org allocation	Create DevHub
API Key	Provider-specific key (see below)	Provider dashboard

API Key Requirements by Provider

Provider	Environment Variable	Example Models	Get Key
RouteLLM	`ROUTELLM_API_KEY`	Grok 4.1, GPT-5, Claude Opus 4	RouteLLM
OpenRouter	`OPENROUTER_API_KEY`	Claude Sonnet, GPT-4, Llama	OpenRouter
Google Gemini	`GOOGLE_API_KEY`	Gemini 2.5 Flash, Gemini Pro	Google AI Studio
Anthropic	`ANTHROPIC_API_KEY`	Claude 3.5 Sonnet, Claude Opus	Anthropic
OpenAI	`OPENAI_API_KEY`	GPT-4, GPT-3.5	OpenAI

Resource Requirements

For Full Evaluation (12 tasks):

Scratch Orgs: 1-5 orgs (depends on --max-workers)
- Minimum: 1 org (sequential, --max-workers 1)
- Recommended: 2-3 orgs (--max-workers 2-3)
- Maximum: 5 orgs (--max-workers 5)
Token Usage: ~100,000 tokens (~0.1M tokens)
- Per task: ~8,000 tokens (input + output + context)
- Full run: ~96,000 tokens
Time: 1-2 hours (with functional validation)
Cost: $0.10-$2 per evaluation (varies by model)

For Lite Evaluation (5 tasks):

Scratch Orgs: 1-3 orgs
Token Usage: ~40,000 tokens
Time: ~10-15 minutes

🚀 Quick Start (5 Minutes)

# 1. Install
git clone https://github.com/yasarshaikh/SF-bench.git
cd SF-bench
pip install -e .

# 2. Set API key (example: RouteLLM for Grok 4)
export ROUTELLM_API_KEY="your-key"

# 3. Run evaluation
python scripts/evaluate.py --model grok-4.1-fast --tasks data/tasks/verified.json --functional

📖 Full Quick Start Guide →

🏆 Current Leaderboard

Results as of December 2025

Rank	Model	Overall	Functional Score	LWC	Deploy	Apex	Flow
🥇	Claude Sonnet 4.5	41.67%	6.0%	100%	100%	100%	0%*
🥈	Gemini 2.5 Flash	25.0%	-	100%	100%	0%*	0%*

* Flow tasks failed due to scratch org creation issues (being fixed)

📊 View Full Leaderboard →

🎯 What is SF-Bench?

SF-Bench is an open, objective benchmark for measuring how well AI coding agents perform on Salesforce development tasks.

Why Salesforce-Specific?

Generic benchmarks (HumanEval, SWE-bench) miss Salesforce-specific challenges:

Challenge	What We Test
Multi-modal development	Apex, LWC (JavaScript), Flows (XML), Metadata
Platform execution	Real scratch orgs, not just syntax checks
Governor limits	CPU time, SOQL queries, heap size
Declarative tools	Flows, Lightning Pages, Permission Sets
Enterprise patterns	Triggers, batch jobs, integrations

We Are Auditors, Not Predictors

✅ Measure actual performance
✅ Report objective results
✅ Verify functional outcomes
❌ Don't predict what models "should" score
❌ Don't claim expected success rates

🔧 How It Works

Validation Pipeline

1. LOAD TASK         → Read task from data/tasks/*.json
2. CLONE REPO        → Clone specified GitHub repo
3. APPLY SOLUTION    → Apply AI-generated patch
4. DEPLOY            → Deploy to Salesforce scratch org
5. RUN TESTS         → Execute unit tests
6. VERIFY OUTCOME    → Check functional requirements
7. REPORT RESULT     → PASS / FAIL / ERROR

Validation Levels (Binary Pass/Fail with Diagnostic Scoring)

CRITICAL: Binary Pass/Fail Methodology

A task is PASSED only if the functional requirement is met
Score breakdown (0-100) is diagnostic metadata only - helps identify where failures occurred
This follows SWE-bench methodology: if functional requirement isn't met, task fails regardless of other checks

Level	Weight	What We Check	Pass Criteria
Deployment	10%	Solution deploys without errors	Required but not sufficient
Unit Tests	20%	All tests pass, coverage ≥80%	Required but not sufficient
Functional	50%	Business outcome achieved	REQUIRED - Gatekeeper
Bulk Operations	10%	Handles 200+ records	Diagnostic only
No Manual Tweaks	10%	Works in one shot	Diagnostic only

🧪 Testing Your Model

Supported AI Providers

Provider	Environment Variable	Example Model
OpenRouter	`OPENROUTER_API_KEY`	`anthropic/claude-3.5-sonnet`
RouteLLM	`ROUTELLM_API_KEY`	`gemini-3-flash-preview`
OpenAI	`OPENAI_API_KEY`	`gpt-4-turbo`
Anthropic	`ANTHROPIC_API_KEY`	`claude-3-5-sonnet-20241022`
Google Gemini	`GOOGLE_API_KEY`	`gemini-2.5-flash`
Ollama (local)	None needed	`codellama`

Quick Test

# With OpenRouter (recommended - access to 100+ models)
export OPENROUTER_API_KEY="your-key"
python scripts/evaluate.py --model anthropic/claude-3.5-sonnet --tasks data/tasks/verified.json

# With Gemini
export GOOGLE_API_KEY="your-key"
python scripts/evaluate.py --model gemini-2.5-flash --tasks data/tasks/verified.json

# With local Ollama
python scripts/evaluate.py --model codellama --provider ollama --tasks data/tasks/verified.json

📊 Task Categories

SF-Bench includes 12 verified tasks across Salesforce development domains:

Category	Tasks	Description	Lite Dataset
Apex	2	Triggers, Classes, Integrations	✅
LWC	2	Lightning Components	✅
Flow	2	Record-Triggered Flows, Invocable Actions	✅
Lightning Pages	1	Dynamic Forms	✅
Experience Cloud	1	Guest Access	❌
Architecture	4	Full-stack Design	✅

Datasets

Lite (5 tasks): Quick validation in ~10 minutes - data/tasks/lite.json
Verified (12 tasks): Full evaluation in ~1 hour - data/tasks/verified.json
Realistic: Challenging scenarios - data/tasks/realistic.json

📖 Documentation

Getting Started

🚀 Quick Start Guide - Get running in 5 minutes
📚 What is SF-Bench? - Complete overview
🏢 What is Salesforce? - For beginners
❓ FAQ - Common questions

For Different Audiences

💼 For Companies - Business case & ROI
👨‍💻 For Salesforce Developers - Evaluation guide
🔬 For Researchers - Methodology details
🔄 SWE-bench Comparison - Benchmark comparison

Reference

📋 Validation Methodology - How we validate
📊 Benchmark Details - Technical specifications
🏆 Full Leaderboard - Complete model rankings
📄 Result Schema - Result format

Contributing

➕ Contributing Guide - How to contribute
🎯 Task Guidelines - Creating new tasks
📊 Submitting Results

📁 Project Structure

sf-bench/
├── sfbench/                  # Core framework
│   ├── engine.py             # Orchestration
│   ├── runners/              # Task runners (Apex, LWC, Flow, etc.)
│   ├── validators/           # Functional validation
│   └── utils/
│       └── ai_agent.py       # AI provider integrations
├── data/
│   ├── tasks/                # Task definitions
│   │   ├── verified.json     # Main benchmark (12 tasks)
│   │   ├── lite.json         # Quick validation (5 tasks)
│   │   └── realistic.json   # Challenging scenarios
│   └── test-scripts/         # Apex test scripts
├── scripts/
│   ├── evaluate.py           # Main evaluation script
│   └── leaderboard.py       # Generate leaderboard
└── docs/                     # Documentation
    ├── getting-started/      # Beginner guides
    ├── personas/             # Persona-specific content
    ├── evaluation/           # Evaluation guides
    └── reference/            # Technical reference

🤝 Get Involved

Action	Link
⭐ Star the repo	GitHub
📊 Submit results	Submit Results
🐛 Report bugs	Issues
➕ Add tasks	Contributing
💬 Discuss	Issues

🔗 Links

📖 Documentation: yasarshaikh.github.io/SF-bench
🐙 GitHub: github.com/yasarshaikh/SF-bench
📊 Leaderboard: View Results
❓ Issues: Report Bugs

📜 License

MIT License - see LICENSE for details.

🙏 Acknowledgments

Inspired by SWE-bench methodology
Built with Salesforce CLI
Uses official Salesforce sample apps

⭐ Star us if you find SF-Bench useful!
Help us build the best Salesforce AI benchmark

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.cursor		.cursor
.github		.github
data		data
docs		docs
scripts		scripts
sfbench		sfbench
tests		tests
.cursorignore		.cursorignore
.cursorindexingignore		.cursorindexingignore
.env.sample		.env.sample
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.secrets.baseline		.secrets.baseline
Agents.md		Agents.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
sfdx-project.json		sfdx-project.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SF-Bench: The Salesforce AI Benchmark

🎯 I Am A...

📋 Prerequisites

API Key Requirements by Provider

Resource Requirements

🚀 Quick Start (5 Minutes)

🏆 Current Leaderboard

🎯 What is SF-Bench?

Why Salesforce-Specific?

We Are Auditors, Not Predictors

🔧 How It Works

Validation Pipeline

Validation Levels (Binary Pass/Fail with Diagnostic Scoring)

🧪 Testing Your Model

Supported AI Providers

Quick Test

📊 Task Categories

Datasets

📖 Documentation

Getting Started

For Different Audiences

Reference

Contributing

📁 Project Structure

🤝 Get Involved

🔗 Links

📜 License

🙏 Acknowledgments

About

Uh oh!

Uh oh!

Languages

License

yasarshaikh/SF-bench

Folders and files

Latest commit

History

Repository files navigation

SF-Bench: The Salesforce AI Benchmark

🎯 I Am A...

📋 Prerequisites

API Key Requirements by Provider

Resource Requirements

🚀 Quick Start (5 Minutes)

🏆 Current Leaderboard

🎯 What is SF-Bench?

Why Salesforce-Specific?

We Are Auditors, Not Predictors

🔧 How It Works

Validation Pipeline

Validation Levels (Binary Pass/Fail with Diagnostic Scoring)

🧪 Testing Your Model

Supported AI Providers

Quick Test

📊 Task Categories

Datasets

📖 Documentation

Getting Started

For Different Audiences

Reference

Contributing

📁 Project Structure

🤝 Get Involved

🔗 Links

📜 License

🙏 Acknowledgments

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages