Skip to content

The first comprehensive benchmark for evaluating AI coding agents on Salesforce development tasks. Tests Apex, LWC, Flows, and more.

License

Notifications You must be signed in to change notification settings

yasarshaikh/SF-bench

SF-Bench: The Salesforce AI Benchmark

SF-Bench Logo

The Open Salesforce AI Benchmark for Evaluating AI Coding Agents on Salesforce Development

Objective measurement. Real execution. Verified results. The Salesforce AI benchmark for developers.

SF-Bench
GitHub stars GitHub forks License: MIT Python 3.10+ Documentation


🎯 I Am A...

Choose your path:

πŸ‘€ I'm... 🎯 I Want To... ➑️ Go To...
New to SF-Bench Understand what this is What is SF-Bench?
New to Salesforce Learn about Salesforce What is Salesforce?
Company/Enterprise Evaluate AI tools for my team For Companies
Salesforce Developer Test AI models on Salesforce Quick Start
Researcher Benchmark AI models Evaluation Guide
SWE-bench User Compare with SWE-bench Comparison
Open Source Enthusiast Contribute to SF-Bench Contributing

πŸ“‹ Prerequisites

Before running SF-Bench, ensure you have:

Requirement Details Where to Get
Python 3.10+ Required runtime python.org
Salesforce CLI sf command-line tool Salesforce CLI
DevHub Org Salesforce org with scratch org allocation Create DevHub
API Key Provider-specific key (see below) Provider dashboard

API Key Requirements by Provider

Provider Environment Variable Example Models Get Key
RouteLLM ROUTELLM_API_KEY Grok 4.1, GPT-5, Claude Opus 4 RouteLLM
OpenRouter OPENROUTER_API_KEY Claude Sonnet, GPT-4, Llama OpenRouter
Google Gemini GOOGLE_API_KEY Gemini 2.5 Flash, Gemini Pro Google AI Studio
Anthropic ANTHROPIC_API_KEY Claude 3.5 Sonnet, Claude Opus Anthropic
OpenAI OPENAI_API_KEY GPT-4, GPT-3.5 OpenAI

Resource Requirements

For Full Evaluation (12 tasks):

  • Scratch Orgs: 1-5 orgs (depends on --max-workers)
    • Minimum: 1 org (sequential, --max-workers 1)
    • Recommended: 2-3 orgs (--max-workers 2-3)
    • Maximum: 5 orgs (--max-workers 5)
  • Token Usage: ~100,000 tokens (~0.1M tokens)
    • Per task: ~8,000 tokens (input + output + context)
    • Full run: ~96,000 tokens
  • Time: 1-2 hours (with functional validation)
  • Cost: $0.10-$2 per evaluation (varies by model)

For Lite Evaluation (5 tasks):

  • Scratch Orgs: 1-3 orgs
  • Token Usage: ~40,000 tokens
  • Time: ~10-15 minutes

πŸš€ Quick Start (5 Minutes)

# 1. Install
git clone https://github.com/yasarshaikh/SF-bench.git
cd SF-bench
pip install -e .

# 2. Set API key (example: RouteLLM for Grok 4)
export ROUTELLM_API_KEY="your-key"

# 3. Run evaluation
python scripts/evaluate.py --model grok-4.1-fast --tasks data/tasks/verified.json --functional

πŸ“– Full Quick Start Guide β†’


πŸ† Current Leaderboard

Results as of December 2025

Rank Model Overall Functional Score LWC Deploy Apex Flow
πŸ₯‡ Claude Sonnet 4.5 41.67% 6.0% 100% 100% 100% 0%*
πŸ₯ˆ Gemini 2.5 Flash 25.0% - 100% 100% 0%* 0%*

* Flow tasks failed due to scratch org creation issues (being fixed)

πŸ“Š View Full Leaderboard β†’


🎯 What is SF-Bench?

SF-Bench is an open, objective benchmark for measuring how well AI coding agents perform on Salesforce development tasks.

Why Salesforce-Specific?

Generic benchmarks (HumanEval, SWE-bench) miss Salesforce-specific challenges:

Challenge What We Test
Multi-modal development Apex, LWC (JavaScript), Flows (XML), Metadata
Platform execution Real scratch orgs, not just syntax checks
Governor limits CPU time, SOQL queries, heap size
Declarative tools Flows, Lightning Pages, Permission Sets
Enterprise patterns Triggers, batch jobs, integrations

We Are Auditors, Not Predictors

  • βœ… Measure actual performance
  • βœ… Report objective results
  • βœ… Verify functional outcomes
  • ❌ Don't predict what models "should" score
  • ❌ Don't claim expected success rates

πŸ”§ How It Works

Validation Pipeline

1. LOAD TASK         β†’ Read task from data/tasks/*.json
2. CLONE REPO        β†’ Clone specified GitHub repo
3. APPLY SOLUTION    β†’ Apply AI-generated patch
4. DEPLOY            β†’ Deploy to Salesforce scratch org
5. RUN TESTS         β†’ Execute unit tests
6. VERIFY OUTCOME    β†’ Check functional requirements
7. REPORT RESULT     β†’ PASS / FAIL / ERROR

Validation Levels (Binary Pass/Fail with Diagnostic Scoring)

CRITICAL: Binary Pass/Fail Methodology

  • A task is PASSED only if the functional requirement is met
  • Score breakdown (0-100) is diagnostic metadata only - helps identify where failures occurred
  • This follows SWE-bench methodology: if functional requirement isn't met, task fails regardless of other checks
Level Weight What We Check Pass Criteria
Deployment 10% Solution deploys without errors Required but not sufficient
Unit Tests 20% All tests pass, coverage β‰₯80% Required but not sufficient
Functional 50% Business outcome achieved REQUIRED - Gatekeeper
Bulk Operations 10% Handles 200+ records Diagnostic only
No Manual Tweaks 10% Works in one shot Diagnostic only

πŸ§ͺ Testing Your Model

Supported AI Providers

Provider Environment Variable Example Model
OpenRouter OPENROUTER_API_KEY anthropic/claude-3.5-sonnet
RouteLLM ROUTELLM_API_KEY gemini-3-flash-preview
OpenAI OPENAI_API_KEY gpt-4-turbo
Anthropic ANTHROPIC_API_KEY claude-3-5-sonnet-20241022
Google Gemini GOOGLE_API_KEY gemini-2.5-flash
Ollama (local) None needed codellama

Quick Test

# With OpenRouter (recommended - access to 100+ models)
export OPENROUTER_API_KEY="your-key"
python scripts/evaluate.py --model anthropic/claude-3.5-sonnet --tasks data/tasks/verified.json

# With Gemini
export GOOGLE_API_KEY="your-key"
python scripts/evaluate.py --model gemini-2.5-flash --tasks data/tasks/verified.json

# With local Ollama
python scripts/evaluate.py --model codellama --provider ollama --tasks data/tasks/verified.json

πŸ“Š Task Categories

SF-Bench includes 12 verified tasks across Salesforce development domains:

Category Tasks Description Lite Dataset
Apex 2 Triggers, Classes, Integrations βœ…
LWC 2 Lightning Components βœ…
Flow 2 Record-Triggered Flows, Invocable Actions βœ…
Lightning Pages 1 Dynamic Forms βœ…
Experience Cloud 1 Guest Access ❌
Architecture 4 Full-stack Design βœ…

Datasets

  • Lite (5 tasks): Quick validation in ~10 minutes - data/tasks/lite.json
  • Verified (12 tasks): Full evaluation in ~1 hour - data/tasks/verified.json
  • Realistic: Challenging scenarios - data/tasks/realistic.json

πŸ“– Documentation

Getting Started

For Different Audiences

Reference

Contributing


πŸ“ Project Structure

sf-bench/
β”œβ”€β”€ sfbench/                  # Core framework
β”‚   β”œβ”€β”€ engine.py             # Orchestration
β”‚   β”œβ”€β”€ runners/              # Task runners (Apex, LWC, Flow, etc.)
β”‚   β”œβ”€β”€ validators/           # Functional validation
β”‚   └── utils/
β”‚       └── ai_agent.py       # AI provider integrations
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ tasks/                # Task definitions
β”‚   β”‚   β”œβ”€β”€ verified.json     # Main benchmark (12 tasks)
β”‚   β”‚   β”œβ”€β”€ lite.json         # Quick validation (5 tasks)
β”‚   β”‚   └── realistic.json   # Challenging scenarios
β”‚   └── test-scripts/         # Apex test scripts
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ evaluate.py           # Main evaluation script
β”‚   └── leaderboard.py       # Generate leaderboard
└── docs/                     # Documentation
    β”œβ”€β”€ getting-started/      # Beginner guides
    β”œβ”€β”€ personas/             # Persona-specific content
    β”œβ”€β”€ evaluation/           # Evaluation guides
    └── reference/            # Technical reference

🀝 Get Involved

Action Link
⭐ Star the repo GitHub
πŸ“Š Submit results Submit Results
πŸ› Report bugs Issues
βž• Add tasks Contributing
πŸ’¬ Discuss Issues

πŸ”— Links


πŸ“œ License

MIT License - see LICENSE for details.


πŸ™ Acknowledgments


SF-Bench

⭐ Star us if you find SF-Bench useful!
Help us build the best Salesforce AI benchmark

About

The first comprehensive benchmark for evaluating AI coding agents on Salesforce development tasks. Tests Apex, LWC, Flows, and more.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks