| layout | title | description | keywords |
|---|---|---|---|
default |
What is SF-Bench? - Complete Overview |
Learn what SF-Bench is, how it works, and why it matters. Perfect for first-time visitors. |
what is sf-bench, salesforce benchmark explained, ai benchmark overview |
SF-Bench is the first comprehensive benchmark for evaluating AI coding agents on real-world Salesforce development tasks.
Existing AI benchmarks (like HumanEval, SWE-bench) test general programming, but miss Salesforce-specific challenges:
❌ They don't test:
- Platform-specific constraints (governor limits)
- Multi-modal development (Apex + LWC + Flow)
- Real Salesforce execution (scratch orgs)
- Business logic validation
✅ SF-Bench does:
- Tests in real Salesforce environments
- Validates functional outcomes (not just syntax)
- Covers all Salesforce development types
- Measures production-ready code
- 12+ verified Salesforce development tasks
- Based on real-world scenarios
- From official Salesforce sample apps
- Tests how well AI generates Salesforce code
- Measures functional correctness
- Reports objective results
- Leaderboard of model performance
- Detailed breakdowns by task type
- Functional validation scores
1. Task Definition
↓
2. AI Generates Solution
↓
3. Deploy to Salesforce Scratch Org
↓
4. Run Unit Tests
↓
5. Verify Functional Outcome
↓
6. Score & Report
| Task Type | What It Tests |
|---|---|
| Apex | Backend code (triggers, classes) |
| LWC | Frontend components (JavaScript) |
| Flow | Visual automation |
| Lightning Pages | UI configuration |
| Experience Cloud | Public-facing sites |
| Architecture | Full-stack solutions |
| Component | Weight | What It Checks |
|---|---|---|
| Deployment | 10% | Code deploys successfully |
| Unit Tests | 20% | All tests pass, coverage ≥80% |
| Functional | 50% | Business outcome achieved |
| Bulk Operations | 10% | Handles 200+ records |
| No Manual Tweaks | 10% | Works in one shot |
Key: Functional validation (50%) ensures the solution actually works, not just compiles.
- Benchmark model performance
- Compare different models
- Research AI capabilities
- Evaluate AI tools for Salesforce development
- Choose the best AI coding assistant
- Measure ROI of AI tools
- Understand AI capabilities
- Choose AI tools
- Learn best practices
- Test and improve models
- Showcase capabilities
- Competitive benchmarking
- HumanEval: General Python programming
- SF-Bench: Salesforce-specific, real execution
- SWE-bench: Open-source Python projects
- SF-Bench: Salesforce platform, enterprise focus
- CodeXGLUE: Multiple languages, syntax-focused
- SF-Bench: Salesforce-only, functional validation
- Tests actual Salesforce development
- Validates functional outcomes
- Production-ready code
- No predictions or claims
- Just facts and results
- Transparent methodology
- All Salesforce development types
- Multiple difficulty levels
- Real-world scenarios
- Open source (MIT license)
- Free to use
- Community-driven
# 1. Install
git clone https://github.com/yasarshaikh/SF-bench.git
cd SF-bench
pip install -e .
# 2. Set API key
export OPENROUTER_API_KEY="your-key"
# 3. Run evaluation
python scripts/evaluate.py --model anthropic/claude-3.5-sonnet- Python 3.10+
- Salesforce CLI
- DevHub org (free)
- AI model API key
- What is Salesforce? - If you're new to Salesforce
- Quick Start Guide - Get running in 5 minutes
- FAQ - Common questions
- For Companies - Business case and ROI
- Comparison with Competitors - Benchmark comparison
- Evaluation Guide - Complete guide
- Validation Methodology - How we validate
- Task Schema - Technical details
- Methodology - Detailed methodology
- Benchmark Details - Technical specifications
- Result Schema - Result format
See which models perform best: Leaderboard →
- ⭐ Star the repo
- 📊 Submit your model's results
- ➕ Contribute tasks
- 🐛 Report bugs
- 💬 Join discussions
Ready to start? Check out our Quick Start Guide!