| layout | default |
|---|---|
| title | Benchmark Details - SF-Bench Technical Specifications |
| description | Detailed technical specifications and evaluation results for SF-Bench. Task categories, difficulty breakdown, and verified repositories used in the benchmark. |
| keywords | salesforce benchmark specifications, ai benchmark details, salesforce evaluation tasks, benchmark technical details |
| Segment | Description | GPT-4o | Claude 3.5 | Gemini 2.0 | Llama 3.3 |
|---|---|---|---|---|---|
| Apex | Triggers, Classes, Tests | -% | -% | -% | -% |
| LWC | Lightning Web Components | -% | -% | -% | -% |
| Flow | Screen Components, Invocable Actions | -% | -% | -% | -% |
| Lightning Pages | FlexiPages, Dynamic Forms | -% | -% | -% | -% |
| Page Layouts | Record Layouts, Compact Layouts | -% | -% | -% | -% |
| Experience Cloud | Sites, Communities | -% | -% | -% | -% |
| Architecture | Full-stack, System Design | -% | -% | -% | -% |
| Deployment | Metadata, Dependencies | -% | -% | -% | -% |
| Agentforce | Agent Scripts, Prompts | -% | -% | -% | -% |
| Overall | All Tasks | -% | -% | -% | -% |
| Difficulty | Total Tasks | Description |
|---|---|---|
| Easy | 2 | Basic configurations, simple fixes |
| Medium | 5 | Multi-step implementations, integrations |
| Hard | 4 | Complex components, advanced patterns |
| Expert | 1 | Full architecture, multi-layer solutions |
| Repository | Stars | Categories | Status |
|---|---|---|---|
| trailheadapps/apex-recipes | 1,059 | Apex | ✅ Active |
| trailheadapps/lwc-recipes | 2,805 | LWC | ✅ Active |
| trailheadapps/dreamhouse-lwc | 469 | LWC, Architecture | ✅ Active |
| trailheadapps/automation-components | 384 | Flow | ✅ Active |
| trailheadapps/ebikes-lwc | 830 | Experience Cloud | ✅ Active |
| trailheadapps/agent-script-recipes | 53 | Agentforce | ✅ Active |
| trailheadapps/coral-cloud | 138 | Data Cloud, AI | ✅ Active |
Each task is evaluated on multiple dimensions:
-
Functional Correctness (40%)
- Tests pass
- Deployment succeeds
- Expected behavior achieved
-
Code Quality (30%)
- No hardcoded values
- Proper error handling
- Follows Salesforce best practices
-
Anti-Gaming Checks (20%)
- No test-specific hacks
- Solution addresses root cause
- Maintainable code
-
Documentation (10%)
- Clear comments
- README updates where applicable
- Pass: Score ≥ 80%
- Partial: 50% ≤ Score < 80%
- Fail: Score < 50%
Run SF-Bench on your model and submit results:
python scripts/evaluate.py --model <your-model> --tasks data/tasks/verified.jsonThen submit your results to be added to the leaderboard.
Last updated: December 2025