Would be great to have a ClawForge skill that runs AgentBench-Live benchmarks.
AgentBench-Live tests AI agents on 10 real-world tasks across 5 domains (code, data, research, tool-use, multi-step). Current results show Claude Code at 0.74 avg vs Gemini CLI at 0.52.
A skill could let users benchmark their own OpenClaw instance directly from the CLI.