BigCodeBench v0.2.1.post3
What's Changed
- Fix
calibrationsetting in the code evaluation. - Add
--no_executeargument for code evaluation. - Support concurrent API inference for
o1anddeepseek-chat. - Fix API inference for Google Gemini.
- Add
--instruction_prefixand--response_prefixarguments for code generation. - Change
--id_rangeinput type. - Add
--revisionarguments for code generation.
Evaluated LLMs (144 models)
- Qwen2.5-Coder-32B-Instruct
- grok-beta
- claude-3-5-haiku-20241022
Full Changelog: v0.2.0...v0.2.1.post2