BigCodeBench v0.2.1.post3

terryyz released this 10 Nov 08:49

· 128 commits to main since this release

8645863

What's Changed

Fix calibration setting in the code evaluation.
Add --no_execute argument for code evaluation.
Support concurrent API inference for o1 and deepseek-chat.
Fix API inference for Google Gemini.
Add --instruction_prefix and --response_prefix arguments for code generation.
Change --id_range input type.
Add --revision arguments for code generation.

Evaluated LLMs (144 models)

Qwen2.5-Coder-32B-Instruct
grok-beta
claude-3-5-haiku-20241022

Full Changelog: v0.2.0...v0.2.1.post2

Assets 2