We welcome your contributions to the Leaderboard! This guide provides step-by-step instructions for adding a new model to the leaderboard.
The repository is organized as follows:
berkeley-function-call-leaderboard/
├── bfcl_eval/
| ├── constants/ # Global constants and configuration values
│ ├── eval_checker/ # Evaluation modules
│ │ ├── ast_eval/ # AST-based evaluation
│ │ ├── multi_turn_eval/ # Multi-turn evaluation
│ ├── model_handler/ # All model-specific handlers
│ │ ├── local_inference/ # Handlers for locally-hosted models
│ │ │ ├── base_oss_handler.py # Base handler for OSS models
│ │ │ ├── gemma.py # Example: Gemma models
│ │ │ ├── qwen.py # Example: Qwen models (Prompt mode)
│ │ │ ├── qwen_fc.py # Example: Qwen models (FC mode)
│ │ │ ├── deepseek_reasoning.py # Example: DeepSeek reasoning models (with reasoning trace)
│ │ │ ├── ...
│ │ ├── api_inference/ # Handlers for API-based models
│ │ │ ├── openai.py # Example: OpenAI models
│ │ │ ├── claude.py # Example: Claude models
│ │ │ ├── ...
│ │ ├── parser/ # Parsing utilities for Java/JavaScript
│ │ ├── base_handler.py # Base handler blueprint
│ ├── data/ # Datasets
│ ├── scripts/ # Helper scripts
├── result/ # Model responses
├── score/ # Evaluation results
To add a new model, focus primarily on the model_handler directory. You do not need to modify the parsing utilities in model_handler/parser or any other directories.
- Base Handler: Start by reviewing
bfcl_eval/model_handler/base_handler.py. All model handlers inherit from this base class. Theinference_single_turnandinference_multi_turnmethods defined there are helpful for understanding the model response generation pipeline. Thebase_handler.pycontains many useful details in the docstrings of each abstract method, so be sure to review them.- If your model is hosted locally, you should also look at
bfcl_eval/model_handler/local_inference/base_oss_handler.py.
- If your model is hosted locally, you should also look at
- Reference Handlers: Checkout some of the existing model handlers (such as
openai.py,claude.py, etc); you can likely reuse some of the existing code if your new model outputs in a similar format.- If your model is OpenAI-compatible, the
openai.pyhandler will be helpful (and you might be able to just use it as is). - If your model is locally hosted, the
llama_fc.pyhandler or thedeepseek_coder.pyhandler can be good starting points.
- If your model is OpenAI-compatible, the
We support models in two modes:
-
Function Calling (FC) Mode:
Models with native tool/function calling capabilities. For example, OpenAI GPT in FC mode uses thetoolssection as documented in the OpenAI function calling guide. -
Prompting Mode:
Models without native function calling capabilities rely on traditional prompt-based interactions, and we supply the function definitions in thesystem promptsection as opposed to a dedicatedtoolssection. Prompt mode also serve as an alternative approach for models that support FC mode but do not fully leverage its function calling ability (i.e., we only use its normal text generation capability).
For API-based models (such as OpenAI GPT), both FC and Prompting modes can be defined in the same handler. Methods related to FC mode end with _FC, while Prompting mode methods end with _prompting.
For locally-hosted models, we only implement prompting methods to maintain code readablity. If a locally-hosted model has both FC and Prompting modes, you will typically create two separate handlers (e.g., qwen_fc.py for FC mode and qwen.py for Prompting mode).
For API-based Models:
- Implement all the methods marked as "not implemented" under the
FC MethodsorPrompting Methodssections inbase_handler.py, depending on which mode(s) your model supports.
For Locally-Hosted Models:
- Implement the
_format_promptmethod in your handler. - Other methods from the
Prompting Methodssection inbase_oss_handler.pyare already implemented, but you may override them if necessary.
Common Requirements for All Handlers:
Regardless of mode or model type, you should implement the following methods to convert raw model response (output of _parse_query_response_xxx) into standard formats expected by the evaluation pipeline:
-
decode_ast
Converts the raw model response into a structured list of dictionaries, with each dictionary representing a function call:[{"func1": {"param1": "val1", "param2": "val2"}}, {"func2": {"param1": "val1"}}]This helps the evaluation pipeline understand the model’s intended function calls.
-
decode_execute
Converts the raw model response into a list of strings representing callable functions:["func1(param1=val1, param2=val2)", "func2(param1=val1)"]
-
Add a new entry in
bfcl_eval/constants/model_config.pyPopulate every field in the
ModelConfigdataclass:Field What to put in it model_nameModel name as used in the API or on Hugging Face. display_nameModel name as it should appear on the leaderboard. urlLink to the model’s documentation, homepage, or repo. orgCompany or organization that developed the model. licenseLicense under which the model is released. Proprietaryif it’s not open-source.model_handlerName of the handler class (e.g., OpenAIHandler,GeminiHandler). -
(Optional) Add pricing
If the model is billed by token usage, specify prices per million tokens:
input_price = 0.50 # USD per 1M input tokens output_price = 1.00 # USD per 1M output tokens
For free/open-source models, set both to
None. -
Set behavior flags
Flag When to set it to Trueis_fc_modelThe handler invokes the model in its function-calling mode instead of prompt-based mode. underscore_to_dotYour FC model rejects dots ( .) in function names; set this so the dots will auto-converts to underscores during evaluation. -
Update Supported Models
- Add your model to the list of supported models in
SUPPORTED_MODELS.md. Include the model name and type (FC or Prompt) in the table. - Add a new entry in
bfcl_eval/constants/supported_models.pyas well.
- Add your model to the list of supported models in
- Raise a Pull Request with your new Model Handler and the necessary updates to the model config.
- Ensure that the model you add is publicly accessible, either open-source or behind a publicly available API. While you may require authentication, billing, registration, or tokens, the general public should ultimately be able to access the endpoint.
- If your model is not publicly accessible, we would still welcome your contribution, but we unfortunately cannot include it in the public-facing leaderboard.
- Have questions or need help? Join the Discord and visit the
#leaderboardchannel. - Feel free to reach out if you have any questions, concerns, or would like guidance while adding your new model. We’re happy to assist!
Thank you for contributing to the Berkeley Function Calling Leaderboard! We look forward to seeing your model added to the community.