🚀 Mafin2.5-FinanceBench

This repository contains the results of our finance benchmark evaluations using our Mafin2.5 system. These evaluations are based on the FinanceBench benchmark as introduced in the paper 📄 FinanceBench: A New Benchmark for Financial Question Answering.

Mafin2.5 Introduction

Mafin2.5 is our latest financial AI model, built on Vectify’s RAG 3.0 framework, which enhances performance through reinforcement learning (RL) with Monte Carlo Tree Search (MCTS). This approach improves multi-step reasoning, retrieval accuracy, and financial task automation.

Mafin1, our earlier model, focused on fine-tuning embedding models for financial AI. For a detailed explanation, refer to our technical report: 📄 Mafin: Enhancing Black-Box Embeddings with Model-Augmented Fine-Tuning. This foundational work laid the groundwork for embedding optimization and improved knowledge retrieval in financial applications.

Benchmark Overview

FinanceBench is a pioneering test suite designed to evaluate the performance of large language models (LLMs) on open-book financial question answering (QA). It includes questions about publicly traded companies, each accompanied by corresponding answers and evidence strings. It has the following key features:

Ecologically valid questions: Covers a diverse set of scenarios relevant to publicly traded companies.
Model evaluation: Includes assessment of 16 state-of-the-art model configurations such as GPT-4.
Limitations identified: Highlights the limitations of current LLMs for financial QA, including hallucinations and refusal to answer.

Evaluation Protocol

We follow a realistic and practical evaluation setup, where all documents are stored in a single database, and Mafin2.5 is tested on the FinanceBench public set. This approach ensures that our model is evaluated under conditions that closely resemble real-world financial applications. For transparency, we have open-sourced our evaluation code in Evaluation Code.

In cases where questions are ambiguous, have multiple valid answers, or are deemed invalid, we rely on expert human annotations to ensure fair and accurate evaluation. For more details, see Human Evaluation.

Results

1. Mafin Model Evolution

This figure showcases the progression of Mafin models, highlighting the significant improvement in accuracy from Mafin 1 to Mafin 2.5. The latest iteration, Mafin 2.5, achieves a remarkable accuracy of 98.7%, demonstrating major advancements in reasoning and retrieval capabilities.

2. Mafin2.5 Performance Across Base Models

As a RAG 3.0 model, Mafin 2.5 is capable of leveraging different base models while maintaining consistent high performance (98.7%). The above figure illustrates its effectiveness across ChatGPT 4o and Deepseek v3, indicating that its strong performance is independent of the underlying LLM. Notably, Deepseek v3 is a privately deployable model, offering an alternative for organizations requiring on-premise or self-hosted AI solutions.

3. Comparison with Market Players

Method	Accuracy (%)	Full Benchmark? (Coverage)	Results Public?	Source
Mafin2.5	98.7	Yes (100%)	Yes	link
Quantly	94	Yes (100%)	No	link
Fintool	98	No (66.7%)	No	link
ChatGPT 4o + Search	31	No (66.7%)	No	link
Perplexity	45	No (66.7%)	No	link

This benchmark comparison demonstrates Mafin 2.5's superiority over competitors, achieving the highest accuracy (98.7%) while covering the full benchmark (100%). Unlike some competitors that only evaluate on partial benchmarks, Mafin 2.5 provides a comprehensive and rigorous assessment.

Key Takeaways

Mafin 2.5 demonstrates massive improvements over previous versions, significantly increasing accuracy from Mafin 1 (38.0%) to Mafin 2.5 (98.7%), showcasing strong advancements in financial AI reasoning.
Mafin 2.5 is highly adaptable across different base models, achieving identical high performance (98.7%) on both ChatGPT 4o (public cloud) and Deepseek v3 (private deployable), making it flexible for various deployment needs.
Mafin 2.5 outperforms market competitors while covering the full benchmark (100%), ensuring a more comprehensive and fair evaluation compared to models that only test on 66.7% of the dataset.

Limitations of the Current Benchmark

Errors and Ambiguities in Evaluation
The current benchmark may contain inconsistencies, ambiguities, or errors in ground truth answers, which can lead to misleading performance evaluations. These issues must be addressed to ensure a fair and reliable assessment of AI capabilities. Establishing a more rigorous annotation and validation process is essential for improving benchmark accuracy.
Lack of Multi-Document Reasoning Tasks
The current benchmark primarily focuses on simple retrieval tasks based on a single document. However, real-world financial applications require more advanced reasoning capabilities, including multi-step retrieval across multiple documents. To improve the benchmark, we call for the inclusion of complex reasoning tasks that better reflect real-world decision-making and analysis.

Contact

If you have questions about these results or want to try our model, email us at [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
human_evaluations		human_evaluations
README.md		README.md
eval.py		eval.py
result_deepseekv3.json		result_deepseekv3.json
result_gpt4o.json		result_gpt4o.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 Mafin2.5-FinanceBench

Mafin2.5 Introduction

Benchmark Overview

Evaluation Protocol

Results

1. Mafin Model Evolution

2. Mafin2.5 Performance Across Base Models

3. Comparison with Market Players

Key Takeaways

Limitations of the Current Benchmark

Contact

About

Uh oh!

Releases

Packages

Languages

VectifyAI/Mafin2.5-FinanceBench

Folders and files

Latest commit

History

Repository files navigation

🚀 Mafin2.5-FinanceBench

Mafin2.5 Introduction

Benchmark Overview

Evaluation Protocol

Results

1. Mafin Model Evolution

2. Mafin2.5 Performance Across Base Models

3. Comparison with Market Players

Key Takeaways

Limitations of the Current Benchmark

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages