Skip to content

The goal is to develop a unified framework for evaluating LLMs, agents, and RAG systems across well-known and custom benchmarks, while providing users with statistical tools to understand and improve their systems.

Notifications You must be signed in to change notification settings

VascoSch92/bench-lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bench Lab

⚠️ Early Development Notice

This repository is a work in progress. Things may be incomplete, unstable, or subject to change.

Bench Lab is a framework for evaluating large language models (LLMs), agents, and RAG systems across various benchmarks. The project provides a unified interface for benchmarking while offering statistical tools to analyze and improve system performance.

Usage Example

Simple example of the API.

import random

from benchlab.library.math_qa._benchmark import MathQABench


def mock_model(instance, s: str) -> str:
    random_answer = random.randint(1, 10)
    return f"The answer for question {instance.id} is {random_answer}. Or {s}"


def main():
    # init the benchmark
    benchmark = MathQABench(n_instance=5)
    # run your implementation on the benchmark
    execution = benchmark.run(mock_model, kwargs={"s": "I don't know"})
    evaluation = execution.evaluate()
    # finally, aggregate the results and print the benchmark report
    report = evaluation.report()
    report.summary()


if __name__ == "__main__":
    main()

About

The goal is to develop a unified framework for evaluating LLMs, agents, and RAG systems across well-known and custom benchmarks, while providing users with statistical tools to understand and improve their systems.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages