Anote's Model Leaderboard provides a way to benchmark and compare model performance. We have an API to:
- Adding new datasets to the leaderboard across all task types.
- Adding new model submissions to existing datasets.
The API is the backbone for a transparent, scalable, and community-driven benchmarking platform for AI models, supporting text classification, named entity recognition, document-level Q&A (chatbot), and line-level Q&A (prompting).
Purpose: Add a new benchmark dataset to the leaderboard. Input (JSON):
{
"name": "Financial Phrasebank - Classification Accuracy",
"url": "https://huggingface.co/datasets/takala/financial_phrasebank",
"task_type": "text_classification",
"description": "A dataset for financial sentiment classification.",
"models": [
{
"rank": 1,
"model": "Gemini",
"score": 0.95,
"ci": "0.93 - 0.97",
"updated": "Sep 2024"
}
]
}Response:
{
"status": "success",
"message": "Dataset added to leaderboard.",
"dataset_id": "uuid"
}Purpose: List available CSV benchmark datasets in frontend/public/benchmark_csvs with inferred task types and columns.
Response:
{
"success": true,
"datasets": [ { "filename": "Commonsense.csv", "task_type": "multiple_choice", "columns": [ ... ] }, ... ]
}
Purpose: Evaluate one or more models across selected CSV datasets and return scores. Input (JSON):
{
"models": [
{"name": "gpt-4o", "provider": "openai", "model": "gpt-4o-mini"},
{"name": "llama3", "provider": "ollama", "model": "llama3:8b"},
{"name": "echo", "provider": "echo"}
],
"datasets": ["Commonsense.csv", "Covid.csv"], // optional subset; defaults to all
"sample_size": 25 // optional per dataset
}
Response:
{
"success": true,
"runs": [
{ "dataset": "Commonsense.csv", "task_type": "multiple_choice", "count": 25,
"results": { "gpt-4o": {"metric": "accuracy", "score": 0.84}, "llama3": {"metric": "accuracy", "score": 0.78} } },
{ "dataset": "Covid.csv", "task_type": "text_classification", "count": 25,
"results": { "gpt-4o": {"metric": "accuracy", "score": 0.92} } }
]
}
Notes
- Supported tasks detected from headers: multiple_choice (accuracy), text_classification (accuracy), qa (F1/EM). Others are skipped.
- Providers:
openai: usesOPENAI_API_KEYand optionalOPENAI_BASE_URLfor OpenAI-compatible endpoints.ollama: usesOLLAMA_BASE_URL(defaulthttp://localhost:11434).echo: returns a dummy output (useful for dry-runs).py: calls a Python function frombackend/models.py(see below).
Define model wrappers that accept a prompt string and return a string response. Env vars supply keys; if python-dotenv is installed, .env is loaded automatically.
- Functions:
zero_shot_gpt4o,zero_shot_gpt4o_mini,zero_shot_claude,zero_shot_gemini, and optional local HF models. - Default set used when
POST /public/run_csv_benchmarksis called without amodelslist comes fromlist_models(). - Example
modelsitem for Python function provider:
{ "name": "gpt-4o", "provider": "py", "fn": "zero_shot_gpt4o" }
Env vars (.env example):
OPENAI_API_KEYANTHROPIC_API_KEYGOOGLE_API_KEYXAI_API_KEY
Purpose: Add a new model submission to an existing dataset. Input (JSON):
{
"dataset_name": "Financial Phrasebank - Classification Accuracy",
"model": "Llama3",
"rank": 4,
"score": 0.92,
"ci": "0.90 - 0.94",
"updated": "Sep 2024"
}Response:
{
"status": "success",
"message": "Model added to dataset on leaderboard."
}The API will support datasets across all current and future Anote task types:
- Text Classification
- Named Entity Recognition
- Document-Level Q&A (Chatbot)
- Line-Level Q&A (Prompting)
- (Extensible for multimodal tasks and multilingual datasets)
Below is a Flask implementation skeleton:
from flask import Flask, request, jsonify
import uuid
app = Flask(__name__)
leaderboard_data = [] # This would be replaced with a DB in production
@app.route('/api/leaderboard/add_dataset', methods=['POST'])
def add_dataset():
data = request.json
dataset_id = str(uuid.uuid4())
data['id'] = dataset_id
leaderboard_data.append(data)
return jsonify({
"status": "success",
"message": "Dataset added to leaderboard.",
"dataset_id": dataset_id
})
@app.route('/api/leaderboard/add_model', methods=['POST'])
def add_model():
data = request.json
for dataset in leaderboard_data:
if dataset['name'] == data['dataset_name']:
dataset.setdefault('models', []).append({
"rank": data["rank"],
"model": data["model"],
"score": data["score"],
"ci": data["ci"],
"updated": data["updated"]
})
return jsonify({
"status": "success",
"message": "Model added to dataset on leaderboard."
})
return jsonify({"status": "error", "message": "Dataset not found."}), 404
if __name__ == '__main__':
app.run(debug=True)The API will integrate with:
- Leaderboard Page (https://leaderboard.anote.ai/)
- Submit to Leaderboard Page (https://leaderboard.anote.ai/submit)
This allows direct testing of Flask API calls from the UI to verify real-time table updates.
Run the backend on port 5001 and seed example data, then start the frontend.
- Backend
- Create a virtualenv (optional) and install deps:
python -m venv .venv && source .venv/bin/activatepip install -r backend/requirements.txt
- Start the API on port 5001:
export PORT=5001 FLASK_ENV=developmentpython backend/app.py
- Sanity check: open
http://localhost:5001/orhttp://localhost:5001/health.
- Seed demo data (in another terminal)
export LEADERBOARD_API_BASE="http://localhost:5001"python backend/examples/seed_demo.py- This seeds two demo submissions to the
flores_spanish_translationdataset.
- Frontend
- In
frontend/:npm install - Ensure the frontend points to the backend (default works):
REACT_APP_API_BASE=http://localhost:5001 npm start
- Open the Evaluations page to see demo scores populate.
Notes
- The demo uses an in-memory store by default (no DB needed).
- If you configure MySQL and load
backend/database/schema.sql, the API will persist to DB.
from backend.sdk.leaderboard_sdk import LeaderboardClient
client = LeaderboardClient(base_url="http://localhost:5001")
# Curated leaderboard entries (for the homepage tiles)
client.add_dataset(
name="Financial Phrasebank - Classification Accuracy",
task_type="text_classification",
url="https://huggingface.co/datasets/takala/financial_phrasebank",
description="A dataset for financial sentiment classification.",
)
client.add_model(
dataset_name="Financial Phrasebank - Classification Accuracy",
model="Llama3",
rank=1,
score=0.92,
updated="Sep 2024",
)
print(client.list_datasets())
# Public evaluation flow (translation demo)
src = client.get_source_sentences(dataset_name="flores_spanish_translation", count=3)
sentence_ids = src["sentence_ids"]
model_results = src["source_sentences"] # echo back for high BLEU in demo
print(client.submit_model(
benchmark_dataset_name="flores_spanish_translation",
model_name="my-demo-model",
model_results=model_results,
sentence_ids=sentence_ids,
))
print(client.get_leaderboard())