WorkRB (~pronounced worker bee) is an open-source evaluation toolbox for benchmarking AI models in the work research domain. It provides a standardized framework that is easy to use and community-driven, scaling evaluation over a wide range of tasks, ontologies, and models.
- π One Buzzing Work Toolkit β Easily download & access ontologies, datasets, and baselines in a single toolkit
- π§ͺ Extensive tasks β Evaluate models on jobβskill matching, normalization, extraction, and similarity
- π Dynamic Multilinguality β Evaluate over languages driven by multilingual ontologies
- π§ Ready-to-go Baselines β Leverage provided baseline models for comparison
- π§© Extensible design β Add your custom tasks and models with simple interfaces
import workrb
# 1. Initialize a model
model = workrb.models.BiEncoderModel("all-MiniLM-L6-v2")
# 2. Select (multilingual) tasks to evaluate
tasks = [
workrb.tasks.ESCOJob2SkillRanking(split="val", languages=["en"]),
workrb.tasks.ESCOSkillNormRanking(split="val", languages=["de", "fr"])
]
# 3. Run benchmark & view results
results = workrb.evaluate(
model,
tasks,
output_folder="results/my_model",
)
print(results)Install WorkRB simply via pip.
pip install workrbRequirements: Python 3.10+, see pyproject.toml for all dependencies.
This section covers common usage patterns. Table of Contents:
Add your custom task or model by (1) inheriting from a predefined base class and implementing the abstract methods, and (2) adding it to the registry:
- Custom Tasks: Inherit from
RankingTask,MultilabelClassificationTask,... Implement the abstract methods. Register via@register_task(). - Custom models: Inherit from
ModelInterface. Implement the abstract methods. Register via@register_model().
from workrb.tasks.abstract.ranking_base import RankingTask
from workrb.models.base import ModelInterface
from workrb.registry import register_task, register_model
@register_task()
class MyCustomTask(RankingTask):
name: str = "MyCustomTask"
...
@register_model()
class MyCustomModel(ModelInterface):
name: str = "MyCustomModel"
...
# Use your custom model and task:
model_results = workrb.evaluate(MyCustomModel(),[MyCustomTask()])For detailed examples, see:
- examples/custom_task_example.py for a complete custom task implementation
- examples/custom_model_example.py for a complete custom model implementation
Feel free to make a PR to add your models & tasks to the official package! See CONTRIBUTING guidelines for details.
WorkRB automatically saves result checkpoints after each task completion in a specific language.
Automatic Resuming - Simply rerun with the same output_folder:
# Run 1: Gets interrupted after 2 tasks
tasks = [
workrb.tasks.ESCOJob2SkillRanking(
split="val",
languages=["en"],
)
]
results = workrb.evaluate(model, tasks, output_folder="results/my_model")
# Run 2: Automatically resumes from checkpoint
results = workrb.evaluate(model, tasks, output_folder="results/my_model")
# β Skips completed tasks, continues from where it left offExtending Benchmarks - Want to extend your results with additional tasks or languages? Add the new tasks or languages when resuming:
# Resume from previous & extend with new task and languages
tasks_extended = [
workrb.tasks.ESCOJob2SkillRanking( # Add de, fr
split="val",
languages=["en", "de", "fr"]),
workrb.tasks.ESCOJob2SkillRanking( # Add new task
split="val",
languages=["en"],
]
results = workrb.evaluate(model, tasks, output_folder="results/my_model")
# β Reuses English results, only evaluates new languages/tasksβYou cannot reduce scope when resuming. This is by design to avoid ambiguity. Finished tasks in the checkpoint should also be included in your WorkRB initialization. If you want to start fresh in the same output folder, use force_restart=True:
results = workrb.evaluate(model, tasks, output_folder="results/my_model", force_restart=True)Results are automatically saved to your output_folder:
results/my_model/
βββ checkpoint.json # Incremental checkpoint (for resuming)
βββ results.json # Final results dump
βββ config.yaml # Final benchmark configuration dump
To load & parse results from a run:
results = workrb.load_results("results/my_model/results.json")
print(results)Metrics: The main benchmark metrics mean_benchmark/<metric>/mean require 4 aggregation steps:
- First, Macro-average languages per task (e.g. ESCOJob2SkillRanking) (
mean_per_task/<task_name>/<metric>/mean) - Macro-average tasks per task group (e.g. Job2SkillRanking) (
mean_per_task_group/<group>/<metric>/mean) - Macro-average task groups per task type (e.g. RankingTask, ClassificationTask)
mean_per_task_type/<type>/<metric>/mean - Macro-average over task types.
Per-language performance is also available: mean_per_language/<lang>/<metric>/mean.
Each aggregation provides 95% confidence intervals (replace mean with ci_margin)
# Benchmark returns a detailed Pydantic model
results: BenchmarkResults = workrb.evaluate(...)
# Calculate aggregated metrics
summary: dict[str, float] = results.get_summary_metrics()
# Show all results
print(summary)
print(results) # Equivalent: internally runs get_summary_metrics()
# Access metric via tag
lang_result = summary["mean_per_language/en/f1_macro/mean"]
lang_result_ci = summary["mean_per_language/en/f1_macro/ci_margin"]| Task Name | Label Type | Dataset Size (English) | Languages |
|---|---|---|---|
| Ranking | |||
| Job to Skills | multi_label | 3039 queries x 13939 targets | 28 |
| Job Normalization | multi_class | 15463 queries x 2942 targets | 28 |
| Skill to Job | multi_label | 13492 queries x 3039 targets | 28 |
| Skill Extraction House | multi_label | 262 queries x 13891 targets | 28 |
| Skill Extraction Tech | multi_label | 338 queries x 13891 targets | 28 |
| Skill Similarity | multi_class | 900 queries x 2648 targets | 1 |
| ESCO Skill Normalization | multi_label | 72008 queries x 13939 targets | 28 |
| Classification | |||
| Job-Skill Classification | multi_label | 3039 samples, 13939 classes | 28 |
| Model Name | Description | Fixed Classifier |
|---|---|---|
| BiEncoderModel | BiEncoder model using sentence-transformers for ranking and classification tasks. | β |
| JobBERTModel | Job-Normalization BiEncoder from Techwolf: https://huggingface.co/TechWolf/JobBERT-v2 | β |
| RndESCOClassificationModel | Random baseline for multi-label classification with random prediction head for ESCO. | β |
Want to contribute new tasks, models, or metrics? Read our CONTRIBUTING.md guide for all details.
# Clone repository
git clone https://github.com/techwolf-ai/workrb.git && cd workrb
# Create and install a virtual environment
uv sync --all-extras
# Activate the virtual environment
source .venv/bin/activate
# Install the pre-commit hooks
pre-commit install --install-hooksDeveloping details
- This project follows the Conventional Commits standard to automate Semantic Versioning and Keep A Changelog with Commitizen.
- Run
poefrom within the development environment to print a list of Poe the Poet tasks available to run on this project. - Run
uv add {package}from within the development environment to install a run time dependency and add it topyproject.tomlanduv.lock. Add--devto install a development dependency. - Run
uv sync --upgradefrom within the development environment to upgrade all dependencies to the latest versions allowed bypyproject.toml. Add--only-devto upgrade the development dependencies only. - Run
cz bumpto bump the package's version, update theCHANGELOG.md, and create a git tag. Then push the changes and the git tag withgit push origin main --tags.
WorkRB builds upon the unifying WorkBench benchmark, consider citing:
@misc{delange2025unifiedworkembeddings,
title={Unified Work Embeddings: Contrastive Learning of a Bidirectional Multi-task Ranker},
author={Matthias De Lange and Jens-Joris Decorte and Jeroen Van Hautte},
year={2025},
eprint={2511.07969},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.07969},
}WorkRB has a community paper coming up!
WIPApache 2.0 License - see LICENSE for details.
