Hi!
Thanks for your wonderful work on MLE-Bench, we found it very insightful for evaluating machine learning engineering capabilities of agents 🙌
We’d like to briefly introduce our project, DSLighting.
About DSLighting
DSLighting is a data science agent harness — an LLM-driven autonomous execution engine that turns task descriptions and datasets into iterative workflows including:
- Code generation
- Execution
- Evaluation
- Refinement
It is designed to make it easy to build, run, and evaluate data science agents in a reproducible and extensible way.
Support for MLE-Bench
We’ve recently added support for running MLE-Bench within DSLighting. With just a few lines of code, users can easily run the benchmark:
from dotenv import load_dotenv
load_dotenv()
from dslighting.api import DSBenchmark
from dslighting.core import ConfigBuilder
config = ConfigBuilder().build_config(
workflow="aide",
model="gpt-4o",
)
benchmark = DSBenchmark("mlebench", data_dir="/path/to/mlebench")
result = benchmark.run(config=config)
print(result.results_path)
print(result.metadata_path)
Why this might be useful
- Minimal setup to run MLE-Bench
- Unified interface across multiple benchmarks
- Supports iterative agent workflows
- Easy to configure for different models and workflows
Other supported benchmarks
DSLighting currently also supports:
- DACode (EMNLP 2024)
- DABench (ICML 2024)
- MoSciBench (ICLR 2026)
- ScienceAgentBench (ICLR 2025)
We hope this can help make MLE-Bench easier to run and extend in agent-based workflows.
Happy to hear your thoughts, and we’d love to explore potential collaboration!
Thanks again for your great work 🙌
Hi!
Thanks for your wonderful work on MLE-Bench, we found it very insightful for evaluating machine learning engineering capabilities of agents 🙌
We’d like to briefly introduce our project, DSLighting.
About DSLighting
DSLighting is a data science agent harness — an LLM-driven autonomous execution engine that turns task descriptions and datasets into iterative workflows including:
It is designed to make it easy to build, run, and evaluate data science agents in a reproducible and extensible way.
Support for MLE-Bench
We’ve recently added support for running MLE-Bench within DSLighting. With just a few lines of code, users can easily run the benchmark:
Why this might be useful
Other supported benchmarks
DSLighting currently also supports:
We hope this can help make MLE-Bench easier to run and extend in agent-based workflows.
Happy to hear your thoughts, and we’d love to explore potential collaboration!
Thanks again for your great work 🙌