Skip to content

Commit 219d6f5

Browse files
jmatejczJuliajMagdalenaKotyniamaciejmajekpawel-kotowski
authored
chore: resolving conflicts (#690)
Co-authored-by: Julia Jia <[email protected]> Co-authored-by: Magdalena Kotynia <[email protected]> Co-authored-by: Maciej Majek <[email protected]> Co-authored-by: Pawel Kotowski <[email protected]> Co-authored-by: Brian Tuan <[email protected]>
1 parent ab73ba7 commit 219d6f5

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+1448
-882
lines changed

docs/simulation_and_benchmarking/rai_bench.md

Lines changed: 18 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ RAI Bench is a comprehensive package that both provides benchmarks with ready-to
66

77
- [Manipulation O3DE Benchmark](#manipulation-o3de-benchmark)
88
- [Tool Calling Agent Benchmark](#tool-calling-agent-benchmark)
9+
- [VLM Benchmark](#vlm-benchmark)
910

1011
## Manipulation O3DE Benchmark
1112

@@ -94,9 +95,9 @@ Evaluates agent performance independently from any simulation, based only on too
9495
The `SubTask` class is used to validate just one tool call. Following classes are available:
9596

9697
- `CheckArgsToolCallSubTask` - verify if a certain tool was called with expected arguments
97-
- `CheckTopicFieldsToolCallSubTask` - verify if a message published to ROS 2topic was of proper type and included expected fields
98-
- `CheckServiceFieldsToolCallSubTask` - verify if a message published to ROS 2service was of proper type and included expected fields
99-
- `CheckActionFieldsToolCallSubTask` - verify if a message published to ROS 2action was of proper type and included expected fields
98+
- `CheckTopicFieldsToolCallSubTask` - verify if a message published to ROS2 topic was of proper type and included expected fields
99+
- `CheckServiceFieldsToolCallSubTask` - verify if a message published to ROS2 service was of proper type and included expected fields
100+
- `CheckActionFieldsToolCallSubTask` - verify if a message published to ROS2 action was of proper type and included expected fields
100101

101102
### Validator
102103

@@ -129,7 +130,6 @@ The ToolCallingAgentBenchmark class manages the execution of tasks and collects
129130
There are predefined Tasks available which are grouped by categories:
130131

131132
- Basic - require retrieving info from certain topics
132-
- Spatial reasoning - questions about surroundings with images attached
133133
- Manipulation
134134
- Custom Interfaces - requires using messages with custom interfaces
135135

@@ -164,3 +164,17 @@ class TaskArgs(BaseModel):
164164
- `GetROS2RGBCameraTask` has 1 required tool call and 1 optional. When `extra_tool_calls` set to 5, agent can correct himself couple times and still pass even with 7 tool calls. There can be 2 types of invalid tool calls, first when the tool is used incorrectly and agent receives an error - this allows him to correct himself easier. Second type is when tool is called properly but it is not the tool that should be called or it is called with wrong params. In this case agent won't get any error so it will be harder for him to correct, but BOTH of these cases are counted as `extra tool call`.
165165

166166
If you want to know details about every task, visit `rai_bench/tool_calling_agent/tasks`
167+
168+
## VLM Benchmark
169+
170+
The VLM Benchmark is a benchmark for VLM models. It includes a set of tasks containing questions related to images and evaluates the performance of the agent that returns the answer in the structured format.
171+
172+
### Running
173+
174+
To run the benchmark:
175+
176+
```bash
177+
cd rai
178+
source setup_shell.sh
179+
python src/rai_bench/rai_bench/examples/vlm_benchmark.py --model-name gemma3:4b --vendor ollama
180+
```

docs/tutorials/benchmarking.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,6 @@ if __name__ == "__main__":
7373
extra_tool_calls=[0, 5], # how many extra tool calls allowed to still pass
7474
task_types=[ # what types of tasks to include
7575
"basic",
76-
"spatial_reasoning",
7776
"custom_interfaces",
7877
],
7978
N_shots=[0, 2], # examples in system prompt
@@ -95,7 +94,7 @@ if __name__ == "__main__":
9594
)
9695
```
9796
98-
Based on the example above the `Tool Calling` benchmark will run basic, spatial_reasoning and custom_interfaces tasks with every configuration of [extra_tool_calls x N_shots x prompt_detail] provided which will result in almost 500 tasks. Manipulation benchmark will run all specified task level once as there is no additional params. Reapeat is set to 1 in both configs so there will be no additional runs.
97+
Based on the example above the `Tool Calling` benchmark will run basic and custom_interfaces tasks with every configuration of [extra_tool_calls x N_shots x prompt_detail] provided which will result in almost 500 tasks. Manipulation benchmark will run all specified task level once as there is no additional params. Reapeat is set to 1 in both configs so there will be no additional runs.
9998
10099
!!! note
101100

src/rai_bench/rai_bench/__init__.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,14 +14,14 @@
1414
from .test_models import (
1515
ManipulationO3DEBenchmarkConfig,
1616
ToolCallingAgentBenchmarkConfig,
17-
test_dual_agents,
1817
test_models,
1918
)
2019
from .utils import (
2120
define_benchmark_logger,
2221
get_llm_for_benchmark,
2322
parse_manipulation_o3de_benchmark_args,
2423
parse_tool_calling_benchmark_args,
24+
parse_vlm_benchmark_args,
2525
)
2626

2727
__all__ = [
@@ -31,6 +31,6 @@
3131
"get_llm_for_benchmark",
3232
"parse_manipulation_o3de_benchmark_args",
3333
"parse_tool_calling_benchmark_args",
34-
"test_dual_agents",
34+
"parse_vlm_benchmark_args",
3535
"test_models",
3636
]

src/rai_bench/rai_bench/agents.py

Lines changed: 0 additions & 123 deletions
This file was deleted.

src/rai_bench/rai_bench/docs/tool_calling_agent_benchmark.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,4 +14,4 @@ Implementations can be found:
1414

1515
- Validators [Validators](../tool_calling_agent/validators.py)
1616
- Subtasks [Validators](../tool_calling_agent/tasks/subtasks.py)
17-
- Tasks, including basic, spatial, custom interfaces and manipulation [Tasks](../tool_calling_agent/tasks/)
17+
- Tasks, including basic, custom interfaces and manipulation [Tasks](../tool_calling_agent/tasks/)

src/rai_bench/rai_bench/examples/benchmarking_models.py

Lines changed: 3 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020

2121
if __name__ == "__main__":
2222
# Define models you want to benchmark
23-
model_names = ["qwen3:4b", "llama3.2:3b"]
23+
model_names = ["qwen2.5:3b", "llama3.2:3b"]
2424
vendors = ["ollama", "ollama"]
2525

2626
# Define benchmarks that will be used
@@ -36,7 +36,7 @@
3636
extra_tool_calls=[0, 5], # how many extra tool calls allowed to still pass
3737
task_types=[ # what types of tasks to include
3838
"basic",
39-
"spatial_reasoning",
39+
"manipulation",
4040
"custom_interfaces",
4141
],
4242
N_shots=[0, 2], # examples in system prompt
@@ -48,11 +48,6 @@
4848
test_models(
4949
model_names=model_names,
5050
vendors=vendors,
51-
benchmark_configs=[mani_conf, tool_conf],
51+
benchmark_configs=[tool_conf, mani_conf],
5252
out_dir=out_dir,
53-
# if you want to pass any additinal args to model
54-
additional_model_args=[
55-
{"reasoning": False},
56-
{},
57-
],
5853
)

src/rai_bench/rai_bench/examples/dual_agent.py

Lines changed: 0 additions & 53 deletions
This file was deleted.
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Copyright (C) 2025 Robotec.AI
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
from pathlib import Path
16+
17+
from rai_bench import (
18+
define_benchmark_logger,
19+
parse_vlm_benchmark_args,
20+
)
21+
from rai_bench.utils import get_llm_for_benchmark
22+
from rai_bench.vlm_benchmark import get_spatial_tasks, run_benchmark
23+
24+
if __name__ == "__main__":
25+
args = parse_vlm_benchmark_args()
26+
experiment_dir = Path(args.out_dir)
27+
experiment_dir.mkdir(parents=True, exist_ok=True)
28+
bench_logger = define_benchmark_logger(out_dir=experiment_dir)
29+
try:
30+
tasks = get_spatial_tasks()
31+
for task in tasks:
32+
task.set_logger(bench_logger)
33+
34+
llm = get_llm_for_benchmark(
35+
model_name=args.model_name,
36+
vendor=args.vendor,
37+
)
38+
run_benchmark(
39+
llm=llm,
40+
out_dir=experiment_dir,
41+
tasks=tasks,
42+
bench_logger=bench_logger,
43+
)
44+
except Exception as e:
45+
bench_logger.critical(
46+
msg=f"Benchmark failed with error: {e}",
47+
exc_info=True,
48+
)

src/rai_bench/rai_bench/manipulation_o3de/__init__.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414

15-
from .benchmark import run_benchmark, run_benchmark_dual_agent
15+
from .benchmark import run_benchmark
1616
from .predefined.scenarios import get_scenarios
1717

18-
__all__ = ["get_scenarios", "run_benchmark", "run_benchmark_dual_agent"]
18+
__all__ = ["get_scenarios", "run_benchmark"]

0 commit comments

Comments
 (0)