Skip to content

Runs too slow #18

@bhavishya-pohani

Description

@bhavishya-pohani

Description:

  • Tried running the benchmark, it takes a long time to run, look at logs below
warning: `VIRTUAL_ENV=/Users/bhavs/Desktop/work/research/code-is-all-you-need/.venv` does not match the project environment path `.venv` and will be ignored; use `--active` to target the active environment instead
2025-10-21 14:01:48,298 - MainThread - INFO - are.simulation.benchmark.huggingface_loader - Dataset has 160 examples in split validation
2025-10-21 14:01:48,299 - MainThread - INFO - are.simulation.benchmark.huggingface_loader - Limiting to 2 scenarios from HuggingFace dataset
2025-10-21 14:01:48,299 - MainThread - INFO - are.simulation.benchmark.scenario_executor - Running each scenario 3 times to improve variance
2025-10-21 14:01:48,299 - MainThread - INFO - are.simulation.benchmark.scenario_executor - Starting.
2025-10-21 14:01:48,299 - MainThread - INFO - are.simulation.multi_scenario_runner - Running scenarios in parallel with 14 workers
2025-10-21 14:01:51,283 - MainThread - INFO - are.simulation.benchmark.huggingface_loader - Loading scenario 1: scenario_universe_28_4sn4lc-??-??
2025-10-21 14:01:51,302 - MainThread - INFO - are.simulation.benchmark.huggingface_loader - Loading scenario 2: scenario_universe_23_5xzkat-??-??
2025-10-21 14:01:51,316 - MainThread - INFO - are.simulation.benchmark.huggingface_loader - Reached limit of 2 scenarios
Loading scenarios from HuggingFace: 100%|██████████████████████| 2/2 [00:03<00:00,  1.51s/it]
2025-10-21 14:04:10,223 - MainThread - INFO - httpx - HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json "HTTP/1.1 200 OK"
2025-10-21 14:04:10,522 - MainThread - WARNING - are.simulation.scenarios.scenario_imported_from_json.utils - Scenario duration overridden to 1800 instead of 1000.0 seconds



2025-10-21 14:06:26,120 - MainThread - INFO - httpx - HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json "HTTP/1.1 200 OK"
2025-10-21 14:06:26,417 - MainThread - WARNING - are.simulation.scenarios.scenario_imported_from_json.utils - Scenario duration overridden to 1800 instead of 1000.0 seconds
2025-10-21 14:06:28,597 - MainThread - INFO - are.simulation.scenarios.scenario_imported_from_json.utils - [scenario_universe_28_4sn4lc]: Initializing turns with judge trigger condition
2025-10-21 14:06:28,597 - MainThread - INFO - are.simulation.scenarios.scenario_imported_from_json.utils - [scenario_universe_28_4sn4lc]: Validation mode online
2025-10-21 14:06:28,597 - MainThread - INFO - are.simulation.scenarios.scenario_imported_from_json.utils - [scenario_universe_28_4sn4lc]: Scenario has 1 turns
2025-10-21 14:06:28,601 - MainThread - INFO - are.simulation.scenario_runner - [Scenario = scenario_universe_28_4sn4lc, Run = 1] Running with Agent default
2025-10-21 14:06:28,612 - MainThread - WARNING - are.simulation.agents.default_agent.are_simulation_main - [Scenario = scenario_universe_28_4sn4lc, Run = 1] Setting wait_for_user_response to False in AgentUserInterface
2025-10-21 14:06:28,612 - MainThread - WARNING - are.simulation.agents.default_agent.are_simulation_main - [Scenario = scenario_universe_28_4sn4lc, Run = 1] Removing tools {'AgentUserInterface__get_last_message_from_user', 'AgentUserInterface__get_last_unread_messages', 'AgentUserInterface__get_last_message_from_agent', 'AgentUserInterface__get_all_messages'} from app_tools
2025-10-21 14:06:28,613 - MainThread - WARNING - are.simulation.agents.default_agent.are_simulation_main - [Scenario = scenario_universe_28_4sn4lc, Run = 1] Setting agent max_turns to 1
2025-10-21 14:07:19,449 - MainThread - INFO - httpx - [Scenario = scenario_universe_28_4sn4lc, Run = 1] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"



2025-10-21 14:11:05,540 - MainThread - WARNING - are.simulation.apps.utils.fallback_file_system - Failed to lazy load stats for /Pictures/Personal/Travels/Costa_Rica_2022.jpg: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/datasets/meta-agents-research-environments/gaia2_filesystem/paths-info/main (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2648)')))"), '(Request ID: f274abba-d7c7-4c37-a0f5-e1eea2653700)')
2025-10-21 14:11:05,540 - MainThread - WARNING - are.simulation.apps.utils.fallback_file_system - Failed to lazy load stats for /Pictures/Personal/Travels/Thailand_2020.jpg: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/datasets/meta-agents-research-environments/gaia2_filesystem/paths-info/main (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2648)')))"), '(Request ID: 3d05f54d-8492-4a33-9b85-ac6a9d457052)')
2025-10-21 14:12:38,662 - MainThread - INFO - httpx - HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json "HTTP/1.1 200 OK"
2025-10-21 14:12:38,984 - MainThread - WARNING - are.simulation.scenarios.scenario_imported_from_json.utils - Scenario duration overridden to 1800 instead of 1000.0 seconds
2025-10-21 14:14:49,862 - MainThread - INFO - httpx - HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json "HTTP/1.1 200 OK"
2025-10-21 14:14:50,113 - MainThread - WARNING - are.simulation.scenarios.scenario_imported_from_json.utils - Scenario duration overridden to 1800 instead of 1000.0 seconds
2025-10-21 14:15:13,046 - MainThread - INFO - are.simulation.scenarios.scenario_imported_from_json.utils - [scenario_universe_28_4sn4lc]: Initializing turns with judge trigger condition
2025-10-21 14:15:13,046 - MainThread - INFO - are.simulation.scenarios.scenario_imported_from_json.utils - [scenario_universe_28_4sn4lc]: Validation mode online
2025-10-21 14:15:13,046 - MainThread - INFO - are.simulation.scenarios.scenario_imported_from_json.utils - [scenario_universe_28_4sn4lc]: Scenario has 1 turns
2025-10-21 14:15:13,050 - MainThread - INFO - are.simulation.scenario_runner - [Scenario = scenario_universe_28_4sn4lc, Run = 3] Running with Agent default
2025-10-21 14:15:13,060 - MainThread - WARNING - are.simulation.agents.default_agent.are_simulation_main - [Scenario = scenario_universe_28_4sn4lc, Run = 3] Setting wait_for_user_response to False in AgentUserInterface
2025-10-21 14:15:13,061 - MainThread - WARNING - are.simulation.agents.default_agent.are_simulation_main - [Scenario = scenario_universe_28_4sn4lc, Run = 3] Removing tools {'AgentUserInterface__get_last_message_from_user', 'AgentUserInterface__get_last_unread_messages', 'AgentUserInterface__get_last_message_from_agent', 'AgentUserInterface__get_all_messages'} from app_tools
2025-10-21 14:15:13,061 - MainThread - WARNING - are.simulation.agents.default_agent.are_simulation_main - [Scenario = scenario_universe_28_4sn4lc, Run = 3] Setting agent max_turns to 1
2025-10-21 14:16:00,794 - MainThread - INFO - httpx - [Scenario = scenario_universe_28_4sn4lc, Run = 3] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
  

2025-10-21 14:19:58,971 - MainThread - WARNING - are.simulation.apps.utils.fallback_file_system - Failed to lazy load stats for /Documents/news/daily_chronicle_news_1987_1995_758.txt: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/datasets/meta-agents-research-environments/gaia2_filesystem/paths-info/main (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2648)')))"), '(Request ID: 9592b820-37c7-4be4-af8b-0574e30e7c70)')
2025-10-21 14:19:59,002 - MainThread - WARNING - are.simulation.apps.utils.fallback_file_system - Failed to lazy load stats for /Documents/wiki/wikipedia_63.txt: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/datasets/meta-agents-research-environments/gaia2_filesystem/paths-info/main (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2648)')))"), '(Request ID: 75e7fabe-4561-4e7d-9e8f-4f6cb209b6af)')
2025-10-21 14:21:00,153 - MainThread - INFO - httpx - HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json "HTTP/1.1 200 OK"
2025-10-21 14:21:00,525 - MainThread - WARNING - are.simulation.scenarios.scenario_imported_from_json.utils - Scenario duration overridden to 1800 instead of 1000.0 seconds



2025-10-21 14:23:08,595 - MainThread - ERROR - are.simulation.multi_scenario_runner - Scenario scenario_universe_28_4sn4lc failed with exception: Process terminated unexpectedly:        
2025-10-21 14:23:08,602 - MainThread - ERROR - are.simulation.multi_scenario_runner - Scenario scenario_universe_28_4sn4lc failed with exception: Process terminated unexpectedly:        
2025-10-21 14:23:08,610 - MainThread - ERROR - are.simulation.multi_scenario_runner - Scenario scenario_universe_28_4sn4lc failed with exception: Process terminated unexpectedly:        
2025-10-21 14:23:08,618 - MainThread - ERROR - are.simulation.multi_scenario_runner - Scenario scenario_universe_23_5xzkat failed with exception: Process terminated unexpectedly:        
                                                                                            2025-10-21 14:23:09,052 - MainThread - INFO - httpx - HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json "HTTP/1.1 200 OK"
2025-10-21 14:23:09,319 - MainThread - WARNING - are.simulation.scenarios.scenario_imported_from_json.utils - Scenario duration overridden to 1800 instead of 1000.0 seconds
2025-10-21 14:23:16,384 - MainThread - INFO - are.simulation.scenarios.scenario_imported_from_json.utils - [scenario_universe_23_5xzkat]: Initializing turns with judge trigger condition
2025-10-21 14:23:16,385 - MainThread - INFO - are.simulation.scenarios.scenario_imported_from_json.utils - [scenario_universe_23_5xzkat]: Validation mode online
2025-10-21 14:23:16,385 - MainThread - INFO - are.simulation.scenarios.scenario_imported_from_json.utils - [scenario_universe_23_5xzkat]: Scenario has 1 turns
2025-10-21 14:23:16,389 - MainThread - INFO - are.simulation.scenario_runner - [Scenario = scenario_universe_23_5xzkat, Run = 2] Running with Agent default
2025-10-21 14:23:16,399 - MainThread - WARNING - are.simulation.agents.default_agent.are_simulation_main - [Scenario = scenario_universe_23_5xzkat, Run = 2] Setting wait_for_user_response to False in AgentUserInterface
2025-10-21 14:23:16,400 - MainThread - WARNING - are.simulation.agents.default_agent.are_simulation_main - [Scenario = scenario_universe_23_5xzkat, Run = 2] Removing tools {'AgentUserInterface__get_last_message_from_user', 'AgentUserInterface__get_last_unread_messages', 'AgentUserInterface__get_last_message_from_agent', 'AgentUserInterface__get_all_messages'} from app_tools
2025-10-21 14:23:16,400 - MainThread - WARNING - are.simulation.agents.default_agent.are_simulation_main - [Scenario = scenario_universe_23_5xzkat, Run = 2] Setting agent max_turns to 1
2025-10-21 14:23:43,144 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:23:50,381 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:23:54,989 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:23:58,368 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:24:01,534 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:24:04,392 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:24:07,927 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:24:13,422 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:24:16,741 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:24:19,835 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:24:22,907 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:24:25,835 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:24:30,840 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:24:34,968 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:24:45,206 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:24:45,211 - MainThread - WARNING - are.simulation.agents.default_agent.are_simulation_main - [Scenario = scenario_universe_23_5xzkat, Run = 2] Max iterations reached - Stopping Agent: 1
2025-10-21 14:24:45,212 - MainThread - INFO - are.simulation.scenario_runner - [Scenario = scenario_universe_23_5xzkat, Run = 2] Agent Output None
2025-10-21 14:24:45,212 - MainThread - INFO - are.simulation.scenario_runner - [Scenario = scenario_universe_23_5xzkat, Run = 2] Validating...
2025-10-21 14:24:45,213 - MainThread - INFO - are.simulation.validation.event_judge - [Scenario = scenario_universe_23_5xzkat, Run = 2] Comparing AgentUserInterface__send_message_to_user to AgentUserInterface__send_message_to_user
2025-10-21 14:24:45,213 - MainThread - INFO - are.simulation.scenario_runner - [Scenario = scenario_universe_23_5xzkat, Run = 2] Validation ScenarioValidationResult(success=True, exception=None, export_path=None, rationale='None', duration=None) EnvState=EnvironmentState.RUNNING
2025-10-21 14:24:45,242 - MainThread - INFO - are.simulation.scenario_runner - [Scenario = scenario_universe_23_5xzkat, Run = 2] Trace exported to ./validation-sonnet-4-5/hf/scenario_universe_23_5xzkat_run_2_2fd833cc.json
2025-10-21 14:24:45,242 - MainThread - INFO - are.simulation.scenario_runner - [Scenario = scenario_universe_23_5xzkat, Run = 2] ✅ Result: ScenarioValidationResult(success=True, exception=None, export_path='./validation-sonnet-4-5/hf/scenario_universe_23_5xzkat_run_2_2fd833cc.json', rationale='None', duration=None)
                                                                                            2025-10-21 14:25:21,887 - MainThread - INFO - are.simulation.scenarios.scenario_imported_from_json.utils - [scenario_universe_23_5xzkat]: Initializing turns with judge trigger condition
2025-10-21 14:25:21,888 - MainThread - INFO - are.simulation.scenarios.scenario_imported_from_json.utils - [scenario_universe_23_5xzkat]: Validation mode online
2025-10-21 14:25:21,888 - MainThread - INFO - are.simulation.scenarios.scenario_imported_from_json.utils - [scenario_universe_23_5xzkat]: Scenario has 1 turns
2025-10-21 14:25:21,892 - MainThread - INFO - are.simulation.scenario_runner - [Scenario = scenario_universe_23_5xzkat, Run = 3] Running with Agent default
2025-10-21 14:25:21,902 - MainThread - WARNING - are.simulation.agents.default_agent.are_simulation_main - [Scenario = scenario_universe_23_5xzkat, Run = 3] Setting wait_for_user_response to False in AgentUserInterface
2025-10-21 14:25:21,903 - MainThread - WARNING - are.simulation.agents.default_agent.are_simulation_main - [Scenario = scenario_universe_23_5xzkat, Run = 3] Removing tools {'AgentUserInterface__get_last_message_from_user', 'AgentUserInterface__get_last_unread_messages', 'AgentUserInterface__get_last_message_from_agent', 'AgentUserInterface__get_all_messages'} from app_tools
2025-10-21 14:25:21,903 - MainThread - WARNING - are.simulation.agents.default_agent.are_simulation_main - [Scenario = scenario_universe_23_5xzkat, Run = 3] Setting agent max_turns to 1
2025-10-21 14:26:06,270 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 3] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:26:06,410 - MainThread - ERROR - are.simulation.multi_scenario_runner - Scenario scenario_universe_23_5xzkat failed with exception: Process terminated unexpectedly:        
Running Search scenarios: 100%|█████████████████| 6/6 [02:57<00:00, 29.65s/it, Success=16.7%]
2025-10-21 14:26:06,424 - MainThread - INFO - are.simulation.multi_scenario_runner - Exported benchmark result to ./validation-sonnet-4-5/output.jsonl
2025-10-21 14:26:06,424 - MainThread - INFO - are.simulation.benchmark.cli - Successfully completed config 'search'
2025-10-21 14:26:06,480 - MainThread - INFO - are.simulation.benchmark.cli - 

=== GAIA2 Validation Report ===
Model: claude-sonnet-4-5
Provider: anthropic


=== Search ===
  - Scenarios: 2 unique (6 total runs)
  - Success rate: 16.7% ± 16.7% (STD: 28.9%)
  - Pass@3: 1 scenarios (50.0%)
  - Pass^3: 0 scenarios (0.0%)
  - Average run duration: 88.9s (STD: 0.0s)

=== Global Summary ===
  - Scenarios: 2 unique (6 total runs)
  - Macro success rate: 16.7% ± 16.7% (STD: 28.9%)
  - Micro success rate: 16.7% ± 16.7% (STD: 28.9%)
  - Pass@3: 1 scenarios (50.0%)
  - Pass^3: 0 scenarios (0.0%)
  - Average run duration: 88.9s (STD: 0.0s)
  - Job duration: 1458.1 seconds

2025-10-21 14:26:06,495 - MainThread - INFO - are.simulation.benchmark.cli - JSON stats report saved to: ./validation-sonnet-4-5/benchmark_stats.json
2025-10-21 14:26:06,495 - MainThread - INFO - are.simulation.benchmark.cli - Benchmark run summary:
2025-10-21 14:26:06,495 - MainThread - INFO - are.simulation.benchmark.cli -   Total configs attempted: 1
2025-10-21 14:26:06,495 - MainThread - INFO - are.simulation.benchmark.cli -   Successful configs: 1
2025-10-21 14:26:06,495 - MainThread - INFO - are.simulation.benchmark.cli -   Failed configs: 0
2025-10-21 14:26:06,495 - MainThread - INFO - are.simulation.benchmark.cli - All Done.


On top of that I see errors around SSL. Is that something that we can resolve?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions