-
Notifications
You must be signed in to change notification settings - Fork 62
Open
Description
Description:
- Tried running the benchmark, it takes a long time to run, look at logs below
warning: `VIRTUAL_ENV=/Users/bhavs/Desktop/work/research/code-is-all-you-need/.venv` does not match the project environment path `.venv` and will be ignored; use `--active` to target the active environment instead
2025-10-21 14:01:48,298 - MainThread - INFO - are.simulation.benchmark.huggingface_loader - Dataset has 160 examples in split validation
2025-10-21 14:01:48,299 - MainThread - INFO - are.simulation.benchmark.huggingface_loader - Limiting to 2 scenarios from HuggingFace dataset
2025-10-21 14:01:48,299 - MainThread - INFO - are.simulation.benchmark.scenario_executor - Running each scenario 3 times to improve variance
2025-10-21 14:01:48,299 - MainThread - INFO - are.simulation.benchmark.scenario_executor - Starting.
2025-10-21 14:01:48,299 - MainThread - INFO - are.simulation.multi_scenario_runner - Running scenarios in parallel with 14 workers
2025-10-21 14:01:51,283 - MainThread - INFO - are.simulation.benchmark.huggingface_loader - Loading scenario 1: scenario_universe_28_4sn4lc-??-??
2025-10-21 14:01:51,302 - MainThread - INFO - are.simulation.benchmark.huggingface_loader - Loading scenario 2: scenario_universe_23_5xzkat-??-??
2025-10-21 14:01:51,316 - MainThread - INFO - are.simulation.benchmark.huggingface_loader - Reached limit of 2 scenarios
Loading scenarios from HuggingFace: 100%|██████████████████████| 2/2 [00:03<00:00, 1.51s/it]
2025-10-21 14:04:10,223 - MainThread - INFO - httpx - HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json "HTTP/1.1 200 OK"
2025-10-21 14:04:10,522 - MainThread - WARNING - are.simulation.scenarios.scenario_imported_from_json.utils - Scenario duration overridden to 1800 instead of 1000.0 seconds
2025-10-21 14:06:26,120 - MainThread - INFO - httpx - HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json "HTTP/1.1 200 OK"
2025-10-21 14:06:26,417 - MainThread - WARNING - are.simulation.scenarios.scenario_imported_from_json.utils - Scenario duration overridden to 1800 instead of 1000.0 seconds
2025-10-21 14:06:28,597 - MainThread - INFO - are.simulation.scenarios.scenario_imported_from_json.utils - [scenario_universe_28_4sn4lc]: Initializing turns with judge trigger condition
2025-10-21 14:06:28,597 - MainThread - INFO - are.simulation.scenarios.scenario_imported_from_json.utils - [scenario_universe_28_4sn4lc]: Validation mode online
2025-10-21 14:06:28,597 - MainThread - INFO - are.simulation.scenarios.scenario_imported_from_json.utils - [scenario_universe_28_4sn4lc]: Scenario has 1 turns
2025-10-21 14:06:28,601 - MainThread - INFO - are.simulation.scenario_runner - [Scenario = scenario_universe_28_4sn4lc, Run = 1] Running with Agent default
2025-10-21 14:06:28,612 - MainThread - WARNING - are.simulation.agents.default_agent.are_simulation_main - [Scenario = scenario_universe_28_4sn4lc, Run = 1] Setting wait_for_user_response to False in AgentUserInterface
2025-10-21 14:06:28,612 - MainThread - WARNING - are.simulation.agents.default_agent.are_simulation_main - [Scenario = scenario_universe_28_4sn4lc, Run = 1] Removing tools {'AgentUserInterface__get_last_message_from_user', 'AgentUserInterface__get_last_unread_messages', 'AgentUserInterface__get_last_message_from_agent', 'AgentUserInterface__get_all_messages'} from app_tools
2025-10-21 14:06:28,613 - MainThread - WARNING - are.simulation.agents.default_agent.are_simulation_main - [Scenario = scenario_universe_28_4sn4lc, Run = 1] Setting agent max_turns to 1
2025-10-21 14:07:19,449 - MainThread - INFO - httpx - [Scenario = scenario_universe_28_4sn4lc, Run = 1] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:11:05,540 - MainThread - WARNING - are.simulation.apps.utils.fallback_file_system - Failed to lazy load stats for /Pictures/Personal/Travels/Costa_Rica_2022.jpg: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/datasets/meta-agents-research-environments/gaia2_filesystem/paths-info/main (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2648)')))"), '(Request ID: f274abba-d7c7-4c37-a0f5-e1eea2653700)')
2025-10-21 14:11:05,540 - MainThread - WARNING - are.simulation.apps.utils.fallback_file_system - Failed to lazy load stats for /Pictures/Personal/Travels/Thailand_2020.jpg: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/datasets/meta-agents-research-environments/gaia2_filesystem/paths-info/main (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2648)')))"), '(Request ID: 3d05f54d-8492-4a33-9b85-ac6a9d457052)')
2025-10-21 14:12:38,662 - MainThread - INFO - httpx - HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json "HTTP/1.1 200 OK"
2025-10-21 14:12:38,984 - MainThread - WARNING - are.simulation.scenarios.scenario_imported_from_json.utils - Scenario duration overridden to 1800 instead of 1000.0 seconds
2025-10-21 14:14:49,862 - MainThread - INFO - httpx - HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json "HTTP/1.1 200 OK"
2025-10-21 14:14:50,113 - MainThread - WARNING - are.simulation.scenarios.scenario_imported_from_json.utils - Scenario duration overridden to 1800 instead of 1000.0 seconds
2025-10-21 14:15:13,046 - MainThread - INFO - are.simulation.scenarios.scenario_imported_from_json.utils - [scenario_universe_28_4sn4lc]: Initializing turns with judge trigger condition
2025-10-21 14:15:13,046 - MainThread - INFO - are.simulation.scenarios.scenario_imported_from_json.utils - [scenario_universe_28_4sn4lc]: Validation mode online
2025-10-21 14:15:13,046 - MainThread - INFO - are.simulation.scenarios.scenario_imported_from_json.utils - [scenario_universe_28_4sn4lc]: Scenario has 1 turns
2025-10-21 14:15:13,050 - MainThread - INFO - are.simulation.scenario_runner - [Scenario = scenario_universe_28_4sn4lc, Run = 3] Running with Agent default
2025-10-21 14:15:13,060 - MainThread - WARNING - are.simulation.agents.default_agent.are_simulation_main - [Scenario = scenario_universe_28_4sn4lc, Run = 3] Setting wait_for_user_response to False in AgentUserInterface
2025-10-21 14:15:13,061 - MainThread - WARNING - are.simulation.agents.default_agent.are_simulation_main - [Scenario = scenario_universe_28_4sn4lc, Run = 3] Removing tools {'AgentUserInterface__get_last_message_from_user', 'AgentUserInterface__get_last_unread_messages', 'AgentUserInterface__get_last_message_from_agent', 'AgentUserInterface__get_all_messages'} from app_tools
2025-10-21 14:15:13,061 - MainThread - WARNING - are.simulation.agents.default_agent.are_simulation_main - [Scenario = scenario_universe_28_4sn4lc, Run = 3] Setting agent max_turns to 1
2025-10-21 14:16:00,794 - MainThread - INFO - httpx - [Scenario = scenario_universe_28_4sn4lc, Run = 3] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:19:58,971 - MainThread - WARNING - are.simulation.apps.utils.fallback_file_system - Failed to lazy load stats for /Documents/news/daily_chronicle_news_1987_1995_758.txt: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/datasets/meta-agents-research-environments/gaia2_filesystem/paths-info/main (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2648)')))"), '(Request ID: 9592b820-37c7-4be4-af8b-0574e30e7c70)')
2025-10-21 14:19:59,002 - MainThread - WARNING - are.simulation.apps.utils.fallback_file_system - Failed to lazy load stats for /Documents/wiki/wikipedia_63.txt: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/datasets/meta-agents-research-environments/gaia2_filesystem/paths-info/main (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2648)')))"), '(Request ID: 75e7fabe-4561-4e7d-9e8f-4f6cb209b6af)')
2025-10-21 14:21:00,153 - MainThread - INFO - httpx - HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json "HTTP/1.1 200 OK"
2025-10-21 14:21:00,525 - MainThread - WARNING - are.simulation.scenarios.scenario_imported_from_json.utils - Scenario duration overridden to 1800 instead of 1000.0 seconds
2025-10-21 14:23:08,595 - MainThread - ERROR - are.simulation.multi_scenario_runner - Scenario scenario_universe_28_4sn4lc failed with exception: Process terminated unexpectedly:
2025-10-21 14:23:08,602 - MainThread - ERROR - are.simulation.multi_scenario_runner - Scenario scenario_universe_28_4sn4lc failed with exception: Process terminated unexpectedly:
2025-10-21 14:23:08,610 - MainThread - ERROR - are.simulation.multi_scenario_runner - Scenario scenario_universe_28_4sn4lc failed with exception: Process terminated unexpectedly:
2025-10-21 14:23:08,618 - MainThread - ERROR - are.simulation.multi_scenario_runner - Scenario scenario_universe_23_5xzkat failed with exception: Process terminated unexpectedly:
2025-10-21 14:23:09,052 - MainThread - INFO - httpx - HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json "HTTP/1.1 200 OK"
2025-10-21 14:23:09,319 - MainThread - WARNING - are.simulation.scenarios.scenario_imported_from_json.utils - Scenario duration overridden to 1800 instead of 1000.0 seconds
2025-10-21 14:23:16,384 - MainThread - INFO - are.simulation.scenarios.scenario_imported_from_json.utils - [scenario_universe_23_5xzkat]: Initializing turns with judge trigger condition
2025-10-21 14:23:16,385 - MainThread - INFO - are.simulation.scenarios.scenario_imported_from_json.utils - [scenario_universe_23_5xzkat]: Validation mode online
2025-10-21 14:23:16,385 - MainThread - INFO - are.simulation.scenarios.scenario_imported_from_json.utils - [scenario_universe_23_5xzkat]: Scenario has 1 turns
2025-10-21 14:23:16,389 - MainThread - INFO - are.simulation.scenario_runner - [Scenario = scenario_universe_23_5xzkat, Run = 2] Running with Agent default
2025-10-21 14:23:16,399 - MainThread - WARNING - are.simulation.agents.default_agent.are_simulation_main - [Scenario = scenario_universe_23_5xzkat, Run = 2] Setting wait_for_user_response to False in AgentUserInterface
2025-10-21 14:23:16,400 - MainThread - WARNING - are.simulation.agents.default_agent.are_simulation_main - [Scenario = scenario_universe_23_5xzkat, Run = 2] Removing tools {'AgentUserInterface__get_last_message_from_user', 'AgentUserInterface__get_last_unread_messages', 'AgentUserInterface__get_last_message_from_agent', 'AgentUserInterface__get_all_messages'} from app_tools
2025-10-21 14:23:16,400 - MainThread - WARNING - are.simulation.agents.default_agent.are_simulation_main - [Scenario = scenario_universe_23_5xzkat, Run = 2] Setting agent max_turns to 1
2025-10-21 14:23:43,144 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:23:50,381 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:23:54,989 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:23:58,368 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:24:01,534 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:24:04,392 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:24:07,927 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:24:13,422 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:24:16,741 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:24:19,835 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:24:22,907 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:24:25,835 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:24:30,840 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:24:34,968 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:24:45,206 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 2] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:24:45,211 - MainThread - WARNING - are.simulation.agents.default_agent.are_simulation_main - [Scenario = scenario_universe_23_5xzkat, Run = 2] Max iterations reached - Stopping Agent: 1
2025-10-21 14:24:45,212 - MainThread - INFO - are.simulation.scenario_runner - [Scenario = scenario_universe_23_5xzkat, Run = 2] Agent Output None
2025-10-21 14:24:45,212 - MainThread - INFO - are.simulation.scenario_runner - [Scenario = scenario_universe_23_5xzkat, Run = 2] Validating...
2025-10-21 14:24:45,213 - MainThread - INFO - are.simulation.validation.event_judge - [Scenario = scenario_universe_23_5xzkat, Run = 2] Comparing AgentUserInterface__send_message_to_user to AgentUserInterface__send_message_to_user
2025-10-21 14:24:45,213 - MainThread - INFO - are.simulation.scenario_runner - [Scenario = scenario_universe_23_5xzkat, Run = 2] Validation ScenarioValidationResult(success=True, exception=None, export_path=None, rationale='None', duration=None) EnvState=EnvironmentState.RUNNING
2025-10-21 14:24:45,242 - MainThread - INFO - are.simulation.scenario_runner - [Scenario = scenario_universe_23_5xzkat, Run = 2] Trace exported to ./validation-sonnet-4-5/hf/scenario_universe_23_5xzkat_run_2_2fd833cc.json
2025-10-21 14:24:45,242 - MainThread - INFO - are.simulation.scenario_runner - [Scenario = scenario_universe_23_5xzkat, Run = 2] ✅ Result: ScenarioValidationResult(success=True, exception=None, export_path='./validation-sonnet-4-5/hf/scenario_universe_23_5xzkat_run_2_2fd833cc.json', rationale='None', duration=None)
2025-10-21 14:25:21,887 - MainThread - INFO - are.simulation.scenarios.scenario_imported_from_json.utils - [scenario_universe_23_5xzkat]: Initializing turns with judge trigger condition
2025-10-21 14:25:21,888 - MainThread - INFO - are.simulation.scenarios.scenario_imported_from_json.utils - [scenario_universe_23_5xzkat]: Validation mode online
2025-10-21 14:25:21,888 - MainThread - INFO - are.simulation.scenarios.scenario_imported_from_json.utils - [scenario_universe_23_5xzkat]: Scenario has 1 turns
2025-10-21 14:25:21,892 - MainThread - INFO - are.simulation.scenario_runner - [Scenario = scenario_universe_23_5xzkat, Run = 3] Running with Agent default
2025-10-21 14:25:21,902 - MainThread - WARNING - are.simulation.agents.default_agent.are_simulation_main - [Scenario = scenario_universe_23_5xzkat, Run = 3] Setting wait_for_user_response to False in AgentUserInterface
2025-10-21 14:25:21,903 - MainThread - WARNING - are.simulation.agents.default_agent.are_simulation_main - [Scenario = scenario_universe_23_5xzkat, Run = 3] Removing tools {'AgentUserInterface__get_last_message_from_user', 'AgentUserInterface__get_last_unread_messages', 'AgentUserInterface__get_last_message_from_agent', 'AgentUserInterface__get_all_messages'} from app_tools
2025-10-21 14:25:21,903 - MainThread - WARNING - are.simulation.agents.default_agent.are_simulation_main - [Scenario = scenario_universe_23_5xzkat, Run = 3] Setting agent max_turns to 1
2025-10-21 14:26:06,270 - MainThread - INFO - httpx - [Scenario = scenario_universe_23_5xzkat, Run = 3] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
2025-10-21 14:26:06,410 - MainThread - ERROR - are.simulation.multi_scenario_runner - Scenario scenario_universe_23_5xzkat failed with exception: Process terminated unexpectedly:
Running Search scenarios: 100%|█████████████████| 6/6 [02:57<00:00, 29.65s/it, Success=16.7%]
2025-10-21 14:26:06,424 - MainThread - INFO - are.simulation.multi_scenario_runner - Exported benchmark result to ./validation-sonnet-4-5/output.jsonl
2025-10-21 14:26:06,424 - MainThread - INFO - are.simulation.benchmark.cli - Successfully completed config 'search'
2025-10-21 14:26:06,480 - MainThread - INFO - are.simulation.benchmark.cli -
=== GAIA2 Validation Report ===
Model: claude-sonnet-4-5
Provider: anthropic
=== Search ===
- Scenarios: 2 unique (6 total runs)
- Success rate: 16.7% ± 16.7% (STD: 28.9%)
- Pass@3: 1 scenarios (50.0%)
- Pass^3: 0 scenarios (0.0%)
- Average run duration: 88.9s (STD: 0.0s)
=== Global Summary ===
- Scenarios: 2 unique (6 total runs)
- Macro success rate: 16.7% ± 16.7% (STD: 28.9%)
- Micro success rate: 16.7% ± 16.7% (STD: 28.9%)
- Pass@3: 1 scenarios (50.0%)
- Pass^3: 0 scenarios (0.0%)
- Average run duration: 88.9s (STD: 0.0s)
- Job duration: 1458.1 seconds
2025-10-21 14:26:06,495 - MainThread - INFO - are.simulation.benchmark.cli - JSON stats report saved to: ./validation-sonnet-4-5/benchmark_stats.json
2025-10-21 14:26:06,495 - MainThread - INFO - are.simulation.benchmark.cli - Benchmark run summary:
2025-10-21 14:26:06,495 - MainThread - INFO - are.simulation.benchmark.cli - Total configs attempted: 1
2025-10-21 14:26:06,495 - MainThread - INFO - are.simulation.benchmark.cli - Successful configs: 1
2025-10-21 14:26:06,495 - MainThread - INFO - are.simulation.benchmark.cli - Failed configs: 0
2025-10-21 14:26:06,495 - MainThread - INFO - are.simulation.benchmark.cli - All Done.
On top of that I see errors around SSL. Is that something that we can resolve?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels