-
Notifications
You must be signed in to change notification settings - Fork 28
Description
Problem
When LLM API calls fail with transient errors (HTTP 529), the evaluator records them as
runtime_error failure cases with score 0%. These organisms are not actually bad -- they
just didn't get a chance to run. They accumulate in the population and reduce the
effectiveness of sample_parents() weighted sampling.
Reproduction
Running a problem using claude-haiku-4-5-20251001 with --num_iterations 3, fail with HTTP 529:
CodingAgentFailureCase(
failure_type='runtime_error',
error_message="LLM call failed: Error code: 529 - {'type': 'error',
'error': {'type': 'overloaded_error', 'message': 'Overloaded'}}"
)
Data
| Iteration | Without Retry (avg) | With Retry (avg) | 0% organisms (API failure) |
|---|---|---|---|
| 0 | 60% | 60% | 0 → 0 |
| 1 | 84% | 92% | 0 → 0 |
| 2 | 70% | 96% | 0 → 0 |
| 3 | 48% | 96% | 4 → 0 |
Current state
arc_agi.py already handles this with tenacity retry (4-10 attempts with exponential
backoff). The other problems (parrot, multiplication_verifier, circle_packing) have
no retry and are vulnerable to this issue.
In my testing, adding tenacity retry (3 attempts, exponential backoff, max 30s wait)
eliminated all false 0% organisms caused by API failures. Would this be the right
approach?