Skip to content

Transient API errors (529) pollute population with false 0% organisms #8

@Ray0907

Description

@Ray0907

Problem

When LLM API calls fail with transient errors (HTTP 529), the evaluator records them as
runtime_error failure cases with score 0%. These organisms are not actually bad -- they
just didn't get a chance to run. They accumulate in the population and reduce the
effectiveness of sample_parents() weighted sampling.

Reproduction

Running a problem using claude-haiku-4-5-20251001 with --num_iterations 3, fail with HTTP 529:

CodingAgentFailureCase(
failure_type='runtime_error',
error_message="LLM call failed: Error code: 529 - {'type': 'error',
'error': {'type': 'overloaded_error', 'message': 'Overloaded'}}"
)

Data

Iteration Without Retry (avg) With Retry (avg) 0% organisms (API failure)
0 60% 60% 0 → 0
1 84% 92% 0 → 0
2 70% 96% 0 → 0
3 48% 96% 4 → 0

Current state

arc_agi.py already handles this with tenacity retry (4-10 attempts with exponential
backoff). The other problems (parrot, multiplication_verifier, circle_packing) have
no retry and are vulnerable to this issue.

In my testing, adding tenacity retry (3 attempts, exponential backoff, max 30s wait)
eliminated all false 0% organisms caused by API failures. Would this be the right
approach?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions