Skip to content

score_async(action="append") fails with LookupError when the original eval's metrics come from an external package, even though only the new scorer's metrics need to be computed. #3238

@darkness8i8

Description

@darkness8i8

The problem is in [score.py:269]

log_metrics = metrics_from_log_header(log) # tries to recreate ALL old metrics
This unconditionally recreates every metric from the log header (including the original ones). When action="append", it should only need the new scorer's metrics, not the old ones. If any old metric can't be found in the registry, the entire re-score fails - even though those old metrics have already been computed and their results are in the log.

Bug: score_async(action='append') fails if original metrics aren't in registry.

When appending new scores to an eval log, score_async attempts to recreate
ALL metrics from the log header via registry_create(). If the original eval
used metrics from an external package (e.g. inspect_evals), and that package
isn't importable in the current environment in the same way (e.g. loaded via
sys.path instead of pip install), the qualified metric names
(e.g. 'inspect_evals/overall_mean') can't be resolved, and the entire
re-scoring operation fails with LookupError — even though those metrics
have already been computed and their results exist in the log.

Expected: action='append' should only require metrics for the NEW scorers.
Actual: It requires ALL metrics (old + new) to be recreatable.

Affected code: inspect_ai/_eval/score.py, line ~269, metrics_from_log_header()

import anyio
from inspect_ai._eval.score import score_async
from inspect_ai._util.platform import platform_init
from inspect_ai.log._recorders import create_recorder_for_location
from inspect_ai.scorer import model_graded_qa

async def repro():
platform_init()
# Any .eval file scored with metrics from an external package
recorder = create_recorder_for_location("some_eval.eval", ".")
log = await recorder.read_log("some_eval.eval")

# Append a new scorer — fails even though model_graded_qa has its own metrics
scored = await score_async(
    log=log,
    scorers=[model_graded_qa()],
    action="append",
)
# LookupError: inspect_evals/overall_mean was not found in the registry

anyio.run(repro)
Suggested fix (in score_async): wrap metrics_from_log_header() in a try/except when action="append", since the old metrics are already computed and stored in log.results — they don't need to be recreated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions