score_async(action="append") fails with LookupError when the original eval's metrics come from an external package, even though only the new scorer's metrics need to be computed.


The problem is in [score.py:269]


log_metrics = metrics_from_log_header(log)  # tries to recreate ALL old metrics
This unconditionally recreates every metric from the log header (including the original ones). When action="append", it should only need the new scorer's metrics, not the old ones. If any old metric can't be found in the registry, the entire re-score fails - even though those old metrics have already been computed and their results are in the log.

Bug: score_async(action='append') fails if original metrics aren't in registry.

When appending new scores to an eval log, score_async attempts to recreate
ALL metrics from the log header via registry_create(). If the original eval
used metrics from an external package (e.g. inspect_evals), and that package
isn't importable in the current environment in the same way (e.g. loaded via
sys.path instead of pip install), the qualified metric names
(e.g. 'inspect_evals/overall_mean') can't be resolved, and the entire
re-scoring operation fails with LookupError — even though those metrics
have already been computed and their results exist in the log.

Expected: action='append' should only require metrics for the NEW scorers.
Actual: It requires ALL metrics (old + new) to be recreatable.

Affected code: inspect_ai/_eval/score.py, line ~269, metrics_from_log_header()


import anyio
from inspect_ai._eval.score import score_async
from inspect_ai._util.platform import platform_init
from inspect_ai.log._recorders import create_recorder_for_location
from inspect_ai.scorer import model_graded_qa

async def repro():
    platform_init()
    # Any .eval file scored with metrics from an external package
    recorder = create_recorder_for_location("some_eval.eval", ".")
    log = await recorder.read_log("some_eval.eval")

    # Append a new scorer — fails even though model_graded_qa has its own metrics
    scored = await score_async(
        log=log,
        scorers=[model_graded_qa()],
        action="append",
    )
    # LookupError: inspect_evals/overall_mean was not found in the registry

anyio.run(repro)
Suggested fix (in score_async): wrap metrics_from_log_header() in a try/except when action="append", since the old metrics are already computed and stored in log.results — they don't need to be recreated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

score_async(action="append") fails with LookupError when the original eval's metrics come from an external package, even though only the new scorer's metrics need to be computed. #3238

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

score_async(action="append") fails with LookupError when the original eval's metrics come from an external package, even though only the new scorer's metrics need to be computed. #3238

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions