-
Notifications
You must be signed in to change notification settings - Fork 407
Description
The problem is in [score.py:269]
log_metrics = metrics_from_log_header(log) # tries to recreate ALL old metrics
This unconditionally recreates every metric from the log header (including the original ones). When action="append", it should only need the new scorer's metrics, not the old ones. If any old metric can't be found in the registry, the entire re-score fails - even though those old metrics have already been computed and their results are in the log.
Bug: score_async(action='append') fails if original metrics aren't in registry.
When appending new scores to an eval log, score_async attempts to recreate
ALL metrics from the log header via registry_create(). If the original eval
used metrics from an external package (e.g. inspect_evals), and that package
isn't importable in the current environment in the same way (e.g. loaded via
sys.path instead of pip install), the qualified metric names
(e.g. 'inspect_evals/overall_mean') can't be resolved, and the entire
re-scoring operation fails with LookupError — even though those metrics
have already been computed and their results exist in the log.
Expected: action='append' should only require metrics for the NEW scorers.
Actual: It requires ALL metrics (old + new) to be recreatable.
Affected code: inspect_ai/_eval/score.py, line ~269, metrics_from_log_header()
import anyio
from inspect_ai._eval.score import score_async
from inspect_ai._util.platform import platform_init
from inspect_ai.log._recorders import create_recorder_for_location
from inspect_ai.scorer import model_graded_qa
async def repro():
platform_init()
# Any .eval file scored with metrics from an external package
recorder = create_recorder_for_location("some_eval.eval", ".")
log = await recorder.read_log("some_eval.eval")
# Append a new scorer — fails even though model_graded_qa has its own metrics
scored = await score_async(
log=log,
scorers=[model_graded_qa()],
action="append",
)
# LookupError: inspect_evals/overall_mean was not found in the registry
anyio.run(repro)
Suggested fix (in score_async): wrap metrics_from_log_header() in a try/except when action="append", since the old metrics are already computed and stored in log.results — they don't need to be recreated.