harness: align graph runs with structured metrics artifacts#1803
harness: align graph runs with structured metrics artifacts#1803jioffe502 wants to merge 9 commits intoNVIDIA:mainfrom
Conversation
Replace stdout metric parsing in harness run execution with structured metrics derived from graph pipeline runtime summary artifacts so results are more stable and reviewer-friendly. Preserve existing artifact shape while adding run-mode/auto-tuning config coverage and updating harness tests for batch-first structured sourcing. Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
Delete the legacy harness stream-metric parser and its tests now that harness metrics are sourced from runtime summary artifacts. Keep PTY child-process output streaming for user job-progress visibility while slimming dead parsing code. Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
- harden latest_commit fallback for gitdir and packed-refs environments - expose resolved auto-heuristic graph flags in runtime and harness artifacts - trim effective_tuning payload while preserving key runtime context
Greptile SummaryThis PR replaces fragile stdout-scraping metric collection with structured
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/harness/artifacts.py | Adds worktree-friendly git fallback (_resolve_git_dir, _read_packed_ref, _read_head_commit); five bare except Exception blocks silently discard errors without logging. |
| nemo_retriever/src/nemo_retriever/harness/run.py | Core rework: graph_pipeline subprocess, structured *.runtime.summary.json metrics, _resolve_effective_tuning with undocumented pdf_split_batch_size asymmetry when auto_tuning=True. |
| nemo_retriever/src/nemo_retriever/harness/config.py | Adds run_mode and auto_tuning fields to HarnessConfig with validation against VALID_RUN_MODES and env-var overrides; clean and correct. |
| nemo_retriever/src/nemo_retriever/utils/ray_resource_hueristics.py | Adds _coerce_count helper using float() before int() to handle fractional Ray resource values; minimal, targeted fix. |
| nemo_retriever/harness/test_configs.yaml | Adds run_mode: batch and auto_tuning: true as active defaults; maps directly to new HarnessConfig fields. |
| nemo_retriever/tests/test_harness_config.py | New tests for run_mode env-var override, invalid-mode rejection, and auto_tuning CLI override; adequate coverage of new config surface. |
| nemo_retriever/tests/test_harness_run.py | Updated to cover graph_pipeline command construction, structured metrics, git-metadata fallback, and auto_tuning zeroing behavior; reflects new behavior correctly. |
| nemo_retriever/tests/test_resource_heuristics.py | New test for fractional Ray resource coercion (CPU: 32.0, GPU: 1.0); directly exercises the _coerce_count fix. |
| nemo_retriever/harness/HANDOFF.md | Trimmed to runbook-style ops notes with a key-files table; no code impact. |
Sequence Diagram
sequenceDiagram
participant Op as Operator
participant H as Harness
participant GP as graph_pipeline subprocess
participant SJ as run_id.runtime.summary.json
participant RJ as results.json
Op->>H: harness run --dataset jp20
H->>H: load_harness_config()
Note over H: YAML + env overrides + CLI → HarnessConfig
Note over H: run_mode=batch, auto_tuning=true
H->>H: _build_command() calls _resolve_effective_tuning()
Note over H: auto_tuning=True → zeros most tuning params
H->>GP: python -m nemo_retriever.examples.graph_pipeline
Note over GP: --run-mode batch
Note over GP: --pdf-extract-tasks 0 (auto)
Note over GP: --runtime-metrics-dir runtime_metrics/
GP->>SJ: write run_id.runtime.summary.json
Note over SJ: {processed_pages, ingestion_only_secs,
Note over SJ: evaluation_metrics: {recall@5: 0.91}}
GP-->>H: exit code
H->>SJ: _read_json_if_exists(runtime_summary_path)
SJ-->>H: runtime_summary dict (or None)
H->>H: _build_structured_metrics_payload(runtime_summary)
H->>H: _evaluate_run_outcome(recall_metrics, evaluation_metrics)
H->>RJ: write_json(results.json)
Note over RJ: {success, metrics, summary_metrics,
Note over RJ: runtime_summary, test_config.tuning}
RJ-->>Op: artifact
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/harness/artifacts.py
Line: 36-39
Comment:
**Silent exception swallowing without logging**
Five bare `except Exception:` blocks in the new git-metadata fallback path (lines 38, 63, 78, 89–90, 106) all return `None`/`pass` without logging any context. When `last_commit()` silently falls back to `"unknown"`, a misconfigured worktree path or permission error becomes completely invisible. The `no-bare-except` rule requires that non-boundary catch sites at minimum log with context before swallowing.
```suggestion
except Exception as exc: # noqa: BLE001
logger.debug("_resolve_git_dir: failed to read %s: %s", dot_git, exc)
return None
```
Apply the same `logger.debug(...)` pattern at the other four catch sites (`_read_packed_ref` line 63, `_read_head_commit` lines 78 and 89–90, and `last_commit` line 106).
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/harness/run.py
Line: 275
Comment:
**`pdf_split_batch_size` not zeroed when `auto_tuning=True`**
Every tuning field in `_resolve_effective_tuning` is zeroed when `use_auto=True` to signal `graph_pipeline` to auto-tune — except `pdf_split_batch_size`, which always forwards the configured value. The asymmetry is undocumented, making it ambiguous whether this is intentional (the pipeline ignores `0` for this param) or an accidental omission. Add a comment if intentional:
```suggestion
# pdf_split_batch_size is always forwarded explicitly; graph_pipeline does not
# implement auto-tuning for this parameter and the configured default is safe.
"pdf_split_batch_size": cfg.pdf_split_batch_size,
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (5): Last reviewed commit: "docs(harness): slim HANDOFF for ops; res..." | Re-trigger Greptile
| resolved_graph_flags = runtime_resolved_tuning.get("resolved_graph_flags") | ||
| if isinstance(resolved_graph_flags, dict): | ||
| final_graph_flags = dict(resolved_graph_flags) | ||
| effective_tuning["graph_pipeline_flags"] = final_graph_flags |
There was a problem hiding this comment.
*_cpus_per_actor keys silently dropped from graph_pipeline_flags on runtime override
When runtime_resolved_tuning provides a resolved_graph_flags dict, final_graph_flags is completely replaced. resolved_graph_flags (built in _resolve_runtime_tuning_summary) includes nemotron_parse_* keys but does not include page_elements_cpus_per_actor, ocr_cpus_per_actor, or embed_cpus_per_actor. Those three fields are present in requested_graph_flags from _resolve_effective_tuning and are actually passed to the subprocess, but disappear from the effective_tuning.graph_pipeline_flags artifact after the replacement. Merging instead of replacing preserves them:
| resolved_graph_flags = runtime_resolved_tuning.get("resolved_graph_flags") | |
| if isinstance(resolved_graph_flags, dict): | |
| final_graph_flags = dict(resolved_graph_flags) | |
| effective_tuning["graph_pipeline_flags"] = final_graph_flags | |
| if isinstance(resolved_graph_flags, dict): | |
| final_graph_flags = {**final_graph_flags, **dict(resolved_graph_flags)} | |
| effective_tuning["graph_pipeline_flags"] = final_graph_flags |
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/harness/run.py
Line: 522-525
Comment:
**`*_cpus_per_actor` keys silently dropped from `graph_pipeline_flags` on runtime override**
When `runtime_resolved_tuning` provides a `resolved_graph_flags` dict, `final_graph_flags` is completely replaced. `resolved_graph_flags` (built in `_resolve_runtime_tuning_summary`) includes `nemotron_parse_*` keys but does not include `page_elements_cpus_per_actor`, `ocr_cpus_per_actor`, or `embed_cpus_per_actor`. Those three fields are present in `requested_graph_flags` from `_resolve_effective_tuning` and are actually passed to the subprocess, but disappear from the `effective_tuning.graph_pipeline_flags` artifact after the replacement. Merging instead of replacing preserves them:
```suggestion
if isinstance(resolved_graph_flags, dict):
final_graph_flags = {**final_graph_flags, **dict(resolved_graph_flags)}
effective_tuning["graph_pipeline_flags"] = final_graph_flags
```
How can I resolve this? If you propose a fix, please make it concise.- add missing trailing newline to CONTRIBUTING.md - add missing trailing newline to release notes markdown
Resolve conflicts: - run.py: omit tuning CLI flags when use_heuristics; otherwise pass effective_tuning (auto_tuning zeros vs configured values). - graph_pipeline.py: keep runtime tuning summary helpers and _count_input_units from main.
Runtime summaries may only return a subset of tuning keys; preserve requested effective values (e.g. auto-tuning CPU sentinels) when overlaying resolved_graph_flags.
…_tuning Remove harness results.json effective_tuning and runtime summary resolved_tuning so this PR stays focused on runtime-summary metrics (vs stdout). Tuning CLI construction unchanged; follow-up can reintroduce tuning capture with a single clear model.
…list Trim graph-refactor migration, git workflow, and obsolete batch/parser notes. Keep operator map, CLI, artifacts, and a short results.json contract. Add maintainer backlog and harness change checklist for agents.
TL;DR
Aligns the retriever harness with the graph pipeline and makes run artifacts easier to trust: metrics come from structured
*.runtime.summary.json, not stdout scraping. The harness invokesgraph_pipelinewith the same controls as before for run mode and CLI wiring.Description
graph_pipelinesubprocess commands from harness config (includingrun_modeand resource flags as determined by preset, YAML, env, anduse_heuristics).results.json/ session layout compatible with existing consumers.latest_commitresolution (gitdir / packed-refs–friendly) in harness artifacts helpers.test_config.tuningcontinues to record the merged harness configuration used for the run.metricsandsummary_metricscontain only performance and evaluation fields from the runtime summary.