feature (DRAFT): Add LLM-as-a-judge for REPLACE evaluation by memadi-nv · Pull Request #158 · NVIDIA-NeMo/Anonymizer

memadi-nv · 2026-05-14T19:34:53Z

Summary

Adds four LLM-as-judge evaluation metrics for the replace workflow, plus the wiring and display surface to render them per record. All judges run as non-critical post-replacement steps (try/except, defaults on failure) so they never block the pipeline.

What's new

Detection Validity judge — runs for all replace strategies. Asks "is each detected (value, label) a valid PII detection given the original text?" Output: detection_valid: bool|None + detection_invalid_entities: list[{value, label, reasoning}].
Type Fidelity judge — Substitute only. Asks "is each synthetic the same entity class and format as its label expects?" Treats labels as buckets (sibling categories pass). Output: type_fidelity_valid + type_fidelity_invalid_replacements.
Attribute Fidelity judge — Substitute only. Narrowed to two checks: gender of name (first_name/last_name/user_name) and age bucket (age/date_of_birth). Output: attribute_fidelity_valid + attribute_fidelity_invalid_entities.
Relational Consistency judge — Substitute only. Cross-entity coherence: geographic, temporal, identity (incl. name↔pronoun), organizational, role/employment, demographic, communication. Output: relational_consistency_valid + relational_consistency_invalid_relations.

Model & config
New ReplaceModelSelection fields: detection_judge, type_fidelity_judge, attribute_fidelity_judge, relational_consistency_judge.
Default replace.yaml updated. relational_consistency_judge defaulted to gpt-oss-120b (non-thinking model — reliable structured output); the others default to nemotron-30b-thinking.

Display
display_record(...) for Substitute now renders four sections plus the existing Replacement Map.
Three-state verdict (replaces yes/no): Satisfied (100%), Partially Satisfied (>0% & <100%), Not Satisfied (0%), Unavailable (judge failed). Each section shows (success_rate: M/N) and a collapsible drilldown of failing/evaluated entities.
Fixed _normalize_replacement_map to handle numpy-wrapped (parquet round-trip) shapes via EntityReplacementMapSchema.model_validate — restores the Replacement Map table and entity highlighting in the Replaced section.

Tests
~1,200 lines added across four new test files (test_detection_judge.py, test_type_fidelity_judge.py, test_attribute_fidelity_judge.py, test_relational_consistency_judge.py) covering prompt structure, schemas, flatteners, and per-row workflow paths.
Existing fixtures updated for the new required ReplaceModelSelection fields.

Notes
All four judge prompts include an explicit <output_format> clause requesting raw JSON with no markdown fences, to defend against thinking-model output drift.
RelationalConsistencyJudgmentSchema.entities is list[str] (flat) rather than nested objects, to reduce JSON parse errors from thinking models.

Type of Change

Testing

make test passes locally
make check passes locally (format + lint + typecheck + lock-check)
Added/updated tests for changes

Documentation

If docs changed: make docs-build passes locally

Related Issues

Closes #98

Signed-off-by: memadi <memadi@nvidia.com>

greptile-apps · 2026-05-14T19:38:55Z

Greptile Summary

This PR adds four LLM-as-judge evaluation metrics to the replace workflow — Detection Validity, Type Fidelity, Attribute Fidelity, and Relational Consistency — along with the display surface to render them per record. All judges run as non-critical post-replacement steps (wrapped in try/except with safe defaults) and produce both internal raw-output columns and user-facing boolean/list columns.

Four new judge workflows each follow the same pattern: flatten the replacement map, skip empty-entity rows, invoke the LLM via LLMStructuredColumnConfig, then flatten the structured output to (valid: bool|None, invalid: list).
Breaking config change: four new required fields added to ReplaceModelSelection with no defaults — existing replace.yaml files that omit them will fail validation on load.
Display additions in display.py render tri-state verdicts with collapsible drilldowns; _normalize_replacement_map is also fixed to handle numpy-wrapped parquet shapes.

Confidence Score: 3/5

The core judge logic and display code are well-implemented and defensive, but the four new model fields in ReplaceModelSelection are required with no defaults, which will break any existing deployment that loads a custom replace.yaml without them.

The four judge workflows are clean and handle failures gracefully. However, adding four required (non-optional) fields to ReplaceModelSelection is a config-level breaking change: any caller or YAML config file that omits these new fields will fail on load, even for Annotate/Redact/Hash strategies where the judge model aliases are never actually used. This needs to be resolved before the change is safe to roll out to existing environments.

src/anonymizer/config/models.py requires attention for the required-field breaking change; src/anonymizer/interface/anonymizer.py for the judge-column visibility gap in the user-facing dataframe.

Important Files Changed

Filename	Overview
src/anonymizer/config/models.py	Adds 4 required fields to ReplaceModelSelection — a breaking change for any user with a custom replace.yaml that omits these new judge model names.
src/anonymizer/engine/replace/replace_runner.py	Wires up four judge workflows sequentially as non-critical try/except steps; shared mutable failed_records list is correct.
src/anonymizer/engine/replace/detection_judge.py	New file implementing detection validity judge with clear schema, prompt, and passthrough logic for empty-entity rows.
src/anonymizer/engine/replace/type_fidelity_judge.py	New file implementing type-fidelity judge with detailed per-label format rules; handles replacement map deserialization correctly.
src/anonymizer/engine/replace/relational_consistency_judge.py	New file implementing cross-entity relational consistency judge; skips rows with fewer than 2 replacements.
src/anonymizer/engine/replace/attribute_fidelity_judge.py	New file implementing attribute-fidelity judge narrowed to gender-of-name and age-bucket checks only.
src/anonymizer/interface/display.py	Adds four judge verdict sections with tri-state badges; _verdict_badge ignores valid boolean for non-None cases.
src/anonymizer/interface/anonymizer.py	Wires up all four judge workflows; judge columns only available in trace_dataframe, not in user-facing result.dataframe.
src/anonymizer/engine/constants.py	Adds 12 new column constants for the four judge outputs, clearly annotated as internal vs user-facing.
src/anonymizer/config/default_model_configs/replace.yaml	Adds four new model aliases; type_fidelity_judge also defaults to gpt-oss-120b, inconsistent with PR description.

Sequence Diagram

sequenceDiagram
    participant RW as ReplacementWorkflow.run()
    participant LLM as LlmReplaceWorkflow
    participant DJ as DetectionJudgeWorkflow
    participant TF as TypeFidelityJudgeWorkflow
    participant RC as RelationalConsistencyJudgeWorkflow
    participant AF as AttributeFidelityJudgeWorkflow

    RW->>LLM: generate_map_only() [Substitute only]
    LLM-->>RW: map_result (dataframe + failed_records)
    RW->>RW: apply_replacement_map()

    RW->>DJ: evaluate(df) [all strategies]
    note over DJ: skip rows with no entities
    DJ-->>RW: COL_DETECTION_VALID, COL_DETECTION_INVALID_ENTITIES

    alt is_substitute
        RW->>TF: evaluate(df)
        TF-->>RW: COL_TYPE_FIDELITY_VALID
        RW->>RC: evaluate(df)
        RC-->>RW: COL_RELATIONAL_CONSISTENCY_VALID
        RW->>AF: evaluate(df)
        AF-->>RW: COL_ATTRIBUTE_FIDELITY_VALID
    end

    RW-->>RW: ReplacementResult(dataframe, failed_records)

Comments Outside Diff (1)

src/anonymizer/interface/anonymizer.py, line 391-428 (link)

Judge evaluation columns excluded from result.dataframe

The _build_user_dataframe allowed column set for replace mode does not include any of the four new judge columns. Users who call result.dataframe will not see judge results — they only appear in result.trace_dataframe. If programmatic access to judge scores is intended, these columns should be added; if display-only is by design, a docstring note would help.

_{Reviews (1): Last reviewed commit: "change judge models for sparce error" | Re-trigger Greptile}

greptile-apps · 2026-05-14T19:38:59Z

    """Model aliases for the replacement pipeline."""

    replacement_generator: str
+    detection_judge: str
+    type_fidelity_judge: str
+    relational_consistency_judge: str
+    attribute_fidelity_judge: str


 class RewriteModelSelection(BaseModel):


Breaking change: new required judge model fields

All four new fields (detection_judge, type_fidelity_judge, relational_consistency_judge, attribute_fidelity_judge) are added as required str fields with no defaults. Any user who has a custom replace.yaml that does not include these fields will immediately get a Pydantic validation error when the config is loaded — even if they never use the Substitute strategy. Consider adding sensible default model strings or making them Optional[str] = None with the runner handling None gracefully, so existing configs continue to work without modification.

greptile-apps · 2026-05-14T19:39:03Z

+def _verdict_badge(valid: object, correct: int, total: int) -> tuple[str, str]:
+    """Return (badge_html, rate_html) for the tri-state verdict.
+
+    - ``valid is None``        -> Unavailable (gray, no rate).
+    - ``total == 0``           -> Satisfied (green, no rate).
+    - ``correct == total``     -> Satisfied (green, with rate).
+    - ``correct == 0``         -> Not Satisfied (red, with rate).
+    - otherwise                -> Partially Satisfied (amber, with rate).
+    """
+    if valid is None:
+        return "<span style='color:#a3a3a3;font-weight:600'>Unavailable</span>", ""
+    if total == 0:
+        return "<span style='color:#22c55e;font-weight:600'>Satisfied</span>", ""
+    if correct >= total:
+        verdict, color = "Satisfied", "#22c55e"
+    elif correct == 0:
+        verdict, color = "Not Satisfied", "#ef4444"
+    else:
+        verdict, color = "Partially Satisfied", "#f59e0b"
+    badge = f"<span style='color:{color};font-weight:600'>{verdict}</span>"
+    rate_html = f" (success_rate: {correct}/{total})"
+    return badge, rate_html


_verdict_badge ignores valid boolean for non-None case

When valid=False but invalid_entries=[] (inconsistent LLM response), correct == total so the badge renders "Satisfied" in green — contradicting the valid=False signal. Consider also checking valid is False when the computed count says all-pass to avoid a misleading fully-green verdict.

greptile-apps · 2026-05-14T19:39:04Z

+  detection_judge: nemotron-30b-thinking
+  type_fidelity_judge: gpt-oss-120b
+  relational_consistency_judge: gpt-oss-120b
+  attribute_fidelity_judge: nemotron-30b-thinking


PR description inconsistency with default config

The PR description states only relational_consistency_judge uses gpt-oss-120b, but type_fidelity_judge is also set to gpt-oss-120b here. Minor documentation inaccuracy.

memadi-nv added 9 commits May 11, 2026 17:27

add entity detection validation

856d9b7

Signed-off-by: memadi <memadi@nvidia.com>

add type fidelity metric

8a8636e

Signed-off-by: memadi <memadi@nvidia.com>

add relational consistency metric

6c3ae51

Signed-off-by: memadi <memadi@nvidia.com>

add attribute fidelity metric

7946920

Signed-off-by: memadi <memadi@nvidia.com>

update prompts

8776b53

Signed-off-by: memadi <memadi@nvidia.com>

disp;ay replacement map

dfad43b

Signed-off-by: memadi <memadi@nvidia.com>

update metric display

96c297a

Signed-off-by: memadi <memadi@nvidia.com>

more specific prompt

186d311

Signed-off-by: memadi <memadi@nvidia.com>

change judge models for sparce error

63592ad

Signed-off-by: memadi <memadi@nvidia.com>

memadi-nv requested a review from a team as a code owner May 14, 2026 19:34

memadi-nv changed the title ~~feature : Add LLM-as-a-judge for REPLACE evaluation~~ feature (DRAFT): Add LLM-as-a-judge for REPLACE evaluation May 14, 2026

memadi-nv marked this pull request as draft May 14, 2026 19:37

greptile-apps Bot reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature (DRAFT): Add LLM-as-a-judge for REPLACE evaluation#158

feature (DRAFT): Add LLM-as-a-judge for REPLACE evaluation#158
memadi-nv wants to merge 9 commits into
mainfrom
memadi/feature/evaluate-replace

memadi-nv commented May 14, 2026

Uh oh!

greptile-apps Bot commented May 14, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot May 14, 2026

Uh oh!

greptile-apps Bot May 14, 2026

Uh oh!

greptile-apps Bot May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

memadi-nv commented May 14, 2026

Summary

What's new

Model & config

Display

Tests

Notes

Type of Change

Testing

Documentation

Related Issues

Uh oh!

greptile-apps Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented May 14, 2026 •

edited

Loading