Skip to content

feature (DRAFT): Add LLM-as-a-judge for REPLACE evaluation#158

Draft
memadi-nv wants to merge 9 commits into
mainfrom
memadi/feature/evaluate-replace
Draft

feature (DRAFT): Add LLM-as-a-judge for REPLACE evaluation#158
memadi-nv wants to merge 9 commits into
mainfrom
memadi/feature/evaluate-replace

Conversation

@memadi-nv
Copy link
Copy Markdown
Contributor

Summary

Adds four LLM-as-judge evaluation metrics for the replace workflow, plus the wiring and display surface to render them per record. All judges run as non-critical post-replacement steps (try/except, defaults on failure) so they never block the pipeline.

What's new

  • Detection Validity judge — runs for all replace strategies. Asks "is each detected (value, label) a valid PII detection given the original text?" Output: detection_valid: bool|None + detection_invalid_entities: list[{value, label, reasoning}].

  • Type Fidelity judge — Substitute only. Asks "is each synthetic the same entity class and format as its label expects?" Treats labels as buckets (sibling categories pass). Output: type_fidelity_valid + type_fidelity_invalid_replacements.

  • Attribute Fidelity judge — Substitute only. Narrowed to two checks: gender of name (first_name/last_name/user_name) and age bucket (age/date_of_birth). Output: attribute_fidelity_valid + attribute_fidelity_invalid_entities.

  • Relational Consistency judge — Substitute only. Cross-entity coherence: geographic, temporal, identity (incl. name↔pronoun), organizational, role/employment, demographic, communication. Output: relational_consistency_valid + relational_consistency_invalid_relations.

    Model & config

  • New ReplaceModelSelection fields: detection_judge, type_fidelity_judge, attribute_fidelity_judge, relational_consistency_judge.

  • Default replace.yaml updated. relational_consistency_judge defaulted to gpt-oss-120b (non-thinking model — reliable structured output); the others default to nemotron-30b-thinking.

    Display

  • display_record(...) for Substitute now renders four sections plus the existing Replacement Map.

  • Three-state verdict (replaces yes/no): Satisfied (100%), Partially Satisfied (>0% & <100%), Not Satisfied (0%), Unavailable (judge failed). Each section shows (success_rate: M/N) and a collapsible drilldown of failing/evaluated entities.

  • Fixed _normalize_replacement_map to handle numpy-wrapped (parquet round-trip) shapes via EntityReplacementMapSchema.model_validate — restores the Replacement Map table and entity highlighting in the Replaced section.

    Tests

  • ~1,200 lines added across four new test files (test_detection_judge.py, test_type_fidelity_judge.py, test_attribute_fidelity_judge.py, test_relational_consistency_judge.py) covering prompt structure, schemas, flatteners, and per-row workflow paths.

  • Existing fixtures updated for the new required ReplaceModelSelection fields.

    Notes

  • All four judge prompts include an explicit <output_format> clause requesting raw JSON with no markdown fences, to defend against thinking-model output drift.

  • RelationalConsistencyJudgmentSchema.entities is list[str] (flat) rather than nested objects, to reduce JSON parse errors from thinking models.

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update
  • Refactoring

Testing

  • make test passes locally
  • make check passes locally (format + lint + typecheck + lock-check)
  • Added/updated tests for changes

Documentation

  • If docs changed: make docs-build passes locally

Related Issues

Closes #98

memadi-nv added 9 commits May 11, 2026 17:27
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
@memadi-nv memadi-nv requested a review from a team as a code owner May 14, 2026 19:34
@memadi-nv memadi-nv changed the title feature : Add LLM-as-a-judge for REPLACE evaluation feature (DRAFT): Add LLM-as-a-judge for REPLACE evaluation May 14, 2026
@memadi-nv memadi-nv marked this pull request as draft May 14, 2026 19:37
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 14, 2026

Greptile Summary

This PR adds four LLM-as-judge evaluation metrics to the replace workflow — Detection Validity, Type Fidelity, Attribute Fidelity, and Relational Consistency — along with the display surface to render them per record. All judges run as non-critical post-replacement steps (wrapped in try/except with safe defaults) and produce both internal raw-output columns and user-facing boolean/list columns.

  • Four new judge workflows each follow the same pattern: flatten the replacement map, skip empty-entity rows, invoke the LLM via LLMStructuredColumnConfig, then flatten the structured output to (valid: bool|None, invalid: list).
  • Breaking config change: four new required fields added to ReplaceModelSelection with no defaults — existing replace.yaml files that omit them will fail validation on load.
  • Display additions in display.py render tri-state verdicts with collapsible drilldowns; _normalize_replacement_map is also fixed to handle numpy-wrapped parquet shapes.

Confidence Score: 3/5

The core judge logic and display code are well-implemented and defensive, but the four new model fields in ReplaceModelSelection are required with no defaults, which will break any existing deployment that loads a custom replace.yaml without them.

The four judge workflows are clean and handle failures gracefully. However, adding four required (non-optional) fields to ReplaceModelSelection is a config-level breaking change: any caller or YAML config file that omits these new fields will fail on load, even for Annotate/Redact/Hash strategies where the judge model aliases are never actually used. This needs to be resolved before the change is safe to roll out to existing environments.

src/anonymizer/config/models.py requires attention for the required-field breaking change; src/anonymizer/interface/anonymizer.py for the judge-column visibility gap in the user-facing dataframe.

Important Files Changed

Filename Overview
src/anonymizer/config/models.py Adds 4 required fields to ReplaceModelSelection — a breaking change for any user with a custom replace.yaml that omits these new judge model names.
src/anonymizer/engine/replace/replace_runner.py Wires up four judge workflows sequentially as non-critical try/except steps; shared mutable failed_records list is correct.
src/anonymizer/engine/replace/detection_judge.py New file implementing detection validity judge with clear schema, prompt, and passthrough logic for empty-entity rows.
src/anonymizer/engine/replace/type_fidelity_judge.py New file implementing type-fidelity judge with detailed per-label format rules; handles replacement map deserialization correctly.
src/anonymizer/engine/replace/relational_consistency_judge.py New file implementing cross-entity relational consistency judge; skips rows with fewer than 2 replacements.
src/anonymizer/engine/replace/attribute_fidelity_judge.py New file implementing attribute-fidelity judge narrowed to gender-of-name and age-bucket checks only.
src/anonymizer/interface/display.py Adds four judge verdict sections with tri-state badges; _verdict_badge ignores valid boolean for non-None cases.
src/anonymizer/interface/anonymizer.py Wires up all four judge workflows; judge columns only available in trace_dataframe, not in user-facing result.dataframe.
src/anonymizer/engine/constants.py Adds 12 new column constants for the four judge outputs, clearly annotated as internal vs user-facing.
src/anonymizer/config/default_model_configs/replace.yaml Adds four new model aliases; type_fidelity_judge also defaults to gpt-oss-120b, inconsistent with PR description.

Sequence Diagram

sequenceDiagram
    participant RW as ReplacementWorkflow.run()
    participant LLM as LlmReplaceWorkflow
    participant DJ as DetectionJudgeWorkflow
    participant TF as TypeFidelityJudgeWorkflow
    participant RC as RelationalConsistencyJudgeWorkflow
    participant AF as AttributeFidelityJudgeWorkflow

    RW->>LLM: generate_map_only() [Substitute only]
    LLM-->>RW: map_result (dataframe + failed_records)
    RW->>RW: apply_replacement_map()

    RW->>DJ: evaluate(df) [all strategies]
    note over DJ: skip rows with no entities
    DJ-->>RW: COL_DETECTION_VALID, COL_DETECTION_INVALID_ENTITIES

    alt is_substitute
        RW->>TF: evaluate(df)
        TF-->>RW: COL_TYPE_FIDELITY_VALID
        RW->>RC: evaluate(df)
        RC-->>RW: COL_RELATIONAL_CONSISTENCY_VALID
        RW->>AF: evaluate(df)
        AF-->>RW: COL_ATTRIBUTE_FIDELITY_VALID
    end

    RW-->>RW: ReplacementResult(dataframe, failed_records)
Loading

Comments Outside Diff (1)

  1. src/anonymizer/interface/anonymizer.py, line 391-428 (link)

    P2 Judge evaluation columns excluded from result.dataframe

    The _build_user_dataframe allowed column set for replace mode does not include any of the four new judge columns. Users who call result.dataframe will not see judge results — they only appear in result.trace_dataframe. If programmatic access to judge scores is intended, these columns should be added; if display-only is by design, a docstring note would help.

Reviews (1): Last reviewed commit: "change judge models for sparce error" | Re-trigger Greptile

Comment on lines 78 to 87
"""Model aliases for the replacement pipeline."""

replacement_generator: str
detection_judge: str
type_fidelity_judge: str
relational_consistency_judge: str
attribute_fidelity_judge: str


class RewriteModelSelection(BaseModel):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Breaking change: new required judge model fields

All four new fields (detection_judge, type_fidelity_judge, relational_consistency_judge, attribute_fidelity_judge) are added as required str fields with no defaults. Any user who has a custom replace.yaml that does not include these fields will immediately get a Pydantic validation error when the config is loaded — even if they never use the Substitute strategy. Consider adding sensible default model strings or making them Optional[str] = None with the runner handling None gracefully, so existing configs continue to work without modification.

Comment on lines +478 to +499
def _verdict_badge(valid: object, correct: int, total: int) -> tuple[str, str]:
"""Return (badge_html, rate_html) for the tri-state verdict.

- ``valid is None`` -> Unavailable (gray, no rate).
- ``total == 0`` -> Satisfied (green, no rate).
- ``correct == total`` -> Satisfied (green, with rate).
- ``correct == 0`` -> Not Satisfied (red, with rate).
- otherwise -> Partially Satisfied (amber, with rate).
"""
if valid is None:
return "<span style='color:#a3a3a3;font-weight:600'>Unavailable</span>", ""
if total == 0:
return "<span style='color:#22c55e;font-weight:600'>Satisfied</span>", ""
if correct >= total:
verdict, color = "Satisfied", "#22c55e"
elif correct == 0:
verdict, color = "Not Satisfied", "#ef4444"
else:
verdict, color = "Partially Satisfied", "#f59e0b"
badge = f"<span style='color:{color};font-weight:600'>{verdict}</span>"
rate_html = f" (success_rate: {correct}/{total})"
return badge, rate_html
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _verdict_badge ignores valid boolean for non-None case

When valid=False but invalid_entries=[] (inconsistent LLM response), correct == total so the badge renders "Satisfied" in green — contradicting the valid=False signal. Consider also checking valid is False when the computed count says all-pass to avoid a misleading fully-green verdict.

Comment on lines +6 to +9
detection_judge: nemotron-30b-thinking
type_fidelity_judge: gpt-oss-120b
relational_consistency_judge: gpt-oss-120b
attribute_fidelity_judge: nemotron-30b-thinking
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 PR description inconsistency with default config

The PR description states only relational_consistency_judge uses gpt-oss-120b, but type_fidelity_judge is also set to gpt-oss-120b here. Minor documentation inaccuracy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: LLM-as-a-judge for REPLACE evaluation

1 participant