feature (DRAFT): Add LLM-as-a-judge for REPLACE evaluation#158
feature (DRAFT): Add LLM-as-a-judge for REPLACE evaluation#158memadi-nv wants to merge 9 commits into
Conversation
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Greptile SummaryThis PR adds four LLM-as-judge evaluation metrics to the replace workflow — Detection Validity, Type Fidelity, Attribute Fidelity, and Relational Consistency — along with the display surface to render them per record. All judges run as non-critical post-replacement steps (wrapped in try/except with safe defaults) and produce both internal raw-output columns and user-facing boolean/list columns.
Confidence Score: 3/5The core judge logic and display code are well-implemented and defensive, but the four new model fields in ReplaceModelSelection are required with no defaults, which will break any existing deployment that loads a custom replace.yaml without them. The four judge workflows are clean and handle failures gracefully. However, adding four required (non-optional) fields to ReplaceModelSelection is a config-level breaking change: any caller or YAML config file that omits these new fields will fail on load, even for Annotate/Redact/Hash strategies where the judge model aliases are never actually used. This needs to be resolved before the change is safe to roll out to existing environments. src/anonymizer/config/models.py requires attention for the required-field breaking change; src/anonymizer/interface/anonymizer.py for the judge-column visibility gap in the user-facing dataframe. Important Files Changed
Sequence DiagramsequenceDiagram
participant RW as ReplacementWorkflow.run()
participant LLM as LlmReplaceWorkflow
participant DJ as DetectionJudgeWorkflow
participant TF as TypeFidelityJudgeWorkflow
participant RC as RelationalConsistencyJudgeWorkflow
participant AF as AttributeFidelityJudgeWorkflow
RW->>LLM: generate_map_only() [Substitute only]
LLM-->>RW: map_result (dataframe + failed_records)
RW->>RW: apply_replacement_map()
RW->>DJ: evaluate(df) [all strategies]
note over DJ: skip rows with no entities
DJ-->>RW: COL_DETECTION_VALID, COL_DETECTION_INVALID_ENTITIES
alt is_substitute
RW->>TF: evaluate(df)
TF-->>RW: COL_TYPE_FIDELITY_VALID
RW->>RC: evaluate(df)
RC-->>RW: COL_RELATIONAL_CONSISTENCY_VALID
RW->>AF: evaluate(df)
AF-->>RW: COL_ATTRIBUTE_FIDELITY_VALID
end
RW-->>RW: ReplacementResult(dataframe, failed_records)
|
| """Model aliases for the replacement pipeline.""" | ||
|
|
||
| replacement_generator: str | ||
| detection_judge: str | ||
| type_fidelity_judge: str | ||
| relational_consistency_judge: str | ||
| attribute_fidelity_judge: str | ||
|
|
||
|
|
||
| class RewriteModelSelection(BaseModel): |
There was a problem hiding this comment.
Breaking change: new required judge model fields
All four new fields (detection_judge, type_fidelity_judge, relational_consistency_judge, attribute_fidelity_judge) are added as required str fields with no defaults. Any user who has a custom replace.yaml that does not include these fields will immediately get a Pydantic validation error when the config is loaded — even if they never use the Substitute strategy. Consider adding sensible default model strings or making them Optional[str] = None with the runner handling None gracefully, so existing configs continue to work without modification.
| def _verdict_badge(valid: object, correct: int, total: int) -> tuple[str, str]: | ||
| """Return (badge_html, rate_html) for the tri-state verdict. | ||
|
|
||
| - ``valid is None`` -> Unavailable (gray, no rate). | ||
| - ``total == 0`` -> Satisfied (green, no rate). | ||
| - ``correct == total`` -> Satisfied (green, with rate). | ||
| - ``correct == 0`` -> Not Satisfied (red, with rate). | ||
| - otherwise -> Partially Satisfied (amber, with rate). | ||
| """ | ||
| if valid is None: | ||
| return "<span style='color:#a3a3a3;font-weight:600'>Unavailable</span>", "" | ||
| if total == 0: | ||
| return "<span style='color:#22c55e;font-weight:600'>Satisfied</span>", "" | ||
| if correct >= total: | ||
| verdict, color = "Satisfied", "#22c55e" | ||
| elif correct == 0: | ||
| verdict, color = "Not Satisfied", "#ef4444" | ||
| else: | ||
| verdict, color = "Partially Satisfied", "#f59e0b" | ||
| badge = f"<span style='color:{color};font-weight:600'>{verdict}</span>" | ||
| rate_html = f" (success_rate: {correct}/{total})" | ||
| return badge, rate_html |
There was a problem hiding this comment.
_verdict_badge ignores valid boolean for non-None case
When valid=False but invalid_entries=[] (inconsistent LLM response), correct == total so the badge renders "Satisfied" in green — contradicting the valid=False signal. Consider also checking valid is False when the computed count says all-pass to avoid a misleading fully-green verdict.
| detection_judge: nemotron-30b-thinking | ||
| type_fidelity_judge: gpt-oss-120b | ||
| relational_consistency_judge: gpt-oss-120b | ||
| attribute_fidelity_judge: nemotron-30b-thinking |
Summary
Adds four LLM-as-judge evaluation metrics for the replace workflow, plus the wiring and display surface to render them per record. All judges run as non-critical post-replacement steps (try/except, defaults on failure) so they never block the pipeline.
What's new
Detection Validity judge — runs for all replace strategies. Asks "is each detected (value, label) a valid PII detection given the original text?" Output: detection_valid: bool|None + detection_invalid_entities: list[{value, label, reasoning}].
Type Fidelity judge — Substitute only. Asks "is each synthetic the same entity class and format as its label expects?" Treats labels as buckets (sibling categories pass). Output: type_fidelity_valid + type_fidelity_invalid_replacements.
Attribute Fidelity judge — Substitute only. Narrowed to two checks: gender of name (first_name/last_name/user_name) and age bucket (age/date_of_birth). Output: attribute_fidelity_valid + attribute_fidelity_invalid_entities.
Relational Consistency judge — Substitute only. Cross-entity coherence: geographic, temporal, identity (incl. name↔pronoun), organizational, role/employment, demographic, communication. Output: relational_consistency_valid + relational_consistency_invalid_relations.
Model & config
New ReplaceModelSelection fields: detection_judge, type_fidelity_judge, attribute_fidelity_judge, relational_consistency_judge.
Default replace.yaml updated. relational_consistency_judge defaulted to gpt-oss-120b (non-thinking model — reliable structured output); the others default to nemotron-30b-thinking.
Display
display_record(...) for Substitute now renders four sections plus the existing Replacement Map.
Three-state verdict (replaces yes/no): Satisfied (100%), Partially Satisfied (>0% & <100%), Not Satisfied (0%), Unavailable (judge failed). Each section shows (success_rate: M/N) and a collapsible drilldown of failing/evaluated entities.
Fixed _normalize_replacement_map to handle numpy-wrapped (parquet round-trip) shapes via EntityReplacementMapSchema.model_validate — restores the Replacement Map table and entity highlighting in the Replaced section.
Tests
~1,200 lines added across four new test files (test_detection_judge.py, test_type_fidelity_judge.py, test_attribute_fidelity_judge.py, test_relational_consistency_judge.py) covering prompt structure, schemas, flatteners, and per-row workflow paths.
Existing fixtures updated for the new required ReplaceModelSelection fields.
Notes
All four judge prompts include an explicit <output_format> clause requesting raw JSON with no markdown fences, to defend against thinking-model output drift.
RelationalConsistencyJudgmentSchema.entities is list[str] (flat) rather than nested objects, to reduce JSON parse errors from thinking models.
Type of Change
Testing
make testpasses locallymake checkpasses locally (format + lint + typecheck + lock-check)Documentation
make docs-buildpasses locallyRelated Issues
Closes #98