Skip to content

[RFC] Experiment Integrity: Content Based Drift Detection #402

@iprithv

Description

@iprithv

This proposal is aimed at protecting the user from making incorrect data driven decisions. By identifying when experiment results are compared against shifted baselines, this RFC adds the necessary Integrity Guardrails to make the Search Relevance Workbench a truly reliable tool for search engineering.

Executive Summary

Today, OpenSearch Search Relevance experiments track inputs (QuerySet, SearchConfiguration, JudgmentList) by ID only. Because these entities are mutable, an experiment’s results can become misleading if the underlying data is modified after execution.

This proposal introduces the Experiment Input Signature. By capturing stable content hashes of all inputs at the moment of execution, we provide a safety mechanism that alerts users when data drift has rendered their historical experiment conclusions incomparable with the current state.


The Problem: Silent Metric Invalidation

Search relevance work is iterative. Currently, the system lacks "Version Awareness" for mutable inputs. In Information Retrieval theory, two experiments are only comparable if they are evaluated over identical input distributions.

  • The Baseline Shift: A user runs an experiment, then modifies the QuerySet (adding or removing queries). When they run a second experiment to compare results, the NDCG/Precision delta is no longer mathematically valid because the query distribution has changed.
  • The Result: The system provides no warning of this drift, leading users to make production decisions based on incomparable data.

Proposed Solution: Input Signatures

1. Experiment Input Signature

We will introduce an ExperimentInputSignature object within the Experiment document to store "fingerprints" of the inputs.

  • Canonical Serialization: To ensure stable hashes, we will use Jackson’s ObjectWriter with SerializationFeature.ORDER_MAP_ENTRIES_BY_KEYS. This ensures that semantic changes are captured while ignoring irrelevant JSON formatting or key ordering.
  • SHA-256 Hashing: The system will generate a signature after entities are loaded and before evaluation begins. It covers:
  • QuerySet: Sorted query text and metadata.
  • JudgmentList: All ratings keyed by query ID.
  • SearchConfiguration: The normalized search request body (excluding transient runtime fields like size, from, or profile). Normalization removes runtime parameters but preserves ranking affecting parameters.

2. Drift Detection & Validation

A validation logic will compare the Stored Signature (historical) against the Live Signature (current state).

  • Validation API: GET /_plugins/search_relevance/experiments/{id}/validate
{
  "status": "DRIFTED",
  "drifted_inputs": ["query_set"],
  "message": "QuerySet has changed since execution"
}

For experiments created before this feature, the API returns "status": "UNAVAILABLE".

  • Non Blocking Behavior: Validation does not block experiment execution; it serves to annotate the result’s trustworthiness.
  • Performance Safeguard: Signatures are computed once per run and cached. This ensures the fingerprint reflects exactly what was tested without adding overhead to core search threads.

Implementation Phases

Phase 1: Signature Metadata (Safety MVP)

  • Update the Experiment model to persist input signatures.
  • Implement the canonical hashing utility using standard Jackson configurations.
  • Goal: Ensure every new experiment has an immutable "receipt" of its specific inputs.

Phase 2: UI Transparency & Alerts

  • Drift Warning: In the Search Relevance Workbench UI, experiments with modified inputs will display a Drift Warning icon.
  • Tooltip: "Warning: The QuerySet or Judgments for this run have changed. Results may not be comparable with the current data state."

Why This is Low Risk

  1. Minimal Schema Impact: Signature data is stored within the existing Experiment index. No new indices or complex migrations are required.
  2. Backward Compatibility: Legacy experiments will simply return a "Signature Unavailable" status, ensuring zero breakage for existing users.
  3. Correctness First: This addresses a fundamental requirement of any evaluation system: ensuring the data being compared is actually comparable.

Success Metrics

  • Data Trust: Users can immediately identify if an experiment result is "stale" or "drifted."
  • Auditability: Provides a provable link between a specific Search Configuration and its evaluation metrics.
  • Zero Regression: No measurable impact on the performance of standard search evaluation workflows.

Final Implementation Note

This RFC focuses strictly on Detection and Transparency. It does not introduce complex auto versioning or copy on write logic, keeping the implementation simple, maintainable, and focused on immediate user safety.

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCIssues requesting major changes

    Type

    No type

    Projects

    Status

    🆕 New

    Status

    New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions