[RFC] Experiment Integrity: Content Based Drift Detection

This proposal is aimed at protecting the user from making incorrect data driven decisions. By identifying when experiment results are compared against shifted baselines, this RFC adds the necessary **Integrity Guardrails** to make the Search Relevance Workbench a truly reliable tool for search engineering.

## Executive Summary

Today, OpenSearch Search Relevance experiments track inputs (QuerySet, SearchConfiguration, JudgmentList) by **ID only**. Because these entities are mutable, an experiment’s results can become misleading if the underlying data is modified after execution.

This proposal introduces the **Experiment Input Signature**. By capturing stable content hashes of all inputs at the moment of execution, we provide a safety mechanism that alerts users when data drift has rendered their historical experiment conclusions incomparable with the current state.

----

## The Problem: Silent Metric Invalidation

Search relevance work is iterative. Currently, the system lacks "Version Awareness" for mutable inputs. **In Information Retrieval theory, two experiments are only comparable if they are evaluated over identical input distributions.**

* **The Baseline Shift:** A user runs an experiment, then modifies the QuerySet (adding or removing queries). When they run a second experiment to compare results, the NDCG/Precision delta is no longer mathematically valid because the query distribution has changed.
* **The Result:** The system provides no warning of this drift, leading users to make production decisions based on incomparable data.

----

## Proposed Solution: Input Signatures

### 1. Experiment Input Signature

We will introduce an `ExperimentInputSignature` object within the `Experiment` document to store "fingerprints" of the inputs.

* **Canonical Serialization**: To ensure stable hashes, we will use Jackson’s `ObjectWriter` with `SerializationFeature.ORDER_MAP_ENTRIES_BY_KEYS`. This ensures that semantic changes are captured while ignoring irrelevant JSON formatting or key ordering.
* **SHA-256 Hashing**: The system will generate a signature **after entities are loaded and before evaluation begins**. It covers:
* **QuerySet**: Sorted query text and metadata.
* **JudgmentList**: All ratings keyed by query ID.
* **SearchConfiguration**: The **normalized search request body** (excluding transient runtime fields like `size`, `from`, or `profile`). Normalization removes runtime parameters but preserves ranking affecting parameters.



### 2. Drift Detection & Validation

A validation logic will compare the **Stored Signature** (historical) against the **Live Signature** (current state).

* **Validation API**: `GET /_plugins/search_relevance/experiments/{id}/validate`
```json
{
  "status": "DRIFTED",
  "drifted_inputs": ["query_set"],
  "message": "QuerySet has changed since execution"
}

```

For experiments created before this feature, the API returns ```"status": "UNAVAILABLE"```.

* **Non Blocking Behavior**: Validation does not block experiment execution; it serves to annotate the result’s trustworthiness.
* **Performance Safeguard**: Signatures are computed once per run and cached. This ensures the fingerprint reflects exactly what was tested without adding overhead to core search threads.

----

## Implementation Phases

### Phase 1: Signature Metadata (Safety MVP)

* Update the `Experiment` model to persist input signatures.
* Implement the canonical hashing utility using standard Jackson configurations.
* **Goal**: Ensure every new experiment has an immutable "receipt" of its specific inputs.

### Phase 2: UI Transparency & Alerts

* **Drift Warning**: In the Search Relevance Workbench UI, experiments with modified inputs will display a **Drift Warning** icon.
* **Tooltip**: *"Warning: The QuerySet or Judgments for this run have changed. Results may not be comparable with the current data state."*

----

## Why This is Low Risk

1. **Minimal Schema Impact**: Signature data is stored within the existing `Experiment` index. No new indices or complex migrations are required.
2. **Backward Compatibility**: Legacy experiments will simply return a "Signature Unavailable" status, ensuring zero breakage for existing users.
3. **Correctness First**: This addresses a fundamental requirement of any evaluation system: ensuring the data being compared is actually comparable.

----

## Success Metrics

* **Data Trust**: Users can immediately identify if an experiment result is "stale" or "drifted."
* **Auditability**: Provides a provable link between a specific Search Configuration and its evaluation metrics.
* **Zero Regression**: No measurable impact on the performance of standard search evaluation workflows.

----

## Final Implementation Note

This RFC focuses strictly on **Detection** and **Transparency**. It does not introduce complex auto versioning or copy on write logic, keeping the implementation simple, maintainable, and focused on immediate user safety.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Experiment Integrity: Content Based Drift Detection #402

Executive Summary

The Problem: Silent Metric Invalidation

Proposed Solution: Input Signatures

1. Experiment Input Signature

2. Drift Detection & Validation

Implementation Phases

Phase 1: Signature Metadata (Safety MVP)

Phase 2: UI Transparency & Alerts

Why This is Low Risk

Success Metrics

Final Implementation Note

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Experiment Integrity: Content Based Drift Detection #402

Description

Executive Summary

The Problem: Silent Metric Invalidation

Proposed Solution: Input Signatures

1. Experiment Input Signature

2. Drift Detection & Validation

Implementation Phases

Phase 1: Signature Metadata (Safety MVP)

Phase 2: UI Transparency & Alerts

Why This is Low Risk

Success Metrics

Final Implementation Note

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions