-
Notifications
You must be signed in to change notification settings - Fork 31
Description
This proposal is aimed at protecting the user from making incorrect data driven decisions. By identifying when experiment results are compared against shifted baselines, this RFC adds the necessary Integrity Guardrails to make the Search Relevance Workbench a truly reliable tool for search engineering.
Executive Summary
Today, OpenSearch Search Relevance experiments track inputs (QuerySet, SearchConfiguration, JudgmentList) by ID only. Because these entities are mutable, an experiment’s results can become misleading if the underlying data is modified after execution.
This proposal introduces the Experiment Input Signature. By capturing stable content hashes of all inputs at the moment of execution, we provide a safety mechanism that alerts users when data drift has rendered their historical experiment conclusions incomparable with the current state.
The Problem: Silent Metric Invalidation
Search relevance work is iterative. Currently, the system lacks "Version Awareness" for mutable inputs. In Information Retrieval theory, two experiments are only comparable if they are evaluated over identical input distributions.
- The Baseline Shift: A user runs an experiment, then modifies the QuerySet (adding or removing queries). When they run a second experiment to compare results, the NDCG/Precision delta is no longer mathematically valid because the query distribution has changed.
- The Result: The system provides no warning of this drift, leading users to make production decisions based on incomparable data.
Proposed Solution: Input Signatures
1. Experiment Input Signature
We will introduce an ExperimentInputSignature object within the Experiment document to store "fingerprints" of the inputs.
- Canonical Serialization: To ensure stable hashes, we will use Jackson’s
ObjectWriterwithSerializationFeature.ORDER_MAP_ENTRIES_BY_KEYS. This ensures that semantic changes are captured while ignoring irrelevant JSON formatting or key ordering. - SHA-256 Hashing: The system will generate a signature after entities are loaded and before evaluation begins. It covers:
- QuerySet: Sorted query text and metadata.
- JudgmentList: All ratings keyed by query ID.
- SearchConfiguration: The normalized search request body (excluding transient runtime fields like
size,from, orprofile). Normalization removes runtime parameters but preserves ranking affecting parameters.
2. Drift Detection & Validation
A validation logic will compare the Stored Signature (historical) against the Live Signature (current state).
- Validation API:
GET /_plugins/search_relevance/experiments/{id}/validate
{
"status": "DRIFTED",
"drifted_inputs": ["query_set"],
"message": "QuerySet has changed since execution"
}
For experiments created before this feature, the API returns "status": "UNAVAILABLE".
- Non Blocking Behavior: Validation does not block experiment execution; it serves to annotate the result’s trustworthiness.
- Performance Safeguard: Signatures are computed once per run and cached. This ensures the fingerprint reflects exactly what was tested without adding overhead to core search threads.
Implementation Phases
Phase 1: Signature Metadata (Safety MVP)
- Update the
Experimentmodel to persist input signatures. - Implement the canonical hashing utility using standard Jackson configurations.
- Goal: Ensure every new experiment has an immutable "receipt" of its specific inputs.
Phase 2: UI Transparency & Alerts
- Drift Warning: In the Search Relevance Workbench UI, experiments with modified inputs will display a Drift Warning icon.
- Tooltip: "Warning: The QuerySet or Judgments for this run have changed. Results may not be comparable with the current data state."
Why This is Low Risk
- Minimal Schema Impact: Signature data is stored within the existing
Experimentindex. No new indices or complex migrations are required. - Backward Compatibility: Legacy experiments will simply return a "Signature Unavailable" status, ensuring zero breakage for existing users.
- Correctness First: This addresses a fundamental requirement of any evaluation system: ensuring the data being compared is actually comparable.
Success Metrics
- Data Trust: Users can immediately identify if an experiment result is "stale" or "drifted."
- Auditability: Provides a provable link between a specific Search Configuration and its evaluation metrics.
- Zero Regression: No measurable impact on the performance of standard search evaluation workflows.
Final Implementation Note
This RFC focuses strictly on Detection and Transparency. It does not introduce complex auto versioning or copy on write logic, keeping the implementation simple, maintainable, and focused on immediate user safety.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status