Project: Domain-Specific RAG Evaluation & MLOps Platform
Version: 1.1.0
Branch:feat/graph-context-relevance
Last updated: 2026-03-27
- Overview
- Graph Context Relevance (GCR) — Composite Score
- Component: Entity Overlap ($S_e$)
- Component: Structural Connectivity ($S_c$)
- Component: Hub Noise Penalty ($P_h$)
- Complexity Analysis
- Weight Calibration
- Edge-Case Behaviour
- Two-Tier Entity Highlighter
- Output Contract
- Related Files
The Graph Context Relevance (GCR) metric evaluates the topological quality of a retrieved subgraph relative to a question/answer pair. Unlike cosine-similarity-based metrics, which treat retrieved context as an unordered bag of embeddings, GCR measures whether the retrieval forms a coherent, connected neighbourhood in the knowledge graph.
GCR is:
- 100% offline — no LLM calls, no embedding model calls at evaluation time.
- Deterministic — identical inputs always produce identical outputs.
- O(N + E) — scales linearly with graph size on the first call; cached thereafter.
The evaluator is implemented in eval-pipeline/src/evaluation/graph_context_relevance.py and
reads from any backend that satisfies the GraphStore Protocol
(eval-pipeline/src/utils/graph_store.py).
| Symbol | Name | Range | Default weight |
|---|---|---|---|
| Entity Overlap | |||
| Structural Connectivity | |||
| Hub Noise Penalty |
The clip operation keeps the composite score in
Maximum achievable score (when
The ceiling is intentionally below 1.0 to leave headroom for domain-calibrated weight
configurations where
Definition: Mean Jaccard similarity between the joint question+answer token set and each retrieved node's token set.
where:
-
$Q = \text{tokens}(\text{question} \cup \text{expected_answer})$ — lowercase alphanumeric tokens $T_n = \text{tokens}(\text{keyphrases}_n \cup \text{entities}_n \cup \text{content}_n)$ -
$\mathrm{Jaccard}(A, B) = \dfrac{|A \cap B|}{|A \cup B|}$ , with$\mathrm{Jaccard}(\emptyset, \cdot) = 0$
Tokenisation rule: re.findall(r"[a-z0-9_\u4e00-\u9fff]+", text.lower()) — captures
ASCII alphanumeric, underscore, and CJK unified ideographs (U+4E00–U+9FFF), supporting
Chinese/English mixed-language corpora.
Complexity:
Semantics:
Definition: Fraction of retrieved nodes in the largest connected component of the induced subgraph.
where:
-
$G[R]$ is the undirected subgraph induced by the retrieved node set$R$ -
$\mathrm{LCC}(\cdot)$ is the largest connected component (by node count) - Only edges whose both endpoints are in
$R$ are included (orphaned edges are excluded)
Special cases:
| Condition |
|
Rationale |
|---|---|---|
| Empty retrieval | ||
| Single node is trivially connected | ||
|
|
Worst case: each node its own component | |
| All nodes in one component | Perfect structural coherence |
Complexity: networkx.connected_components.
Semantics:
Definition: Fraction of retrieved nodes classified as degree hubs in the full graph.
where:
-
$\mu_{\deg}$ and$\sigma_{\deg}$ are the mean and sample standard deviation of all node degrees in the full graph$G$ - The
$\mu + 2\sigma$ threshold corresponds to the 97.7th percentile under a normal degree distribution
Guard conditions:
| Condition |
|
Rationale |
|---|---|---|
| Empty retrieval | ||
| Cannot compute meaningful statistics | ||
| Regular graph — no hub structure | ||
|
|
Penalty proportional to hub fraction |
Complexity:
Semantics:
| Phase | Operation | Complexity | Notes |
|---|---|---|---|
| Graph build (first call) | Load N nodes + E edges from SQLite | Cached; subsequent calls are |
|
| Graph build | nx.Graph.add_node / add_edge |
|
Dict-of-dicts; each op is |
| Entity overlap | Jaccard per node |
|
|
| Connectivity | nx.subgraph + connected_components |
Subgraph is a view, no copy | |
| Hub detection | statistics.mean + stdev |
One pass over all degrees | |
| Hub filtering | Scan retrieved nodes | Compare against pre-computed threshold | |
| Total per evaluation | Graph build amortised after first call |
Amortised complexity (after graph is cached):
Default weights (
-
$\alpha = \beta$ : Semantic relevance and structural coherence are treated as co-equal. Neither is sufficient alone. -
$\gamma < \alpha, \beta$ : Hub noise is a correction signal, not a primary scorer. -
$\alpha + \beta - \gamma_{\max} = 0.6$ : Score range under maximum hub penalty remains meaningful.
Custom weights are accepted at construction time and echoed in the evaluation contract:
from src.evaluation.graph_context_relevance import GraphContextRelevanceEvaluator
from src.utils.graph_store import SQLiteGraphStore
store = SQLiteGraphStore("outputs/my_run/kg.db")
evaluator = GraphContextRelevanceEvaluator(store, alpha=0.5, beta=0.3, gamma=0.2)
result = evaluator.evaluate(
question="What defects appear on the steel surface?",
expected_answer="Scratches and pits are detected.",
retrieved_node_hashes=["abc123...", "def456..."],
)
print(result["score"]) # float in [0.0, 1.0]
print(result["contract"]) # full diagnostic breakdownDomain calibration procedure:
- Collect human-labeled "good retrieval" / "bad retrieval" pairs from the domain corpus.
- Grid-search
$(\alpha, \beta, \gamma)$ subject to$\alpha, \beta, \gamma \geq 0$ and$\alpha + \beta > \gamma$ . - Maximise rank correlation (Kendall-$\tau$) between GCR scores and human judgments.
- Update
pipeline_config.yamlgcr_weightssection with calibrated values.
| Input | GCR | Notes | |||
|---|---|---|---|---|---|
| Empty retrieved set | 0.0 | 0.0 | 0.0 | 0.0 | Short-circuited before graph build |
| Single valid node | Jaccard score | 1.0 | 0 or 1 | varies | Single-node connectivity = 1 |
| All nodes irrelevant (tokens disjoint) | 0.0 | varies | varies | depends on |
|
| Node hashes not in store | — | — | — | Filtered; only valid hashes scored | valid = [h for h in retrieved if h in full_graph] |
| NaN content in node properties | 0.0 | — | — | Graceful; _node_tokens returns empty set |
Tested in test_nan_content_node_handled_gracefully
|
| Zero-variance degree distribution | — | — | 0.0 | No hub penalty | Regular graphs, e.g. |
The QA Debugger in the Insights Portal automatically highlights shared entities between the question and retrieved context passages. Two strategies are applied in priority order:
Checks the extra payload of each QA item for explicit entity lists under any of these keys:
entities | extracted_entities | entity_list | named_entities
If found, those strings are used directly as highlight terms. This tier is exact and precise — it uses the entities extracted by the evaluation pipeline's NER step.
When no explicit entities are available, tokenises the LLM-generated answer and uses the unique content words (length ≥ 4 chars) as highlight terms. Terms are:
- De-duplicated
- Sorted longest-first (greedy matching priority)
- Capped at 12 terms
// Scoring heuristic — from insights-portal/src/utils/textHighlighter.ts
const words = answer
.split(/[\s,.;:!?、。,!?·\-–—/\\()[\]{}"「」『』【】《》〈〉]+/)
.filter((w) => w.length >= 4)
return [...new Set(words)].sort((a, b) => b.length - a.length).slice(0, 12)All highlight markup uses a hardcoded <mark class="hl-entity">$1</mark> / <mark class="hl-overlap">$1</mark> template. The term strings are regex-escaped before insertion. No user-controlled HTML is injected — the input data comes from the user's own locally loaded evaluation CSV files.
Every evaluate() call returns a dict with the following guaranteed structure:
{
"score": float, # Composite GCR in [0.0, 1.0]
"contract": {
"backend": "graph_context_relevance",
"entity_overlap": float, # S_e in [0.0, 1.0]
"structural_connectivity": float, # S_c in [0.0, 1.0]
"hub_noise_penalty": float, # P_h in [0.0, 1.0]
"hub_nodes": list[str], # hashes of flagged hub nodes
"largest_component_size": int,
"retrieved_count": int, # len(valid hashes)
"alpha": float,
"beta": float,
"gamma": float,
}
}All float values are rounded to 6 decimal places.
| File | Role |
|---|---|
eval-pipeline/src/evaluation/graph_context_relevance.py |
GCR evaluator implementation |
eval-pipeline/src/utils/graph_store.py |
GraphStore protocol + SQLiteGraphStore backend |
eval-pipeline/tests/test_graph_context_relevance.py |
5-fixture TDD test suite |
eval-pipeline/tests/test_graph_store.py |
GraphStore unit tests |
insights-portal/src/utils/textHighlighter.ts |
Two-tier entity highlighter |
insights-portal/src/core/insights/engine.ts |
Deterministic rule-based insights engine |
config/pipeline_config.yaml |
gcr_weights and threshold configuration |