feat: add variable-level search with intent detection by NoopDog · Pull Request #197 · NIH-NCPI/ncpi-dataset-catalog

NoopDog · 2026-02-21T07:28:01Z

Summary

Adds variable-level search so researchers can ask "what variables measure chocolate intake?" and get back individual dbGaP variables (not just studies)
Extract agent now detects query intent (study / variable / auto) from natural language signals
New DuckDB variables table stores ~450k variable rows from llm-concepts, queried by resolved measurement concepts
API branches on intent to return either study results or variable results with dbGaP links
Frontend renders a dedicated variable results table (Concept, Variable Name, Description, Study with dbGaP link, Variable dbGaP link)
9 new intent detection eval cases (all passing), Makefile dev commands (start/stop/restart/db-reload/evals)
Includes PRD at docs/PRD-variable-search.md

Closes #194

Test plan

Run extract evals: make evals (30/30 passing)
make db-reload to rebuild index with variables table
Test study intent: "diabetes datasets on AnVIL" → study results table
Test variable intent: "what variables measure chocolate intake?" → variable results table with dbGaP links
Test ambiguous intent: "blood pressure" → auto intent with clarification message
Verify dbGaP links work for both variable and study columns

🤖 Generated with Claude Code

Support searching for measured variables (not just studies) via natural language. The extract agent now detects query intent (study/variable/auto), a new DuckDB variables table stores ~450k variable rows from llm-concepts, and the API branches on intent to return variable results with dbGaP links. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR implements variable-level search functionality, allowing researchers to search for specific measured variables (with dbGaP PHV IDs and descriptions) in addition to the existing study-level search. The feature adds intent detection to distinguish between study queries ("diabetes datasets on AnVIL") and variable queries ("what variables measure chocolate consumption?"), with an "auto" fallback for ambiguous queries that prompts user clarification.

Changes:

Adds a new DuckDB variables table (~450k rows) populated from llm-concepts JSON with per-variable metadata (PHV ID, name, description, concept, dataset/table context)
Implements query intent detection in the extract agent with three modes: "study", "variable", and "auto" (ambiguous)
Extends the API to branch on intent: study results for study queries, variable results for variable queries
Updates the frontend to render a variable results table with clickable dbGaP links for both variables and studies
Adds 9 new intent detection test cases to eval_extract.py and Makefile commands for local development

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
docs/PRD-variable-search.md	Product requirements document detailing the variable search feature design and scope
backend/concept_search/store.py	Adds `variables` table schema, batch loading method, query method, and index creation
backend/concept_search/index.py	Extends `_load_measurement_concepts` to populate variable rows from llm-concepts JSON
backend/concept_search/models.py	Adds `Intent` type and `intent` field to `ExtractResult` and `QueryModel`
backend/concept_search/api_models.py	Adds `VariableResult` model and extends `SearchResponse` with intent and variable fields
backend/concept_search/api.py	Implements intent branching logic, variable URL construction, and variable result building
backend/concept_search/pipeline.py	Propagates intent from extract result through to final QueryModel
backend/concept_search/EXTRACT_PROMPT.md	Adds intent detection guidance with heuristics and examples for study/variable/auto classification
backend/concept_search/eval_extract.py	Adds intent scoring and 9 new test cases covering variable, study, and ambiguous intent queries
backend/Makefile	Adds local development commands: start, stop, restart, db-reload, and evals
app/components/Chat/chat.tsx	Adds Variable interface and conditional rendering of variable results table vs study table based on intent

Comments suppressed due to low confidence (1)

backend/concept_search/api.py:264

The error response for timeout and general exceptions doesn't include the new required fields added to SearchResponse in this PR. The response is missing intent, totalVariables, and variables fields. While the frontend might handle this gracefully due to default values, it's inconsistent with the updated SearchResponse model. Consider adding these fields with appropriate defaults (e.g., intent="study", total_variables=0, variables=[]) to ensure consistency.

        _log_json(event="rate_limited", ip=client_ip, query=request.query)
        return JSONResponse(
            content={"detail": "Too many requests — please try again later."},
            status_code=429,
        )

    _log_json(event="search_request", query=request.query)
    t_start = time.monotonic()

    # Run the 3-agent LLM pipeline (semaphore + timeout)
    try:
        async with _pipeline_semaphore:
            query_model = await asyncio.wait_for(
                run_pipeline(request.query), timeout=60.0
            )
    except (TimeoutError, asyncio.CancelledError):
        _log_json(event="search_timeout", query=request.query)
        elapsed_ms = int((time.monotonic() - t_start) * 1000)
        return SearchResponse(
            message="Search timed out — please try a simpler query.",
            query=QueryModel(mentions=[]),
            studies=[],
            timing=SearchTiming(
                lookup_ms=0,
                pipeline_ms=elapsed_ms,
                total_ms=elapsed_ms,
            ),
            total_studies=0,
        )
    except Exception as exc:

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-21T07:36:19Z

backend/concept_search/store.py

+            "LEFT JOIN studies s ON v.study_id = s.db_gap_id "
+            f"WHERE v.concept_lower IN ({placeholders}) "  # noqa: S608
+            "ORDER BY v.concept, v.study_id, v.variable_name "
+            f"LIMIT {limit}"


The LIMIT clause uses direct string interpolation of the limit parameter without validation. While the default is 100, if this method is ever called with user input for the limit parameter, it could lead to SQL injection. Consider using a parameterized query or at least validating that limit is a positive integer before interpolation.

Won't fix — limit is a Python int with default 100, never derived from user input. Adding validation for an internal parameter would be over-engineering.

Copilot · 2026-02-21T07:36:20Z

backend/concept_search/api_models.py

+    concept: str
+    dataset_id: str
+    db_gap_url: str
+    description: str


The PRD specifies the field name should be variableDescription (line 106 of PRD), but the implementation uses description. This inconsistency means the API response field will be serialized as description (camelCase: description) instead of the expected variableDescription. Update the field name to match the PRD specification.

Suggested change

description: str

description: str = Field(alias="variableDescription")

Won't fix — the PRD is a design document, not a spec. description is cleaner and consistent with how other models name this field. The frontend already consumes it as description.

Copilot · 2026-02-21T07:36:20Z

backend/concept_search/api.py

+        Full URL to the variable page on dbGaP.
+    """
+    # Extract the numeric portion from the phv ID (strip "phv" prefix and version)
+    phv_num = phv_id.split(".")[0].replace("phv", "")


If phv_id is empty or malformed (e.g., doesn't contain a dot), phv_id.split(".")[0] could fail or produce unexpected results. The function should validate that phv_id is well-formed before attempting to extract the numeric portion. Consider adding a guard clause that returns an empty string or a fallback URL if the phv_id format is invalid.

Suggested change

Full URL to the variable page on dbGaP.

"""

# Extract the numeric portion from the phv ID (strip "phv" prefix and version)

phv_num = phv_id.split(".")[0].replace("phv", "")

Full URL to the variable page on dbGaP. If the provided ``phv_id`` is

empty or not in the expected format, an empty string is returned.

"""

# Validate phv_id format before attempting to extract the numeric portion.

if not phv_id or "." not in phv_id:

return ""

# Extract the numeric portion from the phv ID (strip "phv" prefix and version)

phv_prefix = phv_id.split(".", 1)[0]

phv_num = phv_prefix.replace("phv", "")

# Ensure we have a non-empty numeric identifier; otherwise, bail out.

if not phv_num.isdigit():

return ""

Fixed — added empty guard in afda40e.

Copilot · 2026-02-21T07:36:20Z

backend/concept_search/api.py

            timing=SearchTiming(
                lookup_ms=0,
                pipeline_ms=elapsed_ms,
                total_ms=elapsed_ms,
            ),
            total_studies=0,
        )
    t_pipeline = time.monotonic()

-    # Deterministic study lookup
+    # Deterministic lookup — branch on intent
    index = get_index()


When intent == "variable" but there are no MEASUREMENT facet mentions (or all MEASUREMENT mentions are excluded), the concepts list will be empty and no variable results will be returned. This could happen if a user asks a variable-level question about non-measurement facets (e.g., "what variables are in WGS studies?"). Consider adding a message or handling this case explicitly to guide the user that variable search requires measurement concepts.

Won't fix — this correctly returns zero results, which is the expected behavior. The extract agent will always produce measurement mentions for variable-intent queries; this edge case only arises from malformed pipeline output.

Copilot · 2026-02-21T07:36:20Z

backend/concept_search/api.py

+    # Deterministic lookup — branch on intent
    index = get_index()


The variable query only executes when isinstance(index.store, DuckDBStore) is True. If a different store implementation is used in the future, variable queries will silently return no results even though the intent was "variable". Consider logging a warning or returning an informative message when variable search is requested but the store doesn't support it.

Won't fix — DuckDBStore is the only store implementation and there are no plans for alternatives. The isinstance check is a type-narrowing guard, not a feature gate.

Copilot · 2026-02-21T07:36:20Z

backend/concept_search/store.py

+    def query_variables(
+        self,
+        concepts: list[str],
+        limit: int = 100,
+    ) -> list[dict]:
+        """Return variables matching any of the given concept names.
+
+        Args:
+            concepts: Canonical concept names to match (OR-ed).
+            limit: Maximum number of variable rows to return.
+
+        Returns:
+            Variable dicts with study title joined from the studies table.
+        """
+        if not concepts:
+            return []
+        placeholders = ", ".join("?" for _ in concepts)
+        sql = (
+            "SELECT v.concept, v.dataset_id, v.description, v.phv_id,"
+            "  v.study_id, v.table_name, v.variable_name,"
+            "  json_extract_string(s.raw_json, '$.title') AS study_title "
+            "FROM variables v "
+            "LEFT JOIN studies s ON v.study_id = s.db_gap_id "
+            f"WHERE v.concept_lower IN ({placeholders}) "  # noqa: S608
+            "ORDER BY v.concept, v.study_id, v.variable_name "
+            f"LIMIT {limit}"
+        )
+        params = [c.lower() for c in concepts]
+        rows = self._conn.execute(sql, params).fetchall()
+        cols = [
+            "concept", "datasetId", "description", "phvId",
+            "studyId", "tableName", "variableName", "studyTitle",
+        ]
+        return [dict(zip(cols, row)) for row in rows]
+


The new query_variables method and variables table functionality lack test coverage. The existing test suite (backend/tests/test_index.py) has comprehensive tests for query_studies and the DuckDB store, but there are no tests for the new variable search functionality. Add tests covering: variable table creation, variable batch loading, query_variables method with various concept inputs, variable URL generation, and integration with the API layer for variable intent queries.

Agreed — will add variable store tests in a follow-up. The feature is covered by 32 extract evals and manual integration testing.

Copilot · 2026-02-21T07:36:21Z

backend/concept_search/api.py

            ),
            total_studies=0,
        )
    t_pipeline = time.monotonic()


The import statement from .store import DuckDBStore is placed inside the function at line 283, creating a local import. This is inconsistent with the standard pattern used throughout the codebase where imports are at the module level (e.g., line 30 in api.py imports from .index, and index.py line 13 imports from .store at module level). Move this import to the top of the file with the other imports for consistency and better performance.

Fixed — moved to top-level import in afda40e.

Copilot · 2026-02-21T07:36:21Z

backend/concept_search/api.py

            studies=[],
            timing=SearchTiming(
                lookup_ms=0,
                pipeline_ms=elapsed_ms,
                total_ms=elapsed_ms,
            ),
            total_studies=0,
        )
    t_pipeline = time.monotonic()

-    # Deterministic study lookup
+    # Deterministic lookup — branch on intent
    index = get_index()
+    intent = query_model.intent
    studies: list[dict] = []


When intent == "auto", the code falls through to the else branch (line 287-288) and executes a study query. According to the PRD and EXTRACT_PROMPT, when intent is "auto", a clarification message should be set and no search should be performed. The API should return empty results when intent is "auto" and let the message guide the user to clarify. Consider adding an explicit check: elif intent == "study": at line 287 to make this behavior clearer and avoid unintended study lookups for ambiguous queries.

Fixed — added explicit if intent == "auto": pass branch in afda40e.

Copilot · 2026-02-21T07:36:21Z

app/components/Chat/chat.tsx


-      <ResultCount>
-        Found {response.totalStudies}{" "}
-        {response.totalStudies === 1 ? "study" : "studies"} in {totalSeconds}s
-      </ResultCount>
-
-      {response.studies.length > 0 && (
-        <StudyTable>
-          <Table size="small">
-            <TableHead>
-              <TableRow>
-                <TableCell>Title</TableCell>
-                <TableCell>dbGaP Id</TableCell>
-                <TableCell>Platform</TableCell>
-                <TableCell>Focus / Disease</TableCell>
-                <TableCell>Data Type</TableCell>
-                <TableCell>Participants</TableCell>
-                <TableCell>Study Design</TableCell>
-                <TableCell>Consent Code</TableCell>
-              </TableRow>
-            </TableHead>
-            <TableBody>
-              {response.studies.map((study, i) => (
-                <TableRow key={i}>
-                  <TableCell>{study.title}</TableCell>
-                  <TableCell>{study.dbGapId}</TableCell>
-                  <TableCell>{study.platforms.join(", ")}</TableCell>
-                  <TableCell>{study.focus}</TableCell>
-                  <TableCell>{study.dataTypes.join(", ")}</TableCell>
-                  <TableCell>
-                    {study.participantCount != null
-                      ? study.participantCount.toLocaleString()
-                      : "—"}
-                  </TableCell>
-                  <TableCell>{study.studyDesigns.join(", ")}</TableCell>
-                  <TableCell>{study.consentCodes.join(", ")}</TableCell>
-                </TableRow>
-              ))}
-            </TableBody>
-          </Table>
-        </StudyTable>
+      {response.intent === "variable" ? (
+        <>
+          <ResultCount>
+            Found {response.totalVariables}{" "}
+            {response.totalVariables === 1 ? "variable" : "variables"} in{" "}
+            {totalSeconds}s
+          </ResultCount>
+          {response.variables.length > 0 && (
+            <StudyTable>
+              <Table size="small">
+                <TableHead>
+                  <TableRow>
+                    <TableCell>Concept</TableCell>
+                    <TableCell>Variable Name</TableCell>
+                    <TableCell>Description</TableCell>
+                    <TableCell>Study</TableCell>
+                    <TableCell>dbGaP</TableCell>
+                  </TableRow>
+                </TableHead>
+                <TableBody>
+                  {response.variables.map((v, i) => (
+                    <TableRow key={i}>
+                      <TableCell>{v.concept}</TableCell>
+                      <TableCell>{v.variableName}</TableCell>
+                      <TableCell>{v.description}</TableCell>
+                      <TableCell>
+                        <a
+                          href={v.studyUrl}
+                          rel="noopener noreferrer"
+                          target="_blank"
+                        >
+                          {v.studyTitle || v.studyId}
+                        </a>
+                      </TableCell>
+                      <TableCell>
+                        <a
+                          href={v.dbGapUrl}
+                          rel="noopener noreferrer"
+                          target="_blank"
+                        >
+                          {v.phvId}
+                        </a>
+                      </TableCell>
+                    </TableRow>
+                  ))}
+                </TableBody>
+              </Table>
+            </StudyTable>
+          )}
+        </>
+      ) : (
+        <>
+          <ResultCount>
+            Found {response.totalStudies}{" "}
+            {response.totalStudies === 1 ? "study" : "studies"} in{" "}
+            {totalSeconds}s
+          </ResultCount>
+          {response.studies.length > 0 && (
+            <StudyTable>
+              <Table size="small">
+                <TableHead>
+                  <TableRow>
+                    <TableCell>Title</TableCell>
+                    <TableCell>dbGaP Id</TableCell>
+                    <TableCell>Platform</TableCell>
+                    <TableCell>Focus / Disease</TableCell>
+                    <TableCell>Data Type</TableCell>
+                    <TableCell>Participants</TableCell>
+                    <TableCell>Study Design</TableCell>
+                    <TableCell>Consent Code</TableCell>
+                  </TableRow>
+                </TableHead>
+                <TableBody>
+                  {response.studies.map((study, i) => (
+                    <TableRow key={i}>
+                      <TableCell>{study.title}</TableCell>
+                      <TableCell>{study.dbGapId}</TableCell>
+                      <TableCell>{study.platforms.join(", ")}</TableCell>
+                      <TableCell>{study.focus}</TableCell>
+                      <TableCell>{study.dataTypes.join(", ")}</TableCell>
+                      <TableCell>
+                        {study.participantCount != null
+                          ? study.participantCount.toLocaleString()
+                          : "—"}
+                      </TableCell>
+                      <TableCell>{study.studyDesigns.join(", ")}</TableCell>
+                      <TableCell>{study.consentCodes.join(", ")}</TableCell>
+                    </TableRow>
+                  ))}
+                </TableBody>
+              </Table>
+            </StudyTable>
+          )}


When intent === "auto", the frontend will fall into the else branch and display study results (line 367-410). However, per the PRD, when intent is "auto", the system should ask for clarification and not display any results. The message will be shown (line 297-299), but the user will also see an empty study results table which could be confusing. Consider adding an explicit check for "auto" intent to only show the clarification message without any results table.

Fixed — frontend now returns null for auto intent, showing only the clarification message. afda40e.

Copilot · 2026-02-21T07:36:21Z

backend/concept_search/api.py

                lookup_ms=0,
                pipeline_ms=elapsed_ms,
                total_ms=elapsed_ms,
            ),
            total_studies=0,
        )


The variable query filters out excluded measurements (if m.facet == Facet.MEASUREMENT and not m.exclude), but it doesn't apply exclusion logic to the results. If a user searches for "variables measuring blood pressure but not systolic blood pressure", the current implementation will include all blood pressure variables without subtracting the excluded ones. Consider adding exclusion support for variable queries similar to how study queries handle it, or document that exclusions are not supported for variable intent.

Won't fix — variable-level exclusion is explicitly out of scope per the PRD. Study-level filters are now applied (75e798c), but concept-level exclusion is a follow-up.

- Handle intent="auto" explicitly: return clarification message only, don't run a study query for ambiguous intent - Frontend: skip results table for auto intent, show only the message - Guard against empty phv_id in URL builder Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-21T07:50:25Z

backend/concept_search/api.py

+            ]
+            if concepts and isinstance(index.store, DuckDBStore):
+                variable_rows = index.store.query_variables(concepts)


Variable-intent search currently ignores all non-measurement mentions (and all exclude logic). For example, a query like "variables measuring blood pressure on AnVIL" will not filter variables to studies on AnVIL, because the variable path only extracts measurement concepts and bypasses the include/exclude constraints. Consider applying the same include/exclude constraints used for study lookup (e.g., restrict variable results to studies matching include/exclude, and then filter variables by those study_ids, or extend the SQL to join through study_facet_values).

Suggested change

]

if concepts and isinstance(index.store, DuckDBStore):

variable_rows = index.store.query_variables(concepts)

]

# If there are include/exclude constraints, determine the studies

# that satisfy them so we can restrict variable results accordingly.

allowed_study_ids: set[str] | None = None

if (include or exclude) and isinstance(index.store, DuckDBStore):

matching_studies = index.query_studies(include, exclude or None)

allowed_study_ids = {

s["study_id"]

for s in matching_studies

if isinstance(s, dict) and "study_id" in s

} or None

if concepts and isinstance(index.store, DuckDBStore):

variable_rows = index.store.query_variables(concepts)

# Apply study-level include/exclude constraints, if any, by

# filtering variables to those belonging to the allowed studies.

if allowed_study_ids is not None:

variable_rows = [

r

for r in variable_rows

if isinstance(r, dict)

and r.get("study_id") in allowed_study_ids

]

Fixed — query_variables() now accepts optional study_ids set, and the API filters through study-level constraints first. 75e798c.

Copilot · 2026-02-21T07:50:26Z

backend/concept_search/index.py

@@ -321,6 +334,8 @@ def _load_measurement_concepts(self, llm_dir: Path) -> None:
                for concept in concepts
            ]
            self.store.load_facet_values_batch(rows)
+            self.store.load_variables_batch(variable_rows)
+            logger.info("Loaded %d variable rows", len(variable_rows))


_load_measurement_concepts accumulates all variable rows in a single in-memory list before bulk-inserting. With the stated scale (tens/hundreds of thousands of rows), this can significantly increase peak memory during index builds and make local/dev rebuilds fragile. Consider inserting in chunks (e.g., flush every N rows or per-file) and clearing the buffer to bound memory usage.

Won't fix — ~450k rows at ~200 bytes each is ~90MB peak, well within normal build memory. Chunking would add complexity for negligible gain.

Copilot · 2026-02-21T07:50:26Z

backend/concept_search/store.py

+            "  phv_id VARCHAR,"
+            "  study_id VARCHAR,"
+            "  table_name VARCHAR,"
+            "  variable_name VARCHAR"


The variables table is created without a PRIMARY KEY/UNIQUE constraint on phv_id (the PRD and dbGaP IDs imply this should be unique). Without a constraint, duplicate variable rows can be inserted across rebuilds or from input duplication, which can inflate result counts and slow queries. Consider making phv_id the PRIMARY KEY (or at least adding a UNIQUE constraint/index) and deciding a dedupe strategy on conflicts.

Suggested change

" variable_name VARCHAR"

" variable_name VARCHAR,"

" PRIMARY KEY (phv_id)"

Won't fix — phv_id is not guaranteed unique (same variable can map to multiple concepts). DuckDB's COPY CSV also doesn't enforce PK constraints efficiently. The index on concept_lower is sufficient for query performance.

Copilot · 2026-02-21T07:50:26Z

backend/concept_search/store.py

+        sql = (
+            "SELECT v.concept, v.dataset_id, v.description, v.phv_id,"
+            "  v.study_id, v.table_name, v.variable_name,"
+            "  json_extract_string(s.raw_json, '$.title') AS study_title "
+            "FROM variables v "
+            "LEFT JOIN studies s ON v.study_id = s.db_gap_id "
+            f"WHERE v.concept_lower IN ({placeholders}) "  # noqa: S608
+            "ORDER BY v.concept, v.study_id, v.variable_name "
+            f"LIMIT {limit}"
+        )


query_variables interpolates LIMIT directly into the SQL string (f"LIMIT {limit}") instead of binding it as a parameter or strictly validating/clamping it. Even if limit isn't currently user-controlled, this pattern is easy to accidentally expose later and triggers SQL-injection linters. Consider validating limit (min/max) and using a bound parameter for LIMIT if DuckDB supports it (or at least ensure limit is an int after clamping before string formatting).

Duplicate of earlier comment — limit is always a Python int, never user input.

Copilot · 2026-02-21T07:50:26Z

backend/Makefile

+	fi
+	@lsof -ti :8000 | xargs kill 2>/dev/null || true


The stop target unconditionally kills any process listening on port 8000 (lsof -ti :8000 | xargs kill), which can terminate unrelated services a developer might be running. Consider limiting this to the PID in $(PID_FILE) (or only falling back to port-kill when no PID file exists), and/or printing a warning before killing by port.

Suggested change

fi

@lsof -ti :8000 | xargs kill 2>/dev/null || true

else \

echo "Warning: No PID file found. Killing all processes listening on port 8000 (this may stop unrelated services)..."; \

lsof -ti :8000 | xargs kill 2>/dev/null || true; \

fi

Won't fix — the lsof fallback is intentional for stale PID files. Port 8000 is exclusively used by this server in dev.

When a variable query includes non-measurement facets (e.g., platform, data type), filter variable results to only studies matching those constraints. "blood pressure variables on AnVIL" now returns only variables from AnVIL studies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 11 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-21T08:10:25Z

backend/concept_search/api.py

+        elif intent == "variable":
+            # Collect resolved measurement concepts for variable lookup
+            concepts = [
+                v
+                for m in query_model.mentions
+                if m.facet == Facet.MEASUREMENT and not m.exclude
+                for v in m.values
+            ]
+            if concepts and isinstance(index.store, DuckDBStore):
+                # Apply study-level constraints (platform, dataType, etc.)
+                non_measurement = [
+                    c for c in include if c[0] != Facet.MEASUREMENT
+                ]
+                study_ids: set[str] | None = None
+                if non_measurement or exclude:
+                    matched = index.query_studies(
+                        non_measurement or include, exclude or None
+                    )
+                    study_ids = {s.get("dbGapId", "") for s in matched}
+                variable_rows = index.store.query_variables(
+                    concepts, study_ids=study_ids
+                )


When intent == "variable" and there are no measurement concepts in the query (e.g., only platform/dataType filters), the variable search returns zero results silently. Consider adding a message to inform the user that variable search requires at least one measurement concept to be specified.

Copilot · 2026-02-21T08:10:25Z

backend/concept_search/api.py

+def _build_dbgap_variable_url(study_id: str, phv_id: str) -> str:
+    """Build a dbGaP variable page URL.
+
+    Args:
+        study_id: Study accession (e.g., "phs000007").
+        phv_id: Variable PHV ID (e.g., "phv00481718.v2.p1").
+
+    Returns:
+        Full URL to the variable page on dbGaP.
+    """
+    if not phv_id:
+        return ""
+    # Extract the numeric portion from the phv ID (strip "phv" prefix and version)
+    phv_num = phv_id.split(".")[0].replace("phv", "")
+    return (
+        f"https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/variable.cgi"
+        f"?study_id={study_id}&phv={phv_num}"
+    )


The PHV ID parsing assumes a specific format (phv{number}.v{version}.p{participant_set}). If a PHV ID doesn't contain a period (malformed or different format), the split will still work but may produce unexpected results. According to the PRD, the study accession should include version information, but the function receives just the study_id without version. This may produce broken dbGaP links. Consider validating the expected format or handling edge cases where the format differs.

Copilot · 2026-02-21T08:10:25Z

app/components/Chat/chat.tsx

+                  {response.variables.map((v, i) => (
+                    <TableRow key={i}>


Using array index i as the React key for table rows is an anti-pattern. If the list order changes or items are added/removed, React may have difficulty reconciling the component tree efficiently. Consider using a unique identifier like v.phvId which should be unique per variable, or a combination of fields if phvId might not be unique.

Copilot · 2026-02-21T08:10:26Z

backend/Makefile

+		kill $$(cat $(PID_FILE)) 2>/dev/null && echo "Server stopped." || true; \
+		rm -f $(PID_FILE); \
+	fi
+	@lsof -ti :8000 | xargs kill 2>/dev/null || true


The lsof command may not be available on all systems (e.g., some minimal Linux distributions or Windows). While this is a developer-oriented Makefile, consider adding a check for the command's existence or using a more portable alternative like ss or netstat with a fallback.

Suggested change

@lsof -ti :8000 | xargs kill 2>/dev/null || true

@if command -v lsof >/dev/null 2>&1; then \

lsof -ti :8000 | xargs kill 2>/dev/null || true; \

fi

Copilot · 2026-02-21T08:10:26Z

backend/concept_search/store.py

+    def query_variables(
+        self,
+        concepts: list[str],
+        limit: int = 100,
+        study_ids: set[str] | None = None,
+    ) -> list[dict]:
+        """Return variables matching any of the given concept names.
+
+        Args:
+            concepts: Canonical concept names to match (OR-ed).
+            limit: Maximum number of variable rows to return.
+            study_ids: If provided, restrict results to these studies.


The variable query has a hardcoded LIMIT of 100 results. For common concepts like "Blood Pressure" or "Body Mass Index", there could be thousands of matching variables across hundreds of studies. The PR description mentions ~450k total variable rows. Without pagination support or a way for users to request more results, many relevant variables may be hidden. Consider implementing pagination or at minimum documenting this limitation in the API response or user-facing message.

Copilot · 2026-02-21T08:10:26Z

docs/PRD-consent-code-compatibility.md

+# PRD: Consent Code Compatibility Model
+
+**Status:** DRAFT — Feb 2026
+**Related:** PRD-concept-search.md, PRD-search-api.md
+
+---
+
+## Problem
+
+The search API treats consent codes as opaque strings with exact-match semantics. This fails for the three most common researcher consent queries:


The PRD-consent-code-compatibility.md document describes a consent code compatibility feature that is not mentioned in the PR description or issue #194. This PRD appears to be for a separate feature (consent code hierarchy and eligibility matching) and should either be removed from this PR or explicitly mentioned in the PR description if it's intentionally being added as documentation for future work.

Copilot · 2026-02-21T08:10:26Z

backend/concept_search/index.py

                for var in table.get("variables", []):
                    concept = var.get("concept")
                    if concept:
                        concept_studies[concept].add(study_id)
                        study_concepts[study_id].add(concept)
+                        variable_rows.append((
+                            concept,
+                            concept.lower(),
+                            dataset_id,
+                            var.get("description", ""),
+                            var.get("id", ""),
+                            study_id,
+                            table_name,
+                            var.get("name", ""),
+                        ))


Variables without a concept (concept is None or empty) are completely skipped during indexing. This means variables that may be important but weren't assigned a concept during LLM classification will be invisible to variable search. Consider logging how many variables are being skipped, or storing them with a sentinel concept value like "Uncategorized" so they can still be discovered.

Copilot · 2026-02-21T08:10:27Z

app/components/Chat/chat.tsx

-            </TableBody>
-          </Table>
-        </StudyTable>
+      {response.intent === "auto" ? null : response.intent === "variable" ? (


When intent === "auto", nothing is rendered after the clarification message (no result count or table). While this may be intentional to encourage the user to refine their query, it might be helpful to display extracted mentions or provide more guidance. Consider showing what was understood from the query to help the user refine it.

Suggested change

{response.intent === "auto" ? null : response.intent === "variable" ? (

{response.intent === "auto" ? (

<SectionRow>

<SectionLabel>Next steps:</SectionLabel>

{mentions.length > 0 || Object.keys(mappings).length > 0

? "Review the extracted mentions above and refine your question to run a search."

: "Refine your question with more detail so we can run a search."}

</SectionRow>

) : response.intent === "variable" ? (

Copilot · 2026-02-21T08:10:27Z

backend/concept_search/store.py

+        self._conn.execute(
+            "CREATE TABLE variables ("
+            "  concept VARCHAR,"
+            "  concept_lower VARCHAR,"
+            "  dataset_id VARCHAR,"
+            "  description VARCHAR,"
+            "  phv_id VARCHAR,"
+            "  study_id VARCHAR,"
+            "  table_name VARCHAR,"
+            "  variable_name VARCHAR"
+            ")"
+        )


The PRD specifies phv_id as the PRIMARY KEY for the variables table, but the implementation does not define a primary key or any uniqueness constraint. In the real llm-concepts data, the same PHV ID may appear multiple times if a variable appears in multiple tables/datasets within the same study. Consider either adding the PK constraint if phv_id is truly unique, or updating the PRD to remove the "PK" designation and clarify that duplicate phv_ids are expected.

Copilot · 2026-02-21T08:10:27Z

backend/concept_search/store.py

+    def query_variables(
+        self,
+        concepts: list[str],
+        limit: int = 100,
+        study_ids: set[str] | None = None,
+    ) -> list[dict]:
+        """Return variables matching any of the given concept names.
+
+        Args:
+            concepts: Canonical concept names to match (OR-ed).
+            limit: Maximum number of variable rows to return.
+            study_ids: If provided, restrict results to these studies.
+
+        Returns:
+            Variable dicts with study title joined from the studies table.
+        """
+        if not concepts:
+            return []
+        concept_ph = ", ".join("?" for _ in concepts)
+        params: list[str] = [c.lower() for c in concepts]
+        where = f"v.concept_lower IN ({concept_ph})"
+        if study_ids is not None:
+            study_ph = ", ".join("?" for _ in study_ids)
+            where += f" AND v.study_id IN ({study_ph})"
+            params.extend(study_ids)
+        sql = (
+            "SELECT v.concept, v.dataset_id, v.description, v.phv_id,"
+            "  v.study_id, v.table_name, v.variable_name,"
+            "  json_extract_string(s.raw_json, '$.title') AS study_title "
+            "FROM variables v "
+            "LEFT JOIN studies s ON v.study_id = s.db_gap_id "
+            f"WHERE {where} "  # noqa: S608
+            "ORDER BY v.concept, v.study_id, v.variable_name "
+            f"LIMIT {limit}"
+        )
+        rows = self._conn.execute(sql, params).fetchall()
+        cols = [
+            "concept", "datasetId", "description", "phvId",
+            "studyId", "tableName", "variableName", "studyTitle",
+        ]
+        return [dict(zip(cols, row)) for row in rows]


The new query_variables method in DuckDBStore lacks unit tests. The existing test suite comprehensively tests query_studies with various scenarios (AND, OR, exclude, case sensitivity, etc.), but there are no corresponding tests for variable queries. This new functionality should have similar test coverage to ensure it correctly handles concept matching, study filtering, and edge cases.

When intent is "variable" but no measurement is mentioned (e.g., "what variables are measured in AnVIL WGS studies"), return all variables from the matching studies instead of returning empty results. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

NoopDog requested a review from Copilot February 21, 2026 07:28

Copilot started reviewing on behalf of NoopDog February 21, 2026 07:28 View session

NoopDog force-pushed the feat/variable-search branch from 8a9adaf to 9f47af8 Compare February 21, 2026 07:34

Copilot AI reviewed Feb 21, 2026

View reviewed changes

NoopDog requested a review from Copilot February 21, 2026 07:44

Copilot started reviewing on behalf of NoopDog February 21, 2026 07:44 View session

Copilot AI reviewed Feb 21, 2026

View reviewed changes

NoopDog and others added 2 commits February 20, 2026 23:52

test: add evals for variable queries with study-level filters #194

973f795

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

NoopDog requested a review from Copilot February 21, 2026 08:04

Copilot started reviewing on behalf of NoopDog February 21, 2026 08:04 View session

Copilot AI reviewed Feb 21, 2026

View reviewed changes

NoopDog merged commit ea821bf into main Feb 21, 2026
1 check passed

NoopDog deleted the feat/variable-search branch February 21, 2026 08:13

github-actions bot mentioned this pull request Feb 21, 2026

chore(main): release 0.10.0 #185

Merged

	description: str
	description: str = Field(alias="variableDescription")

-        Full URL to the variable page on dbGaP.
-    """
-    # Extract the numeric portion from the phv ID (strip "phv" prefix and version)
-    phv_num = phv_id.split(".")[0].replace("phv", "")
+        Full URL to the variable page on dbGaP. If the provided ``phv_id`` is
+        empty or not in the expected format, an empty string is returned.
+    """
+    # Validate phv_id format before attempting to extract the numeric portion.
+    if not phv_id or "." not in phv_id:
+        return ""
+    # Extract the numeric portion from the phv ID (strip "phv" prefix and version)
+    phv_prefix = phv_id.split(".", 1)[0]
+    phv_num = phv_prefix.replace("phv", "")
+    # Ensure we have a non-empty numeric identifier; otherwise, bail out.
+    if not phv_num.isdigit():
+        return ""

		# Deterministic lookup — branch on intent
		index = get_index()

-            ]
-            if concepts and isinstance(index.store, DuckDBStore):
-                variable_rows = index.store.query_variables(concepts)
+            ]
+            # If there are include/exclude constraints, determine the studies
+            # that satisfy them so we can restrict variable results accordingly.
+            allowed_study_ids: set[str] | None = None
+            if (include or exclude) and isinstance(index.store, DuckDBStore):
+                matching_studies = index.query_studies(include, exclude or None)
+                allowed_study_ids = {
+                    s["study_id"]
+                    for s in matching_studies
+                    if isinstance(s, dict) and "study_id" in s
+                } or None
+            if concepts and isinstance(index.store, DuckDBStore):
+                variable_rows = index.store.query_variables(concepts)
+                # Apply study-level include/exclude constraints, if any, by
+                # filtering variables to those belonging to the allowed studies.
+                if allowed_study_ids is not None:
+                    variable_rows = [
+                        r
+                        for r in variable_rows
+                        if isinstance(r, dict)
+                        and r.get("study_id") in allowed_study_ids
+                    ]

	" variable_name VARCHAR"
	" variable_name VARCHAR,"
	" PRIMARY KEY (phv_id)"

-	fi
-	@lsof -ti :8000 | xargs kill 2>/dev/null || true
+	else \
+		echo "Warning: No PID file found. Killing all processes listening on port 8000 (this may stop unrelated services)..."; \
+		lsof -ti :8000 | xargs kill 2>/dev/null || true; \
+	fi

-      {response.intent === "auto" ? null : response.intent === "variable" ? (
+      {response.intent === "auto" ? (
+        <SectionRow>
+          <SectionLabel>Next steps:</SectionLabel>
+          {mentions.length > 0 || Object.keys(mappings).length > 0
+            ? "Review the extracted mentions above and refine your question to run a search."
+            : "Refine your question with more detail so we can run a search."}
+        </SectionRow>
+      ) : response.intent === "variable" ? (

Comments

Conversation

NoopDog commented Feb 21, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!