Skip to content

Comments

feat: add variable-level search with intent detection#197

Merged
NoopDog merged 5 commits intomainfrom
feat/variable-search
Feb 21, 2026
Merged

feat: add variable-level search with intent detection#197
NoopDog merged 5 commits intomainfrom
feat/variable-search

Conversation

@NoopDog
Copy link
Contributor

@NoopDog NoopDog commented Feb 21, 2026

Summary

  • Adds variable-level search so researchers can ask "what variables measure chocolate intake?" and get back individual dbGaP variables (not just studies)
  • Extract agent now detects query intent (study / variable / auto) from natural language signals
  • New DuckDB variables table stores ~450k variable rows from llm-concepts, queried by resolved measurement concepts
  • API branches on intent to return either study results or variable results with dbGaP links
  • Frontend renders a dedicated variable results table (Concept, Variable Name, Description, Study with dbGaP link, Variable dbGaP link)
  • 9 new intent detection eval cases (all passing), Makefile dev commands (start/stop/restart/db-reload/evals)
  • Includes PRD at docs/PRD-variable-search.md

Closes #194

Test plan

  • Run extract evals: make evals (30/30 passing)
  • make db-reload to rebuild index with variables table
  • Test study intent: "diabetes datasets on AnVIL" → study results table
  • Test variable intent: "what variables measure chocolate intake?" → variable results table with dbGaP links
  • Test ambiguous intent: "blood pressure" → auto intent with clarification message
  • Verify dbGaP links work for both variable and study columns

🤖 Generated with Claude Code

Support searching for measured variables (not just studies) via natural
language. The extract agent now detects query intent (study/variable/auto),
a new DuckDB variables table stores ~450k variable rows from llm-concepts,
and the API branches on intent to return variable results with dbGaP links.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@NoopDog NoopDog force-pushed the feat/variable-search branch from 8a9adaf to 9f47af8 Compare February 21, 2026 07:34
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements variable-level search functionality, allowing researchers to search for specific measured variables (with dbGaP PHV IDs and descriptions) in addition to the existing study-level search. The feature adds intent detection to distinguish between study queries ("diabetes datasets on AnVIL") and variable queries ("what variables measure chocolate consumption?"), with an "auto" fallback for ambiguous queries that prompts user clarification.

Changes:

  • Adds a new DuckDB variables table (~450k rows) populated from llm-concepts JSON with per-variable metadata (PHV ID, name, description, concept, dataset/table context)
  • Implements query intent detection in the extract agent with three modes: "study", "variable", and "auto" (ambiguous)
  • Extends the API to branch on intent: study results for study queries, variable results for variable queries
  • Updates the frontend to render a variable results table with clickable dbGaP links for both variables and studies
  • Adds 9 new intent detection test cases to eval_extract.py and Makefile commands for local development

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
docs/PRD-variable-search.md Product requirements document detailing the variable search feature design and scope
backend/concept_search/store.py Adds variables table schema, batch loading method, query method, and index creation
backend/concept_search/index.py Extends _load_measurement_concepts to populate variable rows from llm-concepts JSON
backend/concept_search/models.py Adds Intent type and intent field to ExtractResult and QueryModel
backend/concept_search/api_models.py Adds VariableResult model and extends SearchResponse with intent and variable fields
backend/concept_search/api.py Implements intent branching logic, variable URL construction, and variable result building
backend/concept_search/pipeline.py Propagates intent from extract result through to final QueryModel
backend/concept_search/EXTRACT_PROMPT.md Adds intent detection guidance with heuristics and examples for study/variable/auto classification
backend/concept_search/eval_extract.py Adds intent scoring and 9 new test cases covering variable, study, and ambiguous intent queries
backend/Makefile Adds local development commands: start, stop, restart, db-reload, and evals
app/components/Chat/chat.tsx Adds Variable interface and conditional rendering of variable results table vs study table based on intent
Comments suppressed due to low confidence (1)

backend/concept_search/api.py:264

  • The error response for timeout and general exceptions doesn't include the new required fields added to SearchResponse in this PR. The response is missing intent, totalVariables, and variables fields. While the frontend might handle this gracefully due to default values, it's inconsistent with the updated SearchResponse model. Consider adding these fields with appropriate defaults (e.g., intent="study", total_variables=0, variables=[]) to ensure consistency.
        _log_json(event="rate_limited", ip=client_ip, query=request.query)
        return JSONResponse(
            content={"detail": "Too many requests — please try again later."},
            status_code=429,
        )

    _log_json(event="search_request", query=request.query)
    t_start = time.monotonic()

    # Run the 3-agent LLM pipeline (semaphore + timeout)
    try:
        async with _pipeline_semaphore:
            query_model = await asyncio.wait_for(
                run_pipeline(request.query), timeout=60.0
            )
    except (TimeoutError, asyncio.CancelledError):
        _log_json(event="search_timeout", query=request.query)
        elapsed_ms = int((time.monotonic() - t_start) * 1000)
        return SearchResponse(
            message="Search timed out — please try a simpler query.",
            query=QueryModel(mentions=[]),
            studies=[],
            timing=SearchTiming(
                lookup_ms=0,
                pipeline_ms=elapsed_ms,
                total_ms=elapsed_ms,
            ),
            total_studies=0,
        )
    except Exception as exc:

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"LEFT JOIN studies s ON v.study_id = s.db_gap_id "
f"WHERE v.concept_lower IN ({placeholders}) " # noqa: S608
"ORDER BY v.concept, v.study_id, v.variable_name "
f"LIMIT {limit}"
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LIMIT clause uses direct string interpolation of the limit parameter without validation. While the default is 100, if this method is ever called with user input for the limit parameter, it could lead to SQL injection. Consider using a parameterized query or at least validating that limit is a positive integer before interpolation.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't fix — limit is a Python int with default 100, never derived from user input. Adding validation for an internal parameter would be over-engineering.

concept: str
dataset_id: str
db_gap_url: str
description: str
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PRD specifies the field name should be variableDescription (line 106 of PRD), but the implementation uses description. This inconsistency means the API response field will be serialized as description (camelCase: description) instead of the expected variableDescription. Update the field name to match the PRD specification.

Suggested change
description: str
description: str = Field(alias="variableDescription")

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't fix — the PRD is a design document, not a spec. description is cleaner and consistent with how other models name this field. The frontend already consumes it as description.

Comment on lines 166 to 169
Full URL to the variable page on dbGaP.
"""
# Extract the numeric portion from the phv ID (strip "phv" prefix and version)
phv_num = phv_id.split(".")[0].replace("phv", "")
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If phv_id is empty or malformed (e.g., doesn't contain a dot), phv_id.split(".")[0] could fail or produce unexpected results. The function should validate that phv_id is well-formed before attempting to extract the numeric portion. Consider adding a guard clause that returns an empty string or a fallback URL if the phv_id format is invalid.

Suggested change
Full URL to the variable page on dbGaP.
"""
# Extract the numeric portion from the phv ID (strip "phv" prefix and version)
phv_num = phv_id.split(".")[0].replace("phv", "")
Full URL to the variable page on dbGaP. If the provided ``phv_id`` is
empty or not in the expected format, an empty string is returned.
"""
# Validate phv_id format before attempting to extract the numeric portion.
if not phv_id or "." not in phv_id:
return ""
# Extract the numeric portion from the phv ID (strip "phv" prefix and version)
phv_prefix = phv_id.split(".", 1)[0]
phv_num = phv_prefix.replace("phv", "")
# Ensure we have a non-empty numeric identifier; otherwise, bail out.
if not phv_num.isdigit():
return ""

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added empty guard in afda40e.

Comment on lines 276 to 286
timing=SearchTiming(
lookup_ms=0,
pipeline_ms=elapsed_ms,
total_ms=elapsed_ms,
),
total_studies=0,
)
t_pipeline = time.monotonic()

# Deterministic study lookup
# Deterministic lookup — branch on intent
index = get_index()
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When intent == "variable" but there are no MEASUREMENT facet mentions (or all MEASUREMENT mentions are excluded), the concepts list will be empty and no variable results will be returned. This could happen if a user asks a variable-level question about non-measurement facets (e.g., "what variables are in WGS studies?"). Consider adding a message or handling this case explicitly to guide the user that variable search requires measurement concepts.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't fix — this correctly returns zero results, which is the expected behavior. The extract agent will always produce measurement mentions for variable-intent queries; this edge case only arises from malformed pipeline output.

Comment on lines +285 to 286
# Deterministic lookup — branch on intent
index = get_index()
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable query only executes when isinstance(index.store, DuckDBStore) is True. If a different store implementation is used in the future, variable queries will silently return no results even though the intent was "variable". Consider logging a warning or returning an informative message when variable search is requested but the store doesn't support it.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't fix — DuckDBStore is the only store implementation and there are no plans for alternatives. The isinstance check is a type-narrowing guard, not a feature gate.

Comment on lines 267 to 301
def query_variables(
self,
concepts: list[str],
limit: int = 100,
) -> list[dict]:
"""Return variables matching any of the given concept names.

Args:
concepts: Canonical concept names to match (OR-ed).
limit: Maximum number of variable rows to return.

Returns:
Variable dicts with study title joined from the studies table.
"""
if not concepts:
return []
placeholders = ", ".join("?" for _ in concepts)
sql = (
"SELECT v.concept, v.dataset_id, v.description, v.phv_id,"
" v.study_id, v.table_name, v.variable_name,"
" json_extract_string(s.raw_json, '$.title') AS study_title "
"FROM variables v "
"LEFT JOIN studies s ON v.study_id = s.db_gap_id "
f"WHERE v.concept_lower IN ({placeholders}) " # noqa: S608
"ORDER BY v.concept, v.study_id, v.variable_name "
f"LIMIT {limit}"
)
params = [c.lower() for c in concepts]
rows = self._conn.execute(sql, params).fetchall()
cols = [
"concept", "datasetId", "description", "phvId",
"studyId", "tableName", "variableName", "studyTitle",
]
return [dict(zip(cols, row)) for row in rows]

Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new query_variables method and variables table functionality lack test coverage. The existing test suite (backend/tests/test_index.py) has comprehensive tests for query_studies and the DuckDB store, but there are no tests for the new variable search functionality. Add tests covering: variable table creation, variable batch loading, query_variables method with various concept inputs, variable URL generation, and integration with the API layer for variable intent queries.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — will add variable store tests in a follow-up. The feature is covered by 32 extract evals and manual integration testing.

),
total_studies=0,
)
t_pipeline = time.monotonic()
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import statement from .store import DuckDBStore is placed inside the function at line 283, creating a local import. This is inconsistent with the standard pattern used throughout the codebase where imports are at the module level (e.g., line 30 in api.py imports from .index, and index.py line 13 imports from .store at module level). Move this import to the top of the file with the other imports for consistency and better performance.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — moved to top-level import in afda40e.

Comment on lines 275 to 288
studies=[],
timing=SearchTiming(
lookup_ms=0,
pipeline_ms=elapsed_ms,
total_ms=elapsed_ms,
),
total_studies=0,
)
t_pipeline = time.monotonic()

# Deterministic study lookup
# Deterministic lookup — branch on intent
index = get_index()
intent = query_model.intent
studies: list[dict] = []
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When intent == "auto", the code falls through to the else branch (line 287-288) and executes a study query. According to the PRD and EXTRACT_PROMPT, when intent is "auto", a clarification message should be set and no search should be performed. The API should return empty results when intent is "auto" and let the message guide the user to clarify. Consider adding an explicit check: elif intent == "study": at line 287 to make this behavior clearer and avoid unintended study lookups for ambiguous queries.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added explicit if intent == "auto": pass branch in afda40e.

Comment on lines 317 to 410

<ResultCount>
Found {response.totalStudies}{" "}
{response.totalStudies === 1 ? "study" : "studies"} in {totalSeconds}s
</ResultCount>

{response.studies.length > 0 && (
<StudyTable>
<Table size="small">
<TableHead>
<TableRow>
<TableCell>Title</TableCell>
<TableCell>dbGaP Id</TableCell>
<TableCell>Platform</TableCell>
<TableCell>Focus / Disease</TableCell>
<TableCell>Data Type</TableCell>
<TableCell>Participants</TableCell>
<TableCell>Study Design</TableCell>
<TableCell>Consent Code</TableCell>
</TableRow>
</TableHead>
<TableBody>
{response.studies.map((study, i) => (
<TableRow key={i}>
<TableCell>{study.title}</TableCell>
<TableCell>{study.dbGapId}</TableCell>
<TableCell>{study.platforms.join(", ")}</TableCell>
<TableCell>{study.focus}</TableCell>
<TableCell>{study.dataTypes.join(", ")}</TableCell>
<TableCell>
{study.participantCount != null
? study.participantCount.toLocaleString()
: "—"}
</TableCell>
<TableCell>{study.studyDesigns.join(", ")}</TableCell>
<TableCell>{study.consentCodes.join(", ")}</TableCell>
</TableRow>
))}
</TableBody>
</Table>
</StudyTable>
{response.intent === "variable" ? (
<>
<ResultCount>
Found {response.totalVariables}{" "}
{response.totalVariables === 1 ? "variable" : "variables"} in{" "}
{totalSeconds}s
</ResultCount>
{response.variables.length > 0 && (
<StudyTable>
<Table size="small">
<TableHead>
<TableRow>
<TableCell>Concept</TableCell>
<TableCell>Variable Name</TableCell>
<TableCell>Description</TableCell>
<TableCell>Study</TableCell>
<TableCell>dbGaP</TableCell>
</TableRow>
</TableHead>
<TableBody>
{response.variables.map((v, i) => (
<TableRow key={i}>
<TableCell>{v.concept}</TableCell>
<TableCell>{v.variableName}</TableCell>
<TableCell>{v.description}</TableCell>
<TableCell>
<a
href={v.studyUrl}
rel="noopener noreferrer"
target="_blank"
>
{v.studyTitle || v.studyId}
</a>
</TableCell>
<TableCell>
<a
href={v.dbGapUrl}
rel="noopener noreferrer"
target="_blank"
>
{v.phvId}
</a>
</TableCell>
</TableRow>
))}
</TableBody>
</Table>
</StudyTable>
)}
</>
) : (
<>
<ResultCount>
Found {response.totalStudies}{" "}
{response.totalStudies === 1 ? "study" : "studies"} in{" "}
{totalSeconds}s
</ResultCount>
{response.studies.length > 0 && (
<StudyTable>
<Table size="small">
<TableHead>
<TableRow>
<TableCell>Title</TableCell>
<TableCell>dbGaP Id</TableCell>
<TableCell>Platform</TableCell>
<TableCell>Focus / Disease</TableCell>
<TableCell>Data Type</TableCell>
<TableCell>Participants</TableCell>
<TableCell>Study Design</TableCell>
<TableCell>Consent Code</TableCell>
</TableRow>
</TableHead>
<TableBody>
{response.studies.map((study, i) => (
<TableRow key={i}>
<TableCell>{study.title}</TableCell>
<TableCell>{study.dbGapId}</TableCell>
<TableCell>{study.platforms.join(", ")}</TableCell>
<TableCell>{study.focus}</TableCell>
<TableCell>{study.dataTypes.join(", ")}</TableCell>
<TableCell>
{study.participantCount != null
? study.participantCount.toLocaleString()
: "—"}
</TableCell>
<TableCell>{study.studyDesigns.join(", ")}</TableCell>
<TableCell>{study.consentCodes.join(", ")}</TableCell>
</TableRow>
))}
</TableBody>
</Table>
</StudyTable>
)}
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When intent === "auto", the frontend will fall into the else branch and display study results (line 367-410). However, per the PRD, when intent is "auto", the system should ask for clarification and not display any results. The message will be shown (line 297-299), but the user will also see an empty study results table which could be confusing. Consider adding an explicit check for "auto" intent to only show the clarification message without any results table.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — frontend now returns null for auto intent, showing only the clarification message. afda40e.

Comment on lines 277 to 282
lookup_ms=0,
pipeline_ms=elapsed_ms,
total_ms=elapsed_ms,
),
total_studies=0,
)
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable query filters out excluded measurements (if m.facet == Facet.MEASUREMENT and not m.exclude), but it doesn't apply exclusion logic to the results. If a user searches for "variables measuring blood pressure but not systolic blood pressure", the current implementation will include all blood pressure variables without subtracting the excluded ones. Consider adding exclusion support for variable queries similar to how study queries handle it, or document that exclusions are not supported for variable intent.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't fix — variable-level exclusion is explicitly out of scope per the PRD. Study-level filters are now applied (75e798c), but concept-level exclusion is a follow-up.

- Handle intent="auto" explicitly: return clarification message only,
  don't run a study query for ambiguous intent
- Frontend: skip results table for auto intent, show only the message
- Guard against empty phv_id in URL builder

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 304 to 306
]
if concepts and isinstance(index.store, DuckDBStore):
variable_rows = index.store.query_variables(concepts)
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable-intent search currently ignores all non-measurement mentions (and all exclude logic). For example, a query like "variables measuring blood pressure on AnVIL" will not filter variables to studies on AnVIL, because the variable path only extracts measurement concepts and bypasses the include/exclude constraints. Consider applying the same include/exclude constraints used for study lookup (e.g., restrict variable results to studies matching include/exclude, and then filter variables by those study_ids, or extend the SQL to join through study_facet_values).

Suggested change
]
if concepts and isinstance(index.store, DuckDBStore):
variable_rows = index.store.query_variables(concepts)
]
# If there are include/exclude constraints, determine the studies
# that satisfy them so we can restrict variable results accordingly.
allowed_study_ids: set[str] | None = None
if (include or exclude) and isinstance(index.store, DuckDBStore):
matching_studies = index.query_studies(include, exclude or None)
allowed_study_ids = {
s["study_id"]
for s in matching_studies
if isinstance(s, dict) and "study_id" in s
} or None
if concepts and isinstance(index.store, DuckDBStore):
variable_rows = index.store.query_variables(concepts)
# Apply study-level include/exclude constraints, if any, by
# filtering variables to those belonging to the allowed studies.
if allowed_study_ids is not None:
variable_rows = [
r
for r in variable_rows
if isinstance(r, dict)
and r.get("study_id") in allowed_study_ids
]

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — query_variables() now accepts optional study_ids set, and the API filters through study-level constraints first. 75e798c.

Comment on lines 298 to 338
@@ -321,6 +334,8 @@ def _load_measurement_concepts(self, llm_dir: Path) -> None:
for concept in concepts
]
self.store.load_facet_values_batch(rows)
self.store.load_variables_batch(variable_rows)
logger.info("Loaded %d variable rows", len(variable_rows))
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_load_measurement_concepts accumulates all variable rows in a single in-memory list before bulk-inserting. With the stated scale (tens/hundreds of thousands of rows), this can significantly increase peak memory during index builds and make local/dev rebuilds fragile. Consider inserting in chunks (e.g., flush every N rows or per-file) and clearing the buffer to bound memory usage.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't fix — ~450k rows at ~200 bytes each is ~90MB peak, well within normal build memory. Chunking would add complexity for negligible gain.

" phv_id VARCHAR,"
" study_id VARCHAR,"
" table_name VARCHAR,"
" variable_name VARCHAR"
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variables table is created without a PRIMARY KEY/UNIQUE constraint on phv_id (the PRD and dbGaP IDs imply this should be unique). Without a constraint, duplicate variable rows can be inserted across rebuilds or from input duplication, which can inflate result counts and slow queries. Consider making phv_id the PRIMARY KEY (or at least adding a UNIQUE constraint/index) and deciding a dedupe strategy on conflicts.

Suggested change
" variable_name VARCHAR"
" variable_name VARCHAR,"
" PRIMARY KEY (phv_id)"

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't fix — phv_id is not guaranteed unique (same variable can map to multiple concepts). DuckDB's COPY CSV also doesn't enforce PK constraints efficiently. The index on concept_lower is sufficient for query performance.

Comment on lines 284 to 293
sql = (
"SELECT v.concept, v.dataset_id, v.description, v.phv_id,"
" v.study_id, v.table_name, v.variable_name,"
" json_extract_string(s.raw_json, '$.title') AS study_title "
"FROM variables v "
"LEFT JOIN studies s ON v.study_id = s.db_gap_id "
f"WHERE v.concept_lower IN ({placeholders}) " # noqa: S608
"ORDER BY v.concept, v.study_id, v.variable_name "
f"LIMIT {limit}"
)
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query_variables interpolates LIMIT directly into the SQL string (f"LIMIT {limit}") instead of binding it as a parameter or strictly validating/clamping it. Even if limit isn't currently user-controlled, this pattern is easy to accidentally expose later and triggers SQL-injection linters. Consider validating limit (min/max) and using a bound parameter for LIMIT if DuckDB supports it (or at least ensure limit is an int after clamping before string formatting).

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate of earlier comment — limit is always a Python int, never user input.

Comment on lines +68 to +69
fi
@lsof -ti :8000 | xargs kill 2>/dev/null || true
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stop target unconditionally kills any process listening on port 8000 (lsof -ti :8000 | xargs kill), which can terminate unrelated services a developer might be running. Consider limiting this to the PID in $(PID_FILE) (or only falling back to port-kill when no PID file exists), and/or printing a warning before killing by port.

Suggested change
fi
@lsof -ti :8000 | xargs kill 2>/dev/null || true
else \
echo "Warning: No PID file found. Killing all processes listening on port 8000 (this may stop unrelated services)..."; \
lsof -ti :8000 | xargs kill 2>/dev/null || true; \
fi

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't fix — the lsof fallback is intentional for stale PID files. Port 8000 is exclusively used by this server in dev.

NoopDog and others added 2 commits February 20, 2026 23:52
When a variable query includes non-measurement facets (e.g., platform,
data type), filter variable results to only studies matching those
constraints. "blood pressure variables on AnVIL" now returns only
variables from AnVIL studies.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 11 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 297 to 318
elif intent == "variable":
# Collect resolved measurement concepts for variable lookup
concepts = [
v
for m in query_model.mentions
if m.facet == Facet.MEASUREMENT and not m.exclude
for v in m.values
]
if concepts and isinstance(index.store, DuckDBStore):
# Apply study-level constraints (platform, dataType, etc.)
non_measurement = [
c for c in include if c[0] != Facet.MEASUREMENT
]
study_ids: set[str] | None = None
if non_measurement or exclude:
matched = index.query_studies(
non_measurement or include, exclude or None
)
study_ids = {s.get("dbGapId", "") for s in matched}
variable_rows = index.store.query_variables(
concepts, study_ids=study_ids
)
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When intent == "variable" and there are no measurement concepts in the query (e.g., only platform/dataType filters), the variable search returns zero results silently. Consider adding a message to inform the user that variable search requires at least one measurement concept to be specified.

Copilot uses AI. Check for mistakes.
Comment on lines +158 to +175
def _build_dbgap_variable_url(study_id: str, phv_id: str) -> str:
"""Build a dbGaP variable page URL.

Args:
study_id: Study accession (e.g., "phs000007").
phv_id: Variable PHV ID (e.g., "phv00481718.v2.p1").

Returns:
Full URL to the variable page on dbGaP.
"""
if not phv_id:
return ""
# Extract the numeric portion from the phv ID (strip "phv" prefix and version)
phv_num = phv_id.split(".")[0].replace("phv", "")
return (
f"https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/variable.cgi"
f"?study_id={study_id}&phv={phv_num}"
)
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PHV ID parsing assumes a specific format (phv{number}.v{version}.p{participant_set}). If a PHV ID doesn't contain a period (malformed or different format), the split will still work but may produce unexpected results. According to the PRD, the study accession should include version information, but the function receives just the study_id without version. This may produce broken dbGaP links. Consider validating the expected format or handling edge cases where the format differs.

Copilot uses AI. Check for mistakes.
Comment on lines +338 to +339
{response.variables.map((v, i) => (
<TableRow key={i}>
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using array index i as the React key for table rows is an anti-pattern. If the list order changes or items are added/removed, React may have difficulty reconciling the component tree efficiently. Consider using a unique identifier like v.phvId which should be unique per variable, or a combination of fields if phvId might not be unique.

Copilot uses AI. Check for mistakes.
kill $$(cat $(PID_FILE)) 2>/dev/null && echo "Server stopped." || true; \
rm -f $(PID_FILE); \
fi
@lsof -ti :8000 | xargs kill 2>/dev/null || true
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lsof command may not be available on all systems (e.g., some minimal Linux distributions or Windows). While this is a developer-oriented Makefile, consider adding a check for the command's existence or using a more portable alternative like ss or netstat with a fallback.

Suggested change
@lsof -ti :8000 | xargs kill 2>/dev/null || true
@if command -v lsof >/dev/null 2>&1; then \
lsof -ti :8000 | xargs kill 2>/dev/null || true; \
fi

Copilot uses AI. Check for mistakes.
Comment on lines 267 to 278
def query_variables(
self,
concepts: list[str],
limit: int = 100,
study_ids: set[str] | None = None,
) -> list[dict]:
"""Return variables matching any of the given concept names.

Args:
concepts: Canonical concept names to match (OR-ed).
limit: Maximum number of variable rows to return.
study_ids: If provided, restrict results to these studies.
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable query has a hardcoded LIMIT of 100 results. For common concepts like "Blood Pressure" or "Body Mass Index", there could be thousands of matching variables across hundreds of studies. The PR description mentions ~450k total variable rows. Without pagination support or a way for users to request more results, many relevant variables may be hidden. Consider implementing pagination or at minimum documenting this limitation in the API response or user-facing message.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +10
# PRD: Consent Code Compatibility Model

**Status:** DRAFT — Feb 2026
**Related:** PRD-concept-search.md, PRD-search-api.md

---

## Problem

The search API treats consent codes as opaque strings with exact-match semantics. This fails for the three most common researcher consent queries:
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PRD-consent-code-compatibility.md document describes a consent code compatibility feature that is not mentioned in the PR description or issue #194. This PRD appears to be for a separate feature (consent code hierarchy and eligibility matching) and should either be removed from this PR or explicitly mentioned in the PR description if it's intentionally being added as documentation for future work.

Copilot uses AI. Check for mistakes.
Comment on lines 306 to +320
for var in table.get("variables", []):
concept = var.get("concept")
if concept:
concept_studies[concept].add(study_id)
study_concepts[study_id].add(concept)
variable_rows.append((
concept,
concept.lower(),
dataset_id,
var.get("description", ""),
var.get("id", ""),
study_id,
table_name,
var.get("name", ""),
))
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variables without a concept (concept is None or empty) are completely skipped during indexing. This means variables that may be important but weren't assigned a concept during LLM classification will be invisible to variable search. Consider logging how many variables are being skipped, or storing them with a sentinel concept value like "Uncategorized" so they can still be discovered.

Copilot uses AI. Check for mistakes.
</TableBody>
</Table>
</StudyTable>
{response.intent === "auto" ? null : response.intent === "variable" ? (
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When intent === "auto", nothing is rendered after the clarification message (no result count or table). While this may be intentional to encourage the user to refine their query, it might be helpful to display extracted mentions or provide more guidance. Consider showing what was understood from the query to help the user refine it.

Suggested change
{response.intent === "auto" ? null : response.intent === "variable" ? (
{response.intent === "auto" ? (
<SectionRow>
<SectionLabel>Next steps:</SectionLabel>
{mentions.length > 0 || Object.keys(mappings).length > 0
? "Review the extracted mentions above and refine your question to run a search."
: "Refine your question with more detail so we can run a search."}
</SectionRow>
) : response.intent === "variable" ? (

Copilot uses AI. Check for mistakes.
Comment on lines +82 to +93
self._conn.execute(
"CREATE TABLE variables ("
" concept VARCHAR,"
" concept_lower VARCHAR,"
" dataset_id VARCHAR,"
" description VARCHAR,"
" phv_id VARCHAR,"
" study_id VARCHAR,"
" table_name VARCHAR,"
" variable_name VARCHAR"
")"
)
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PRD specifies phv_id as the PRIMARY KEY for the variables table, but the implementation does not define a primary key or any uniqueness constraint. In the real llm-concepts data, the same PHV ID may appear multiple times if a variable appears in multiple tables/datasets within the same study. Consider either adding the PK constraint if phv_id is truly unique, or updating the PRD to remove the "PK" designation and clarify that duplicate phv_ids are expected.

Copilot uses AI. Check for mistakes.
Comment on lines 267 to 307
def query_variables(
self,
concepts: list[str],
limit: int = 100,
study_ids: set[str] | None = None,
) -> list[dict]:
"""Return variables matching any of the given concept names.

Args:
concepts: Canonical concept names to match (OR-ed).
limit: Maximum number of variable rows to return.
study_ids: If provided, restrict results to these studies.

Returns:
Variable dicts with study title joined from the studies table.
"""
if not concepts:
return []
concept_ph = ", ".join("?" for _ in concepts)
params: list[str] = [c.lower() for c in concepts]
where = f"v.concept_lower IN ({concept_ph})"
if study_ids is not None:
study_ph = ", ".join("?" for _ in study_ids)
where += f" AND v.study_id IN ({study_ph})"
params.extend(study_ids)
sql = (
"SELECT v.concept, v.dataset_id, v.description, v.phv_id,"
" v.study_id, v.table_name, v.variable_name,"
" json_extract_string(s.raw_json, '$.title') AS study_title "
"FROM variables v "
"LEFT JOIN studies s ON v.study_id = s.db_gap_id "
f"WHERE {where} " # noqa: S608
"ORDER BY v.concept, v.study_id, v.variable_name "
f"LIMIT {limit}"
)
rows = self._conn.execute(sql, params).fetchall()
cols = [
"concept", "datasetId", "description", "phvId",
"studyId", "tableName", "variableName", "studyTitle",
]
return [dict(zip(cols, row)) for row in rows]
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new query_variables method in DuckDBStore lacks unit tests. The existing test suite comprehensively tests query_studies with various scenarios (AND, OR, exclude, case sensitivity, etc.), but there are no corresponding tests for variable queries. This new functionality should have similar test coverage to ensure it correctly handles concept matching, study filtering, and edge cases.

Copilot uses AI. Check for mistakes.
When intent is "variable" but no measurement is mentioned (e.g., "what
variables are measured in AnVIL WGS studies"), return all variables from
the matching studies instead of returning empty results.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@NoopDog NoopDog merged commit ea821bf into main Feb 21, 2026
1 check passed
@NoopDog NoopDog deleted the feat/variable-search branch February 21, 2026 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: variable-level search — find specific measured variables, not just studies

1 participant