feat!: Add multi-dataset query support (resolves #1933). by junhaoliao · Pull Request #1992 · y-scope/clp

junhaoliao · 2026-02-14T11:33:25Z

Description

Add multi-dataset query support, allowing users to search across multiple datasets in a single query.
This is a breaking change — the dataset field in the query job config is replaced with datasets: list[str] (Python) / Vec<String> (Rust) / Type.Array(Type.String()) (TypeScript) across all query entry points (Web UI, API server, CLI).

Impact Assessment

What is affected: All query entry points (Web UI, Rust API server, CLI), the query scheduler, query workers, C++ clp-s binary output handler, and result rendering in the Web UI.
Why the change is needed: Users need to search across multiple datasets simultaneously without submitting separate queries for each dataset. This enables cross-dataset log correlation and analysis workflows.
Implications: This is a breaking change. The dataset field is removed from the query job config in favour of datasets. Existing API clients, stored queries, and CLI scripts that use --dataset must be updated to use --datasets. The scheduler enforces a configurable max_datasets_per_query limit to prevent resource exhaustion.

Key changes by layer:

Job config schema — QueryJobConfig.dataset → QueryJobConfig.datasets in Python, Rust, and
TypeScript.
- components/job-orchestration/job_orchestration/scheduler/job_config.py
- components/clp-py-utils/clp_py_utils/clp_config.py
- components/clp-rust-utils/src/job_config/search.rs
- components/webui/common/src/schemas/search.ts
- Code owners: @hoophalab, @LinZhihao-723
Configuration — Add max_datasets_per_query (default 10, null for unlimited) to
QueryScheduler config in clp-config.yaml.
- components/clp-py-utils/clp_py_utils/clp_config.py
- components/package-template/src/etc/clp-config.template.text.yaml
- components/package-template/src/etc/clp-config.template.json.yaml
- Code owners: @sitaowang1998
Scheduler fan-out — query_scheduler.py loops over all requested datasets, fetches archives
from each dataset's metadata table, merge-sorts by begin_timestamp, validates dataset existence
and count limits, and passes dataset as an explicit Celery task argument so each worker knows
which dataset it is querying.
- components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py
- Code owners: @gibber9809, @sitaowang1998
C++ core results attribution — Add --dataset CLI flag to clp-s and include a dataset
field in each BSON result document written to the results cache, enabling per-result dataset
attribution.
- components/core/src/clp_s/CommandLineArguments.cpp
- components/core/src/clp_s/CommandLineArguments.hpp
- components/core/src/clp_s/OutputHandlerImpl.cpp
- components/core/src/clp_s/OutputHandlerImpl.hpp
- components/core/src/clp_s/archive_constants.hpp
- components/core/src/clp_s/clp-s.cpp
- Code owners: @gibber9809
Worker execution — fs_search_task.py and extract_stream_task.py accept dataset as an
explicit Celery task parameter and pass it to the clp-s binary via --dataset.
- components/job-orchestration/job_orchestration/executor/query/fs_search_task.py
- components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py
- Code owners: @gibber9809, @sitaowang1998
Web UI Client & Web UI server — The dataset selector changes from single-select to multi-select (selectDataset: string | null → selectDatasets: string[]). LogViewerLink reads dataset from each search result document instead of the global cached state. "All Time" range uses UNION ALL across multiple datasets. The multi-select dropdown uses responsive tag collapsing and defaults to the "default" dataset when the selection is emptied. Updated Fastify routes to accept an datasets array.
- components/webui/client/src/pages/SearchPage/SearchControls/Dataset/DatasetSelect.tsx
- components/webui/client/src/pages/SearchPage/SearchControls/Dataset/index.module.css
- components/webui/client/src/pages/SearchPage/SearchControls/Dataset/index.tsx
- components/webui/client/src/pages/SearchPage/SearchControls/Native/SearchButton/SubmitButton/index.tsx
- components/webui/client/src/pages/SearchPage/SearchControls/Presto/Guided/presto-guided-search-requests.ts
- components/webui/client/src/pages/SearchPage/SearchControls/Presto/SqlSearchButton/RunButton/GuidedRunButton.tsx
- components/webui/client/src/pages/SearchPage/SearchControls/QueryStatus/QuerySpeed/index.tsx
- components/webui/client/src/pages/SearchPage/SearchControls/TimeRangeInput/Presto/TimeRangeFooter/TimestampKeySelect/index.tsx
- components/webui/client/src/pages/SearchPage/SearchControls/TimeRangeInput/sql.ts
- components/webui/client/src/pages/SearchPage/SearchControls/TimeRangeInput/utils.tsx
- components/webui/client/src/pages/SearchPage/SearchResults/SearchResultsTable/Native/Message/LogViewerLink.tsx
- components/webui/client/src/pages/SearchPage/SearchResults/SearchResultsTable/Native/Message/index.tsx
- components/webui/client/src/pages/SearchPage/SearchResults/SearchResultsTable/Native/SearchResultsVirtualTable/typings.tsx
- components/webui/client/src/pages/SearchPage/SearchResults/SearchResultsTimeline/Native/NativeResultsTimeline.tsx
- components/webui/client/src/pages/SearchPage/SearchState/Presto/useTimestampKeyInit/index.tsx
- components/webui/client/src/pages/SearchPage/SearchState/index.tsx
- components/webui/server/src/routes/api/search/index.ts
- components/webui/common/src/schemas/search.ts
- Code owners: @hoophalab, @davemarco
API server — Updated routes to accept a datasets array.
- components/api-server/src/client.rs
- components/api-server/src/routes.rs
- Code owners: @LinZhihao-723, @hoophalab
CLI tools — --dataset → --datasets with nargs="+". validate_dataset_exists →
validate_datasets_exist to validate all datasets before submission.
- components/clp-package-utils/clp_package_utils/scripts/search.py
- components/clp-package-utils/clp_package_utils/scripts/native/search.py
- components/clp-package-utils/clp_package_utils/scripts/native/utils.py
- Code owners: @gibber9809, @hoophalab
Docs — Updated API server guide, quick-start guide, and regenerated OpenAPI spec to reflect
datasets (plural).
- docs/src/_static/generated/api-server-openapi.json
- docs/src/user-docs/guides-using-the-api-server.md
- docs/src/user-docs/quick-start/clp-json.md
- Code owners: @hoophalab , @kirkrodrigues

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

0. Build CLP package

Task: Verify the project builds successfully with all changes (C++, Rust, Python, TypeScript).

Command:

task

Output:

...
#30 exporting manifest list sha256:6559f02cc7654b3e2366ce77a0c8d0057b54c3e7d2765549bc4e3409632cd863
#30 exporting manifest list sha256:6559f02cc7654b3e2366ce77a0c8d0057b54c3e7d2765549bc4e3409632cd863 0.0s done
#30 naming to moby-dangling@sha256:6559f02cc7654b3e2366ce77a0c8d0057b54c3e7d2765549bc4e3409632cd863 done
#30 unpacking to moby-dangling@sha256:6559f02cc7654b3e2366ce77a0c8d0057b54c3e7d2765549bc4e3409632cd863
#30 unpacking to moby-dangling@sha256:6559f02cc7654b3e2366ce77a0c8d0057b54c3e7d2765549bc4e3409632cd863 2.2s done
#30 DONE 7.7s
task: [package] echo '0.9.1-dev' > '/home/junhao/workspace/5-clp/build/clp-package/VERSION'

Build completed successfully — all C++ targets, Rust binaries, Python packages, and TypeScript
client compiled without errors.

1. Start CLP and compress sample data

Task: Start CLP and run a compression job to populate the default dataset with data for
subsequent search tests.

Commands:

cd build/clp-package
./sbin/start-clp.sh
./sbin/compress.sh --timestamp-key timestamp ~/samples/postgresql.jsonl

Output:

...
2026-02-14T07:33:48.292 INFO [controller] Started CLP.

2026-02-14T07:33:53.291 INFO [compress] Compression job 1 submitted.
2026-02-14T07:33:55.295 INFO [compress] Compressed 392.84MB into 9.94MB (39.53x). Speed: 204.49MB/s.
2026-02-14T07:33:55.795 INFO [compress] Compression finished.
2026-02-14T07:33:55.796 INFO [compress] Compressed 392.84MB into 9.94MB (39.53x). Speed: 182.59MB/s.

All containers started and passed health checks. Compression succeeded.

2. Search with `--datasets` flag (single dataset)

Task: Verify the new --datasets CLI flag works for a single dataset with count aggregation.

Command:

./sbin/search.sh "*" --datasets default --count

Output:

tags: [] count: 1000000

Explanation: The search correctly targets the default dataset using the new --datasets flag
and returns the expected 1,000,000 log events.

3. Search without `--datasets` flag (defaults to `default`)

Task: Verify that omitting the --datasets flag defaults to querying the default dataset.

Command:

./sbin/search.sh "*" --count

Output:

tags: [] count: 1000000

Explanation: Without --datasets, the CLI defaults to ["default"], producing the same result
as the explicit dataset selection in step 2.

4. Search with non-existent dataset (validation)

Task: Verify that specifying a non-existent dataset produces a clear error and exits with a
non-zero status code.

Command:

./sbin/search.sh "*" --datasets nonexistent --count

Output:

2026-02-14T07:34:22.140 ERROR [search] Dataset `nonexistent` doesn't exist.
2026-02-14T07:34:22.253 ERROR [search] Search failed.

Explanation: The validate_datasets_exist function correctly rejects the non-existent dataset
before submitting the query job, preventing wasted scheduler/worker resources.

5. Web UI: Dataset multi-select rendering

Task: Verify the dataset selector renders correctly in single-select and multi-select states without
vertical overflow.

Method: Opened http://localhost:8080/search and tested the dataset selector.

Single dataset (1 selected):

The selected tag (anything ×) renders in full — no collapsed + 1 ... indicator.
The selector stays at its natural content width.

Multiple datasets (3 selected):

The selector expands to a fixed 200px width.
Shows the first tag that fits (anything ×) with a + 2 ... indicator for the remaining.
All content stays in a single row — no vertical overflow.

Explanation: maxTagCount="responsive" is conditionally applied only when more than one dataset is
selected. With a single dataset, responsive collapsing is disabled so the tag always renders in full.
When multiple datasets are selected, the container expands to 200px (via the selectContainerExpanded
CSS class) giving rc-overflow a known width to measure against for responsive tag collapsing.

6. Stop CLP

Command:

./sbin/stop-clp.sh

Output:

...
2026-02-14T07:34:48.274 INFO [controller] Stopped CLP.

Summary by CodeRabbit

New Features
- Search now supports selecting and querying multiple datasets at once.
- Added configurable per-query dataset limit (default 10).
UI
- Dataset selector updated to multi-select with improved layout, fallback and selection limits.
- Search submission, results timeline and related controls adapted for multi-dataset workflows.
API & CLI
- API request payloads use a datasets array instead of a single dataset.
- CLI accepts repeated --datasets entries for scoping queries.
Documentation
- Examples and quick-start updated to show datasets array usage.

coderabbitai · 2026-02-14T11:33:31Z

Walkthrough

Replaces single dataset (string/null) with datasets (array) across APIs, job models, scheduler, CLP tooling, core results path, executor tasks, and Web UI; adds per-query dataset limit (config + UI setting) and threads dataset metadata through results and CLI invocations.

Changes

Cohort / File(s)	Summary
API & Job Models `components/api-server/src/client.rs`, `components/api-server/src/routes.rs`, `components/clp-rust-utils/src/job_config/search.rs`, `components/webui/common/src/schemas/search.ts`	Replaced single `dataset` with plural `datasets` (Option<Vec> / Type.Array(Type.String())); updated conversions, OpenAPI example, and schema shapes.
Scheduler & Orchestration `components/job-orchestration/.../scheduler/job_config.py`, `components/job-orchestration/.../scheduler/query/query_scheduler.py`	QueryJobConfig now has `datasets: list[str]
Python CLP Utilities & CLI `components/clp-package-utils/clp_package_utils/scripts/native/search.py`, `.../scripts/search.py`, `.../scripts/native/utils.py`, `components/clp-package-utils/clp_package_utils/controller.py`	CLI `--dataset` → `--datasets` (append); function signatures updated to accept `datasets: list[str]
CLP Core & Results Cache `components/core/src/clp_s/CommandLineArguments.hpp`, `.../CommandLineArguments.cpp`, `components/core/src/clp_s/OutputHandlerImpl.hpp`, `.../OutputHandlerImpl.cpp`, `components/core/src/clp_s/archive_constants.hpp`, `components/core/src/clp_s/clp-s.cpp`	Added CLI dataset option and member; extended ResultsCacheOutputHandler/QueryResult to include `dataset`, persist `cDataset` in documents, and updated constructor/signatures and call sites to pass dataset.
Executor Tasks `components/job-orchestration/.../executor/query/extract_stream_task.py`, `.../fs_search_task.py`	Threaded optional `dataset` through command/env builders; CLP_S command builders accept/use dataset (fallback to first dataset when needed); appended `--dataset` flag when provided.
Web UI - State, Settings & Config `components/webui/client/src/pages/SearchPage/SearchState/index.tsx`, `components/webui/client/public/settings.json`, `components/webui/client/src/config/index.ts`, `components/webui/client/src/settings.ts`, `components/package-template/.../clp-config.template.*`	Replaced singular store fields with `selectedDatasets: string[]` and `queriedDatasets: string[]`; added `MaxDatasetsPerQuery` setting and exported `SETTINGS_MAX_DATASETS_PER_QUERY`; added `max_datasets_per_query` in config templates.
Web UI - Dataset Selection & Styling `components/webui/client/src/pages/SearchPage/SearchControls/Dataset/DatasetSelect.tsx`, `.../index.tsx`, `.../index.module.css`	Converted dataset selector to multi-select, added fallback logic, conditional expanded styling and responsive tag handling.
Web UI - Submission, Metrics, Time Range, Results multiple files under `components/webui/client/src/pages/SearchPage/...` (SubmitButton, Presto guided, QuerySpeed utils, TimeRangeInput, TimestampKeySelect, NativeResultsTimeline, SearchResults components, etc.)	Updated components and utilities to accept/use arrays of datasets (use first dataset where single-dataset semantics required), updated SQL builders to UNION across datasets, threaded `dataset` into result rendering and LogViewerLink, and adjusted hooks/queries to use selectedDatasets/queriedDatasets.
WebUI Server & Stream File Manager / Routes `components/webui/server/src/plugins/app/StreamFileManager.ts`, `components/webui/server/src/routes/api/search/index.ts`	submitAndWaitForExtractStreamJob and route handlers now use `datasets` array in jobConfig and request payload wiring.
Docs & Quick Start `docs/src/user-docs/guides-using-the-api-server.md`, `docs/src/user-docs/quick-start/clp-json.md`	Examples and prose updated to show `datasets: ["default"]` and document multi-dataset CLI usage and UI label pluralization.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant WebUI as Web UI (React store)
    participant API as API Server
    participant Scheduler as Query Scheduler
    participant Executor as Task Executor
    participant CLP_S as Search Engine (CLP_S)

    User->>WebUI: select one or more datasets
    WebUI->>WebUI: store selectedDatasets[]
    User->>WebUI: submit query (datasets[])
    WebUI->>API: POST /query with datasets[]
    API->>Scheduler: create/dispatch QueryJob(datasets[])
    Scheduler->>Scheduler: validate datasets & enforce max_datasets_per_query
    Scheduler->>Scheduler: discover archives per dataset -> archives[{archive_id,dataset},...]
    Scheduler->>Executor: create tasks with archive dicts
    Executor->>CLP_S: run search/extract per archive with --dataset <dataset>
    CLP_S->>Executor: return results (include dataset)
    Executor->>API: store results in results cache (includes dataset)
    API->>WebUI: return results (contains dataset)
    WebUI->>User: render results (dataset-aware)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 51.43% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding multi-dataset query support, which is the primary objective of this PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…alues

…ename cachedDatasets to queriedDatasets; fix query speed calculation

junhaoliao · 2026-02-14T14:28:31Z

components/core/src/clp_s/CommandLineArguments.cpp

+            )(
+                    "dataset",
+                    po::value<std::string>(&m_dataset)->value_name("DATASET"),
+                    "The dataset name to include in each result document"
            );


The dataset name needs to be include in the query results before dumping them into the result cache. i know this feels like a hack. a more future proof interface might to to accept a JSON object string --extra-result-metadata and merge such objects with the results before dumping into the results cache. what do you think?

alternatively, we can add multi-dataset support to the stream extraction flow as well. then we just pass an array of datasets in the stream extraction config, avoiding touching the clp-s binary

i think this may be more aligned with the future plan of using a single table for all datasets (maybe i haven't had a correct understanding of how we are going to refactor the datasets feature though

…k`; improve clarity in `DatasetSelect` component

…nsistency

…ileManager and extract_stream_task

coderabbitai

Actionable comments posted: 16

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

components/webui/client/src/pages/SearchPage/SearchControls/Presto/SqlSearchButton/RunButton/GuidedRunButton.tsx (1)

99-100: ⚠️ Potential issue | 🟡 Minor

Coding guideline: prefer false === isQueryReady over !isQueryReady.

Proposed fix

-                disabled={!isQueryReady ||
+                disabled={false === isQueryReady ||
                     searchUiState === SEARCH_UI_STATE.QUERY_ID_PENDING}

As per coding guidelines: **/*.{tsx}: "Prefer false === <expression> rather than !<expression>."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@components/webui/client/src/pages/SearchPage/SearchControls/Presto/SqlSearchButton/RunButton/GuidedRunButton.tsx`
around lines 99 - 100, In GuidedRunButton update the disabled conditional to
avoid using negation: replace the `!isQueryReady` check with `false ===
isQueryReady` so the prop check reads `disabled={false === isQueryReady ||
searchUiState === SEARCH_UI_STATE.QUERY_ID_PENDING}` (keep the existing
searchUiState comparison unchanged); this targets the isQueryReady boolean usage
in the GuidedRunButton component to comply with the coding guideline.

components/api-server/src/client.rs (1)

27-56: ⚠️ Potential issue | 🟡 Minor

Consider validating that datasets is not an empty vector.

A client can submit "datasets": [], which would pass the is_none() check on Line 131 and propagate an empty dataset list to the scheduler. If the scheduler doesn't guard against this, it could cause unexpected behaviour. Consider adding a validation or normalising empty vectors to None:

🛡️ Suggested guard in submit_query

     let mut search_job_config: SearchJobConfig = query_config.into();
-    if search_job_config.datasets.is_none() {
+    if search_job_config.datasets.as_ref().is_none_or(|d| d.is_empty()) {
         search_job_config.datasets = match self.config.package.storage_engine {
             StorageEngine::Clp => None,
             StorageEngine::ClpS => Some(vec!["default".to_owned()]),
         }
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@components/api-server/src/client.rs` around lines 27 - 56, QueryConfig allows
"datasets": [] which should be treated as None; update the code that handles
incoming QueryConfig (e.g., in submit_query) to normalise or validate the
datasets field by checking QueryConfig::datasets and if Some(vec) &&
vec.is_empty() either set it to None before further processing or return a 4xx
validation error; reference the QueryConfig struct and the submit_query handler
so the empty-vector case is handled early and you never propagate an empty
Vec<String> to the scheduler.

components/webui/client/src/pages/SearchPage/SearchControls/QueryStatus/QuerySpeed/utils.ts (1)

28-48: ⚠️ Potential issue | 🟡 Minor

Ensure jobId uses parameterized query or add format validation.

jobId originates from the backend API response (trusted internal source), mitigating direct injection risk. However, it's typed as string (not numeric) and is interpolated directly into SQL without frontend validation. Adopt parameterized queries or validate that jobId matches an expected format before interpolation to follow SQL injection prevention best practices.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@components/webui/client/src/pages/SearchPage/SearchControls/QueryStatus/QuerySpeed/utils.ts`
around lines 28 - 48, The SQL string builder in utils.ts interpolates jobId
directly into the query (see archivesSubquery and the template that uses
${jobId}), which risks injection because jobId is a string; fix it by switching
to a parameterized query or validating/coercing jobId to a strict numeric format
before interpolation—either accept a numeric jobId argument
(Number.parseInt/Number and reject NaN) or replace the inline ${jobId} use with
a parameter placeholder and pass jobId as a bound parameter from the caller
(update the function that returns this query and its consumers to support
parameter binding).

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (1)

504-537: 🧹 Nitpick | 🔵 Trivial

Consider using zip instead of index-based iteration.

The for i in range(len(archives)) pattern at lines 524 and 537 could be cleaner with zip(archives, task_ids).

Suggested refactor

     if QueryJobType.SEARCH_OR_AGGREGATION == job_type:
         return celery.group(
             search.s(
                 job_id=job.id,
-                archive_id=archives[i]["archive_id"],
-                task_id=task_ids[i],
+                archive_id=archive["archive_id"],
+                task_id=task_id,
                 job_config=job_config,
-                dataset=archives[i].get("dataset"),
+                dataset=archive.get("dataset"),
                 clp_metadata_db_conn_params=clp_metadata_db_conn_params,
                 results_cache_uri=results_cache_uri,
             )
-            for i in range(len(archives))
+            for archive, task_id in zip(archives, task_ids)
         )

Apply the same pattern to the EXTRACT_JSON/EXTRACT_IR branch.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py`
around lines 504 - 537, The code in get_task_group_for_job uses index-based
iteration (for i in range(len(archives))) to pair archives and task_ids when
building celery.group for QueryJobType.SEARCH_OR_AGGREGATION and
QueryJobType.EXTRACT_JSON/EXTRACT_IR; replace those loops with zip(archives,
task_ids) to iterate pairs directly and update the generator expressions that
call search.s(...) and extract_stream.s(...) to use the unpacked archive and
task_id instead of archives[i] and task_ids[i]; ensure you reference job.id,
job_config, archive.get("dataset"), clp_metadata_db_conn_params and
results_cache_uri unchanged while switching both branches to the zip-based
pattern.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/clp-py-utils/clp_py_utils/clp_config.py`:
- Line 435: The config field max_datasets_per_query (in clp_config.py) lacks
documentation for None semantics—add a brief inline comment next to its
declaration explaining that setting this field to None (or null in YAML/JSON)
means "unlimited" datasets per query; keep the comment concise and on the same
line or immediately above the declaration so users editing config files
immediately see that None/null = unlimited.

In
`@components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py`:
- Around line 99-103: The code currently allows dataset to remain None if
extract_json_config.datasets is empty, which leads to invalid paths when
building s3_object_key and local paths; update the start of the logic in
extract_stream_task (the block using dataset, extract_json_config.datasets,
StorageType.S3 and s3_config.key_prefix) to validate dataset after resolving
from extract_json_config.datasets and, if still None, return an error (or raise
an exception) early with a clear message instead of proceeding to build
f"{s3_config.key_prefix}{dataset}/{archive_id}" or using get_directory() /
dataset; ensure any callers handle the early error accordingly.

In
`@components/job-orchestration/job_orchestration/executor/query/fs_search_task.py`:
- Around line 122-125: The CLP_S branch passes dataset (which can be None) into
_make_core_clp_s_command_and_env_vars which expects a str and will break
(f-string will include "None" and Path / None raises TypeError); fix by adding a
guard in _make_command_and_env_vars before the StorageEngine.CLP_S branch to
validate dataset is not None (raise a ValueError with a clear message) or
require/convert it to a string, or change _make_core_clp_s_command_and_env_vars
to accept Optional[str] and handle None safely; reference StorageEngine.CLP_S,
_make_command_and_env_vars, and _make_core_clp_s_command_and_env_vars when
applying the guard or signature change.

In
`@components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py`:
- Around line 178-180: The code silently selects only the first dataset via
extraction_dataset = self.__job_config.datasets[0] which can confuse reviewers;
add a concise inline comment next to that line explaining that extraction jobs
intentionally use only the first dataset (single-archive behavior) and that
multi-dataset context is not applicable here, referencing __job_config.datasets,
extraction_dataset, and the subsequent archive_exists check to make the
rationale clear.
- Around line 417-427: The current code directly interpolates dataset names into
the SELECT string (in the loop building union_parts), which risks SQL injection;
instead build the SELECT with a parameter placeholder for the dataset value and
accumulate a params list to supply to the DB driver. Concretely, keep using
get_archives_table_name(table_prefix, ds) for the table name (since table names
cannot be parameterized), but change the appended string to use a parameter
placeholder (e.g. "%s" or "?" matching the DB adapter) like "SELECT id AS
archive_id, end_timestamp, ? AS dataset FROM {table}{where_clause}", and append
the ds value to a params list for each union part; after joining union_parts
into query, return/execute the query together with the flattened params list so
the dataset values are passed as bound parameters rather than interpolated.
- Around line 736-738: The call to get_archives_for_search with datasets can
pass None (causing a TypeError in get_archives_for_search's for ds in datasets),
so before calling get_archives_for_search (where archives_for_search is
assigned) add a guard: if datasets is None, populate datasets with all available
datasets (e.g., call the existing helper that lists datasets such as
get_all_datasets/db helper or query the DB via db_conn/table_prefix to return a
list of dataset names), or explicitly set datasets = [] if the intended
semantics are “no datasets”; then call get_archives_for_search(db_conn,
table_prefix, search_config, archive_end_ts_lower_bound, datasets) so the
function always receives an iterable list.

In
`@components/webui/client/src/pages/SearchPage/SearchControls/Dataset/DatasetSelect.tsx`:
- Around line 55-63: The useEffect that checks isSuccess && datasets.length ===
0 keeps calling updateDatasets(getFallbackDatasets()) even when
getFallbackDatasets() returns a new empty array reference, causing an infinite
re-render loop; modify the effect in DatasetSelect.tsx to early-return if data
is an empty array (data.length === 0) or only call updateDatasets when the
fallback actually changes (e.g., fallback has length > 0 or differs from
datasets by length/content) so that updateDatasets is not invoked with a new
empty array reference repeatedly; specifically guard the useEffect (and the
related effect that calls updateDatasets([])) to avoid setting an
identical-empty selection by comparing lengths or contents before calling
updateDatasets.
- Around line 43-52: In getFallbackDatasets, remove the unnecessary type
assertion on available[0]; since available is derived as string[] (const
available = data || []) and you already check length > 0, replace the returned
array [available[0] as string] with [available[0]] to avoid the redundant cast
(references: function getFallbackDatasets, variable available, constant
CLP_DEFAULT_DATASET_NAME).

In
`@components/webui/client/src/pages/SearchPage/SearchControls/Native/SearchButton/SubmitButton/index.tsx`:
- Around line 107-112: Update the inline comment above the isNoDatasetsAndClpS
computation to reference "datasets" plural and state that clp-s requires at
least one dataset; locate the comment surrounding the isNoDatasetsAndClpS
constant (which checks selectedDatasets.length and SETTINGS_STORAGE_ENGINE
against CLP_STORAGE_ENGINES.CLP_S) and change the phrasing to something like
"Submit button must be disabled if there are no datasets since clp-s requires at
least one dataset for queries."

In
`@components/webui/client/src/pages/SearchPage/SearchControls/Presto/Guided/presto-guided-search-requests.ts`:
- Line 74: The conditional guarding the destructured variable `from` uses an
unnecessary typeof check; replace the `if ("undefined" === typeof from)` with a
direct undefined comparison (e.g., `if (from === undefined)`) in
presto-guided-search-requests.ts so the code is clearer and still correctly
handles `string | undefined` for `from`.
- Around line 71-76: The code currently destructures selectedDatasets to only
use the first entry (const [from] = selectedDatasets) which silently ignores
additional selections; update all guided-Presto entry points (the usages in
presto-guided-search-requests.ts, GuidedRunButton, TimestampKeySelect and
buildPrestoGuidedQueries) to detect when
useSearchStore.getState().selectedDatasets.length > 1 and handle it explicitly:
either prevent guided mode (disable the run button and input controls) or
surface a prominent, user-visible warning/toast/banner that multiple datasets
are not supported in guided mode and block execution until the user reduces the
selection to one; do not silently proceed by using only selectedDatasets[0].
Ensure the same single-check-and-block logic and a clear error/warning message
is implemented wherever selectedDatasets is currently destructured.

In
`@components/webui/client/src/pages/SearchPage/SearchControls/Presto/SqlSearchButton/RunButton/GuidedRunButton.tsx`:
- Around line 37-40: The button currently reads selectedDatasets via
useSearchStore.getState() which is non-reactive so `from` (and consequently
`isQueryReady`) won't update on selection changes; replace the non-reactive call
with a reactive selector such as const selectedDatasets = useSearchStore(state
=> state.selectedDatasets) and compute `from` from that reactive value (keeping
the `from` name and downstream references intact) so the component re-renders
when selection changes.

In
`@components/webui/client/src/pages/SearchPage/SearchControls/QueryStatus/QuerySpeed/utils.ts`:
- Around line 15-26: The SQL-building code interpolates datasetNames directly
into table identifiers (see variable archivesSubquery and
settings.SqlDbClpTablePrefix) which is an injection risk; fix by introducing a
single helper like escapeSqlIdentifier(name) that either validates names against
a safe whitelist regex (e.g. only [A-Za-z0-9_\\-] and throw on invalid) or
properly quotes/escapes identifiers (wrap with backticks and escape any
backticks inside), and call that helper wherever datasetNames are used to build
table names (both in this file's archivesSubquery logic and in
TimeRangeInput/sql.ts); do not use string interpolation of raw datasetNames
directly and document that identifiers are validated/escaped before
concatenation.

In
`@components/webui/client/src/pages/SearchPage/SearchControls/TimeRangeInput/Presto/TimeRangeFooter/TimestampKeySelect/index.tsx`:
- Around line 20-26: The code duplicates the logic that derives the first
selected dataset (`0 < selectedDatasets.length ? selectedDatasets[0] : null`)
across components (seen in this file and useTimestampKeyInit/index.tsx); extract
a reusable selector/helper (e.g., getFirstSelectedDataset or
selectFirstSelectedDataset) and use it with useSearchStore in both places
instead of repeating the ternary. Update usages that reference selectedDatasets
(the hook useSearchStore and components using dataset) to call the new selector
so the derivation is centralized and DRY.

In
`@components/webui/client/src/pages/SearchPage/SearchControls/TimeRangeInput/sql.ts`:
- Around line 34-44: The function buildClpsTimeRangeSql currently interpolates
datasetNames directly into SQL identifiers (via settings.SqlDbClpTablePrefix +
name + _ + SqlTableSuffix.ARCHIVES), which risks SQL injection if
client-supplied names are malicious; add a client-side defensive check in
buildClpsTimeRangeSql to validate each name against /^\w+$/ (or equivalent
alphanumeric/underscore-only regex) and either throw an error or omit invalid
entries before building unionParts, so only validated safe identifiers are
concatenated into the SQL string; ensure the validation references
buildClpsTimeRangeSql, CLP_ARCHIVES_TABLE_COLUMN_NAMES,
settings.SqlDbClpTablePrefix, and SqlTableSuffix.ARCHIVES so reviewers can
locate the change.

In `@components/webui/common/src/schemas/search.ts`:
- Line 16: The datasets array in the search schema currently allows empty
arrays; update the schema definition for the datasets property to require at
least one entry by adding a minItems: 1 constraint to the Type.Array call (i.e.,
change datasets: Type.Array(Type.String()) to include { minItems: 1 }); this
aligns with the existing pattern used elsewhere (e.g., compression schema) and
ensures validation rejects empty dataset lists.

---

Outside diff comments:
In `@components/api-server/src/client.rs`:
- Around line 27-56: QueryConfig allows "datasets": [] which should be treated
as None; update the code that handles incoming QueryConfig (e.g., in
submit_query) to normalise or validate the datasets field by checking
QueryConfig::datasets and if Some(vec) && vec.is_empty() either set it to None
before further processing or return a 4xx validation error; reference the
QueryConfig struct and the submit_query handler so the empty-vector case is
handled early and you never propagate an empty Vec<String> to the scheduler.

In
`@components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py`:
- Around line 504-537: The code in get_task_group_for_job uses index-based
iteration (for i in range(len(archives))) to pair archives and task_ids when
building celery.group for QueryJobType.SEARCH_OR_AGGREGATION and
QueryJobType.EXTRACT_JSON/EXTRACT_IR; replace those loops with zip(archives,
task_ids) to iterate pairs directly and update the generator expressions that
call search.s(...) and extract_stream.s(...) to use the unpacked archive and
task_id instead of archives[i] and task_ids[i]; ensure you reference job.id,
job_config, archive.get("dataset"), clp_metadata_db_conn_params and
results_cache_uri unchanged while switching both branches to the zip-based
pattern.

In
`@components/webui/client/src/pages/SearchPage/SearchControls/Presto/SqlSearchButton/RunButton/GuidedRunButton.tsx`:
- Around line 99-100: In GuidedRunButton update the disabled conditional to
avoid using negation: replace the `!isQueryReady` check with `false ===
isQueryReady` so the prop check reads `disabled={false === isQueryReady ||
searchUiState === SEARCH_UI_STATE.QUERY_ID_PENDING}` (keep the existing
searchUiState comparison unchanged); this targets the isQueryReady boolean usage
in the GuidedRunButton component to comply with the coding guideline.

In
`@components/webui/client/src/pages/SearchPage/SearchControls/QueryStatus/QuerySpeed/utils.ts`:
- Around line 28-48: The SQL string builder in utils.ts interpolates jobId
directly into the query (see archivesSubquery and the template that uses
${jobId}), which risks injection because jobId is a string; fix it by switching
to a parameterized query or validating/coercing jobId to a strict numeric format
before interpolation—either accept a numeric jobId argument
(Number.parseInt/Number and reject NaN) or replace the inline ${jobId} use with
a parameter placeholder and pass jobId as a bound parameter from the caller
(update the function that returns this query and its consumers to support
parameter binding).

coderabbitai · 2026-02-17T20:08:41Z

components/clp-py-utils/clp_py_utils/clp_config.py

    host: DomainStr = "localhost"
    port: Port = DEFAULT_PORT
    jobs_poll_delay: PositiveFloat = 0.1  # seconds
+    max_datasets_per_query: PositiveInt | None = 10


🧹 Nitpick | 🔵 Trivial

Add an inline comment documenting the None semantics.

Users editing config files need to know that setting this field to null means "unlimited." A brief comment would help:

📝 Suggested documentation

- max_datasets_per_query: PositiveInt | None = 10 + max_datasets_per_query: PositiveInt | None = 10 # None means unlimited

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

max_datasets_per_query: PositiveInt | None = 10

max_datasets_per_query: PositiveInt | None = 10 # None means unlimited

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/clp-py-utils/clp_py_utils/clp_config.py` at line 435, The config field max_datasets_per_query (in clp_config.py) lacks documentation for None semantics—add a brief inline comment next to its declaration explaining that setting this field to None (or null in YAML/JSON) means "unlimited" datasets per query; keep the comment concise and on the same line or immediately above the declaration so users editing config files immediately see that None/null = unlimited.

coderabbitai · 2026-02-17T20:08:41Z

components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py

+    if dataset is None:
+        dataset = extract_json_config.datasets[0] if extract_json_config.datasets else None
    if StorageType.S3 == storage_type:
        s3_config = worker_config.archive_output.storage.s3_config
        s3_object_key = f"{s3_config.key_prefix}{dataset}/{archive_id}"


⚠️ Potential issue | 🔴 Critical

dataset may be None when used in path construction, producing invalid paths.

If the dataset parameter is None and extract_json_config.datasets is empty or None, dataset stays None. It is then interpolated into the f-string at line 103 (f"{s3_config.key_prefix}{dataset}/{archive_id}") and used at line 122 (get_directory() / dataset), producing paths containing the literal string "None".

Consider adding an early return with an error when dataset resolves to None:

Proposed fix

if dataset is None: dataset = extract_json_config.datasets[0] if extract_json_config.datasets else None + if dataset is None: + logger.error("No dataset specified for JSON extraction") + return None, None if StorageType.S3 == storage_type:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py` around lines 99 - 103, The code currently allows dataset to remain None if extract_json_config.datasets is empty, which leads to invalid paths when building s3_object_key and local paths; update the start of the logic in extract_stream_task (the block using dataset, extract_json_config.datasets, StorageType.S3 and s3_config.key_prefix) to validate dataset after resolving from extract_json_config.datasets and, if still None, return an error (or raise an exception) early with a clear message instead of proceeding to build f"{s3_config.key_prefix}{dataset}/{archive_id}" or using get_directory() / dataset; ensure any callers handle the early error accordingly.

coderabbitai · 2026-02-17T20:08:41Z

components/job-orchestration/job_orchestration/executor/query/fs_search_task.py

    elif StorageEngine.CLP_S == storage_engine:
        command, env_vars = _make_core_clp_s_command_and_env_vars(
-            clp_home, worker_config, archive_id, search_config
+            clp_home, worker_config, archive_id, search_config, dataset
        )


⚠️ Potential issue | 🔴 Critical

dataset can be None here, but _make_core_clp_s_command_and_env_vars requires str.

_make_command_and_env_vars accepts dataset: str | None = None (line 114), but passes it directly to _make_core_clp_s_command_and_env_vars (line 69) which declares dataset: str. If dataset is None at this call site:

Line 77: the f-string would embed the literal "None" in the S3 key.

Line 95: Path / None would raise a TypeError.

Add a guard before the CLP_S branch, or make dataset required when the storage engine is CLP_S.

🐛 Proposed fix

elif StorageEngine.CLP_S == storage_engine: + if dataset is None: + logger.error("dataset is required for the CLP_S storage engine") + return None, None command, env_vars = _make_core_clp_s_command_and_env_vars( clp_home, worker_config, archive_id, search_config, dataset )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

elif StorageEngine.CLP_S == storage_engine:

command, env_vars = _make_core_clp_s_command_and_env_vars(

clp_home, worker_config, archive_id, search_config

clp_home, worker_config, archive_id, search_config, dataset

)

elif StorageEngine.CLP_S == storage_engine:

if dataset is None:

logger.error("dataset is required for the CLP_S storage engine")

return None, None

command, env_vars = _make_core_clp_s_command_and_env_vars(

clp_home, worker_config, archive_id, search_config, dataset

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/job-orchestration/job_orchestration/executor/query/fs_search_task.py` around lines 122 - 125, The CLP_S branch passes dataset (which can be None) into _make_core_clp_s_command_and_env_vars which expects a str and will break (f-string will include "None" and Path / None raises TypeError); fix by adding a guard in _make_command_and_env_vars before the StorageEngine.CLP_S branch to validate dataset is not None (raise a ValueError with a clear message) or require/convert it to a string, or change _make_core_clp_s_command_and_env_vars to accept Optional[str] and handle None safely; reference StorageEngine.CLP_S, _make_command_and_env_vars, and _make_core_clp_s_command_and_env_vars when applying the guard or signature change.

coderabbitai · 2026-02-17T20:08:41Z

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py

🧹 Nitpick | 🔵 Trivial

Silently using datasets[0] loses multi-dataset context for extraction jobs.

extraction_dataset takes only the first element. This is likely intentional for extraction (single archive), but worth a brief inline comment explaining why only the first dataset is used here.

🧰 Tools

🪛 Ruff (0.15.1)

[warning] 180-180: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py` around lines 178 - 180, The code silently selects only the first dataset via extraction_dataset = self.__job_config.datasets[0] which can confuse reviewers; add a concise inline comment next to that line explaining that extraction jobs intentionally use only the first dataset (single-archive behavior) and that multi-dataset context is not applicable here, referencing __job_config.datasets, extraction_dataset, and the subsequent archive_exists check to make the rationale clear.

coderabbitai · 2026-02-17T20:08:41Z

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py

+    where_clause = ""
    if len(filter_clauses) > 0:
-        query += " WHERE " + " AND ".join(filter_clauses)
-    query += " ORDER BY end_timestamp DESC"
+        where_clause = " WHERE " + " AND ".join(filter_clauses)
+
+    union_parts = []
+    for ds in datasets:
+        table = get_archives_table_name(table_prefix, ds)
+        union_parts.append(
+            f"SELECT id AS archive_id, end_timestamp, '{ds}' AS dataset FROM {table}{where_clause}"
+        )
+    query = " UNION ALL ".join(union_parts) + " ORDER BY end_timestamp DESC"


⚠️ Potential issue | 🟡 Minor

SQL injection: user-supplied dataset names interpolated directly into query strings.

The dataset name ds is string-interpolated into the SQL query at line 425. While datasets are validated against existing_datasets at lines 706–722 before reaching this point (which mitigates active exploitation), the validation is skipped when datasets is None. If the None issue above is resolved by defaulting to fetched datasets, this path would be safe. Still, consider using parameterized queries for the '{ds}' value to defend in depth.

Note: The table name interpolation via get_archives_table_name follows existing codebase patterns and can't easily be parameterized.

🧰 Tools

🪛 Ruff (0.15.1)

[error] 425-425: Possible SQL injection vector through string-based query construction

(S608)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py` around lines 417 - 427, The current code directly interpolates dataset names into the SELECT string (in the loop building union_parts), which risks SQL injection; instead build the SELECT with a parameter placeholder for the dataset value and accumulate a params list to supply to the DB driver. Concretely, keep using get_archives_table_name(table_prefix, ds) for the table name (since table names cannot be parameterized), but change the appended string to use a parameter placeholder (e.g. "%s" or "?" matching the DB adapter) like "SELECT id AS archive_id, end_timestamp, ? AS dataset FROM {table}{where_clause}", and append the ds value to a params list for each union part; after joining union_parts into query, return/execute the query together with the flattened params list so the dataset values are passed as bound parameters rather than interpolated.

coderabbitai · 2026-02-17T20:08:42Z

...ent/src/pages/SearchPage/SearchControls/Presto/SqlSearchButton/RunButton/GuidedRunButton.tsx

+    const {selectedDatasets} = useSearchStore.getState();
+    const from = 0 < selectedDatasets.length ?
+        selectedDatasets[0] :
+        null;


⚠️ Potential issue | 🔴 Critical

Bug: getState() makes selectedDatasets non-reactive — button won't reflect dataset changes.

useSearchStore.getState() captures state once and does not subscribe to updates. Since from (derived from selectedDatasets) feeds into isQueryReady, the button's disabled state and tooltip will not update when the user changes the dataset selection.

Use a reactive selector instead:

🐛 Proposed fix

- const {selectedDatasets} = useSearchStore.getState(); - const from = 0 < selectedDatasets.length ? - selectedDatasets[0] : + const selectedDatasets = useSearchStore((state) => state.selectedDatasets); + const from = 0 < selectedDatasets.length ? + selectedDatasets[0] : null;

Based on learnings: "use useStore.getState().method for callbacks since the output is not reactive and doesn't need state as a dependency in the hook, and use useStore((state) => state.property) with proper selectors for reactive components that need to re-render when state changes."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const {selectedDatasets} = useSearchStore.getState();

const from = 0 < selectedDatasets.length ?

selectedDatasets[0] :

null;

const selectedDatasets = useSearchStore((state) => state.selectedDatasets);

const from = 0 < selectedDatasets.length ?

selectedDatasets[0] :

null;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/webui/client/src/pages/SearchPage/SearchControls/Presto/SqlSearchButton/RunButton/GuidedRunButton.tsx` around lines 37 - 40, The button currently reads selectedDatasets via useSearchStore.getState() which is non-reactive so `from` (and consequently `isQueryReady`) won't update on selection changes; replace the non-reactive call with a reactive selector such as const selectedDatasets = useSearchStore(state => state.selectedDatasets) and compute `from` from that reactive value (keeping the `from` name and downstream references intact) so the component re-renders when selection changes.

coderabbitai · 2026-02-17T20:08:42Z

components/webui/client/src/pages/SearchPage/SearchControls/QueryStatus/QuerySpeed/utils.ts

+    if (0 === datasetNames.length) {
+        archivesSubquery = "SELECT id, uncompressed_size" +
+            ` FROM ${settings.SqlDbClpArchivesTableName}`;
+    } else if (1 === datasetNames.length) {
+        archivesSubquery = "SELECT id, uncompressed_size" +
+            ` FROM ${settings.SqlDbClpTablePrefix}${datasetNames[0]}_archives`;
+    } else {
+        archivesSubquery = datasetNames
+            .map((name) => "SELECT id, uncompressed_size" +
+                ` FROM ${settings.SqlDbClpTablePrefix}${name}_archives`)
+            .join(" UNION ALL ");
+    }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Search for datasetNames definition and usage to understand the data flow rg -n 'datasetNames' components/webui/client/src/pages/SearchPage/SearchControls/QueryStatus/QuerySpeed/ --type=ts -B 2 -A 2

Repository: y-scope/clp

Length of output: 3833

🏁 Script executed:

# Look at the complete file to understand context head -80 components/webui/client/src/pages/SearchPage/SearchControls/QueryStatus/QuerySpeed/utils.ts

Repository: y-scope/clp

Length of output: 2159

🏁 Script executed:

# Search for any dataset name validation patterns in the webui client rg -n 'dataset.*validat|sanitiz|allowlist|whitelist|escape|quote' components/webui/client/src/ --type=ts --type=tsx -i | head -30

Repository: y-scope/clp

Length of output: 82

🏁 Script executed:

# Find where datasetNames parameter comes from by checking callers rg -n 'buildQuerySpeedSql' components/webui/client/src/ --type=ts --type=tsx -B 2 -A 2

Repository: y-scope/clp

Length of output: 82

🏁 Script executed:

# Check if there's any dataset validation in the store or data layer rg -n 'selectedDatasets|queriedDatasets' components/webui/client/src/ --type=ts --type=tsx | head -20

Repository: y-scope/clp

Length of output: 82

🏁 Script executed:

# Find imports/calls to fetchQuerySpeed rg -n 'fetchQuerySpeed' components/webui/client/src/ --type=ts | head -20

Repository: y-scope/clp

Length of output: 693

🏁 Script executed:

# Search for where this module is imported rg -n 'from.*QuerySpeed' components/webui/client/src/ --type=ts | head -20

Repository: y-scope/clp

Length of output: 178

🏁 Script executed:

# Check the api/sql module to understand querySql find components/webui/client/src -name "sql.ts" -o -name "sql.tsx" | head -5

Repository: y-scope/clp

Length of output: 427

🏁 Script executed:

# Look for dataset selection/filtering logic in the search page find components/webui/client/src/pages/SearchPage -name "*.ts" | head -10

Repository: y-scope/clp

Length of output: 1007

🏁 Script executed:

# Check the QuerySpeed index.tsx to see where queriedDatasets comes from cat -n components/webui/client/src/pages/SearchPage/SearchControls/QueryStatus/QuerySpeed/index.tsx

Repository: y-scope/clp

Length of output: 1984

🏁 Script executed:

# Look at the sql.ts in api folder to understand querySql find components/webui/client/src/api -name "sql.ts" | head -5

Repository: y-scope/clp

Length of output: 37

🏁 Script executed:

# Check for dataset validation in the search state/store find components/webui/client/src/pages/SearchPage -name "*state*" -o -name "*store*" | head -10

Repository: y-scope/clp

Length of output: 37

🏁 Script executed:

# Search for where dataset names are defined/sourced in the search page rg -n 'queriedDatasets' components/webui/client/src/pages/SearchPage --type=ts | head -20

Repository: y-scope/clp

Length of output: 849

🏁 Script executed:

# Check the SearchState index.tsx to see how queriedDatasets is set cat -n components/webui/client/src/pages/SearchPage/SearchState/index.tsx | head -150

Repository: y-scope/clp

Length of output: 5396

🏁 Script executed:

# Look for the api/sql module to understand if querySql does any sanitization find components/webui/client/src -path "*/api/*sql.ts" -type f

Repository: y-scope/clp

Length of output: 37

🏁 Script executed:

# Check the Dataset controls to see how datasets are selected/validated cat -n components/webui/client/src/pages/SearchPage/SearchControls/Dataset/sql.ts

Repository: y-scope/clp

Length of output: 1121

🏁 Script executed:

# Find the api/sql.ts file find components/webui/client/src -path "*/api/sql.ts" -type f

Repository: y-scope/clp

Length of output: 37

🏁 Script executed:

# Check how querySql is implemented cat -n components/webui/client/src/api/sql.ts | head -50

Repository: y-scope/clp

Length of output: 125

🏁 Script executed:

# Look for any examples of dataset name sanitization or validation elsewhere in the codebase rg -n 'dataset.*\[a-zA-Z0-9_-\]|sanitiz.*dataset|escape.*name' components/webui/client/src --type=ts -i | head -20

Repository: y-scope/clp

Length of output: 37

🏁 Script executed:

# Find where sql.ts or querySql is defined find components/webui/client/src -name "*.ts" -o -name "*.tsx" | xargs grep -l "export.*querySql" | head -5

Repository: y-scope/clp

Length of output: 99

🏁 Script executed:

# Search for querySql definition rg -n 'querySql.*=' components/webui/client/src --type=ts | head -20

Repository: y-scope/clp

Length of output: 155

🏁 Script executed:

# Check the api directory structure find components/webui/client/src/api -type f | head -20

Repository: y-scope/clp

Length of output: 644

🏁 Script executed:

# Check the querySql implementation cat -n components/webui/client/src/api/sql/index.ts

Repository: y-scope/clp

Length of output: 417

🏁 Script executed:

# Look for where the SQL query is executed - check if there's any backend validation rg -n 'querySql|dataset.*validation|sql.*validate' components/webui/client/src/api/sql/ --type=ts -A 5 | head -50

Repository: y-scope/clp

Length of output: 440

🏁 Script executed:

# Find the backend endpoint that handles /api/archive-metadata/sql find components -name "*.cpp" -o -name "*.hpp" -o -name "*.py" | xargs grep -l "archive-metadata/sql" | head -10

Repository: y-scope/clp

Length of output: 37

🏁 Script executed:

# Search for SQL query handling in the backend rg -n 'archive-metadata.*sql|querySql' components --type=cpp --type=py | head -20

Repository: y-scope/clp

Length of output: 37

🏁 Script executed:

# Search for where dataset names might be validated in the backend rg -n 'dataset.*validat|sanitiz.*dataset' components --type=cpp --type=py -i | head -20

Repository: y-scope/clp

Length of output: 391

🏁 Script executed:

# Check the backend handler for the SQL endpoint to see if validation happens there find components -path "*/archive-metadata*" -name "*.cpp" -o -name "*.hpp" -o -name "*.py" 2>/dev/null | head -20

Repository: y-scope/clp

Length of output: 1197

🏁 Script executed:

# Look for the REST API endpoint handler rg -n 'archive.metadata.*sql|/api/archive-metadata' components --type=cpp --type=py | head -20

Repository: y-scope/clp

Length of output: 37

🏁 Script executed:

# Search more broadly for SQL execution in the backend rg -n 'execute.*query|query.*execute|executeSql' components --type=py -A 3 | head -40

Repository: y-scope/clp

Length of output: 3569

🏁 Script executed:

# Look for where the REST endpoint /api/archive-metadata/sql is implemented find components -type f $ -name "*.py" -o -name "*.cpp" -o -name "*.hpp" $ | xargs grep -l "archive.metadata" | head -10

Repository: y-scope/clp

Length of output: 544

🏁 Script executed:

# Search in the job orchestration for where SQL is actually executed for metadata queries rg -n 'presto|trino|execute.*sql' components/job-orchestration --type=py | head -30

Repository: y-scope/clp

Length of output: 37

🏁 Script executed:

# Check if there are any examples of dataset name escaping/quoting in the codebase rg -n 'quote.*name|escape.*name|f\"\"\"|backtick|identifier' components --type=py --type=ts -i | head -20

Repository: y-scope/clp

Length of output: 199

🏁 Script executed:

# Search for the backend endpoint that handles SQL queries find components -type f -name "*.py" | xargs grep -l "archive-metadata" | head -10

Repository: y-scope/clp

Length of output: 37

🏁 Script executed:

# Look for REST API/Flask routes rg -n '@app.route|@api.route|router.post' components --type=py | grep -i sql | head -20

Repository: y-scope/clp

Length of output: 37

🏁 Script executed:

# Search for where presto queries are built with dataset names rg -n 'FROM.*dataset|table.*prefix|tableName' components --type=py | head -30

Repository: y-scope/clp

Length of output: 3961

🏁 Script executed:

# Look at the clp_metadata_db_utils.py to see how table names are constructed cat -n components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py | head -150

Repository: y-scope/clp

Length of output: 5784

🏁 Script executed:

# Search for the REST endpoint handler that processes the SQL query find components -type f -name "*.py" | xargs grep -l "queryString" | head -10

Repository: y-scope/clp

Length of output: 37

🏁 Script executed:

# Look for the webui backend/API handler find components/webui -type f -name "*.py" | head -20

Repository: y-scope/clp

Length of output: 37

🏁 Script executed:

# Look for where datasets are imported/created to see if names are validated rg -n 'add_dataset|dataset_name.*=' components/clp-py-utils components/job-orchestration --type=py -B 2 -A 2 | head -50

Repository: y-scope/clp

Length of output: 1729

🏁 Script executed:

# Check the webui backend more carefully - look for app.py or main server file find components/webui -type f -name "*.py" -o -name "app.py" -o -name "server.py" | head -20

Repository: y-scope/clp

Length of output: 37

🏁 Script executed:

# Look for where datasets are fetched/selected in the UI rg -n 'fetchDatasetNames|selectedDatasets' components/webui/client/src/pages/SearchPage --type=ts | head -20

Repository: y-scope/clp

Length of output: 2910

🏁 Script executed:

# Check the Dataset component to see how datasets are selected cat -n components/webui/client/src/pages/SearchPage/SearchControls/Dataset/ 2>/dev/null | head -100

Repository: y-scope/clp

Length of output: 37

🏁 Script executed:

# Find the Dataset component file find components/webui/client/src/pages/SearchPage/SearchControls/Dataset -type f -name "*.tsx" -o -name "*.ts"

Repository: y-scope/clp

Length of output: 293

🏁 Script executed:

# Look at the DatasetSelect component cat -n components/webui/client/src/pages/SearchPage/SearchControls/Dataset/DatasetSelect.tsx

Repository: y-scope/clp

Length of output: 4078

🏁 Script executed:

# Look at the Dataset index component cat -n components/webui/client/src/pages/SearchPage/SearchControls/Dataset/index.tsx

Repository: y-scope/clp

Length of output: 1281

🏁 Script executed:

# Look for any sanitization or validation in the search request building rg -n 'buildClpsTimeRangeSql|buildQuerySpeedSql' components/webui/client/src --type=ts -A 10 | head -60

Repository: y-scope/clp

Length of output: 5617

🏁 Script executed:

# Check if there are any other similar SQL building patterns in the codebase rg -n 'FROM.*\$\{.*prefix\}\$\{.*name\}' components/webui/client/src --type=ts | head -20

Repository: y-scope/clp

Length of output: 37

🏁 Script executed:

# Look for any existing SQL escaping utilities in the codebase rg -n 'escape|quote|sanitize' components/webui/client/src --type=ts | head -20

Repository: y-scope/clp

Length of output: 2669

SQL injection risk: dataset names interpolated directly into table names without escaping.

datasetNames are concatenated directly into SQL table names (lines 20, 24) without any quoting or escaping around the dataset name variable. While dataset names originate from the database, they are stored with no character restrictions (VARCHAR(255)), making them vulnerable to SQL injection if a dataset name contains special characters (e.g., backticks, underscores, quotes). This same pattern is repeated in TimeRangeInput/sql.ts.

Consider quoting or escaping dataset names before interpolation, or use parameterized queries if the backend supports them.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/webui/client/src/pages/SearchPage/SearchControls/QueryStatus/QuerySpeed/utils.ts` around lines 15 - 26, The SQL-building code interpolates datasetNames directly into table identifiers (see variable archivesSubquery and settings.SqlDbClpTablePrefix) which is an injection risk; fix by introducing a single helper like escapeSqlIdentifier(name) that either validates names against a safe whitelist regex (e.g. only [A-Za-z0-9_\\-] and throw on invalid) or properly quotes/escapes identifiers (wrap with backticks and escape any backticks inside), and call that helper wherever datasetNames are used to build table names (both in this file's archivesSubquery logic and in TimeRangeInput/sql.ts); do not use string interpolation of raw datasetNames directly and document that identifiers are validated/escaped before concatenation.

coderabbitai · 2026-02-17T20:08:42Z

...SearchPage/SearchControls/TimeRangeInput/Presto/TimeRangeFooter/TimestampKeySelect/index.tsx

+    const selectedDatasets = useSearchStore((state) => state.selectedDatasets);
    const timestampKey = usePrestoSearchState((state) => state.timestampKey);
    const updateTimestampKey = usePrestoSearchState((state) => state.updateTimestampKey);
    const searchUiState = useSearchStore((state) => state.searchUiState);
+    const dataset = 0 < selectedDatasets.length ?
+        selectedDatasets[0] :
+        null;


🧹 Nitpick | 🔵 Trivial

Duplicated dataset derivation logic across components.

The pattern 0 < selectedDatasets.length ? selectedDatasets[0] : null is repeated identically in useTimestampKeyInit/index.tsx (lines 19–21) and here (lines 24–26). Consider extracting a small helper or selector to keep this DRY, especially if more components need the same derivation.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/webui/client/src/pages/SearchPage/SearchControls/TimeRangeInput/Presto/TimeRangeFooter/TimestampKeySelect/index.tsx` around lines 20 - 26, The code duplicates the logic that derives the first selected dataset (`0 < selectedDatasets.length ? selectedDatasets[0] : null`) across components (seen in this file and useTimestampKeyInit/index.tsx); extract a reusable selector/helper (e.g., getFirstSelectedDataset or selectFirstSelectedDataset) and use it with useSearchStore in both places instead of repeating the ternary. Update usages that reference selectedDatasets (the hook useSearchStore and components using dataset) to call the new selector so the derivation is centralized and DRY.

coderabbitai · 2026-02-17T20:08:42Z

components/webui/client/src/pages/SearchPage/SearchControls/TimeRangeInput/sql.ts

+const buildClpsTimeRangeSql = (datasetNames: string[]): string => {
+    const unionParts = datasetNames.map((name) => `SELECT
  MIN(${CLP_ARCHIVES_TABLE_COLUMN_NAMES.BEGIN_TIMESTAMP}) AS begin_timestamp,
  MAX(${CLP_ARCHIVES_TABLE_COLUMN_NAMES.END_TIMESTAMP}) AS end_timestamp
-FROM ${settings.SqlDbClpTablePrefix}${datasetName}_${SqlTableSuffix.ARCHIVES}`;
+FROM ${settings.SqlDbClpTablePrefix}${name}_${SqlTableSuffix.ARCHIVES}`);
+
+    return `SELECT
+  MIN(begin_timestamp) AS begin_timestamp,
+  MAX(end_timestamp) AS end_timestamp
+FROM (${unionParts.join("\nUNION ALL\n")}) AS combined`;
 };


⚠️ Potential issue | 🟡 Minor

SQL injection risk: dataset names are interpolated directly into SQL strings.

Dataset names from datasetNames are concatenated into the SQL query without sanitisation or parameterisation. While dataset names are validated server-side to contain only \w+ characters, a compromised or manipulated client could submit arbitrary strings. Consider adding a client-side guard (e.g., a regex check matching ^\w+$) before interpolating names into SQL:

🛡️ Suggested defensive guard

const buildClpsTimeRangeSql = (datasetNames: string[]): string => { + const VALID_NAME = /^\w+$/; + for (const name of datasetNames) { + if (false === VALID_NAME.test(name)) { + throw new Error(`Invalid dataset name: ${name}`); + } + } const unionParts = datasetNames.map((name) => `SELECT

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const buildClpsTimeRangeSql = (datasetNames: string[]): string => {

const unionParts = datasetNames.map((name) => `SELECT

MIN(${CLP_ARCHIVES_TABLE_COLUMN_NAMES.BEGIN_TIMESTAMP}) AS begin_timestamp,

MAX(${CLP_ARCHIVES_TABLE_COLUMN_NAMES.END_TIMESTAMP}) AS end_timestamp

FROM ${settings.SqlDbClpTablePrefix}${datasetName}_${SqlTableSuffix.ARCHIVES}`;

FROM ${settings.SqlDbClpTablePrefix}${name}_${SqlTableSuffix.ARCHIVES}`);

return `SELECT

MIN(begin_timestamp) AS begin_timestamp,

MAX(end_timestamp) AS end_timestamp

FROM (${unionParts.join("\nUNION ALL\n")}) AS combined`;

};

const buildClpsTimeRangeSql = (datasetNames: string[]): string => {

const VALID_NAME = /^\w+$/;

for (const name of datasetNames) {

if (false === VALID_NAME.test(name)) {

throw new Error(`Invalid dataset name: ${name}`);

}

}

const unionParts = datasetNames.map((name) => `SELECT

MIN(${CLP_ARCHIVES_TABLE_COLUMN_NAMES.BEGIN_TIMESTAMP}) AS begin_timestamp,

MAX(${CLP_ARCHIVES_TABLE_COLUMN_NAMES.END_TIMESTAMP}) AS end_timestamp

FROM ${settings.SqlDbClpTablePrefix}${name}_${SqlTableSuffix.ARCHIVES}`);

return `SELECT

MIN(begin_timestamp) AS begin_timestamp,

MAX(end_timestamp) AS end_timestamp

FROM (${unionParts.join("\nUNION ALL\n")}) AS combined`;

};

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/webui/client/src/pages/SearchPage/SearchControls/TimeRangeInput/sql.ts` around lines 34 - 44, The function buildClpsTimeRangeSql currently interpolates datasetNames directly into SQL identifiers (via settings.SqlDbClpTablePrefix + name + _ + SqlTableSuffix.ARCHIVES), which risks SQL injection if client-supplied names are malicious; add a client-side defensive check in buildClpsTimeRangeSql to validate each name against /^\w+$/ (or equivalent alphanumeric/underscore-only regex) and either throw an error or omit invalid entries before building unionParts, so only validated safe identifiers are concatenated into the SQL string; ensure the validation references buildClpsTimeRangeSql, CLP_ARCHIVES_TABLE_COLUMN_NAMES, settings.SqlDbClpTablePrefix, and SqlTableSuffix.ARCHIVES so reviewers can locate the change.

coderabbitai · 2026-02-17T20:08:42Z

components/webui/common/src/schemas/search.ts

-    // otherwise, `{dataset: null}` gets converted to `{dataset: ""}`.
-    dataset: Type.Union([Type.Null(),
-        Type.String()]),
+    datasets: Type.Array(Type.String()),


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Search for datasets validation in server-side route handlers rg -n --type=ts -C5 'datasets' components/webui/server/src/routes/

Repository: y-scope/clp

Length of output: 1945

🏁 Script executed:

cat -n components/webui/common/src/schemas/search.ts

Repository: y-scope/clp

Length of output: 1259

🏁 Script executed:

#!/bin/bash # Search for scheduler validation or error handling related to datasets rg -n --type=ts -i 'scheduler|validation' components/webui/server/src/routes/api/search/index.ts | head -20

Repository: y-scope/clp

Length of output: 37

🏁 Script executed:

#!/bin/bash # Search for any array length validation patterns in the codebase rg -n 'minItems|length.*===.*0|\.length\s*>' components/webui/common/src/schemas/ --type=ts

Repository: y-scope/clp

Length of output: 166

🏁 Script executed:

#!/bin/bash # Check if there are request validators or middleware that might validate datasets rg -n --type=ts 'datasets.*length|datasets.*empty' components/webui/server/ -C3

Repository: y-scope/clp

Length of output: 37

🏁 Script executed:

#!/bin/bash # Search for tests related to query job creation to understand expected behavior find components/webui -name "*.test.ts" -o -name "*.spec.ts" | xargs rg -l "datasets" 2>/dev/null | head -5

Repository: y-scope/clp

Length of output: 37

🏁 Script executed:

#!/bin/bash # Look for any comments or documentation about datasets validation rg -n "datasets" components/webui/common/src/schemas/ -B2 -A2 --type=ts

Repository: y-scope/clp

Length of output: 452

Add minItems constraint to enforce non-empty datasets arrays.

The schema allows empty arrays [], which would be semantically invalid for a search query. The codebase establishes a pattern of constraining arrays with minItems: 1 (as seen in components/webui/common/src/schemas/compression.ts). Align this field with that pattern:

datasets: Type.Array(Type.String(), { minItems: 1 }),

No downstream validation was found to reject empty arrays at the route handler or scheduler level.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/webui/common/src/schemas/search.ts` at line 16, The datasets array in the search schema currently allows empty arrays; update the schema definition for the datasets property to require at least one entry by adding a minItems: 1 constraint to the Type.Array call (i.e., change datasets: Type.Array(Type.String()) to include { minItems: 1 }); this aligns with the existing pattern used elsewhere (e.g., compression schema) and ensures validation rejects empty dataset lists.

sitaowang1998

I have reviewed the package side. Should we also check if the datasets exceeds max_datasets_per_query in search script?

sitaowang1998 · 2026-02-17T21:04:57Z

components/clp-package-utils/clp_package_utils/scripts/native/utils.py

Shall we report all datasets that do not exist in the metadata database?

sitaowang1998 · 2026-02-17T21:06:47Z

components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py

If using the first dataset is the intended behavior for backward compatibility, add comment to avoid confusion.

sitaowang1998 · 2026-02-17T21:13:38Z

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py


+                # NOTE: This assumes we never delete a dataset.
+                missing = set(datasets) - existing_datasets
+                if missing:


Prefer to use explicit check, even in Python.

Suggested change

if missing:

if len(missing) == 0:

sitaowang1998 · 2026-02-17T21:14:09Z

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py

+                if missing:
+                    existing_datasets.update(fetch_existing_datasets(db_cursor, table_prefix))
+                    missing = set(datasets) - existing_datasets
+                    if missing:


Same as above.

Suggested change

if missing:

if len(missing) == 0:

sitaowang1998 · 2026-02-17T21:15:08Z

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py

-                if dataset not in existing_datasets:
-                    logger.error(f"Dataset `{dataset}` doesn't exist.")
+            datasets = QueryJobConfig.model_validate(job_config).datasets
+            if datasets is not None:


This if branch is pretty large, should we extract it into a separate function?

…d enforce validation in UI and backend

…ncy and support multiple flag usage

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/clp-package-utils/clp_package_utils/scripts/native/search.py`:
- Around line 305-308: Replace the f-string logging call so it uses the logger's
lazy formatting instead of pre-evaluating the string: change the logger.error
call that currently interpolates len(datasets) and max_datasets_per_query into a
parameterized log message (referencing variables datasets and
max_datasets_per_query and the logger.error invocation) to pass the values as
arguments to logger.error; keep the same message text and preserve the
subsequent return -1 behavior.
- Around line 300-313: Replace the current exception logging in the dataset
existence check so the full traceback is logged: in the block where
validate_datasets_exist(database_config, datasets) is called (handling
parsed_args.dataset and datasets), change the logger.error(e) call to
logger.exception(e) to mirror the earlier handler and preserve the stack trace
for failures in validate_datasets_exist.

In `@components/clp-package-utils/clp_package_utils/scripts/search.py`:
- Around line 116-128: Replace the boolean positional argument in the call to
clp_config.database.get_clp_connection_params_and_type(True) with a keyword
argument (e.g., get_clp_connection_params_and_type(include_type=True)) to
improve readability, and change the exception logging inside the except block
from logger.error(e) to logger.exception(e) so the traceback is preserved; these
changes should be applied in the block that handles StorageEngine.CLP_S and uses
validate_dataset_name for each dataset.

In
`@components/webui/client/src/pages/SearchPage/SearchControls/Dataset/DatasetSelect.tsx`:
- Around line 27-29: The selector is pulling updateSelectedDatasets as a
reactive function which then gets used as a stable callback; change usage so you
don't subscribe to the function: stop using useSearchStore((state) =>
state.updateSelectedDatasets) and instead call
useSearchStore.getState().updateSelectedDatasets(...) wherever updateDatasets is
invoked (e.g., in DatasetSelect component), and remove updateDatasets from any
useEffect dependency arrays so the hook doesn't depend on a stable function
reference; keep using selectors (useSearchStore((s) => s.selectedDatasets) and
useSearchStore((s) => s.searchUiState)) for reactive state only.

---

Duplicate comments:
In
`@components/webui/client/src/pages/SearchPage/SearchControls/Dataset/DatasetSelect.tsx`:
- Around line 56-64: The effect using isSuccess, data, datasets,
getFallbackDatasets and updateDatasets currently pushes a new empty array
reference and can loop; change the guard in the useEffect (the block referencing
isSuccess && 0 === datasets.length) to only call
updateDatasets(getFallbackDatasets()) when getFallbackDatasets() actually
returns a non-empty array (e.g., check getFallbackDatasets().length > 0) or when
the fallback differs in length from datasets to avoid storing a new empty array
reference; apply the same guard to the other effect that performs
initial-selection (the one around lines 76–87) so both effects only update
Zustand when there are real dataset items to set.
- Line 52: Remove the redundant type assertion on available[0] in DatasetSelect
(the variable available is already string[] and the preceding length guard
ensures it's defined); replace the expression "[available[0] as string]" with
just "[available[0]]" wherever it appears (e.g., in the component
return/initialization) so the code relies on the correct inferred type and
eliminates the unnecessary "as string" cast.

coderabbitai · 2026-02-18T13:43:22Z

components/clp-package-utils/clp_package_utils/scripts/native/search.py

+    datasets = parsed_args.dataset
+    if datasets is not None:
+        max_datasets_per_query = clp_config.query_scheduler.max_datasets_per_query
+        if max_datasets_per_query is not None and len(datasets) > max_datasets_per_query:
+            logger.error(
+                f"Number of datasets ({len(datasets)}) exceeds"
+                f" max_datasets_per_query={max_datasets_per_query}."
+            )
+            return -1
        try:
-            validate_dataset_exists(database_config, dataset)
+            validate_datasets_exist(database_config, datasets)
        except Exception as e:
            logger.error(e)
            return -1


⚠️ Potential issue | 🟡 Minor

logger.error(e) drops the traceback — prefer logger.exception(e) for consistency.

The handler at lines 309–313 silently discards the stack trace. This is inconsistent with the existing handler earlier in the same file (lines 295–297) that uses logger.exception. Use logger.exception(e) so failures in validate_datasets_exist are fully observable.

🔧 Proposed fix

try: validate_datasets_exist(database_config, datasets) except Exception as e: - logger.error(e) + logger.exception(e) return -1

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

datasets = parsed_args.dataset

if datasets is not None:

max_datasets_per_query = clp_config.query_scheduler.max_datasets_per_query

if max_datasets_per_query is not None and len(datasets) > max_datasets_per_query:

logger.error(

f"Number of datasets ({len(datasets)}) exceeds"

f" max_datasets_per_query={max_datasets_per_query}."

)

return -1

try:

validate_dataset_exists(database_config, dataset)

validate_datasets_exist(database_config, datasets)

except Exception as e:

logger.error(e)

return -1

datasets = parsed_args.dataset

if datasets is not None:

max_datasets_per_query = clp_config.query_scheduler.max_datasets_per_query

if max_datasets_per_query is not None and len(datasets) > max_datasets_per_query:

logger.error(

f"Number of datasets ({len(datasets)}) exceeds"

f" max_datasets_per_query={max_datasets_per_query}."

)

return -1

try:

validate_datasets_exist(database_config, datasets)

except Exception as e:

logger.exception(e)

return -1

🧰 Tools

🪛 Ruff (0.15.1)

[warning] 305-306: Logging statement uses f-string

(G004)

[warning] 311-311: Do not catch blind exception: Exception

(BLE001)

[warning] 312-312: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/clp-package-utils/clp_package_utils/scripts/native/search.py` around lines 300 - 313, Replace the current exception logging in the dataset existence check so the full traceback is logged: in the block where validate_datasets_exist(database_config, datasets) is called (handling parsed_args.dataset and datasets), change the logger.error(e) call to logger.exception(e) to mirror the earlier handler and preserve the stack trace for failures in validate_datasets_exist.

coderabbitai · 2026-02-18T13:43:23Z

components/clp-package-utils/clp_package_utils/scripts/native/search.py

+                f"Number of datasets ({len(datasets)}) exceeds"
+                f" max_datasets_per_query={max_datasets_per_query}."
+            )
+            return -1


🧹 Nitpick | 🔵 Trivial

Optional: avoid f-strings in logging calls (Ruff G004).

logger.error(f"Number of datasets ({len(datasets)}) exceeds ...") pre-evaluates the format string unconditionally. While the performance impact at ERROR level is negligible, this pattern is flagged by Ruff G004 and is inconsistent with lazy-evaluation style.

♻️ Suggested refactor

- logger.error( - f"Number of datasets ({len(datasets)}) exceeds" - f" max_datasets_per_query={max_datasets_per_query}." - ) + logger.error( + "Number of datasets (%d) exceeds max_datasets_per_query=%d.", + len(datasets), + max_datasets_per_query, + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

f"Number of datasets ({len(datasets)}) exceeds"

f" max_datasets_per_query={max_datasets_per_query}."

)

return -1

logger.error(

"Number of datasets (%d) exceeds max_datasets_per_query=%d.",

len(datasets),

max_datasets_per_query,

)

return -1

🧰 Tools

🪛 Ruff (0.15.1)

[warning] 305-306: Logging statement uses f-string

(G004)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/clp-package-utils/clp_package_utils/scripts/native/search.py` around lines 305 - 308, Replace the f-string logging call so it uses the logger's lazy formatting instead of pre-evaluating the string: change the logger.error call that currently interpolates len(datasets) and max_datasets_per_query into a parameterized log message (referencing variables datasets and max_datasets_per_query and the logger.error invocation) to pass the values as arguments to logger.error; keep the same message text and preserve the subsequent return -1 behavior.

coderabbitai · 2026-02-18T13:43:23Z

components/clp-package-utils/clp_package_utils/scripts/search.py

+    datasets = parsed_args.dataset
    if StorageEngine.CLP_S == storage_engine:
-        dataset = CLP_DEFAULT_DATASET_NAME if dataset is None else dataset
+        datasets = [CLP_DEFAULT_DATASET_NAME] if datasets is None else datasets
        try:
            clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)
-            validate_dataset_name(clp_db_connection_params["table_prefix"], dataset)
+            for ds in datasets:
+                validate_dataset_name(clp_db_connection_params["table_prefix"], ds)
        except Exception as e:
            logger.error(e)
            return -1
-    elif dataset is not None:
+    elif datasets is not None:
        logger.error(f"Dataset selection is not supported for storage engine: {storage_engine}.")
        return -1


⚠️ Potential issue | 🟡 Minor

Two minor issues in the exception handling block.

Line 120 — Boolean positional argument reduces call-site readability. Use the keyword form.

Lines 123–124 — logger.error(e) only logs the message, not the traceback. This is inconsistent with the existing handler at lines 103–105 that uses logger.exception. Prefer logger.exception(e) so the stack trace is preserved.

🔧 Proposed fix

- clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True) + clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type( + disable_localhost_socket_connection=True + ) for ds in datasets: validate_dataset_name(clp_db_connection_params["table_prefix"], ds) except Exception as e: - logger.error(e) + logger.exception(e) return -1

🧰 Tools

🪛 Ruff (0.15.1)

[error] 120-120: Boolean positional value in function call

(FBT003)

[warning] 123-123: Do not catch blind exception: Exception

(BLE001)

[warning] 124-124: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

[warning] 127-127: Logging statement uses f-string

(G004)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/clp-package-utils/clp_package_utils/scripts/search.py` around lines 116 - 128, Replace the boolean positional argument in the call to clp_config.database.get_clp_connection_params_and_type(True) with a keyword argument (e.g., get_clp_connection_params_and_type(include_type=True)) to improve readability, and change the exception logging inside the except block from logger.error(e) to logger.exception(e) so the traceback is preserved; these changes should be applied in the block that handles StorageEngine.CLP_S and uses validate_dataset_name for each dataset.

coderabbitai · 2026-02-18T13:43:23Z

components/webui/client/src/pages/SearchPage/SearchControls/Dataset/DatasetSelect.tsx

+    const datasets = useSearchStore((state) => state.selectedDatasets);
    const searchUiState = useSearchStore((state) => state.searchUiState);
+    const updateDatasets = useSearchStore((state) => state.updateSelectedDatasets);


🧹 Nitpick | 🔵 Trivial

Use useSearchStore.getState() for the updateDatasets callback.

Per the team's Zustand convention, store methods used as callbacks (not for reactive rendering) should be accessed via useStore.getState().method rather than through a selector. This avoids listing a stable function reference as a useEffect dependency and aligns with the project's existing patterns.

Proposed fix

const datasets = useSearchStore((state) => state.selectedDatasets); const searchUiState = useSearchStore((state) => state.searchUiState); - const updateDatasets = useSearchStore((state) => state.updateSelectedDatasets);

Then replace all updateDatasets(...) calls (and remove it from useEffect dependency arrays) with useSearchStore.getState().updateSelectedDatasets(...).

Based on learnings: "use useStore.getState().method for callbacks since the output is not reactive and doesn't need state as a dependency in the hook, and use useStore((state) => state.property) with proper selectors for reactive components that need to re-render when state changes."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/webui/client/src/pages/SearchPage/SearchControls/Dataset/DatasetSelect.tsx` around lines 27 - 29, The selector is pulling updateSelectedDatasets as a reactive function which then gets used as a stable callback; change usage so you don't subscribe to the function: stop using useSearchStore((state) => state.updateSelectedDatasets) and instead call useSearchStore.getState().updateSelectedDatasets(...) wherever updateDatasets is invoked (e.g., in DatasetSelect component), and remove updateDatasets from any useEffect dependency arrays so the hook doesn't depend on a stable function reference; keep using selectors (useSearchStore((s) => s.selectedDatasets) and useSearchStore((s) => s.searchUiState)) for reactive state only.

…validation consistency

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

components/clp-package-utils/clp_package_utils/scripts/native/utils.py (1)
75-93: ⚠️ Potential issue | 🔴 Critical

Update decompress.py and archive_manager.py to use the renamed function.

The old validate_dataset_exists function has been removed and replaced with validate_datasets_exist. Two files still reference the old function and will fail on import:

decompress.py (lines 39, 151)

archive_manager.py (lines 31, 203)

Both files must be updated to import validate_datasets_exist and pass datasets as a list (e.g., validate_datasets_exist(config, [dataset]) instead of validate_datasets_exist(config, dataset)).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/clp-package-utils/clp_package_utils/scripts/native/utils.py`
around lines 75 - 93, Replace the old validate_dataset_exists usage with the new
validate_datasets_exist: update the imports in the modules that still call the
old function to import validate_datasets_exist, and change call sites that
currently pass a single string (e.g., validate_dataset_exists(config, dataset))
to pass a list/iterable (e.g., validate_datasets_exist(config, [dataset]) or
validate_datasets_exist(config, datasets_list)); ensure you reference the
function name validate_datasets_exist and update both call sites that previously
used validate_dataset_exists so they pass a list of dataset names.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/clp-package-utils/clp_package_utils/scripts/native/utils.py`:
- Around line 92-93: Replace the non-idiomatic check and raw list error with a
clearer, friendlier exception: change `if len(missing) > 0:` to `if missing:`,
and raise either a dedicated exception (e.g., define `class
DatasetNotFoundError(ValueError)` and raise that) or at minimum a shorter
ValueError using a joined string for readability like `raise
ValueError(f"Datasets not found: {', '.join(missing)}")`; refer to the existing
`missing` variable and the current `ValueError` site to locate and update the
code.

---

Outside diff comments:
In `@components/clp-package-utils/clp_package_utils/scripts/native/utils.py`:
- Around line 75-93: Replace the old validate_dataset_exists usage with the new
validate_datasets_exist: update the imports in the modules that still call the
old function to import validate_datasets_exist, and change call sites that
currently pass a single string (e.g., validate_dataset_exists(config, dataset))
to pass a list/iterable (e.g., validate_datasets_exist(config, [dataset]) or
validate_datasets_exist(config, datasets_list)); ensure you reference the
function name validate_datasets_exist and update both call sites that previously
used validate_dataset_exists so they pass a list of dataset names.

---

Duplicate comments:
In
`@components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py`:
- Around line 99-100: The code uses extract_json_config.datasets[0] guarded only
by "datasets is not None", which still raises IndexError for an empty list;
update the logic around the dataset assignment in extract_stream_task.py (the
block that sets dataset = extract_json_config.datasets[0] if
extract_json_config.datasets is not None else None) to ensure the list is
non-empty before indexing—e.g., check "if extract_json_config.datasets and
len(extract_json_config.datasets) > 0" or use a safe extractor like
"next(iter(extract_json_config.datasets), None)" so dataset falls back to None
for empty lists.

In
`@components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py`:
- Around line 736-738: When datasets is None before calling
get_archives_for_search, set datasets to the full list of existing dataset IDs
instead of passing None; update the code around where archives_for_search is
computed so that if datasets is None you call the project/dataset discovery
helper (e.g., a function that lists all datasets using db_conn/table_prefix or
an existing helper like list_all_datasets/get_all_dataset_ids) and assign that
list to datasets, then call get_archives_for_search(db_conn, table_prefix,
search_config, archive_end_ts_lower_bound, datasets) so the for ds in ... loop
in get_archives_for_search never receives None.
- Around line 178-179: The current guard uses "is not None" but still allows an
empty list and can cause IndexError when accessing
self.__job_config.datasets[0]; update the logic around extraction_dataset in
query_scheduler.py so you only index when datasets is truthy (e.g., check if
self.__job_config.datasets before taking [0]) and pass None to archive_exists
when datasets is empty or missing; change the assignment of extraction_dataset
and the subsequent archive_exists call to use this falsy check so empty lists
are handled defensively.
- Around line 417-427: The SQL currently injects the dataset name directly into
the query via f"'{ds}'" in the loop (see variables union_parts, ds,
get_archives_table_name, where_clause, query); change this to use a parameter
placeholder for the dataset literal while continuing to use
get_archives_table_name(table_prefix, ds) for the table identifier (table names
remain non-parameterized). Concretely, build each union part with a placeholder
(e.g., %s or the DB driver's placeholder) instead of embedding '{ds}', append
the corresponding ds value to a params list for each union part, and when
executing the final " UNION ALL ... ORDER BY end_timestamp DESC" query pass the
accumulated params so dataset values are bound safely. Ensure any existing
filter_clauses parameters are merged into the same params list in the right
order before executing.

coderabbitai · 2026-02-18T14:37:55Z

components/clp-package-utils/clp_package_utils/scripts/native/utils.py

+        if len(missing) > 0:
+            raise ValueError(f"Dataset(s) {missing} don't exist.")


🧹 Nitpick | 🔵 Trivial

Simplify truthiness check and improve error message readability.

Three minor issues on these two lines:

len(missing) > 0 — prefer the idiomatic if missing:.

Ruff TRY003: the long message is passed inline to ValueError rather than encapsulated in a custom exception (or at least a short message). Consider using a dedicated exception class or a shorter message.

{missing} renders the raw Python list repr (e.g., ['ds1', 'ds2']); joining the names is more user-friendly.

♻️ Proposed refactor

- if len(missing) > 0: - raise ValueError(f"Dataset(s) {missing} don't exist.") + if missing: + raise ValueError(f"Dataset(s) don't exist: {', '.join(missing)}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if len(missing) > 0:

raise ValueError(f"Dataset(s) {missing} don't exist.")

if missing:

raise ValueError(f"Dataset(s) don't exist: {', '.join(missing)}")

🧰 Tools

🪛 Ruff (0.15.1)

[warning] 93-93: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/clp-package-utils/clp_package_utils/scripts/native/utils.py` around lines 92 - 93, Replace the non-idiomatic check and raw list error with a clearer, friendlier exception: change `if len(missing) > 0:` to `if missing:`, and raise either a dedicated exception (e.g., define `class DatasetNotFoundError(ValueError)` and raise that) or at minimum a shorter ValueError using a joined string for readability like `raise ValueError(f"Datasets not found: {', '.join(missing)}")`; refer to the existing `missing` variable and the current `ValueError` site to locate and update the code.

feat!: Add multi-dataset query support (resolves y-scope#1933).

63c1c25

junhaoliao added 7 commits February 14, 2026 09:25

lint

f8c63ac

docs: Add max_datasets_per_query option to config templates

64e814e

docs: Fix docstring formatting in validate_datasets_exist function

ad3db27

docs: Update documentation for multi-dataset query support

0354382

refactor: Simplify dataset queries by consolidating logic with UNION ALL

ad4f505

refactor: Remove unnecessary type assertions in SearchState default v…

de7f7e3

…alues

refactor: Rename selectDatasets to selectedDatasets for consistency; …

096371e

…ename cachedDatasets to queriedDatasets; fix query speed calculation

junhaoliao commented Feb 14, 2026

View reviewed changes

junhaoliao and others added 5 commits February 14, 2026 10:58

refactor: Simplify fallback dataset selection logic using `useCallbac…

89384a2

…k`; improve clarity in `DatasetSelect` component

style: Adjust selectContainerExpanded width to improve UI layout co…

669641f

…nsistency

refactor: Handle multi-dataset configurations consistently in StreamF…

8451a51

…ileManager and extract_stream_task

lint

f2ed336

Merge branch 'main' into multi-dataset

07d2b17

junhaoliao requested review from LinZhihao-723, davemarco, hoophalab and sitaowang1998 February 17, 2026 19:59

junhaoliao marked this pull request as ready for review February 17, 2026 20:00

junhaoliao requested review from a team and gibber9809 as code owners February 17, 2026 20:00

coderabbitai bot reviewed Feb 17, 2026

View reviewed changes

sitaowang1998 requested changes Feb 17, 2026

View reviewed changes

junhaoliao added 3 commits February 18, 2026 21:30

feat: Add MaxDatasetsPerQuery config to limit datasets per query an…

f6b7a22

…d enforce validation in UI and backend

refactor: Replace --datasets with --dataset for improved consiste…

406c9f8

…ncy and support multiple flag usage

consolidate missing dataset checks

ba934b3

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

refactor: Standardize dataset null checks across modules and improve …

8a871f8

…validation consistency

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

	max_datasets_per_query: PositiveInt \| None = 10
	max_datasets_per_query: PositiveInt \| None = 10 # None means unlimited

		if len(missing) > 0:
		raise ValueError(f"Dataset(s) {missing} don't exist.")

Conversation

junhaoliao commented Feb 14, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Impact Assessment

Checklist

Validation performed

0. Build CLP package

1. Start CLP and compress sample data

2. Search with --datasets flag (single dataset)

3. Search without --datasets flag (defaults to default)

4. Search with non-existent dataset (validation)

5. Web UI: Dataset multi-select rendering

6. Stop CLP

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

junhaoliao Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

junhaoliao Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

sitaowang1998 left a comment

Choose a reason for hiding this comment

Uh oh!

sitaowang1998 Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

sitaowang1998 Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

sitaowang1998 Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

sitaowang1998 Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

sitaowang1998 Feb 17, 2026

Choose a reason for hiding this comment

junhaoliao commented Feb 14, 2026 •

edited by coderabbitai bot

Loading

2. Search with `--datasets` flag (single dataset)

3. Search without `--datasets` flag (defaults to `default`)

coderabbitai bot commented Feb 14, 2026 •

edited

Loading