Skip to content

feat!: Add multi-dataset query support (resolves #1933).#1992

Open
junhaoliao wants to merge 17 commits intoy-scope:mainfrom
junhaoliao:multi-dataset
Open

feat!: Add multi-dataset query support (resolves #1933).#1992
junhaoliao wants to merge 17 commits intoy-scope:mainfrom
junhaoliao:multi-dataset

Conversation

@junhaoliao
Copy link
Member

@junhaoliao junhaoliao commented Feb 14, 2026

Description

Add multi-dataset query support, allowing users to search across multiple datasets in a single query.
This is a breaking change — the dataset field in the query job config is replaced with datasets: list[str] (Python) / Vec<String> (Rust) / Type.Array(Type.String()) (TypeScript) across all query entry points (Web UI, API server, CLI).

Impact Assessment

  • What is affected: All query entry points (Web UI, Rust API server, CLI), the query scheduler, query workers, C++ clp-s binary output handler, and result rendering in the Web UI.
  • Why the change is needed: Users need to search across multiple datasets simultaneously without submitting separate queries for each dataset. This enables cross-dataset log correlation and analysis workflows.
  • Implications: This is a breaking change. The dataset field is removed from the query job config in favour of datasets. Existing API clients, stored queries, and CLI scripts that use --dataset must be updated to use --datasets. The scheduler enforces a configurable max_datasets_per_query limit to prevent resource exhaustion.

Key changes by layer:

  1. Job config schemaQueryJobConfig.datasetQueryJobConfig.datasets in Python, Rust, and
    TypeScript.

    • components/job-orchestration/job_orchestration/scheduler/job_config.py
    • components/clp-py-utils/clp_py_utils/clp_config.py
    • components/clp-rust-utils/src/job_config/search.rs
    • components/webui/common/src/schemas/search.ts
    • Code owners: @hoophalab, @LinZhihao-723
  2. Configuration — Add max_datasets_per_query (default 10, null for unlimited) to
    QueryScheduler config in clp-config.yaml.

    • components/clp-py-utils/clp_py_utils/clp_config.py
    • components/package-template/src/etc/clp-config.template.text.yaml
    • components/package-template/src/etc/clp-config.template.json.yaml
    • Code owners: @sitaowang1998
  3. Scheduler fan-outquery_scheduler.py loops over all requested datasets, fetches archives
    from each dataset's metadata table, merge-sorts by begin_timestamp, validates dataset existence
    and count limits, and passes dataset as an explicit Celery task argument so each worker knows
    which dataset it is querying.

  4. C++ core results attribution — Add --dataset CLI flag to clp-s and include a dataset
    field in each BSON result document written to the results cache, enabling per-result dataset
    attribution.

    • components/core/src/clp_s/CommandLineArguments.cpp
    • components/core/src/clp_s/CommandLineArguments.hpp
    • components/core/src/clp_s/OutputHandlerImpl.cpp
    • components/core/src/clp_s/OutputHandlerImpl.hpp
    • components/core/src/clp_s/archive_constants.hpp
    • components/core/src/clp_s/clp-s.cpp
    • Code owners: @gibber9809
  5. Worker executionfs_search_task.py and extract_stream_task.py accept dataset as an
    explicit Celery task parameter and pass it to the clp-s binary via --dataset.

    • components/job-orchestration/job_orchestration/executor/query/fs_search_task.py
    • components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py
    • Code owners: @gibber9809, @sitaowang1998
  6. Web UI Client & Web UI server — The dataset selector changes from single-select to multi-select (selectDataset: string | nullselectDatasets: string[]). LogViewerLink reads dataset from each search result document instead of the global cached state. "All Time" range uses UNION ALL across multiple datasets. The multi-select dropdown uses responsive tag collapsing and defaults to the "default" dataset when the selection is emptied. Updated Fastify routes to accept an datasets array.

    • components/webui/client/src/pages/SearchPage/SearchControls/Dataset/DatasetSelect.tsx
    • components/webui/client/src/pages/SearchPage/SearchControls/Dataset/index.module.css
    • components/webui/client/src/pages/SearchPage/SearchControls/Dataset/index.tsx
    • components/webui/client/src/pages/SearchPage/SearchControls/Native/SearchButton/SubmitButton/index.tsx
    • components/webui/client/src/pages/SearchPage/SearchControls/Presto/Guided/presto-guided-search-requests.ts
    • components/webui/client/src/pages/SearchPage/SearchControls/Presto/SqlSearchButton/RunButton/GuidedRunButton.tsx
    • components/webui/client/src/pages/SearchPage/SearchControls/QueryStatus/QuerySpeed/index.tsx
    • components/webui/client/src/pages/SearchPage/SearchControls/TimeRangeInput/Presto/TimeRangeFooter/TimestampKeySelect/index.tsx
    • components/webui/client/src/pages/SearchPage/SearchControls/TimeRangeInput/sql.ts
    • components/webui/client/src/pages/SearchPage/SearchControls/TimeRangeInput/utils.tsx
    • components/webui/client/src/pages/SearchPage/SearchResults/SearchResultsTable/Native/Message/LogViewerLink.tsx
    • components/webui/client/src/pages/SearchPage/SearchResults/SearchResultsTable/Native/Message/index.tsx
    • components/webui/client/src/pages/SearchPage/SearchResults/SearchResultsTable/Native/SearchResultsVirtualTable/typings.tsx
    • components/webui/client/src/pages/SearchPage/SearchResults/SearchResultsTimeline/Native/NativeResultsTimeline.tsx
    • components/webui/client/src/pages/SearchPage/SearchState/Presto/useTimestampKeyInit/index.tsx
    • components/webui/client/src/pages/SearchPage/SearchState/index.tsx
    • components/webui/server/src/routes/api/search/index.ts
    • components/webui/common/src/schemas/search.ts
    • Code owners: @hoophalab, @davemarco
  7. API server — Updated routes to accept a datasets array.

  8. CLI tools--dataset--datasets with nargs="+". validate_dataset_exists
    validate_datasets_exist to validate all datasets before submission.

    • components/clp-package-utils/clp_package_utils/scripts/search.py
    • components/clp-package-utils/clp_package_utils/scripts/native/search.py
    • components/clp-package-utils/clp_package_utils/scripts/native/utils.py
    • Code owners: @gibber9809, @hoophalab
  9. Docs — Updated API server guide, quick-start guide, and regenerated OpenAPI spec to reflect
    datasets (plural).

    • docs/src/_static/generated/api-server-openapi.json
    • docs/src/user-docs/guides-using-the-api-server.md
    • docs/src/user-docs/quick-start/clp-json.md
    • Code owners: @hoophalab , @kirkrodrigues

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

0. Build CLP package

Task: Verify the project builds successfully with all changes (C++, Rust, Python, TypeScript).

Command:

task

Output:

...
#30 exporting manifest list sha256:6559f02cc7654b3e2366ce77a0c8d0057b54c3e7d2765549bc4e3409632cd863
#30 exporting manifest list sha256:6559f02cc7654b3e2366ce77a0c8d0057b54c3e7d2765549bc4e3409632cd863 0.0s done
#30 naming to moby-dangling@sha256:6559f02cc7654b3e2366ce77a0c8d0057b54c3e7d2765549bc4e3409632cd863 done
#30 unpacking to moby-dangling@sha256:6559f02cc7654b3e2366ce77a0c8d0057b54c3e7d2765549bc4e3409632cd863
#30 unpacking to moby-dangling@sha256:6559f02cc7654b3e2366ce77a0c8d0057b54c3e7d2765549bc4e3409632cd863 2.2s done
#30 DONE 7.7s
task: [package] echo '0.9.1-dev' > '/home/junhao/workspace/5-clp/build/clp-package/VERSION'

Build completed successfully — all C++ targets, Rust binaries, Python packages, and TypeScript
client compiled without errors.

1. Start CLP and compress sample data

Task: Start CLP and run a compression job to populate the default dataset with data for
subsequent search tests.

Commands:

cd build/clp-package
./sbin/start-clp.sh
./sbin/compress.sh --timestamp-key timestamp ~/samples/postgresql.jsonl

Output:

...
2026-02-14T07:33:48.292 INFO [controller] Started CLP.

2026-02-14T07:33:53.291 INFO [compress] Compression job 1 submitted.
2026-02-14T07:33:55.295 INFO [compress] Compressed 392.84MB into 9.94MB (39.53x). Speed: 204.49MB/s.
2026-02-14T07:33:55.795 INFO [compress] Compression finished.
2026-02-14T07:33:55.796 INFO [compress] Compressed 392.84MB into 9.94MB (39.53x). Speed: 182.59MB/s.

All containers started and passed health checks. Compression succeeded.

2. Search with --datasets flag (single dataset)

Task: Verify the new --datasets CLI flag works for a single dataset with count aggregation.

Command:

./sbin/search.sh "*" --datasets default --count

Output:

tags: [] count: 1000000

Explanation: The search correctly targets the default dataset using the new --datasets flag
and returns the expected 1,000,000 log events.

3. Search without --datasets flag (defaults to default)

Task: Verify that omitting the --datasets flag defaults to querying the default dataset.

Command:

./sbin/search.sh "*" --count

Output:

tags: [] count: 1000000

Explanation: Without --datasets, the CLI defaults to ["default"], producing the same result
as the explicit dataset selection in step 2.

4. Search with non-existent dataset (validation)

Task: Verify that specifying a non-existent dataset produces a clear error and exits with a
non-zero status code.

Command:

./sbin/search.sh "*" --datasets nonexistent --count

Output:

2026-02-14T07:34:22.140 ERROR [search] Dataset `nonexistent` doesn't exist.
2026-02-14T07:34:22.253 ERROR [search] Search failed.

Explanation: The validate_datasets_exist function correctly rejects the non-existent dataset
before submitting the query job, preventing wasted scheduler/worker resources.

5. Web UI: Dataset multi-select rendering

Task: Verify the dataset selector renders correctly in single-select and multi-select states without
vertical overflow.

Method: Opened http://localhost:8080/search and tested the dataset selector.

Single dataset (1 selected):

  • The selected tag (anything ×) renders in full — no collapsed + 1 ... indicator.
  • The selector stays at its natural content width.

Multiple datasets (3 selected):

  • The selector expands to a fixed 200px width.
  • Shows the first tag that fits (anything ×) with a + 2 ... indicator for the remaining.
  • All content stays in a single row — no vertical overflow.

Explanation: maxTagCount="responsive" is conditionally applied only when more than one dataset is
selected. With a single dataset, responsive collapsing is disabled so the tag always renders in full.
When multiple datasets are selected, the container expands to 200px (via the selectContainerExpanded
CSS class) giving rc-overflow a known width to measure against for responsive tag collapsing.

6. Stop CLP

Command:

./sbin/stop-clp.sh

Output:

...
2026-02-14T07:34:48.274 INFO [controller] Stopped CLP.

Summary by CodeRabbit

  • New Features

    • Search now supports selecting and querying multiple datasets at once.
    • Added configurable per-query dataset limit (default 10).
  • UI

    • Dataset selector updated to multi-select with improved layout, fallback and selection limits.
    • Search submission, results timeline and related controls adapted for multi-dataset workflows.
  • API & CLI

    • API request payloads use a datasets array instead of a single dataset.
    • CLI accepts repeated --datasets entries for scoping queries.
  • Documentation

    • Examples and quick-start updated to show datasets array usage.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 14, 2026

Walkthrough

Replaces single dataset (string/null) with datasets (array) across APIs, job models, scheduler, CLP tooling, core results path, executor tasks, and Web UI; adds per-query dataset limit (config + UI setting) and threads dataset metadata through results and CLI invocations.

Changes

Cohort / File(s) Summary
API & Job Models
components/api-server/src/client.rs, components/api-server/src/routes.rs, components/clp-rust-utils/src/job_config/search.rs, components/webui/common/src/schemas/search.ts
Replaced single dataset with plural datasets (Option<Vec> / Type.Array(Type.String())); updated conversions, OpenAPI example, and schema shapes.
Scheduler & Orchestration
components/job-orchestration/.../scheduler/job_config.py, components/job-orchestration/.../scheduler/query/query_scheduler.py
QueryJobConfig now has `datasets: list[str]
Python CLP Utilities & CLI
components/clp-package-utils/clp_package_utils/scripts/native/search.py, .../scripts/search.py, .../scripts/native/utils.py, components/clp-package-utils/clp_package_utils/controller.py
CLI --dataset--datasets (append); function signatures updated to accept `datasets: list[str]
CLP Core & Results Cache
components/core/src/clp_s/CommandLineArguments.hpp, .../CommandLineArguments.cpp, components/core/src/clp_s/OutputHandlerImpl.hpp, .../OutputHandlerImpl.cpp, components/core/src/clp_s/archive_constants.hpp, components/core/src/clp_s/clp-s.cpp
Added CLI dataset option and member; extended ResultsCacheOutputHandler/QueryResult to include dataset, persist cDataset in documents, and updated constructor/signatures and call sites to pass dataset.
Executor Tasks
components/job-orchestration/.../executor/query/extract_stream_task.py, .../fs_search_task.py
Threaded optional dataset through command/env builders; CLP_S command builders accept/use dataset (fallback to first dataset when needed); appended --dataset flag when provided.
Web UI - State, Settings & Config
components/webui/client/src/pages/SearchPage/SearchState/index.tsx, components/webui/client/public/settings.json, components/webui/client/src/config/index.ts, components/webui/client/src/settings.ts, components/package-template/.../clp-config.template.*
Replaced singular store fields with selectedDatasets: string[] and queriedDatasets: string[]; added MaxDatasetsPerQuery setting and exported SETTINGS_MAX_DATASETS_PER_QUERY; added max_datasets_per_query in config templates.
Web UI - Dataset Selection & Styling
components/webui/client/src/pages/SearchPage/SearchControls/Dataset/DatasetSelect.tsx, .../index.tsx, .../index.module.css
Converted dataset selector to multi-select, added fallback logic, conditional expanded styling and responsive tag handling.
Web UI - Submission, Metrics, Time Range, Results
multiple files under components/webui/client/src/pages/SearchPage/... (SubmitButton, Presto guided, QuerySpeed utils, TimeRangeInput, TimestampKeySelect, NativeResultsTimeline, SearchResults components, etc.)
Updated components and utilities to accept/use arrays of datasets (use first dataset where single-dataset semantics required), updated SQL builders to UNION across datasets, threaded dataset into result rendering and LogViewerLink, and adjusted hooks/queries to use selectedDatasets/queriedDatasets.
WebUI Server & Stream File Manager / Routes
components/webui/server/src/plugins/app/StreamFileManager.ts, components/webui/server/src/routes/api/search/index.ts
submitAndWaitForExtractStreamJob and route handlers now use datasets array in jobConfig and request payload wiring.
Docs & Quick Start
docs/src/user-docs/guides-using-the-api-server.md, docs/src/user-docs/quick-start/clp-json.md
Examples and prose updated to show datasets: ["default"] and document multi-dataset CLI usage and UI label pluralization.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant WebUI as Web UI (React store)
    participant API as API Server
    participant Scheduler as Query Scheduler
    participant Executor as Task Executor
    participant CLP_S as Search Engine (CLP_S)

    User->>WebUI: select one or more datasets
    WebUI->>WebUI: store selectedDatasets[]
    User->>WebUI: submit query (datasets[])
    WebUI->>API: POST /query with datasets[]
    API->>Scheduler: create/dispatch QueryJob(datasets[])
    Scheduler->>Scheduler: validate datasets & enforce max_datasets_per_query
    Scheduler->>Scheduler: discover archives per dataset -> archives[{archive_id,dataset},...]
    Scheduler->>Executor: create tasks with archive dicts
    Executor->>CLP_S: run search/extract per archive with --dataset <dataset>
    CLP_S->>Executor: return results (include dataset)
    Executor->>API: store results in results cache (includes dataset)
    API->>WebUI: return results (contains dataset)
    WebUI->>User: render results (dataset-aware)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 51.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding multi-dataset query support, which is the primary objective of this PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment on lines +608 to 612
)(
"dataset",
po::value<std::string>(&m_dataset)->value_name("DATASET"),
"The dataset name to include in each result document"
);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dataset name needs to be include in the query results before dumping them into the result cache. i know this feels like a hack. a more future proof interface might to to accept a JSON object string --extra-result-metadata and merge such objects with the results before dumping into the results cache. what do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively, we can add multi-dataset support to the stream extraction flow as well. then we just pass an array of datasets in the stream extraction config, avoiding touching the clp-s binary

i think this may be more aligned with the future plan of using a single table for all datasets (maybe i haven't had a correct understanding of how we are going to refactor the datasets feature though

@junhaoliao junhaoliao marked this pull request as ready for review February 17, 2026 20:00
@junhaoliao junhaoliao requested review from a team and gibber9809 as code owners February 17, 2026 20:00
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
components/webui/client/src/pages/SearchPage/SearchControls/Presto/SqlSearchButton/RunButton/GuidedRunButton.tsx (1)

99-100: ⚠️ Potential issue | 🟡 Minor

Coding guideline: prefer false === isQueryReady over !isQueryReady.

Proposed fix
-                disabled={!isQueryReady ||
+                disabled={false === isQueryReady ||
                     searchUiState === SEARCH_UI_STATE.QUERY_ID_PENDING}

As per coding guidelines: **/*.{tsx}: "Prefer false === <expression> rather than !<expression>."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@components/webui/client/src/pages/SearchPage/SearchControls/Presto/SqlSearchButton/RunButton/GuidedRunButton.tsx`
around lines 99 - 100, In GuidedRunButton update the disabled conditional to
avoid using negation: replace the `!isQueryReady` check with `false ===
isQueryReady` so the prop check reads `disabled={false === isQueryReady ||
searchUiState === SEARCH_UI_STATE.QUERY_ID_PENDING}` (keep the existing
searchUiState comparison unchanged); this targets the isQueryReady boolean usage
in the GuidedRunButton component to comply with the coding guideline.
components/api-server/src/client.rs (1)

27-56: ⚠️ Potential issue | 🟡 Minor

Consider validating that datasets is not an empty vector.

A client can submit "datasets": [], which would pass the is_none() check on Line 131 and propagate an empty dataset list to the scheduler. If the scheduler doesn't guard against this, it could cause unexpected behaviour. Consider adding a validation or normalising empty vectors to None:

🛡️ Suggested guard in submit_query
     let mut search_job_config: SearchJobConfig = query_config.into();
-    if search_job_config.datasets.is_none() {
+    if search_job_config.datasets.as_ref().is_none_or(|d| d.is_empty()) {
         search_job_config.datasets = match self.config.package.storage_engine {
             StorageEngine::Clp => None,
             StorageEngine::ClpS => Some(vec!["default".to_owned()]),
         }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/api-server/src/client.rs` around lines 27 - 56, QueryConfig allows
"datasets": [] which should be treated as None; update the code that handles
incoming QueryConfig (e.g., in submit_query) to normalise or validate the
datasets field by checking QueryConfig::datasets and if Some(vec) &&
vec.is_empty() either set it to None before further processing or return a 4xx
validation error; reference the QueryConfig struct and the submit_query handler
so the empty-vector case is handled early and you never propagate an empty
Vec<String> to the scheduler.
components/webui/client/src/pages/SearchPage/SearchControls/QueryStatus/QuerySpeed/utils.ts (1)

28-48: ⚠️ Potential issue | 🟡 Minor

Ensure jobId uses parameterized query or add format validation.

jobId originates from the backend API response (trusted internal source), mitigating direct injection risk. However, it's typed as string (not numeric) and is interpolated directly into SQL without frontend validation. Adopt parameterized queries or validate that jobId matches an expected format before interpolation to follow SQL injection prevention best practices.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@components/webui/client/src/pages/SearchPage/SearchControls/QueryStatus/QuerySpeed/utils.ts`
around lines 28 - 48, The SQL string builder in utils.ts interpolates jobId
directly into the query (see archivesSubquery and the template that uses
${jobId}), which risks injection because jobId is a string; fix it by switching
to a parameterized query or validating/coercing jobId to a strict numeric format
before interpolation—either accept a numeric jobId argument
(Number.parseInt/Number and reject NaN) or replace the inline ${jobId} use with
a parameter placeholder and pass jobId as a bound parameter from the caller
(update the function that returns this query and its consumers to support
parameter binding).
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (1)

504-537: 🧹 Nitpick | 🔵 Trivial

Consider using zip instead of index-based iteration.

The for i in range(len(archives)) pattern at lines 524 and 537 could be cleaner with zip(archives, task_ids).

Suggested refactor
     if QueryJobType.SEARCH_OR_AGGREGATION == job_type:
         return celery.group(
             search.s(
                 job_id=job.id,
-                archive_id=archives[i]["archive_id"],
-                task_id=task_ids[i],
+                archive_id=archive["archive_id"],
+                task_id=task_id,
                 job_config=job_config,
-                dataset=archives[i].get("dataset"),
+                dataset=archive.get("dataset"),
                 clp_metadata_db_conn_params=clp_metadata_db_conn_params,
                 results_cache_uri=results_cache_uri,
             )
-            for i in range(len(archives))
+            for archive, task_id in zip(archives, task_ids)
         )

Apply the same pattern to the EXTRACT_JSON/EXTRACT_IR branch.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py`
around lines 504 - 537, The code in get_task_group_for_job uses index-based
iteration (for i in range(len(archives))) to pair archives and task_ids when
building celery.group for QueryJobType.SEARCH_OR_AGGREGATION and
QueryJobType.EXTRACT_JSON/EXTRACT_IR; replace those loops with zip(archives,
task_ids) to iterate pairs directly and update the generator expressions that
call search.s(...) and extract_stream.s(...) to use the unpacked archive and
task_id instead of archives[i] and task_ids[i]; ensure you reference job.id,
job_config, archive.get("dataset"), clp_metadata_db_conn_params and
results_cache_uri unchanged while switching both branches to the zip-based
pattern.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/clp-py-utils/clp_py_utils/clp_config.py`:
- Line 435: The config field max_datasets_per_query (in clp_config.py) lacks
documentation for None semantics—add a brief inline comment next to its
declaration explaining that setting this field to None (or null in YAML/JSON)
means "unlimited" datasets per query; keep the comment concise and on the same
line or immediately above the declaration so users editing config files
immediately see that None/null = unlimited.

In
`@components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py`:
- Around line 99-103: The code currently allows dataset to remain None if
extract_json_config.datasets is empty, which leads to invalid paths when
building s3_object_key and local paths; update the start of the logic in
extract_stream_task (the block using dataset, extract_json_config.datasets,
StorageType.S3 and s3_config.key_prefix) to validate dataset after resolving
from extract_json_config.datasets and, if still None, return an error (or raise
an exception) early with a clear message instead of proceeding to build
f"{s3_config.key_prefix}{dataset}/{archive_id}" or using get_directory() /
dataset; ensure any callers handle the early error accordingly.

In
`@components/job-orchestration/job_orchestration/executor/query/fs_search_task.py`:
- Around line 122-125: The CLP_S branch passes dataset (which can be None) into
_make_core_clp_s_command_and_env_vars which expects a str and will break
(f-string will include "None" and Path / None raises TypeError); fix by adding a
guard in _make_command_and_env_vars before the StorageEngine.CLP_S branch to
validate dataset is not None (raise a ValueError with a clear message) or
require/convert it to a string, or change _make_core_clp_s_command_and_env_vars
to accept Optional[str] and handle None safely; reference StorageEngine.CLP_S,
_make_command_and_env_vars, and _make_core_clp_s_command_and_env_vars when
applying the guard or signature change.

In
`@components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py`:
- Around line 178-180: The code silently selects only the first dataset via
extraction_dataset = self.__job_config.datasets[0] which can confuse reviewers;
add a concise inline comment next to that line explaining that extraction jobs
intentionally use only the first dataset (single-archive behavior) and that
multi-dataset context is not applicable here, referencing __job_config.datasets,
extraction_dataset, and the subsequent archive_exists check to make the
rationale clear.
- Around line 417-427: The current code directly interpolates dataset names into
the SELECT string (in the loop building union_parts), which risks SQL injection;
instead build the SELECT with a parameter placeholder for the dataset value and
accumulate a params list to supply to the DB driver. Concretely, keep using
get_archives_table_name(table_prefix, ds) for the table name (since table names
cannot be parameterized), but change the appended string to use a parameter
placeholder (e.g. "%s" or "?" matching the DB adapter) like "SELECT id AS
archive_id, end_timestamp, ? AS dataset FROM {table}{where_clause}", and append
the ds value to a params list for each union part; after joining union_parts
into query, return/execute the query together with the flattened params list so
the dataset values are passed as bound parameters rather than interpolated.
- Around line 736-738: The call to get_archives_for_search with datasets can
pass None (causing a TypeError in get_archives_for_search's for ds in datasets),
so before calling get_archives_for_search (where archives_for_search is
assigned) add a guard: if datasets is None, populate datasets with all available
datasets (e.g., call the existing helper that lists datasets such as
get_all_datasets/db helper or query the DB via db_conn/table_prefix to return a
list of dataset names), or explicitly set datasets = [] if the intended
semantics are “no datasets”; then call get_archives_for_search(db_conn,
table_prefix, search_config, archive_end_ts_lower_bound, datasets) so the
function always receives an iterable list.

In
`@components/webui/client/src/pages/SearchPage/SearchControls/Dataset/DatasetSelect.tsx`:
- Around line 55-63: The useEffect that checks isSuccess && datasets.length ===
0 keeps calling updateDatasets(getFallbackDatasets()) even when
getFallbackDatasets() returns a new empty array reference, causing an infinite
re-render loop; modify the effect in DatasetSelect.tsx to early-return if data
is an empty array (data.length === 0) or only call updateDatasets when the
fallback actually changes (e.g., fallback has length > 0 or differs from
datasets by length/content) so that updateDatasets is not invoked with a new
empty array reference repeatedly; specifically guard the useEffect (and the
related effect that calls updateDatasets([])) to avoid setting an
identical-empty selection by comparing lengths or contents before calling
updateDatasets.
- Around line 43-52: In getFallbackDatasets, remove the unnecessary type
assertion on available[0]; since available is derived as string[] (const
available = data || []) and you already check length > 0, replace the returned
array [available[0] as string] with [available[0]] to avoid the redundant cast
(references: function getFallbackDatasets, variable available, constant
CLP_DEFAULT_DATASET_NAME).

In
`@components/webui/client/src/pages/SearchPage/SearchControls/Native/SearchButton/SubmitButton/index.tsx`:
- Around line 107-112: Update the inline comment above the isNoDatasetsAndClpS
computation to reference "datasets" plural and state that clp-s requires at
least one dataset; locate the comment surrounding the isNoDatasetsAndClpS
constant (which checks selectedDatasets.length and SETTINGS_STORAGE_ENGINE
against CLP_STORAGE_ENGINES.CLP_S) and change the phrasing to something like
"Submit button must be disabled if there are no datasets since clp-s requires at
least one dataset for queries."

In
`@components/webui/client/src/pages/SearchPage/SearchControls/Presto/Guided/presto-guided-search-requests.ts`:
- Line 74: The conditional guarding the destructured variable `from` uses an
unnecessary typeof check; replace the `if ("undefined" === typeof from)` with a
direct undefined comparison (e.g., `if (from === undefined)`) in
presto-guided-search-requests.ts so the code is clearer and still correctly
handles `string | undefined` for `from`.
- Around line 71-76: The code currently destructures selectedDatasets to only
use the first entry (const [from] = selectedDatasets) which silently ignores
additional selections; update all guided-Presto entry points (the usages in
presto-guided-search-requests.ts, GuidedRunButton, TimestampKeySelect and
buildPrestoGuidedQueries) to detect when
useSearchStore.getState().selectedDatasets.length > 1 and handle it explicitly:
either prevent guided mode (disable the run button and input controls) or
surface a prominent, user-visible warning/toast/banner that multiple datasets
are not supported in guided mode and block execution until the user reduces the
selection to one; do not silently proceed by using only selectedDatasets[0].
Ensure the same single-check-and-block logic and a clear error/warning message
is implemented wherever selectedDatasets is currently destructured.

In
`@components/webui/client/src/pages/SearchPage/SearchControls/Presto/SqlSearchButton/RunButton/GuidedRunButton.tsx`:
- Around line 37-40: The button currently reads selectedDatasets via
useSearchStore.getState() which is non-reactive so `from` (and consequently
`isQueryReady`) won't update on selection changes; replace the non-reactive call
with a reactive selector such as const selectedDatasets = useSearchStore(state
=> state.selectedDatasets) and compute `from` from that reactive value (keeping
the `from` name and downstream references intact) so the component re-renders
when selection changes.

In
`@components/webui/client/src/pages/SearchPage/SearchControls/QueryStatus/QuerySpeed/utils.ts`:
- Around line 15-26: The SQL-building code interpolates datasetNames directly
into table identifiers (see variable archivesSubquery and
settings.SqlDbClpTablePrefix) which is an injection risk; fix by introducing a
single helper like escapeSqlIdentifier(name) that either validates names against
a safe whitelist regex (e.g. only [A-Za-z0-9_\\-] and throw on invalid) or
properly quotes/escapes identifiers (wrap with backticks and escape any
backticks inside), and call that helper wherever datasetNames are used to build
table names (both in this file's archivesSubquery logic and in
TimeRangeInput/sql.ts); do not use string interpolation of raw datasetNames
directly and document that identifiers are validated/escaped before
concatenation.

In
`@components/webui/client/src/pages/SearchPage/SearchControls/TimeRangeInput/Presto/TimeRangeFooter/TimestampKeySelect/index.tsx`:
- Around line 20-26: The code duplicates the logic that derives the first
selected dataset (`0 < selectedDatasets.length ? selectedDatasets[0] : null`)
across components (seen in this file and useTimestampKeyInit/index.tsx); extract
a reusable selector/helper (e.g., getFirstSelectedDataset or
selectFirstSelectedDataset) and use it with useSearchStore in both places
instead of repeating the ternary. Update usages that reference selectedDatasets
(the hook useSearchStore and components using dataset) to call the new selector
so the derivation is centralized and DRY.

In
`@components/webui/client/src/pages/SearchPage/SearchControls/TimeRangeInput/sql.ts`:
- Around line 34-44: The function buildClpsTimeRangeSql currently interpolates
datasetNames directly into SQL identifiers (via settings.SqlDbClpTablePrefix +
name + _ + SqlTableSuffix.ARCHIVES), which risks SQL injection if
client-supplied names are malicious; add a client-side defensive check in
buildClpsTimeRangeSql to validate each name against /^\w+$/ (or equivalent
alphanumeric/underscore-only regex) and either throw an error or omit invalid
entries before building unionParts, so only validated safe identifiers are
concatenated into the SQL string; ensure the validation references
buildClpsTimeRangeSql, CLP_ARCHIVES_TABLE_COLUMN_NAMES,
settings.SqlDbClpTablePrefix, and SqlTableSuffix.ARCHIVES so reviewers can
locate the change.

In `@components/webui/common/src/schemas/search.ts`:
- Line 16: The datasets array in the search schema currently allows empty
arrays; update the schema definition for the datasets property to require at
least one entry by adding a minItems: 1 constraint to the Type.Array call (i.e.,
change datasets: Type.Array(Type.String()) to include { minItems: 1 }); this
aligns with the existing pattern used elsewhere (e.g., compression schema) and
ensures validation rejects empty dataset lists.

---

Outside diff comments:
In `@components/api-server/src/client.rs`:
- Around line 27-56: QueryConfig allows "datasets": [] which should be treated
as None; update the code that handles incoming QueryConfig (e.g., in
submit_query) to normalise or validate the datasets field by checking
QueryConfig::datasets and if Some(vec) && vec.is_empty() either set it to None
before further processing or return a 4xx validation error; reference the
QueryConfig struct and the submit_query handler so the empty-vector case is
handled early and you never propagate an empty Vec<String> to the scheduler.

In
`@components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py`:
- Around line 504-537: The code in get_task_group_for_job uses index-based
iteration (for i in range(len(archives))) to pair archives and task_ids when
building celery.group for QueryJobType.SEARCH_OR_AGGREGATION and
QueryJobType.EXTRACT_JSON/EXTRACT_IR; replace those loops with zip(archives,
task_ids) to iterate pairs directly and update the generator expressions that
call search.s(...) and extract_stream.s(...) to use the unpacked archive and
task_id instead of archives[i] and task_ids[i]; ensure you reference job.id,
job_config, archive.get("dataset"), clp_metadata_db_conn_params and
results_cache_uri unchanged while switching both branches to the zip-based
pattern.

In
`@components/webui/client/src/pages/SearchPage/SearchControls/Presto/SqlSearchButton/RunButton/GuidedRunButton.tsx`:
- Around line 99-100: In GuidedRunButton update the disabled conditional to
avoid using negation: replace the `!isQueryReady` check with `false ===
isQueryReady` so the prop check reads `disabled={false === isQueryReady ||
searchUiState === SEARCH_UI_STATE.QUERY_ID_PENDING}` (keep the existing
searchUiState comparison unchanged); this targets the isQueryReady boolean usage
in the GuidedRunButton component to comply with the coding guideline.

In
`@components/webui/client/src/pages/SearchPage/SearchControls/QueryStatus/QuerySpeed/utils.ts`:
- Around line 28-48: The SQL string builder in utils.ts interpolates jobId
directly into the query (see archivesSubquery and the template that uses
${jobId}), which risks injection because jobId is a string; fix it by switching
to a parameterized query or validating/coercing jobId to a strict numeric format
before interpolation—either accept a numeric jobId argument
(Number.parseInt/Number and reject NaN) or replace the inline ${jobId} use with
a parameter placeholder and pass jobId as a bound parameter from the caller
(update the function that returns this query and its consumers to support
parameter binding).

host: DomainStr = "localhost"
port: Port = DEFAULT_PORT
jobs_poll_delay: PositiveFloat = 0.1 # seconds
max_datasets_per_query: PositiveInt | None = 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Add an inline comment documenting the None semantics.

Users editing config files need to know that setting this field to null means "unlimited." A brief comment would help:

📝 Suggested documentation
-    max_datasets_per_query: PositiveInt | None = 10
+    max_datasets_per_query: PositiveInt | None = 10  # None means unlimited
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
max_datasets_per_query: PositiveInt | None = 10
max_datasets_per_query: PositiveInt | None = 10 # None means unlimited
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/clp-py-utils/clp_py_utils/clp_config.py` at line 435, The config
field max_datasets_per_query (in clp_config.py) lacks documentation for None
semantics—add a brief inline comment next to its declaration explaining that
setting this field to None (or null in YAML/JSON) means "unlimited" datasets per
query; keep the comment concise and on the same line or immediately above the
declaration so users editing config files immediately see that None/null =
unlimited.

Comment on lines 99 to 103
if dataset is None:
dataset = extract_json_config.datasets[0] if extract_json_config.datasets else None
if StorageType.S3 == storage_type:
s3_config = worker_config.archive_output.storage.s3_config
s3_object_key = f"{s3_config.key_prefix}{dataset}/{archive_id}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

dataset may be None when used in path construction, producing invalid paths.

If the dataset parameter is None and extract_json_config.datasets is empty or None, dataset stays None. It is then interpolated into the f-string at line 103 (f"{s3_config.key_prefix}{dataset}/{archive_id}") and used at line 122 (get_directory() / dataset), producing paths containing the literal string "None".

Consider adding an early return with an error when dataset resolves to None:

Proposed fix
     if dataset is None:
         dataset = extract_json_config.datasets[0] if extract_json_config.datasets else None
+    if dataset is None:
+        logger.error("No dataset specified for JSON extraction")
+        return None, None
     if StorageType.S3 == storage_type:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py`
around lines 99 - 103, The code currently allows dataset to remain None if
extract_json_config.datasets is empty, which leads to invalid paths when
building s3_object_key and local paths; update the start of the logic in
extract_stream_task (the block using dataset, extract_json_config.datasets,
StorageType.S3 and s3_config.key_prefix) to validate dataset after resolving
from extract_json_config.datasets and, if still None, return an error (or raise
an exception) early with a clear message instead of proceeding to build
f"{s3_config.key_prefix}{dataset}/{archive_id}" or using get_directory() /
dataset; ensure any callers handle the early error accordingly.

Comment on lines 122 to 125
elif StorageEngine.CLP_S == storage_engine:
command, env_vars = _make_core_clp_s_command_and_env_vars(
clp_home, worker_config, archive_id, search_config
clp_home, worker_config, archive_id, search_config, dataset
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

dataset can be None here, but _make_core_clp_s_command_and_env_vars requires str.

_make_command_and_env_vars accepts dataset: str | None = None (line 114), but passes it directly to _make_core_clp_s_command_and_env_vars (line 69) which declares dataset: str. If dataset is None at this call site:

  • Line 77: the f-string would embed the literal "None" in the S3 key.
  • Line 95: Path / None would raise a TypeError.

Add a guard before the CLP_S branch, or make dataset required when the storage engine is CLP_S.

🐛 Proposed fix
     elif StorageEngine.CLP_S == storage_engine:
+        if dataset is None:
+            logger.error("dataset is required for the CLP_S storage engine")
+            return None, None
         command, env_vars = _make_core_clp_s_command_and_env_vars(
             clp_home, worker_config, archive_id, search_config, dataset
         )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
elif StorageEngine.CLP_S == storage_engine:
command, env_vars = _make_core_clp_s_command_and_env_vars(
clp_home, worker_config, archive_id, search_config
clp_home, worker_config, archive_id, search_config, dataset
)
elif StorageEngine.CLP_S == storage_engine:
if dataset is None:
logger.error("dataset is required for the CLP_S storage engine")
return None, None
command, env_vars = _make_core_clp_s_command_and_env_vars(
clp_home, worker_config, archive_id, search_config, dataset
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@components/job-orchestration/job_orchestration/executor/query/fs_search_task.py`
around lines 122 - 125, The CLP_S branch passes dataset (which can be None) into
_make_core_clp_s_command_and_env_vars which expects a str and will break
(f-string will include "None" and Path / None raises TypeError); fix by adding a
guard in _make_command_and_env_vars before the StorageEngine.CLP_S branch to
validate dataset is not None (raise a ValueError with a clear message) or
require/convert it to a string, or change _make_core_clp_s_command_and_env_vars
to accept Optional[str] and handle None safely; reference StorageEngine.CLP_S,
_make_command_and_env_vars, and _make_core_clp_s_command_and_env_vars when
applying the guard or signature change.

Comment on lines 178 to 180
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Silently using datasets[0] loses multi-dataset context for extraction jobs.

extraction_dataset takes only the first element. This is likely intentional for extraction (single archive), but worth a brief inline comment explaining why only the first dataset is used here.

🧰 Tools
🪛 Ruff (0.15.1)

[warning] 180-180: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py`
around lines 178 - 180, The code silently selects only the first dataset via
extraction_dataset = self.__job_config.datasets[0] which can confuse reviewers;
add a concise inline comment next to that line explaining that extraction jobs
intentionally use only the first dataset (single-archive behavior) and that
multi-dataset context is not applicable here, referencing __job_config.datasets,
extraction_dataset, and the subsequent archive_exists check to make the
rationale clear.

Comment on lines +417 to +427
where_clause = ""
if len(filter_clauses) > 0:
query += " WHERE " + " AND ".join(filter_clauses)
query += " ORDER BY end_timestamp DESC"
where_clause = " WHERE " + " AND ".join(filter_clauses)

union_parts = []
for ds in datasets:
table = get_archives_table_name(table_prefix, ds)
union_parts.append(
f"SELECT id AS archive_id, end_timestamp, '{ds}' AS dataset FROM {table}{where_clause}"
)
query = " UNION ALL ".join(union_parts) + " ORDER BY end_timestamp DESC"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

SQL injection: user-supplied dataset names interpolated directly into query strings.

The dataset name ds is string-interpolated into the SQL query at line 425. While datasets are validated against existing_datasets at lines 706–722 before reaching this point (which mitigates active exploitation), the validation is skipped when datasets is None. If the None issue above is resolved by defaulting to fetched datasets, this path would be safe. Still, consider using parameterized queries for the '{ds}' value to defend in depth.

Note: The table name interpolation via get_archives_table_name follows existing codebase patterns and can't easily be parameterized.

🧰 Tools
🪛 Ruff (0.15.1)

[error] 425-425: Possible SQL injection vector through string-based query construction

(S608)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py`
around lines 417 - 427, The current code directly interpolates dataset names
into the SELECT string (in the loop building union_parts), which risks SQL
injection; instead build the SELECT with a parameter placeholder for the dataset
value and accumulate a params list to supply to the DB driver. Concretely, keep
using get_archives_table_name(table_prefix, ds) for the table name (since table
names cannot be parameterized), but change the appended string to use a
parameter placeholder (e.g. "%s" or "?" matching the DB adapter) like "SELECT id
AS archive_id, end_timestamp, ? AS dataset FROM {table}{where_clause}", and
append the ds value to a params list for each union part; after joining
union_parts into query, return/execute the query together with the flattened
params list so the dataset values are passed as bound parameters rather than
interpolated.

Comment on lines +37 to +40
const {selectedDatasets} = useSearchStore.getState();
const from = 0 < selectedDatasets.length ?
selectedDatasets[0] :
null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: getState() makes selectedDatasets non-reactive — button won't reflect dataset changes.

useSearchStore.getState() captures state once and does not subscribe to updates. Since from (derived from selectedDatasets) feeds into isQueryReady, the button's disabled state and tooltip will not update when the user changes the dataset selection.

Use a reactive selector instead:

🐛 Proposed fix
-    const {selectedDatasets} = useSearchStore.getState();
-    const from = 0 < selectedDatasets.length ?
-        selectedDatasets[0] :
+    const selectedDatasets = useSearchStore((state) => state.selectedDatasets);
+    const from = 0 < selectedDatasets.length ?
+        selectedDatasets[0] :
         null;

Based on learnings: "use useStore.getState().method for callbacks since the output is not reactive and doesn't need state as a dependency in the hook, and use useStore((state) => state.property) with proper selectors for reactive components that need to re-render when state changes."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const {selectedDatasets} = useSearchStore.getState();
const from = 0 < selectedDatasets.length ?
selectedDatasets[0] :
null;
const selectedDatasets = useSearchStore((state) => state.selectedDatasets);
const from = 0 < selectedDatasets.length ?
selectedDatasets[0] :
null;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@components/webui/client/src/pages/SearchPage/SearchControls/Presto/SqlSearchButton/RunButton/GuidedRunButton.tsx`
around lines 37 - 40, The button currently reads selectedDatasets via
useSearchStore.getState() which is non-reactive so `from` (and consequently
`isQueryReady`) won't update on selection changes; replace the non-reactive call
with a reactive selector such as const selectedDatasets = useSearchStore(state
=> state.selectedDatasets) and compute `from` from that reactive value (keeping
the `from` name and downstream references intact) so the component re-renders
when selection changes.

Comment on lines +15 to +26
if (0 === datasetNames.length) {
archivesSubquery = "SELECT id, uncompressed_size" +
` FROM ${settings.SqlDbClpArchivesTableName}`;
} else if (1 === datasetNames.length) {
archivesSubquery = "SELECT id, uncompressed_size" +
` FROM ${settings.SqlDbClpTablePrefix}${datasetNames[0]}_archives`;
} else {
archivesSubquery = datasetNames
.map((name) => "SELECT id, uncompressed_size" +
` FROM ${settings.SqlDbClpTablePrefix}${name}_archives`)
.join(" UNION ALL ");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Search for datasetNames definition and usage to understand the data flow
rg -n 'datasetNames' components/webui/client/src/pages/SearchPage/SearchControls/QueryStatus/QuerySpeed/ --type=ts -B 2 -A 2

Repository: y-scope/clp

Length of output: 3833


🏁 Script executed:

# Look at the complete file to understand context
head -80 components/webui/client/src/pages/SearchPage/SearchControls/QueryStatus/QuerySpeed/utils.ts

Repository: y-scope/clp

Length of output: 2159


🏁 Script executed:

# Search for any dataset name validation patterns in the webui client
rg -n 'dataset.*validat|sanitiz|allowlist|whitelist|escape|quote' components/webui/client/src/ --type=ts --type=tsx -i | head -30

Repository: y-scope/clp

Length of output: 82


🏁 Script executed:

# Find where datasetNames parameter comes from by checking callers
rg -n 'buildQuerySpeedSql' components/webui/client/src/ --type=ts --type=tsx -B 2 -A 2

Repository: y-scope/clp

Length of output: 82


🏁 Script executed:

# Check if there's any dataset validation in the store or data layer
rg -n 'selectedDatasets|queriedDatasets' components/webui/client/src/ --type=ts --type=tsx | head -20

Repository: y-scope/clp

Length of output: 82


🏁 Script executed:

# Find imports/calls to fetchQuerySpeed
rg -n 'fetchQuerySpeed' components/webui/client/src/ --type=ts | head -20

Repository: y-scope/clp

Length of output: 693


🏁 Script executed:

# Search for where this module is imported
rg -n 'from.*QuerySpeed' components/webui/client/src/ --type=ts | head -20

Repository: y-scope/clp

Length of output: 178


🏁 Script executed:

# Check the api/sql module to understand querySql
find components/webui/client/src -name "sql.ts" -o -name "sql.tsx" | head -5

Repository: y-scope/clp

Length of output: 427


🏁 Script executed:

# Look for dataset selection/filtering logic in the search page
find components/webui/client/src/pages/SearchPage -name "*.ts" | head -10

Repository: y-scope/clp

Length of output: 1007


🏁 Script executed:

# Check the QuerySpeed index.tsx to see where queriedDatasets comes from
cat -n components/webui/client/src/pages/SearchPage/SearchControls/QueryStatus/QuerySpeed/index.tsx

Repository: y-scope/clp

Length of output: 1984


🏁 Script executed:

# Look at the sql.ts in api folder to understand querySql
find components/webui/client/src/api -name "sql.ts" | head -5

Repository: y-scope/clp

Length of output: 37


🏁 Script executed:

# Check for dataset validation in the search state/store
find components/webui/client/src/pages/SearchPage -name "*state*" -o -name "*store*" | head -10

Repository: y-scope/clp

Length of output: 37


🏁 Script executed:

# Search for where dataset names are defined/sourced in the search page
rg -n 'queriedDatasets' components/webui/client/src/pages/SearchPage --type=ts | head -20

Repository: y-scope/clp

Length of output: 849


🏁 Script executed:

# Check the SearchState index.tsx to see how queriedDatasets is set
cat -n components/webui/client/src/pages/SearchPage/SearchState/index.tsx | head -150

Repository: y-scope/clp

Length of output: 5396


🏁 Script executed:

# Look for the api/sql module to understand if querySql does any sanitization
find components/webui/client/src -path "*/api/*sql.ts" -type f

Repository: y-scope/clp

Length of output: 37


🏁 Script executed:

# Check the Dataset controls to see how datasets are selected/validated
cat -n components/webui/client/src/pages/SearchPage/SearchControls/Dataset/sql.ts

Repository: y-scope/clp

Length of output: 1121


🏁 Script executed:

# Find the api/sql.ts file
find components/webui/client/src -path "*/api/sql.ts" -type f

Repository: y-scope/clp

Length of output: 37


🏁 Script executed:

# Check how querySql is implemented
cat -n components/webui/client/src/api/sql.ts | head -50

Repository: y-scope/clp

Length of output: 125


🏁 Script executed:

# Look for any examples of dataset name sanitization or validation elsewhere in the codebase
rg -n 'dataset.*\[a-zA-Z0-9_-\]|sanitiz.*dataset|escape.*name' components/webui/client/src --type=ts -i | head -20

Repository: y-scope/clp

Length of output: 37


🏁 Script executed:

# Find where sql.ts or querySql is defined
find components/webui/client/src -name "*.ts" -o -name "*.tsx" | xargs grep -l "export.*querySql" | head -5

Repository: y-scope/clp

Length of output: 99


🏁 Script executed:

# Search for querySql definition
rg -n 'querySql.*=' components/webui/client/src --type=ts | head -20

Repository: y-scope/clp

Length of output: 155


🏁 Script executed:

# Check the api directory structure
find components/webui/client/src/api -type f | head -20

Repository: y-scope/clp

Length of output: 644


🏁 Script executed:

# Check the querySql implementation
cat -n components/webui/client/src/api/sql/index.ts

Repository: y-scope/clp

Length of output: 417


🏁 Script executed:

# Look for where the SQL query is executed - check if there's any backend validation
rg -n 'querySql|dataset.*validation|sql.*validate' components/webui/client/src/api/sql/ --type=ts -A 5 | head -50

Repository: y-scope/clp

Length of output: 440


🏁 Script executed:

# Find the backend endpoint that handles /api/archive-metadata/sql
find components -name "*.cpp" -o -name "*.hpp" -o -name "*.py" | xargs grep -l "archive-metadata/sql" | head -10

Repository: y-scope/clp

Length of output: 37


🏁 Script executed:

# Search for SQL query handling in the backend
rg -n 'archive-metadata.*sql|querySql' components --type=cpp --type=py | head -20

Repository: y-scope/clp

Length of output: 37


🏁 Script executed:

# Search for where dataset names might be validated in the backend
rg -n 'dataset.*validat|sanitiz.*dataset' components --type=cpp --type=py -i | head -20

Repository: y-scope/clp

Length of output: 391


🏁 Script executed:

# Check the backend handler for the SQL endpoint to see if validation happens there
find components -path "*/archive-metadata*" -name "*.cpp" -o -name "*.hpp" -o -name "*.py" 2>/dev/null | head -20

Repository: y-scope/clp

Length of output: 1197


🏁 Script executed:

# Look for the REST API endpoint handler
rg -n 'archive.metadata.*sql|/api/archive-metadata' components --type=cpp --type=py | head -20

Repository: y-scope/clp

Length of output: 37


🏁 Script executed:

# Search more broadly for SQL execution in the backend
rg -n 'execute.*query|query.*execute|executeSql' components --type=py -A 3 | head -40

Repository: y-scope/clp

Length of output: 3569


🏁 Script executed:

# Look for where the REST endpoint /api/archive-metadata/sql is implemented
find components -type f \( -name "*.py" -o -name "*.cpp" -o -name "*.hpp" \) | xargs grep -l "archive.metadata" | head -10

Repository: y-scope/clp

Length of output: 544


🏁 Script executed:

# Search in the job orchestration for where SQL is actually executed for metadata queries
rg -n 'presto|trino|execute.*sql' components/job-orchestration --type=py | head -30

Repository: y-scope/clp

Length of output: 37


🏁 Script executed:

# Check if there are any examples of dataset name escaping/quoting in the codebase
rg -n 'quote.*name|escape.*name|f\"\"\"|backtick|identifier' components --type=py --type=ts -i | head -20

Repository: y-scope/clp

Length of output: 199


🏁 Script executed:

# Search for the backend endpoint that handles SQL queries
find components -type f -name "*.py" | xargs grep -l "archive-metadata" | head -10

Repository: y-scope/clp

Length of output: 37


🏁 Script executed:

# Look for REST API/Flask routes
rg -n '@app.route|@api.route|router.post' components --type=py | grep -i sql | head -20

Repository: y-scope/clp

Length of output: 37


🏁 Script executed:

# Search for where presto queries are built with dataset names
rg -n 'FROM.*dataset|table.*prefix|tableName' components --type=py | head -30

Repository: y-scope/clp

Length of output: 3961


🏁 Script executed:

# Look at the clp_metadata_db_utils.py to see how table names are constructed
cat -n components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py | head -150

Repository: y-scope/clp

Length of output: 5784


🏁 Script executed:

# Search for the REST endpoint handler that processes the SQL query
find components -type f -name "*.py" | xargs grep -l "queryString" | head -10

Repository: y-scope/clp

Length of output: 37


🏁 Script executed:

# Look for the webui backend/API handler
find components/webui -type f -name "*.py" | head -20

Repository: y-scope/clp

Length of output: 37


🏁 Script executed:

# Look for where datasets are imported/created to see if names are validated
rg -n 'add_dataset|dataset_name.*=' components/clp-py-utils components/job-orchestration --type=py -B 2 -A 2 | head -50

Repository: y-scope/clp

Length of output: 1729


🏁 Script executed:

# Check the webui backend more carefully - look for app.py or main server file
find components/webui -type f -name "*.py" -o -name "app.py" -o -name "server.py" | head -20

Repository: y-scope/clp

Length of output: 37


🏁 Script executed:

# Look for where datasets are fetched/selected in the UI
rg -n 'fetchDatasetNames|selectedDatasets' components/webui/client/src/pages/SearchPage --type=ts | head -20

Repository: y-scope/clp

Length of output: 2910


🏁 Script executed:

# Check the Dataset component to see how datasets are selected
cat -n components/webui/client/src/pages/SearchPage/SearchControls/Dataset/ 2>/dev/null | head -100

Repository: y-scope/clp

Length of output: 37


🏁 Script executed:

# Find the Dataset component file
find components/webui/client/src/pages/SearchPage/SearchControls/Dataset -type f -name "*.tsx" -o -name "*.ts"

Repository: y-scope/clp

Length of output: 293


🏁 Script executed:

# Look at the DatasetSelect component
cat -n components/webui/client/src/pages/SearchPage/SearchControls/Dataset/DatasetSelect.tsx

Repository: y-scope/clp

Length of output: 4078


🏁 Script executed:

# Look at the Dataset index component
cat -n components/webui/client/src/pages/SearchPage/SearchControls/Dataset/index.tsx

Repository: y-scope/clp

Length of output: 1281


🏁 Script executed:

# Look for any sanitization or validation in the search request building
rg -n 'buildClpsTimeRangeSql|buildQuerySpeedSql' components/webui/client/src --type=ts -A 10 | head -60

Repository: y-scope/clp

Length of output: 5617


🏁 Script executed:

# Check if there are any other similar SQL building patterns in the codebase
rg -n 'FROM.*\$\{.*prefix\}\$\{.*name\}' components/webui/client/src --type=ts | head -20

Repository: y-scope/clp

Length of output: 37


🏁 Script executed:

# Look for any existing SQL escaping utilities in the codebase
rg -n 'escape|quote|sanitize' components/webui/client/src --type=ts | head -20

Repository: y-scope/clp

Length of output: 2669


SQL injection risk: dataset names interpolated directly into table names without escaping.

datasetNames are concatenated directly into SQL table names (lines 20, 24) without any quoting or escaping around the dataset name variable. While dataset names originate from the database, they are stored with no character restrictions (VARCHAR(255)), making them vulnerable to SQL injection if a dataset name contains special characters (e.g., backticks, underscores, quotes). This same pattern is repeated in TimeRangeInput/sql.ts.

Consider quoting or escaping dataset names before interpolation, or use parameterized queries if the backend supports them.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@components/webui/client/src/pages/SearchPage/SearchControls/QueryStatus/QuerySpeed/utils.ts`
around lines 15 - 26, The SQL-building code interpolates datasetNames directly
into table identifiers (see variable archivesSubquery and
settings.SqlDbClpTablePrefix) which is an injection risk; fix by introducing a
single helper like escapeSqlIdentifier(name) that either validates names against
a safe whitelist regex (e.g. only [A-Za-z0-9_\\-] and throw on invalid) or
properly quotes/escapes identifiers (wrap with backticks and escape any
backticks inside), and call that helper wherever datasetNames are used to build
table names (both in this file's archivesSubquery logic and in
TimeRangeInput/sql.ts); do not use string interpolation of raw datasetNames
directly and document that identifiers are validated/escaped before
concatenation.

Comment on lines +20 to +26
const selectedDatasets = useSearchStore((state) => state.selectedDatasets);
const timestampKey = usePrestoSearchState((state) => state.timestampKey);
const updateTimestampKey = usePrestoSearchState((state) => state.updateTimestampKey);
const searchUiState = useSearchStore((state) => state.searchUiState);
const dataset = 0 < selectedDatasets.length ?
selectedDatasets[0] :
null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Duplicated dataset derivation logic across components.

The pattern 0 < selectedDatasets.length ? selectedDatasets[0] : null is repeated identically in useTimestampKeyInit/index.tsx (lines 19–21) and here (lines 24–26). Consider extracting a small helper or selector to keep this DRY, especially if more components need the same derivation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@components/webui/client/src/pages/SearchPage/SearchControls/TimeRangeInput/Presto/TimeRangeFooter/TimestampKeySelect/index.tsx`
around lines 20 - 26, The code duplicates the logic that derives the first
selected dataset (`0 < selectedDatasets.length ? selectedDatasets[0] : null`)
across components (seen in this file and useTimestampKeyInit/index.tsx); extract
a reusable selector/helper (e.g., getFirstSelectedDataset or
selectFirstSelectedDataset) and use it with useSearchStore in both places
instead of repeating the ternary. Update usages that reference selectedDatasets
(the hook useSearchStore and components using dataset) to call the new selector
so the derivation is centralized and DRY.

Comment on lines +34 to 44
const buildClpsTimeRangeSql = (datasetNames: string[]): string => {
const unionParts = datasetNames.map((name) => `SELECT
MIN(${CLP_ARCHIVES_TABLE_COLUMN_NAMES.BEGIN_TIMESTAMP}) AS begin_timestamp,
MAX(${CLP_ARCHIVES_TABLE_COLUMN_NAMES.END_TIMESTAMP}) AS end_timestamp
FROM ${settings.SqlDbClpTablePrefix}${datasetName}_${SqlTableSuffix.ARCHIVES}`;
FROM ${settings.SqlDbClpTablePrefix}${name}_${SqlTableSuffix.ARCHIVES}`);

return `SELECT
MIN(begin_timestamp) AS begin_timestamp,
MAX(end_timestamp) AS end_timestamp
FROM (${unionParts.join("\nUNION ALL\n")}) AS combined`;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

SQL injection risk: dataset names are interpolated directly into SQL strings.

Dataset names from datasetNames are concatenated into the SQL query without sanitisation or parameterisation. While dataset names are validated server-side to contain only \w+ characters, a compromised or manipulated client could submit arbitrary strings. Consider adding a client-side guard (e.g., a regex check matching ^\w+$) before interpolating names into SQL:

🛡️ Suggested defensive guard
 const buildClpsTimeRangeSql = (datasetNames: string[]): string => {
+    const VALID_NAME = /^\w+$/;
+    for (const name of datasetNames) {
+        if (false === VALID_NAME.test(name)) {
+            throw new Error(`Invalid dataset name: ${name}`);
+        }
+    }
     const unionParts = datasetNames.map((name) => `SELECT
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const buildClpsTimeRangeSql = (datasetNames: string[]): string => {
const unionParts = datasetNames.map((name) => `SELECT
MIN(${CLP_ARCHIVES_TABLE_COLUMN_NAMES.BEGIN_TIMESTAMP}) AS begin_timestamp,
MAX(${CLP_ARCHIVES_TABLE_COLUMN_NAMES.END_TIMESTAMP}) AS end_timestamp
FROM ${settings.SqlDbClpTablePrefix}${datasetName}_${SqlTableSuffix.ARCHIVES}`;
FROM ${settings.SqlDbClpTablePrefix}${name}_${SqlTableSuffix.ARCHIVES}`);
return `SELECT
MIN(begin_timestamp) AS begin_timestamp,
MAX(end_timestamp) AS end_timestamp
FROM (${unionParts.join("\nUNION ALL\n")}) AS combined`;
};
const buildClpsTimeRangeSql = (datasetNames: string[]): string => {
const VALID_NAME = /^\w+$/;
for (const name of datasetNames) {
if (false === VALID_NAME.test(name)) {
throw new Error(`Invalid dataset name: ${name}`);
}
}
const unionParts = datasetNames.map((name) => `SELECT
MIN(${CLP_ARCHIVES_TABLE_COLUMN_NAMES.BEGIN_TIMESTAMP}) AS begin_timestamp,
MAX(${CLP_ARCHIVES_TABLE_COLUMN_NAMES.END_TIMESTAMP}) AS end_timestamp
FROM ${settings.SqlDbClpTablePrefix}${name}_${SqlTableSuffix.ARCHIVES}`);
return `SELECT
MIN(begin_timestamp) AS begin_timestamp,
MAX(end_timestamp) AS end_timestamp
FROM (${unionParts.join("\nUNION ALL\n")}) AS combined`;
};
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@components/webui/client/src/pages/SearchPage/SearchControls/TimeRangeInput/sql.ts`
around lines 34 - 44, The function buildClpsTimeRangeSql currently interpolates
datasetNames directly into SQL identifiers (via settings.SqlDbClpTablePrefix +
name + _ + SqlTableSuffix.ARCHIVES), which risks SQL injection if
client-supplied names are malicious; add a client-side defensive check in
buildClpsTimeRangeSql to validate each name against /^\w+$/ (or equivalent
alphanumeric/underscore-only regex) and either throw an error or omit invalid
entries before building unionParts, so only validated safe identifiers are
concatenated into the SQL string; ensure the validation references
buildClpsTimeRangeSql, CLP_ARCHIVES_TABLE_COLUMN_NAMES,
settings.SqlDbClpTablePrefix, and SqlTableSuffix.ARCHIVES so reviewers can
locate the change.

// otherwise, `{dataset: null}` gets converted to `{dataset: ""}`.
dataset: Type.Union([Type.Null(),
Type.String()]),
datasets: Type.Array(Type.String()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for datasets validation in server-side route handlers
rg -n --type=ts -C5 'datasets' components/webui/server/src/routes/

Repository: y-scope/clp

Length of output: 1945


🏁 Script executed:

cat -n components/webui/common/src/schemas/search.ts

Repository: y-scope/clp

Length of output: 1259


🏁 Script executed:

#!/bin/bash
# Search for scheduler validation or error handling related to datasets
rg -n --type=ts -i 'scheduler|validation' components/webui/server/src/routes/api/search/index.ts | head -20

Repository: y-scope/clp

Length of output: 37


🏁 Script executed:

#!/bin/bash
# Search for any array length validation patterns in the codebase
rg -n 'minItems|length.*===.*0|\.length\s*>' components/webui/common/src/schemas/ --type=ts

Repository: y-scope/clp

Length of output: 166


🏁 Script executed:

#!/bin/bash
# Check if there are request validators or middleware that might validate datasets
rg -n --type=ts 'datasets.*length|datasets.*empty' components/webui/server/ -C3

Repository: y-scope/clp

Length of output: 37


🏁 Script executed:

#!/bin/bash
# Search for tests related to query job creation to understand expected behavior
find components/webui -name "*.test.ts" -o -name "*.spec.ts" | xargs rg -l "datasets" 2>/dev/null | head -5

Repository: y-scope/clp

Length of output: 37


🏁 Script executed:

#!/bin/bash
# Look for any comments or documentation about datasets validation
rg -n "datasets" components/webui/common/src/schemas/ -B2 -A2 --type=ts

Repository: y-scope/clp

Length of output: 452


Add minItems constraint to enforce non-empty datasets arrays.

The schema allows empty arrays [], which would be semantically invalid for a search query. The codebase establishes a pattern of constraining arrays with minItems: 1 (as seen in components/webui/common/src/schemas/compression.ts). Align this field with that pattern:

datasets: Type.Array(Type.String(), { minItems: 1 }),

No downstream validation was found to reject empty arrays at the route handler or scheduler level.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/webui/common/src/schemas/search.ts` at line 16, The datasets array
in the search schema currently allows empty arrays; update the schema definition
for the datasets property to require at least one entry by adding a minItems: 1
constraint to the Type.Array call (i.e., change datasets:
Type.Array(Type.String()) to include { minItems: 1 }); this aligns with the
existing pattern used elsewhere (e.g., compression schema) and ensures
validation rejects empty dataset lists.

Copy link
Contributor

@sitaowang1998 sitaowang1998 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed the package side. Should we also check if the datasets exceeds max_datasets_per_query in search script?

Comment on lines 91 to 93
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we report all datasets that do not exist in the metadata database?

Comment on lines 99 to 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using the first dataset is the intended behavior for backward compatibility, add comment to avoid confusion.


# NOTE: This assumes we never delete a dataset.
missing = set(datasets) - existing_datasets
if missing:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer to use explicit check, even in Python.

Suggested change
if missing:
if len(missing) == 0:

if missing:
existing_datasets.update(fetch_existing_datasets(db_cursor, table_prefix))
missing = set(datasets) - existing_datasets
if missing:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Suggested change
if missing:
if len(missing) == 0:

if dataset not in existing_datasets:
logger.error(f"Dataset `{dataset}` doesn't exist.")
datasets = QueryJobConfig.model_validate(job_config).datasets
if datasets is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if branch is pretty large, should we extract it into a separate function?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/clp-package-utils/clp_package_utils/scripts/native/search.py`:
- Around line 305-308: Replace the f-string logging call so it uses the logger's
lazy formatting instead of pre-evaluating the string: change the logger.error
call that currently interpolates len(datasets) and max_datasets_per_query into a
parameterized log message (referencing variables datasets and
max_datasets_per_query and the logger.error invocation) to pass the values as
arguments to logger.error; keep the same message text and preserve the
subsequent return -1 behavior.
- Around line 300-313: Replace the current exception logging in the dataset
existence check so the full traceback is logged: in the block where
validate_datasets_exist(database_config, datasets) is called (handling
parsed_args.dataset and datasets), change the logger.error(e) call to
logger.exception(e) to mirror the earlier handler and preserve the stack trace
for failures in validate_datasets_exist.

In `@components/clp-package-utils/clp_package_utils/scripts/search.py`:
- Around line 116-128: Replace the boolean positional argument in the call to
clp_config.database.get_clp_connection_params_and_type(True) with a keyword
argument (e.g., get_clp_connection_params_and_type(include_type=True)) to
improve readability, and change the exception logging inside the except block
from logger.error(e) to logger.exception(e) so the traceback is preserved; these
changes should be applied in the block that handles StorageEngine.CLP_S and uses
validate_dataset_name for each dataset.

In
`@components/webui/client/src/pages/SearchPage/SearchControls/Dataset/DatasetSelect.tsx`:
- Around line 27-29: The selector is pulling updateSelectedDatasets as a
reactive function which then gets used as a stable callback; change usage so you
don't subscribe to the function: stop using useSearchStore((state) =>
state.updateSelectedDatasets) and instead call
useSearchStore.getState().updateSelectedDatasets(...) wherever updateDatasets is
invoked (e.g., in DatasetSelect component), and remove updateDatasets from any
useEffect dependency arrays so the hook doesn't depend on a stable function
reference; keep using selectors (useSearchStore((s) => s.selectedDatasets) and
useSearchStore((s) => s.searchUiState)) for reactive state only.

---

Duplicate comments:
In
`@components/webui/client/src/pages/SearchPage/SearchControls/Dataset/DatasetSelect.tsx`:
- Around line 56-64: The effect using isSuccess, data, datasets,
getFallbackDatasets and updateDatasets currently pushes a new empty array
reference and can loop; change the guard in the useEffect (the block referencing
isSuccess && 0 === datasets.length) to only call
updateDatasets(getFallbackDatasets()) when getFallbackDatasets() actually
returns a non-empty array (e.g., check getFallbackDatasets().length > 0) or when
the fallback differs in length from datasets to avoid storing a new empty array
reference; apply the same guard to the other effect that performs
initial-selection (the one around lines 76–87) so both effects only update
Zustand when there are real dataset items to set.
- Line 52: Remove the redundant type assertion on available[0] in DatasetSelect
(the variable available is already string[] and the preceding length guard
ensures it's defined); replace the expression "[available[0] as string]" with
just "[available[0]]" wherever it appears (e.g., in the component
return/initialization) so the code relies on the correct inferred type and
eliminates the unnecessary "as string" cast.

Comment on lines +300 to 313
datasets = parsed_args.dataset
if datasets is not None:
max_datasets_per_query = clp_config.query_scheduler.max_datasets_per_query
if max_datasets_per_query is not None and len(datasets) > max_datasets_per_query:
logger.error(
f"Number of datasets ({len(datasets)}) exceeds"
f" max_datasets_per_query={max_datasets_per_query}."
)
return -1
try:
validate_dataset_exists(database_config, dataset)
validate_datasets_exist(database_config, datasets)
except Exception as e:
logger.error(e)
return -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

logger.error(e) drops the traceback — prefer logger.exception(e) for consistency.

The handler at lines 309–313 silently discards the stack trace. This is inconsistent with the existing handler earlier in the same file (lines 295–297) that uses logger.exception. Use logger.exception(e) so failures in validate_datasets_exist are fully observable.

🔧 Proposed fix
     try:
         validate_datasets_exist(database_config, datasets)
     except Exception as e:
-        logger.error(e)
+        logger.exception(e)
         return -1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
datasets = parsed_args.dataset
if datasets is not None:
max_datasets_per_query = clp_config.query_scheduler.max_datasets_per_query
if max_datasets_per_query is not None and len(datasets) > max_datasets_per_query:
logger.error(
f"Number of datasets ({len(datasets)}) exceeds"
f" max_datasets_per_query={max_datasets_per_query}."
)
return -1
try:
validate_dataset_exists(database_config, dataset)
validate_datasets_exist(database_config, datasets)
except Exception as e:
logger.error(e)
return -1
datasets = parsed_args.dataset
if datasets is not None:
max_datasets_per_query = clp_config.query_scheduler.max_datasets_per_query
if max_datasets_per_query is not None and len(datasets) > max_datasets_per_query:
logger.error(
f"Number of datasets ({len(datasets)}) exceeds"
f" max_datasets_per_query={max_datasets_per_query}."
)
return -1
try:
validate_datasets_exist(database_config, datasets)
except Exception as e:
logger.exception(e)
return -1
🧰 Tools
🪛 Ruff (0.15.1)

[warning] 305-306: Logging statement uses f-string

(G004)


[warning] 311-311: Do not catch blind exception: Exception

(BLE001)


[warning] 312-312: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/clp-package-utils/clp_package_utils/scripts/native/search.py`
around lines 300 - 313, Replace the current exception logging in the dataset
existence check so the full traceback is logged: in the block where
validate_datasets_exist(database_config, datasets) is called (handling
parsed_args.dataset and datasets), change the logger.error(e) call to
logger.exception(e) to mirror the earlier handler and preserve the stack trace
for failures in validate_datasets_exist.

Comment on lines +305 to +308
f"Number of datasets ({len(datasets)}) exceeds"
f" max_datasets_per_query={max_datasets_per_query}."
)
return -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Optional: avoid f-strings in logging calls (Ruff G004).

logger.error(f"Number of datasets ({len(datasets)}) exceeds ...") pre-evaluates the format string unconditionally. While the performance impact at ERROR level is negligible, this pattern is flagged by Ruff G004 and is inconsistent with lazy-evaluation style.

♻️ Suggested refactor
-            logger.error(
-                f"Number of datasets ({len(datasets)}) exceeds"
-                f" max_datasets_per_query={max_datasets_per_query}."
-            )
+            logger.error(
+                "Number of datasets (%d) exceeds max_datasets_per_query=%d.",
+                len(datasets),
+                max_datasets_per_query,
+            )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
f"Number of datasets ({len(datasets)}) exceeds"
f" max_datasets_per_query={max_datasets_per_query}."
)
return -1
logger.error(
"Number of datasets (%d) exceeds max_datasets_per_query=%d.",
len(datasets),
max_datasets_per_query,
)
return -1
🧰 Tools
🪛 Ruff (0.15.1)

[warning] 305-306: Logging statement uses f-string

(G004)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/clp-package-utils/clp_package_utils/scripts/native/search.py`
around lines 305 - 308, Replace the f-string logging call so it uses the
logger's lazy formatting instead of pre-evaluating the string: change the
logger.error call that currently interpolates len(datasets) and
max_datasets_per_query into a parameterized log message (referencing variables
datasets and max_datasets_per_query and the logger.error invocation) to pass the
values as arguments to logger.error; keep the same message text and preserve the
subsequent return -1 behavior.

Comment on lines +116 to 128
datasets = parsed_args.dataset
if StorageEngine.CLP_S == storage_engine:
dataset = CLP_DEFAULT_DATASET_NAME if dataset is None else dataset
datasets = [CLP_DEFAULT_DATASET_NAME] if datasets is None else datasets
try:
clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)
validate_dataset_name(clp_db_connection_params["table_prefix"], dataset)
for ds in datasets:
validate_dataset_name(clp_db_connection_params["table_prefix"], ds)
except Exception as e:
logger.error(e)
return -1
elif dataset is not None:
elif datasets is not None:
logger.error(f"Dataset selection is not supported for storage engine: {storage_engine}.")
return -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Two minor issues in the exception handling block.

  1. Line 120 — Boolean positional argument reduces call-site readability. Use the keyword form.
  2. Lines 123–124logger.error(e) only logs the message, not the traceback. This is inconsistent with the existing handler at lines 103–105 that uses logger.exception. Prefer logger.exception(e) so the stack trace is preserved.
🔧 Proposed fix
-            clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)
+            clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(
+                disable_localhost_socket_connection=True
+            )
             for ds in datasets:
                 validate_dataset_name(clp_db_connection_params["table_prefix"], ds)
         except Exception as e:
-            logger.error(e)
+            logger.exception(e)
             return -1
🧰 Tools
🪛 Ruff (0.15.1)

[error] 120-120: Boolean positional value in function call

(FBT003)


[warning] 123-123: Do not catch blind exception: Exception

(BLE001)


[warning] 124-124: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


[warning] 127-127: Logging statement uses f-string

(G004)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/clp-package-utils/clp_package_utils/scripts/search.py` around
lines 116 - 128, Replace the boolean positional argument in the call to
clp_config.database.get_clp_connection_params_and_type(True) with a keyword
argument (e.g., get_clp_connection_params_and_type(include_type=True)) to
improve readability, and change the exception logging inside the except block
from logger.error(e) to logger.exception(e) so the traceback is preserved; these
changes should be applied in the block that handles StorageEngine.CLP_S and uses
validate_dataset_name for each dataset.

Comment on lines +27 to +29
const datasets = useSearchStore((state) => state.selectedDatasets);
const searchUiState = useSearchStore((state) => state.searchUiState);
const updateDatasets = useSearchStore((state) => state.updateSelectedDatasets);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Use useSearchStore.getState() for the updateDatasets callback.

Per the team's Zustand convention, store methods used as callbacks (not for reactive rendering) should be accessed via useStore.getState().method rather than through a selector. This avoids listing a stable function reference as a useEffect dependency and aligns with the project's existing patterns.

Proposed fix
     const datasets = useSearchStore((state) => state.selectedDatasets);
     const searchUiState = useSearchStore((state) => state.searchUiState);
-    const updateDatasets = useSearchStore((state) => state.updateSelectedDatasets);

Then replace all updateDatasets(...) calls (and remove it from useEffect dependency arrays) with useSearchStore.getState().updateSelectedDatasets(...).

Based on learnings: "use useStore.getState().method for callbacks since the output is not reactive and doesn't need state as a dependency in the hook, and use useStore((state) => state.property) with proper selectors for reactive components that need to re-render when state changes."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@components/webui/client/src/pages/SearchPage/SearchControls/Dataset/DatasetSelect.tsx`
around lines 27 - 29, The selector is pulling updateSelectedDatasets as a
reactive function which then gets used as a stable callback; change usage so you
don't subscribe to the function: stop using useSearchStore((state) =>
state.updateSelectedDatasets) and instead call
useSearchStore.getState().updateSelectedDatasets(...) wherever updateDatasets is
invoked (e.g., in DatasetSelect component), and remove updateDatasets from any
useEffect dependency arrays so the hook doesn't depend on a stable function
reference; keep using selectors (useSearchStore((s) => s.selectedDatasets) and
useSearchStore((s) => s.searchUiState)) for reactive state only.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/clp-package-utils/clp_package_utils/scripts/native/utils.py (1)

75-93: ⚠️ Potential issue | 🔴 Critical

Update decompress.py and archive_manager.py to use the renamed function.

The old validate_dataset_exists function has been removed and replaced with validate_datasets_exist. Two files still reference the old function and will fail on import:

  • decompress.py (lines 39, 151)
  • archive_manager.py (lines 31, 203)

Both files must be updated to import validate_datasets_exist and pass datasets as a list (e.g., validate_datasets_exist(config, [dataset]) instead of validate_datasets_exist(config, dataset)).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/clp-package-utils/clp_package_utils/scripts/native/utils.py`
around lines 75 - 93, Replace the old validate_dataset_exists usage with the new
validate_datasets_exist: update the imports in the modules that still call the
old function to import validate_datasets_exist, and change call sites that
currently pass a single string (e.g., validate_dataset_exists(config, dataset))
to pass a list/iterable (e.g., validate_datasets_exist(config, [dataset]) or
validate_datasets_exist(config, datasets_list)); ensure you reference the
function name validate_datasets_exist and update both call sites that previously
used validate_dataset_exists so they pass a list of dataset names.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/clp-package-utils/clp_package_utils/scripts/native/utils.py`:
- Around line 92-93: Replace the non-idiomatic check and raw list error with a
clearer, friendlier exception: change `if len(missing) > 0:` to `if missing:`,
and raise either a dedicated exception (e.g., define `class
DatasetNotFoundError(ValueError)` and raise that) or at minimum a shorter
ValueError using a joined string for readability like `raise
ValueError(f"Datasets not found: {', '.join(missing)}")`; refer to the existing
`missing` variable and the current `ValueError` site to locate and update the
code.

---

Outside diff comments:
In `@components/clp-package-utils/clp_package_utils/scripts/native/utils.py`:
- Around line 75-93: Replace the old validate_dataset_exists usage with the new
validate_datasets_exist: update the imports in the modules that still call the
old function to import validate_datasets_exist, and change call sites that
currently pass a single string (e.g., validate_dataset_exists(config, dataset))
to pass a list/iterable (e.g., validate_datasets_exist(config, [dataset]) or
validate_datasets_exist(config, datasets_list)); ensure you reference the
function name validate_datasets_exist and update both call sites that previously
used validate_dataset_exists so they pass a list of dataset names.

---

Duplicate comments:
In
`@components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py`:
- Around line 99-100: The code uses extract_json_config.datasets[0] guarded only
by "datasets is not None", which still raises IndexError for an empty list;
update the logic around the dataset assignment in extract_stream_task.py (the
block that sets dataset = extract_json_config.datasets[0] if
extract_json_config.datasets is not None else None) to ensure the list is
non-empty before indexing—e.g., check "if extract_json_config.datasets and
len(extract_json_config.datasets) > 0" or use a safe extractor like
"next(iter(extract_json_config.datasets), None)" so dataset falls back to None
for empty lists.

In
`@components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py`:
- Around line 736-738: When datasets is None before calling
get_archives_for_search, set datasets to the full list of existing dataset IDs
instead of passing None; update the code around where archives_for_search is
computed so that if datasets is None you call the project/dataset discovery
helper (e.g., a function that lists all datasets using db_conn/table_prefix or
an existing helper like list_all_datasets/get_all_dataset_ids) and assign that
list to datasets, then call get_archives_for_search(db_conn, table_prefix,
search_config, archive_end_ts_lower_bound, datasets) so the for ds in ... loop
in get_archives_for_search never receives None.
- Around line 178-179: The current guard uses "is not None" but still allows an
empty list and can cause IndexError when accessing
self.__job_config.datasets[0]; update the logic around extraction_dataset in
query_scheduler.py so you only index when datasets is truthy (e.g., check if
self.__job_config.datasets before taking [0]) and pass None to archive_exists
when datasets is empty or missing; change the assignment of extraction_dataset
and the subsequent archive_exists call to use this falsy check so empty lists
are handled defensively.
- Around line 417-427: The SQL currently injects the dataset name directly into
the query via f"'{ds}'" in the loop (see variables union_parts, ds,
get_archives_table_name, where_clause, query); change this to use a parameter
placeholder for the dataset literal while continuing to use
get_archives_table_name(table_prefix, ds) for the table identifier (table names
remain non-parameterized). Concretely, build each union part with a placeholder
(e.g., %s or the DB driver's placeholder) instead of embedding '{ds}', append
the corresponding ds value to a params list for each union part, and when
executing the final " UNION ALL ... ORDER BY end_timestamp DESC" query pass the
accumulated params so dataset values are bound safely. Ensure any existing
filter_clauses parameters are merged into the same params list in the right
order before executing.

Comment on lines +92 to +93
if len(missing) > 0:
raise ValueError(f"Dataset(s) {missing} don't exist.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Simplify truthiness check and improve error message readability.

Three minor issues on these two lines:

  1. len(missing) > 0 — prefer the idiomatic if missing:.
  2. Ruff TRY003: the long message is passed inline to ValueError rather than encapsulated in a custom exception (or at least a short message). Consider using a dedicated exception class or a shorter message.
  3. {missing} renders the raw Python list repr (e.g., ['ds1', 'ds2']); joining the names is more user-friendly.
♻️ Proposed refactor
-        if len(missing) > 0:
-            raise ValueError(f"Dataset(s) {missing} don't exist.")
+        if missing:
+            raise ValueError(f"Dataset(s) don't exist: {', '.join(missing)}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if len(missing) > 0:
raise ValueError(f"Dataset(s) {missing} don't exist.")
if missing:
raise ValueError(f"Dataset(s) don't exist: {', '.join(missing)}")
🧰 Tools
🪛 Ruff (0.15.1)

[warning] 93-93: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/clp-package-utils/clp_package_utils/scripts/native/utils.py`
around lines 92 - 93, Replace the non-idiomatic check and raw list error with a
clearer, friendlier exception: change `if len(missing) > 0:` to `if missing:`,
and raise either a dedicated exception (e.g., define `class
DatasetNotFoundError(ValueError)` and raise that) or at minimum a shorter
ValueError using a joined string for readability like `raise
ValueError(f"Datasets not found: {', '.join(missing)}")`; refer to the existing
`missing` variable and the current `ValueError` site to locate and update the
code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments