feat!: Add multi-dataset query support (resolves #1933).#1992
feat!: Add multi-dataset query support (resolves #1933).#1992junhaoliao wants to merge 17 commits intoy-scope:mainfrom
Conversation
WalkthroughReplaces single Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant WebUI as Web UI (React store)
participant API as API Server
participant Scheduler as Query Scheduler
participant Executor as Task Executor
participant CLP_S as Search Engine (CLP_S)
User->>WebUI: select one or more datasets
WebUI->>WebUI: store selectedDatasets[]
User->>WebUI: submit query (datasets[])
WebUI->>API: POST /query with datasets[]
API->>Scheduler: create/dispatch QueryJob(datasets[])
Scheduler->>Scheduler: validate datasets & enforce max_datasets_per_query
Scheduler->>Scheduler: discover archives per dataset -> archives[{archive_id,dataset},...]
Scheduler->>Executor: create tasks with archive dicts
Executor->>CLP_S: run search/extract per archive with --dataset <dataset>
CLP_S->>Executor: return results (include dataset)
Executor->>API: store results in results cache (includes dataset)
API->>WebUI: return results (contains dataset)
WebUI->>User: render results (dataset-aware)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…ename cachedDatasets to queriedDatasets; fix query speed calculation
| )( | ||
| "dataset", | ||
| po::value<std::string>(&m_dataset)->value_name("DATASET"), | ||
| "The dataset name to include in each result document" | ||
| ); |
There was a problem hiding this comment.
The dataset name needs to be include in the query results before dumping them into the result cache. i know this feels like a hack. a more future proof interface might to to accept a JSON object string --extra-result-metadata and merge such objects with the results before dumping into the results cache. what do you think?
There was a problem hiding this comment.
alternatively, we can add multi-dataset support to the stream extraction flow as well. then we just pass an array of datasets in the stream extraction config, avoiding touching the clp-s binary
i think this may be more aligned with the future plan of using a single table for all datasets (maybe i haven't had a correct understanding of how we are going to refactor the datasets feature though
…k`; improve clarity in `DatasetSelect` component
…ileManager and extract_stream_task
There was a problem hiding this comment.
Actionable comments posted: 16
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
components/webui/client/src/pages/SearchPage/SearchControls/Presto/SqlSearchButton/RunButton/GuidedRunButton.tsx (1)
99-100:⚠️ Potential issue | 🟡 MinorCoding guideline: prefer
false === isQueryReadyover!isQueryReady.Proposed fix
- disabled={!isQueryReady || + disabled={false === isQueryReady || searchUiState === SEARCH_UI_STATE.QUERY_ID_PENDING}As per coding guidelines:
**/*.{tsx}: "Preferfalse === <expression>rather than!<expression>."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/webui/client/src/pages/SearchPage/SearchControls/Presto/SqlSearchButton/RunButton/GuidedRunButton.tsx` around lines 99 - 100, In GuidedRunButton update the disabled conditional to avoid using negation: replace the `!isQueryReady` check with `false === isQueryReady` so the prop check reads `disabled={false === isQueryReady || searchUiState === SEARCH_UI_STATE.QUERY_ID_PENDING}` (keep the existing searchUiState comparison unchanged); this targets the isQueryReady boolean usage in the GuidedRunButton component to comply with the coding guideline.components/api-server/src/client.rs (1)
27-56:⚠️ Potential issue | 🟡 MinorConsider validating that
datasetsis not an empty vector.A client can submit
"datasets": [], which would pass theis_none()check on Line 131 and propagate an empty dataset list to the scheduler. If the scheduler doesn't guard against this, it could cause unexpected behaviour. Consider adding a validation or normalising empty vectors toNone:🛡️ Suggested guard in submit_query
let mut search_job_config: SearchJobConfig = query_config.into(); - if search_job_config.datasets.is_none() { + if search_job_config.datasets.as_ref().is_none_or(|d| d.is_empty()) { search_job_config.datasets = match self.config.package.storage_engine { StorageEngine::Clp => None, StorageEngine::ClpS => Some(vec!["default".to_owned()]), } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/api-server/src/client.rs` around lines 27 - 56, QueryConfig allows "datasets": [] which should be treated as None; update the code that handles incoming QueryConfig (e.g., in submit_query) to normalise or validate the datasets field by checking QueryConfig::datasets and if Some(vec) && vec.is_empty() either set it to None before further processing or return a 4xx validation error; reference the QueryConfig struct and the submit_query handler so the empty-vector case is handled early and you never propagate an empty Vec<String> to the scheduler.components/webui/client/src/pages/SearchPage/SearchControls/QueryStatus/QuerySpeed/utils.ts (1)
28-48:⚠️ Potential issue | 🟡 MinorEnsure
jobIduses parameterized query or add format validation.
jobIdoriginates from the backend API response (trusted internal source), mitigating direct injection risk. However, it's typed asstring(not numeric) and is interpolated directly into SQL without frontend validation. Adopt parameterized queries or validate thatjobIdmatches an expected format before interpolation to follow SQL injection prevention best practices.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/webui/client/src/pages/SearchPage/SearchControls/QueryStatus/QuerySpeed/utils.ts` around lines 28 - 48, The SQL string builder in utils.ts interpolates jobId directly into the query (see archivesSubquery and the template that uses ${jobId}), which risks injection because jobId is a string; fix it by switching to a parameterized query or validating/coercing jobId to a strict numeric format before interpolation—either accept a numeric jobId argument (Number.parseInt/Number and reject NaN) or replace the inline ${jobId} use with a parameter placeholder and pass jobId as a bound parameter from the caller (update the function that returns this query and its consumers to support parameter binding).components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (1)
504-537: 🧹 Nitpick | 🔵 TrivialConsider using
zipinstead of index-based iteration.The
for i in range(len(archives))pattern at lines 524 and 537 could be cleaner withzip(archives, task_ids).Suggested refactor
if QueryJobType.SEARCH_OR_AGGREGATION == job_type: return celery.group( search.s( job_id=job.id, - archive_id=archives[i]["archive_id"], - task_id=task_ids[i], + archive_id=archive["archive_id"], + task_id=task_id, job_config=job_config, - dataset=archives[i].get("dataset"), + dataset=archive.get("dataset"), clp_metadata_db_conn_params=clp_metadata_db_conn_params, results_cache_uri=results_cache_uri, ) - for i in range(len(archives)) + for archive, task_id in zip(archives, task_ids) )Apply the same pattern to the
EXTRACT_JSON/EXTRACT_IRbranch.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py` around lines 504 - 537, The code in get_task_group_for_job uses index-based iteration (for i in range(len(archives))) to pair archives and task_ids when building celery.group for QueryJobType.SEARCH_OR_AGGREGATION and QueryJobType.EXTRACT_JSON/EXTRACT_IR; replace those loops with zip(archives, task_ids) to iterate pairs directly and update the generator expressions that call search.s(...) and extract_stream.s(...) to use the unpacked archive and task_id instead of archives[i] and task_ids[i]; ensure you reference job.id, job_config, archive.get("dataset"), clp_metadata_db_conn_params and results_cache_uri unchanged while switching both branches to the zip-based pattern.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/clp-py-utils/clp_py_utils/clp_config.py`:
- Line 435: The config field max_datasets_per_query (in clp_config.py) lacks
documentation for None semantics—add a brief inline comment next to its
declaration explaining that setting this field to None (or null in YAML/JSON)
means "unlimited" datasets per query; keep the comment concise and on the same
line or immediately above the declaration so users editing config files
immediately see that None/null = unlimited.
In
`@components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py`:
- Around line 99-103: The code currently allows dataset to remain None if
extract_json_config.datasets is empty, which leads to invalid paths when
building s3_object_key and local paths; update the start of the logic in
extract_stream_task (the block using dataset, extract_json_config.datasets,
StorageType.S3 and s3_config.key_prefix) to validate dataset after resolving
from extract_json_config.datasets and, if still None, return an error (or raise
an exception) early with a clear message instead of proceeding to build
f"{s3_config.key_prefix}{dataset}/{archive_id}" or using get_directory() /
dataset; ensure any callers handle the early error accordingly.
In
`@components/job-orchestration/job_orchestration/executor/query/fs_search_task.py`:
- Around line 122-125: The CLP_S branch passes dataset (which can be None) into
_make_core_clp_s_command_and_env_vars which expects a str and will break
(f-string will include "None" and Path / None raises TypeError); fix by adding a
guard in _make_command_and_env_vars before the StorageEngine.CLP_S branch to
validate dataset is not None (raise a ValueError with a clear message) or
require/convert it to a string, or change _make_core_clp_s_command_and_env_vars
to accept Optional[str] and handle None safely; reference StorageEngine.CLP_S,
_make_command_and_env_vars, and _make_core_clp_s_command_and_env_vars when
applying the guard or signature change.
In
`@components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py`:
- Around line 178-180: The code silently selects only the first dataset via
extraction_dataset = self.__job_config.datasets[0] which can confuse reviewers;
add a concise inline comment next to that line explaining that extraction jobs
intentionally use only the first dataset (single-archive behavior) and that
multi-dataset context is not applicable here, referencing __job_config.datasets,
extraction_dataset, and the subsequent archive_exists check to make the
rationale clear.
- Around line 417-427: The current code directly interpolates dataset names into
the SELECT string (in the loop building union_parts), which risks SQL injection;
instead build the SELECT with a parameter placeholder for the dataset value and
accumulate a params list to supply to the DB driver. Concretely, keep using
get_archives_table_name(table_prefix, ds) for the table name (since table names
cannot be parameterized), but change the appended string to use a parameter
placeholder (e.g. "%s" or "?" matching the DB adapter) like "SELECT id AS
archive_id, end_timestamp, ? AS dataset FROM {table}{where_clause}", and append
the ds value to a params list for each union part; after joining union_parts
into query, return/execute the query together with the flattened params list so
the dataset values are passed as bound parameters rather than interpolated.
- Around line 736-738: The call to get_archives_for_search with datasets can
pass None (causing a TypeError in get_archives_for_search's for ds in datasets),
so before calling get_archives_for_search (where archives_for_search is
assigned) add a guard: if datasets is None, populate datasets with all available
datasets (e.g., call the existing helper that lists datasets such as
get_all_datasets/db helper or query the DB via db_conn/table_prefix to return a
list of dataset names), or explicitly set datasets = [] if the intended
semantics are “no datasets”; then call get_archives_for_search(db_conn,
table_prefix, search_config, archive_end_ts_lower_bound, datasets) so the
function always receives an iterable list.
In
`@components/webui/client/src/pages/SearchPage/SearchControls/Dataset/DatasetSelect.tsx`:
- Around line 55-63: The useEffect that checks isSuccess && datasets.length ===
0 keeps calling updateDatasets(getFallbackDatasets()) even when
getFallbackDatasets() returns a new empty array reference, causing an infinite
re-render loop; modify the effect in DatasetSelect.tsx to early-return if data
is an empty array (data.length === 0) or only call updateDatasets when the
fallback actually changes (e.g., fallback has length > 0 or differs from
datasets by length/content) so that updateDatasets is not invoked with a new
empty array reference repeatedly; specifically guard the useEffect (and the
related effect that calls updateDatasets([])) to avoid setting an
identical-empty selection by comparing lengths or contents before calling
updateDatasets.
- Around line 43-52: In getFallbackDatasets, remove the unnecessary type
assertion on available[0]; since available is derived as string[] (const
available = data || []) and you already check length > 0, replace the returned
array [available[0] as string] with [available[0]] to avoid the redundant cast
(references: function getFallbackDatasets, variable available, constant
CLP_DEFAULT_DATASET_NAME).
In
`@components/webui/client/src/pages/SearchPage/SearchControls/Native/SearchButton/SubmitButton/index.tsx`:
- Around line 107-112: Update the inline comment above the isNoDatasetsAndClpS
computation to reference "datasets" plural and state that clp-s requires at
least one dataset; locate the comment surrounding the isNoDatasetsAndClpS
constant (which checks selectedDatasets.length and SETTINGS_STORAGE_ENGINE
against CLP_STORAGE_ENGINES.CLP_S) and change the phrasing to something like
"Submit button must be disabled if there are no datasets since clp-s requires at
least one dataset for queries."
In
`@components/webui/client/src/pages/SearchPage/SearchControls/Presto/Guided/presto-guided-search-requests.ts`:
- Line 74: The conditional guarding the destructured variable `from` uses an
unnecessary typeof check; replace the `if ("undefined" === typeof from)` with a
direct undefined comparison (e.g., `if (from === undefined)`) in
presto-guided-search-requests.ts so the code is clearer and still correctly
handles `string | undefined` for `from`.
- Around line 71-76: The code currently destructures selectedDatasets to only
use the first entry (const [from] = selectedDatasets) which silently ignores
additional selections; update all guided-Presto entry points (the usages in
presto-guided-search-requests.ts, GuidedRunButton, TimestampKeySelect and
buildPrestoGuidedQueries) to detect when
useSearchStore.getState().selectedDatasets.length > 1 and handle it explicitly:
either prevent guided mode (disable the run button and input controls) or
surface a prominent, user-visible warning/toast/banner that multiple datasets
are not supported in guided mode and block execution until the user reduces the
selection to one; do not silently proceed by using only selectedDatasets[0].
Ensure the same single-check-and-block logic and a clear error/warning message
is implemented wherever selectedDatasets is currently destructured.
In
`@components/webui/client/src/pages/SearchPage/SearchControls/Presto/SqlSearchButton/RunButton/GuidedRunButton.tsx`:
- Around line 37-40: The button currently reads selectedDatasets via
useSearchStore.getState() which is non-reactive so `from` (and consequently
`isQueryReady`) won't update on selection changes; replace the non-reactive call
with a reactive selector such as const selectedDatasets = useSearchStore(state
=> state.selectedDatasets) and compute `from` from that reactive value (keeping
the `from` name and downstream references intact) so the component re-renders
when selection changes.
In
`@components/webui/client/src/pages/SearchPage/SearchControls/QueryStatus/QuerySpeed/utils.ts`:
- Around line 15-26: The SQL-building code interpolates datasetNames directly
into table identifiers (see variable archivesSubquery and
settings.SqlDbClpTablePrefix) which is an injection risk; fix by introducing a
single helper like escapeSqlIdentifier(name) that either validates names against
a safe whitelist regex (e.g. only [A-Za-z0-9_\\-] and throw on invalid) or
properly quotes/escapes identifiers (wrap with backticks and escape any
backticks inside), and call that helper wherever datasetNames are used to build
table names (both in this file's archivesSubquery logic and in
TimeRangeInput/sql.ts); do not use string interpolation of raw datasetNames
directly and document that identifiers are validated/escaped before
concatenation.
In
`@components/webui/client/src/pages/SearchPage/SearchControls/TimeRangeInput/Presto/TimeRangeFooter/TimestampKeySelect/index.tsx`:
- Around line 20-26: The code duplicates the logic that derives the first
selected dataset (`0 < selectedDatasets.length ? selectedDatasets[0] : null`)
across components (seen in this file and useTimestampKeyInit/index.tsx); extract
a reusable selector/helper (e.g., getFirstSelectedDataset or
selectFirstSelectedDataset) and use it with useSearchStore in both places
instead of repeating the ternary. Update usages that reference selectedDatasets
(the hook useSearchStore and components using dataset) to call the new selector
so the derivation is centralized and DRY.
In
`@components/webui/client/src/pages/SearchPage/SearchControls/TimeRangeInput/sql.ts`:
- Around line 34-44: The function buildClpsTimeRangeSql currently interpolates
datasetNames directly into SQL identifiers (via settings.SqlDbClpTablePrefix +
name + _ + SqlTableSuffix.ARCHIVES), which risks SQL injection if
client-supplied names are malicious; add a client-side defensive check in
buildClpsTimeRangeSql to validate each name against /^\w+$/ (or equivalent
alphanumeric/underscore-only regex) and either throw an error or omit invalid
entries before building unionParts, so only validated safe identifiers are
concatenated into the SQL string; ensure the validation references
buildClpsTimeRangeSql, CLP_ARCHIVES_TABLE_COLUMN_NAMES,
settings.SqlDbClpTablePrefix, and SqlTableSuffix.ARCHIVES so reviewers can
locate the change.
In `@components/webui/common/src/schemas/search.ts`:
- Line 16: The datasets array in the search schema currently allows empty
arrays; update the schema definition for the datasets property to require at
least one entry by adding a minItems: 1 constraint to the Type.Array call (i.e.,
change datasets: Type.Array(Type.String()) to include { minItems: 1 }); this
aligns with the existing pattern used elsewhere (e.g., compression schema) and
ensures validation rejects empty dataset lists.
---
Outside diff comments:
In `@components/api-server/src/client.rs`:
- Around line 27-56: QueryConfig allows "datasets": [] which should be treated
as None; update the code that handles incoming QueryConfig (e.g., in
submit_query) to normalise or validate the datasets field by checking
QueryConfig::datasets and if Some(vec) && vec.is_empty() either set it to None
before further processing or return a 4xx validation error; reference the
QueryConfig struct and the submit_query handler so the empty-vector case is
handled early and you never propagate an empty Vec<String> to the scheduler.
In
`@components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py`:
- Around line 504-537: The code in get_task_group_for_job uses index-based
iteration (for i in range(len(archives))) to pair archives and task_ids when
building celery.group for QueryJobType.SEARCH_OR_AGGREGATION and
QueryJobType.EXTRACT_JSON/EXTRACT_IR; replace those loops with zip(archives,
task_ids) to iterate pairs directly and update the generator expressions that
call search.s(...) and extract_stream.s(...) to use the unpacked archive and
task_id instead of archives[i] and task_ids[i]; ensure you reference job.id,
job_config, archive.get("dataset"), clp_metadata_db_conn_params and
results_cache_uri unchanged while switching both branches to the zip-based
pattern.
In
`@components/webui/client/src/pages/SearchPage/SearchControls/Presto/SqlSearchButton/RunButton/GuidedRunButton.tsx`:
- Around line 99-100: In GuidedRunButton update the disabled conditional to
avoid using negation: replace the `!isQueryReady` check with `false ===
isQueryReady` so the prop check reads `disabled={false === isQueryReady ||
searchUiState === SEARCH_UI_STATE.QUERY_ID_PENDING}` (keep the existing
searchUiState comparison unchanged); this targets the isQueryReady boolean usage
in the GuidedRunButton component to comply with the coding guideline.
In
`@components/webui/client/src/pages/SearchPage/SearchControls/QueryStatus/QuerySpeed/utils.ts`:
- Around line 28-48: The SQL string builder in utils.ts interpolates jobId
directly into the query (see archivesSubquery and the template that uses
${jobId}), which risks injection because jobId is a string; fix it by switching
to a parameterized query or validating/coercing jobId to a strict numeric format
before interpolation—either accept a numeric jobId argument
(Number.parseInt/Number and reject NaN) or replace the inline ${jobId} use with
a parameter placeholder and pass jobId as a bound parameter from the caller
(update the function that returns this query and its consumers to support
parameter binding).
| host: DomainStr = "localhost" | ||
| port: Port = DEFAULT_PORT | ||
| jobs_poll_delay: PositiveFloat = 0.1 # seconds | ||
| max_datasets_per_query: PositiveInt | None = 10 |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Add an inline comment documenting the None semantics.
Users editing config files need to know that setting this field to null means "unlimited." A brief comment would help:
📝 Suggested documentation
- max_datasets_per_query: PositiveInt | None = 10
+ max_datasets_per_query: PositiveInt | None = 10 # None means unlimited📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| max_datasets_per_query: PositiveInt | None = 10 | |
| max_datasets_per_query: PositiveInt | None = 10 # None means unlimited |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/clp-py-utils/clp_py_utils/clp_config.py` at line 435, The config
field max_datasets_per_query (in clp_config.py) lacks documentation for None
semantics—add a brief inline comment next to its declaration explaining that
setting this field to None (or null in YAML/JSON) means "unlimited" datasets per
query; keep the comment concise and on the same line or immediately above the
declaration so users editing config files immediately see that None/null =
unlimited.
| if dataset is None: | ||
| dataset = extract_json_config.datasets[0] if extract_json_config.datasets else None | ||
| if StorageType.S3 == storage_type: | ||
| s3_config = worker_config.archive_output.storage.s3_config | ||
| s3_object_key = f"{s3_config.key_prefix}{dataset}/{archive_id}" |
There was a problem hiding this comment.
dataset may be None when used in path construction, producing invalid paths.
If the dataset parameter is None and extract_json_config.datasets is empty or None, dataset stays None. It is then interpolated into the f-string at line 103 (f"{s3_config.key_prefix}{dataset}/{archive_id}") and used at line 122 (get_directory() / dataset), producing paths containing the literal string "None".
Consider adding an early return with an error when dataset resolves to None:
Proposed fix
if dataset is None:
dataset = extract_json_config.datasets[0] if extract_json_config.datasets else None
+ if dataset is None:
+ logger.error("No dataset specified for JSON extraction")
+ return None, None
if StorageType.S3 == storage_type:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py`
around lines 99 - 103, The code currently allows dataset to remain None if
extract_json_config.datasets is empty, which leads to invalid paths when
building s3_object_key and local paths; update the start of the logic in
extract_stream_task (the block using dataset, extract_json_config.datasets,
StorageType.S3 and s3_config.key_prefix) to validate dataset after resolving
from extract_json_config.datasets and, if still None, return an error (or raise
an exception) early with a clear message instead of proceeding to build
f"{s3_config.key_prefix}{dataset}/{archive_id}" or using get_directory() /
dataset; ensure any callers handle the early error accordingly.
| elif StorageEngine.CLP_S == storage_engine: | ||
| command, env_vars = _make_core_clp_s_command_and_env_vars( | ||
| clp_home, worker_config, archive_id, search_config | ||
| clp_home, worker_config, archive_id, search_config, dataset | ||
| ) |
There was a problem hiding this comment.
dataset can be None here, but _make_core_clp_s_command_and_env_vars requires str.
_make_command_and_env_vars accepts dataset: str | None = None (line 114), but passes it directly to _make_core_clp_s_command_and_env_vars (line 69) which declares dataset: str. If dataset is None at this call site:
- Line 77: the f-string would embed the literal
"None"in the S3 key. - Line 95:
Path / Nonewould raise aTypeError.
Add a guard before the CLP_S branch, or make dataset required when the storage engine is CLP_S.
🐛 Proposed fix
elif StorageEngine.CLP_S == storage_engine:
+ if dataset is None:
+ logger.error("dataset is required for the CLP_S storage engine")
+ return None, None
command, env_vars = _make_core_clp_s_command_and_env_vars(
clp_home, worker_config, archive_id, search_config, dataset
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| elif StorageEngine.CLP_S == storage_engine: | |
| command, env_vars = _make_core_clp_s_command_and_env_vars( | |
| clp_home, worker_config, archive_id, search_config | |
| clp_home, worker_config, archive_id, search_config, dataset | |
| ) | |
| elif StorageEngine.CLP_S == storage_engine: | |
| if dataset is None: | |
| logger.error("dataset is required for the CLP_S storage engine") | |
| return None, None | |
| command, env_vars = _make_core_clp_s_command_and_env_vars( | |
| clp_home, worker_config, archive_id, search_config, dataset | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@components/job-orchestration/job_orchestration/executor/query/fs_search_task.py`
around lines 122 - 125, The CLP_S branch passes dataset (which can be None) into
_make_core_clp_s_command_and_env_vars which expects a str and will break
(f-string will include "None" and Path / None raises TypeError); fix by adding a
guard in _make_command_and_env_vars before the StorageEngine.CLP_S branch to
validate dataset is not None (raise a ValueError with a clear message) or
require/convert it to a string, or change _make_core_clp_s_command_and_env_vars
to accept Optional[str] and handle None safely; reference StorageEngine.CLP_S,
_make_command_and_env_vars, and _make_core_clp_s_command_and_env_vars when
applying the guard or signature change.
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Silently using datasets[0] loses multi-dataset context for extraction jobs.
extraction_dataset takes only the first element. This is likely intentional for extraction (single archive), but worth a brief inline comment explaining why only the first dataset is used here.
🧰 Tools
🪛 Ruff (0.15.1)
[warning] 180-180: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py`
around lines 178 - 180, The code silently selects only the first dataset via
extraction_dataset = self.__job_config.datasets[0] which can confuse reviewers;
add a concise inline comment next to that line explaining that extraction jobs
intentionally use only the first dataset (single-archive behavior) and that
multi-dataset context is not applicable here, referencing __job_config.datasets,
extraction_dataset, and the subsequent archive_exists check to make the
rationale clear.
| where_clause = "" | ||
| if len(filter_clauses) > 0: | ||
| query += " WHERE " + " AND ".join(filter_clauses) | ||
| query += " ORDER BY end_timestamp DESC" | ||
| where_clause = " WHERE " + " AND ".join(filter_clauses) | ||
|
|
||
| union_parts = [] | ||
| for ds in datasets: | ||
| table = get_archives_table_name(table_prefix, ds) | ||
| union_parts.append( | ||
| f"SELECT id AS archive_id, end_timestamp, '{ds}' AS dataset FROM {table}{where_clause}" | ||
| ) | ||
| query = " UNION ALL ".join(union_parts) + " ORDER BY end_timestamp DESC" |
There was a problem hiding this comment.
SQL injection: user-supplied dataset names interpolated directly into query strings.
The dataset name ds is string-interpolated into the SQL query at line 425. While datasets are validated against existing_datasets at lines 706–722 before reaching this point (which mitigates active exploitation), the validation is skipped when datasets is None. If the None issue above is resolved by defaulting to fetched datasets, this path would be safe. Still, consider using parameterized queries for the '{ds}' value to defend in depth.
Note: The table name interpolation via get_archives_table_name follows existing codebase patterns and can't easily be parameterized.
🧰 Tools
🪛 Ruff (0.15.1)
[error] 425-425: Possible SQL injection vector through string-based query construction
(S608)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py`
around lines 417 - 427, The current code directly interpolates dataset names
into the SELECT string (in the loop building union_parts), which risks SQL
injection; instead build the SELECT with a parameter placeholder for the dataset
value and accumulate a params list to supply to the DB driver. Concretely, keep
using get_archives_table_name(table_prefix, ds) for the table name (since table
names cannot be parameterized), but change the appended string to use a
parameter placeholder (e.g. "%s" or "?" matching the DB adapter) like "SELECT id
AS archive_id, end_timestamp, ? AS dataset FROM {table}{where_clause}", and
append the ds value to a params list for each union part; after joining
union_parts into query, return/execute the query together with the flattened
params list so the dataset values are passed as bound parameters rather than
interpolated.
| const {selectedDatasets} = useSearchStore.getState(); | ||
| const from = 0 < selectedDatasets.length ? | ||
| selectedDatasets[0] : | ||
| null; |
There was a problem hiding this comment.
Bug: getState() makes selectedDatasets non-reactive — button won't reflect dataset changes.
useSearchStore.getState() captures state once and does not subscribe to updates. Since from (derived from selectedDatasets) feeds into isQueryReady, the button's disabled state and tooltip will not update when the user changes the dataset selection.
Use a reactive selector instead:
🐛 Proposed fix
- const {selectedDatasets} = useSearchStore.getState();
- const from = 0 < selectedDatasets.length ?
- selectedDatasets[0] :
+ const selectedDatasets = useSearchStore((state) => state.selectedDatasets);
+ const from = 0 < selectedDatasets.length ?
+ selectedDatasets[0] :
null;Based on learnings: "use useStore.getState().method for callbacks since the output is not reactive and doesn't need state as a dependency in the hook, and use useStore((state) => state.property) with proper selectors for reactive components that need to re-render when state changes."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const {selectedDatasets} = useSearchStore.getState(); | |
| const from = 0 < selectedDatasets.length ? | |
| selectedDatasets[0] : | |
| null; | |
| const selectedDatasets = useSearchStore((state) => state.selectedDatasets); | |
| const from = 0 < selectedDatasets.length ? | |
| selectedDatasets[0] : | |
| null; |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@components/webui/client/src/pages/SearchPage/SearchControls/Presto/SqlSearchButton/RunButton/GuidedRunButton.tsx`
around lines 37 - 40, The button currently reads selectedDatasets via
useSearchStore.getState() which is non-reactive so `from` (and consequently
`isQueryReady`) won't update on selection changes; replace the non-reactive call
with a reactive selector such as const selectedDatasets = useSearchStore(state
=> state.selectedDatasets) and compute `from` from that reactive value (keeping
the `from` name and downstream references intact) so the component re-renders
when selection changes.
| if (0 === datasetNames.length) { | ||
| archivesSubquery = "SELECT id, uncompressed_size" + | ||
| ` FROM ${settings.SqlDbClpArchivesTableName}`; | ||
| } else if (1 === datasetNames.length) { | ||
| archivesSubquery = "SELECT id, uncompressed_size" + | ||
| ` FROM ${settings.SqlDbClpTablePrefix}${datasetNames[0]}_archives`; | ||
| } else { | ||
| archivesSubquery = datasetNames | ||
| .map((name) => "SELECT id, uncompressed_size" + | ||
| ` FROM ${settings.SqlDbClpTablePrefix}${name}_archives`) | ||
| .join(" UNION ALL "); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Search for datasetNames definition and usage to understand the data flow
rg -n 'datasetNames' components/webui/client/src/pages/SearchPage/SearchControls/QueryStatus/QuerySpeed/ --type=ts -B 2 -A 2Repository: y-scope/clp
Length of output: 3833
🏁 Script executed:
# Look at the complete file to understand context
head -80 components/webui/client/src/pages/SearchPage/SearchControls/QueryStatus/QuerySpeed/utils.tsRepository: y-scope/clp
Length of output: 2159
🏁 Script executed:
# Search for any dataset name validation patterns in the webui client
rg -n 'dataset.*validat|sanitiz|allowlist|whitelist|escape|quote' components/webui/client/src/ --type=ts --type=tsx -i | head -30Repository: y-scope/clp
Length of output: 82
🏁 Script executed:
# Find where datasetNames parameter comes from by checking callers
rg -n 'buildQuerySpeedSql' components/webui/client/src/ --type=ts --type=tsx -B 2 -A 2Repository: y-scope/clp
Length of output: 82
🏁 Script executed:
# Check if there's any dataset validation in the store or data layer
rg -n 'selectedDatasets|queriedDatasets' components/webui/client/src/ --type=ts --type=tsx | head -20Repository: y-scope/clp
Length of output: 82
🏁 Script executed:
# Find imports/calls to fetchQuerySpeed
rg -n 'fetchQuerySpeed' components/webui/client/src/ --type=ts | head -20Repository: y-scope/clp
Length of output: 693
🏁 Script executed:
# Search for where this module is imported
rg -n 'from.*QuerySpeed' components/webui/client/src/ --type=ts | head -20Repository: y-scope/clp
Length of output: 178
🏁 Script executed:
# Check the api/sql module to understand querySql
find components/webui/client/src -name "sql.ts" -o -name "sql.tsx" | head -5Repository: y-scope/clp
Length of output: 427
🏁 Script executed:
# Look for dataset selection/filtering logic in the search page
find components/webui/client/src/pages/SearchPage -name "*.ts" | head -10Repository: y-scope/clp
Length of output: 1007
🏁 Script executed:
# Check the QuerySpeed index.tsx to see where queriedDatasets comes from
cat -n components/webui/client/src/pages/SearchPage/SearchControls/QueryStatus/QuerySpeed/index.tsxRepository: y-scope/clp
Length of output: 1984
🏁 Script executed:
# Look at the sql.ts in api folder to understand querySql
find components/webui/client/src/api -name "sql.ts" | head -5Repository: y-scope/clp
Length of output: 37
🏁 Script executed:
# Check for dataset validation in the search state/store
find components/webui/client/src/pages/SearchPage -name "*state*" -o -name "*store*" | head -10Repository: y-scope/clp
Length of output: 37
🏁 Script executed:
# Search for where dataset names are defined/sourced in the search page
rg -n 'queriedDatasets' components/webui/client/src/pages/SearchPage --type=ts | head -20Repository: y-scope/clp
Length of output: 849
🏁 Script executed:
# Check the SearchState index.tsx to see how queriedDatasets is set
cat -n components/webui/client/src/pages/SearchPage/SearchState/index.tsx | head -150Repository: y-scope/clp
Length of output: 5396
🏁 Script executed:
# Look for the api/sql module to understand if querySql does any sanitization
find components/webui/client/src -path "*/api/*sql.ts" -type fRepository: y-scope/clp
Length of output: 37
🏁 Script executed:
# Check the Dataset controls to see how datasets are selected/validated
cat -n components/webui/client/src/pages/SearchPage/SearchControls/Dataset/sql.tsRepository: y-scope/clp
Length of output: 1121
🏁 Script executed:
# Find the api/sql.ts file
find components/webui/client/src -path "*/api/sql.ts" -type fRepository: y-scope/clp
Length of output: 37
🏁 Script executed:
# Check how querySql is implemented
cat -n components/webui/client/src/api/sql.ts | head -50Repository: y-scope/clp
Length of output: 125
🏁 Script executed:
# Look for any examples of dataset name sanitization or validation elsewhere in the codebase
rg -n 'dataset.*\[a-zA-Z0-9_-\]|sanitiz.*dataset|escape.*name' components/webui/client/src --type=ts -i | head -20Repository: y-scope/clp
Length of output: 37
🏁 Script executed:
# Find where sql.ts or querySql is defined
find components/webui/client/src -name "*.ts" -o -name "*.tsx" | xargs grep -l "export.*querySql" | head -5Repository: y-scope/clp
Length of output: 99
🏁 Script executed:
# Search for querySql definition
rg -n 'querySql.*=' components/webui/client/src --type=ts | head -20Repository: y-scope/clp
Length of output: 155
🏁 Script executed:
# Check the api directory structure
find components/webui/client/src/api -type f | head -20Repository: y-scope/clp
Length of output: 644
🏁 Script executed:
# Check the querySql implementation
cat -n components/webui/client/src/api/sql/index.tsRepository: y-scope/clp
Length of output: 417
🏁 Script executed:
# Look for where the SQL query is executed - check if there's any backend validation
rg -n 'querySql|dataset.*validation|sql.*validate' components/webui/client/src/api/sql/ --type=ts -A 5 | head -50Repository: y-scope/clp
Length of output: 440
🏁 Script executed:
# Find the backend endpoint that handles /api/archive-metadata/sql
find components -name "*.cpp" -o -name "*.hpp" -o -name "*.py" | xargs grep -l "archive-metadata/sql" | head -10Repository: y-scope/clp
Length of output: 37
🏁 Script executed:
# Search for SQL query handling in the backend
rg -n 'archive-metadata.*sql|querySql' components --type=cpp --type=py | head -20Repository: y-scope/clp
Length of output: 37
🏁 Script executed:
# Search for where dataset names might be validated in the backend
rg -n 'dataset.*validat|sanitiz.*dataset' components --type=cpp --type=py -i | head -20Repository: y-scope/clp
Length of output: 391
🏁 Script executed:
# Check the backend handler for the SQL endpoint to see if validation happens there
find components -path "*/archive-metadata*" -name "*.cpp" -o -name "*.hpp" -o -name "*.py" 2>/dev/null | head -20Repository: y-scope/clp
Length of output: 1197
🏁 Script executed:
# Look for the REST API endpoint handler
rg -n 'archive.metadata.*sql|/api/archive-metadata' components --type=cpp --type=py | head -20Repository: y-scope/clp
Length of output: 37
🏁 Script executed:
# Search more broadly for SQL execution in the backend
rg -n 'execute.*query|query.*execute|executeSql' components --type=py -A 3 | head -40Repository: y-scope/clp
Length of output: 3569
🏁 Script executed:
# Look for where the REST endpoint /api/archive-metadata/sql is implemented
find components -type f \( -name "*.py" -o -name "*.cpp" -o -name "*.hpp" \) | xargs grep -l "archive.metadata" | head -10Repository: y-scope/clp
Length of output: 544
🏁 Script executed:
# Search in the job orchestration for where SQL is actually executed for metadata queries
rg -n 'presto|trino|execute.*sql' components/job-orchestration --type=py | head -30Repository: y-scope/clp
Length of output: 37
🏁 Script executed:
# Check if there are any examples of dataset name escaping/quoting in the codebase
rg -n 'quote.*name|escape.*name|f\"\"\"|backtick|identifier' components --type=py --type=ts -i | head -20Repository: y-scope/clp
Length of output: 199
🏁 Script executed:
# Search for the backend endpoint that handles SQL queries
find components -type f -name "*.py" | xargs grep -l "archive-metadata" | head -10Repository: y-scope/clp
Length of output: 37
🏁 Script executed:
# Look for REST API/Flask routes
rg -n '@app.route|@api.route|router.post' components --type=py | grep -i sql | head -20Repository: y-scope/clp
Length of output: 37
🏁 Script executed:
# Search for where presto queries are built with dataset names
rg -n 'FROM.*dataset|table.*prefix|tableName' components --type=py | head -30Repository: y-scope/clp
Length of output: 3961
🏁 Script executed:
# Look at the clp_metadata_db_utils.py to see how table names are constructed
cat -n components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py | head -150Repository: y-scope/clp
Length of output: 5784
🏁 Script executed:
# Search for the REST endpoint handler that processes the SQL query
find components -type f -name "*.py" | xargs grep -l "queryString" | head -10Repository: y-scope/clp
Length of output: 37
🏁 Script executed:
# Look for the webui backend/API handler
find components/webui -type f -name "*.py" | head -20Repository: y-scope/clp
Length of output: 37
🏁 Script executed:
# Look for where datasets are imported/created to see if names are validated
rg -n 'add_dataset|dataset_name.*=' components/clp-py-utils components/job-orchestration --type=py -B 2 -A 2 | head -50Repository: y-scope/clp
Length of output: 1729
🏁 Script executed:
# Check the webui backend more carefully - look for app.py or main server file
find components/webui -type f -name "*.py" -o -name "app.py" -o -name "server.py" | head -20Repository: y-scope/clp
Length of output: 37
🏁 Script executed:
# Look for where datasets are fetched/selected in the UI
rg -n 'fetchDatasetNames|selectedDatasets' components/webui/client/src/pages/SearchPage --type=ts | head -20Repository: y-scope/clp
Length of output: 2910
🏁 Script executed:
# Check the Dataset component to see how datasets are selected
cat -n components/webui/client/src/pages/SearchPage/SearchControls/Dataset/ 2>/dev/null | head -100Repository: y-scope/clp
Length of output: 37
🏁 Script executed:
# Find the Dataset component file
find components/webui/client/src/pages/SearchPage/SearchControls/Dataset -type f -name "*.tsx" -o -name "*.ts"Repository: y-scope/clp
Length of output: 293
🏁 Script executed:
# Look at the DatasetSelect component
cat -n components/webui/client/src/pages/SearchPage/SearchControls/Dataset/DatasetSelect.tsxRepository: y-scope/clp
Length of output: 4078
🏁 Script executed:
# Look at the Dataset index component
cat -n components/webui/client/src/pages/SearchPage/SearchControls/Dataset/index.tsxRepository: y-scope/clp
Length of output: 1281
🏁 Script executed:
# Look for any sanitization or validation in the search request building
rg -n 'buildClpsTimeRangeSql|buildQuerySpeedSql' components/webui/client/src --type=ts -A 10 | head -60Repository: y-scope/clp
Length of output: 5617
🏁 Script executed:
# Check if there are any other similar SQL building patterns in the codebase
rg -n 'FROM.*\$\{.*prefix\}\$\{.*name\}' components/webui/client/src --type=ts | head -20Repository: y-scope/clp
Length of output: 37
🏁 Script executed:
# Look for any existing SQL escaping utilities in the codebase
rg -n 'escape|quote|sanitize' components/webui/client/src --type=ts | head -20Repository: y-scope/clp
Length of output: 2669
SQL injection risk: dataset names interpolated directly into table names without escaping.
datasetNames are concatenated directly into SQL table names (lines 20, 24) without any quoting or escaping around the dataset name variable. While dataset names originate from the database, they are stored with no character restrictions (VARCHAR(255)), making them vulnerable to SQL injection if a dataset name contains special characters (e.g., backticks, underscores, quotes). This same pattern is repeated in TimeRangeInput/sql.ts.
Consider quoting or escaping dataset names before interpolation, or use parameterized queries if the backend supports them.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@components/webui/client/src/pages/SearchPage/SearchControls/QueryStatus/QuerySpeed/utils.ts`
around lines 15 - 26, The SQL-building code interpolates datasetNames directly
into table identifiers (see variable archivesSubquery and
settings.SqlDbClpTablePrefix) which is an injection risk; fix by introducing a
single helper like escapeSqlIdentifier(name) that either validates names against
a safe whitelist regex (e.g. only [A-Za-z0-9_\\-] and throw on invalid) or
properly quotes/escapes identifiers (wrap with backticks and escape any
backticks inside), and call that helper wherever datasetNames are used to build
table names (both in this file's archivesSubquery logic and in
TimeRangeInput/sql.ts); do not use string interpolation of raw datasetNames
directly and document that identifiers are validated/escaped before
concatenation.
| const selectedDatasets = useSearchStore((state) => state.selectedDatasets); | ||
| const timestampKey = usePrestoSearchState((state) => state.timestampKey); | ||
| const updateTimestampKey = usePrestoSearchState((state) => state.updateTimestampKey); | ||
| const searchUiState = useSearchStore((state) => state.searchUiState); | ||
| const dataset = 0 < selectedDatasets.length ? | ||
| selectedDatasets[0] : | ||
| null; |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Duplicated dataset derivation logic across components.
The pattern 0 < selectedDatasets.length ? selectedDatasets[0] : null is repeated identically in useTimestampKeyInit/index.tsx (lines 19–21) and here (lines 24–26). Consider extracting a small helper or selector to keep this DRY, especially if more components need the same derivation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@components/webui/client/src/pages/SearchPage/SearchControls/TimeRangeInput/Presto/TimeRangeFooter/TimestampKeySelect/index.tsx`
around lines 20 - 26, The code duplicates the logic that derives the first
selected dataset (`0 < selectedDatasets.length ? selectedDatasets[0] : null`)
across components (seen in this file and useTimestampKeyInit/index.tsx); extract
a reusable selector/helper (e.g., getFirstSelectedDataset or
selectFirstSelectedDataset) and use it with useSearchStore in both places
instead of repeating the ternary. Update usages that reference selectedDatasets
(the hook useSearchStore and components using dataset) to call the new selector
so the derivation is centralized and DRY.
| const buildClpsTimeRangeSql = (datasetNames: string[]): string => { | ||
| const unionParts = datasetNames.map((name) => `SELECT | ||
| MIN(${CLP_ARCHIVES_TABLE_COLUMN_NAMES.BEGIN_TIMESTAMP}) AS begin_timestamp, | ||
| MAX(${CLP_ARCHIVES_TABLE_COLUMN_NAMES.END_TIMESTAMP}) AS end_timestamp | ||
| FROM ${settings.SqlDbClpTablePrefix}${datasetName}_${SqlTableSuffix.ARCHIVES}`; | ||
| FROM ${settings.SqlDbClpTablePrefix}${name}_${SqlTableSuffix.ARCHIVES}`); | ||
|
|
||
| return `SELECT | ||
| MIN(begin_timestamp) AS begin_timestamp, | ||
| MAX(end_timestamp) AS end_timestamp | ||
| FROM (${unionParts.join("\nUNION ALL\n")}) AS combined`; | ||
| }; |
There was a problem hiding this comment.
SQL injection risk: dataset names are interpolated directly into SQL strings.
Dataset names from datasetNames are concatenated into the SQL query without sanitisation or parameterisation. While dataset names are validated server-side to contain only \w+ characters, a compromised or manipulated client could submit arbitrary strings. Consider adding a client-side guard (e.g., a regex check matching ^\w+$) before interpolating names into SQL:
🛡️ Suggested defensive guard
const buildClpsTimeRangeSql = (datasetNames: string[]): string => {
+ const VALID_NAME = /^\w+$/;
+ for (const name of datasetNames) {
+ if (false === VALID_NAME.test(name)) {
+ throw new Error(`Invalid dataset name: ${name}`);
+ }
+ }
const unionParts = datasetNames.map((name) => `SELECT📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const buildClpsTimeRangeSql = (datasetNames: string[]): string => { | |
| const unionParts = datasetNames.map((name) => `SELECT | |
| MIN(${CLP_ARCHIVES_TABLE_COLUMN_NAMES.BEGIN_TIMESTAMP}) AS begin_timestamp, | |
| MAX(${CLP_ARCHIVES_TABLE_COLUMN_NAMES.END_TIMESTAMP}) AS end_timestamp | |
| FROM ${settings.SqlDbClpTablePrefix}${datasetName}_${SqlTableSuffix.ARCHIVES}`; | |
| FROM ${settings.SqlDbClpTablePrefix}${name}_${SqlTableSuffix.ARCHIVES}`); | |
| return `SELECT | |
| MIN(begin_timestamp) AS begin_timestamp, | |
| MAX(end_timestamp) AS end_timestamp | |
| FROM (${unionParts.join("\nUNION ALL\n")}) AS combined`; | |
| }; | |
| const buildClpsTimeRangeSql = (datasetNames: string[]): string => { | |
| const VALID_NAME = /^\w+$/; | |
| for (const name of datasetNames) { | |
| if (false === VALID_NAME.test(name)) { | |
| throw new Error(`Invalid dataset name: ${name}`); | |
| } | |
| } | |
| const unionParts = datasetNames.map((name) => `SELECT | |
| MIN(${CLP_ARCHIVES_TABLE_COLUMN_NAMES.BEGIN_TIMESTAMP}) AS begin_timestamp, | |
| MAX(${CLP_ARCHIVES_TABLE_COLUMN_NAMES.END_TIMESTAMP}) AS end_timestamp | |
| FROM ${settings.SqlDbClpTablePrefix}${name}_${SqlTableSuffix.ARCHIVES}`); | |
| return `SELECT | |
| MIN(begin_timestamp) AS begin_timestamp, | |
| MAX(end_timestamp) AS end_timestamp | |
| FROM (${unionParts.join("\nUNION ALL\n")}) AS combined`; | |
| }; |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@components/webui/client/src/pages/SearchPage/SearchControls/TimeRangeInput/sql.ts`
around lines 34 - 44, The function buildClpsTimeRangeSql currently interpolates
datasetNames directly into SQL identifiers (via settings.SqlDbClpTablePrefix +
name + _ + SqlTableSuffix.ARCHIVES), which risks SQL injection if
client-supplied names are malicious; add a client-side defensive check in
buildClpsTimeRangeSql to validate each name against /^\w+$/ (or equivalent
alphanumeric/underscore-only regex) and either throw an error or omit invalid
entries before building unionParts, so only validated safe identifiers are
concatenated into the SQL string; ensure the validation references
buildClpsTimeRangeSql, CLP_ARCHIVES_TABLE_COLUMN_NAMES,
settings.SqlDbClpTablePrefix, and SqlTableSuffix.ARCHIVES so reviewers can
locate the change.
| // otherwise, `{dataset: null}` gets converted to `{dataset: ""}`. | ||
| dataset: Type.Union([Type.Null(), | ||
| Type.String()]), | ||
| datasets: Type.Array(Type.String()), |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for datasets validation in server-side route handlers
rg -n --type=ts -C5 'datasets' components/webui/server/src/routes/Repository: y-scope/clp
Length of output: 1945
🏁 Script executed:
cat -n components/webui/common/src/schemas/search.tsRepository: y-scope/clp
Length of output: 1259
🏁 Script executed:
#!/bin/bash
# Search for scheduler validation or error handling related to datasets
rg -n --type=ts -i 'scheduler|validation' components/webui/server/src/routes/api/search/index.ts | head -20Repository: y-scope/clp
Length of output: 37
🏁 Script executed:
#!/bin/bash
# Search for any array length validation patterns in the codebase
rg -n 'minItems|length.*===.*0|\.length\s*>' components/webui/common/src/schemas/ --type=tsRepository: y-scope/clp
Length of output: 166
🏁 Script executed:
#!/bin/bash
# Check if there are request validators or middleware that might validate datasets
rg -n --type=ts 'datasets.*length|datasets.*empty' components/webui/server/ -C3Repository: y-scope/clp
Length of output: 37
🏁 Script executed:
#!/bin/bash
# Search for tests related to query job creation to understand expected behavior
find components/webui -name "*.test.ts" -o -name "*.spec.ts" | xargs rg -l "datasets" 2>/dev/null | head -5Repository: y-scope/clp
Length of output: 37
🏁 Script executed:
#!/bin/bash
# Look for any comments or documentation about datasets validation
rg -n "datasets" components/webui/common/src/schemas/ -B2 -A2 --type=tsRepository: y-scope/clp
Length of output: 452
Add minItems constraint to enforce non-empty datasets arrays.
The schema allows empty arrays [], which would be semantically invalid for a search query. The codebase establishes a pattern of constraining arrays with minItems: 1 (as seen in components/webui/common/src/schemas/compression.ts). Align this field with that pattern:
datasets: Type.Array(Type.String(), { minItems: 1 }),No downstream validation was found to reject empty arrays at the route handler or scheduler level.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/webui/common/src/schemas/search.ts` at line 16, The datasets array
in the search schema currently allows empty arrays; update the schema definition
for the datasets property to require at least one entry by adding a minItems: 1
constraint to the Type.Array call (i.e., change datasets:
Type.Array(Type.String()) to include { minItems: 1 }); this aligns with the
existing pattern used elsewhere (e.g., compression schema) and ensures
validation rejects empty dataset lists.
sitaowang1998
left a comment
There was a problem hiding this comment.
I have reviewed the package side. Should we also check if the datasets exceeds max_datasets_per_query in search script?
There was a problem hiding this comment.
Shall we report all datasets that do not exist in the metadata database?
There was a problem hiding this comment.
If using the first dataset is the intended behavior for backward compatibility, add comment to avoid confusion.
|
|
||
| # NOTE: This assumes we never delete a dataset. | ||
| missing = set(datasets) - existing_datasets | ||
| if missing: |
There was a problem hiding this comment.
Prefer to use explicit check, even in Python.
| if missing: | |
| if len(missing) == 0: |
| if missing: | ||
| existing_datasets.update(fetch_existing_datasets(db_cursor, table_prefix)) | ||
| missing = set(datasets) - existing_datasets | ||
| if missing: |
There was a problem hiding this comment.
Same as above.
| if missing: | |
| if len(missing) == 0: |
| if dataset not in existing_datasets: | ||
| logger.error(f"Dataset `{dataset}` doesn't exist.") | ||
| datasets = QueryJobConfig.model_validate(job_config).datasets | ||
| if datasets is not None: |
There was a problem hiding this comment.
This if branch is pretty large, should we extract it into a separate function?
…d enforce validation in UI and backend
…ncy and support multiple flag usage
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/clp-package-utils/clp_package_utils/scripts/native/search.py`:
- Around line 305-308: Replace the f-string logging call so it uses the logger's
lazy formatting instead of pre-evaluating the string: change the logger.error
call that currently interpolates len(datasets) and max_datasets_per_query into a
parameterized log message (referencing variables datasets and
max_datasets_per_query and the logger.error invocation) to pass the values as
arguments to logger.error; keep the same message text and preserve the
subsequent return -1 behavior.
- Around line 300-313: Replace the current exception logging in the dataset
existence check so the full traceback is logged: in the block where
validate_datasets_exist(database_config, datasets) is called (handling
parsed_args.dataset and datasets), change the logger.error(e) call to
logger.exception(e) to mirror the earlier handler and preserve the stack trace
for failures in validate_datasets_exist.
In `@components/clp-package-utils/clp_package_utils/scripts/search.py`:
- Around line 116-128: Replace the boolean positional argument in the call to
clp_config.database.get_clp_connection_params_and_type(True) with a keyword
argument (e.g., get_clp_connection_params_and_type(include_type=True)) to
improve readability, and change the exception logging inside the except block
from logger.error(e) to logger.exception(e) so the traceback is preserved; these
changes should be applied in the block that handles StorageEngine.CLP_S and uses
validate_dataset_name for each dataset.
In
`@components/webui/client/src/pages/SearchPage/SearchControls/Dataset/DatasetSelect.tsx`:
- Around line 27-29: The selector is pulling updateSelectedDatasets as a
reactive function which then gets used as a stable callback; change usage so you
don't subscribe to the function: stop using useSearchStore((state) =>
state.updateSelectedDatasets) and instead call
useSearchStore.getState().updateSelectedDatasets(...) wherever updateDatasets is
invoked (e.g., in DatasetSelect component), and remove updateDatasets from any
useEffect dependency arrays so the hook doesn't depend on a stable function
reference; keep using selectors (useSearchStore((s) => s.selectedDatasets) and
useSearchStore((s) => s.searchUiState)) for reactive state only.
---
Duplicate comments:
In
`@components/webui/client/src/pages/SearchPage/SearchControls/Dataset/DatasetSelect.tsx`:
- Around line 56-64: The effect using isSuccess, data, datasets,
getFallbackDatasets and updateDatasets currently pushes a new empty array
reference and can loop; change the guard in the useEffect (the block referencing
isSuccess && 0 === datasets.length) to only call
updateDatasets(getFallbackDatasets()) when getFallbackDatasets() actually
returns a non-empty array (e.g., check getFallbackDatasets().length > 0) or when
the fallback differs in length from datasets to avoid storing a new empty array
reference; apply the same guard to the other effect that performs
initial-selection (the one around lines 76–87) so both effects only update
Zustand when there are real dataset items to set.
- Line 52: Remove the redundant type assertion on available[0] in DatasetSelect
(the variable available is already string[] and the preceding length guard
ensures it's defined); replace the expression "[available[0] as string]" with
just "[available[0]]" wherever it appears (e.g., in the component
return/initialization) so the code relies on the correct inferred type and
eliminates the unnecessary "as string" cast.
| datasets = parsed_args.dataset | ||
| if datasets is not None: | ||
| max_datasets_per_query = clp_config.query_scheduler.max_datasets_per_query | ||
| if max_datasets_per_query is not None and len(datasets) > max_datasets_per_query: | ||
| logger.error( | ||
| f"Number of datasets ({len(datasets)}) exceeds" | ||
| f" max_datasets_per_query={max_datasets_per_query}." | ||
| ) | ||
| return -1 | ||
| try: | ||
| validate_dataset_exists(database_config, dataset) | ||
| validate_datasets_exist(database_config, datasets) | ||
| except Exception as e: | ||
| logger.error(e) | ||
| return -1 |
There was a problem hiding this comment.
logger.error(e) drops the traceback — prefer logger.exception(e) for consistency.
The handler at lines 309–313 silently discards the stack trace. This is inconsistent with the existing handler earlier in the same file (lines 295–297) that uses logger.exception. Use logger.exception(e) so failures in validate_datasets_exist are fully observable.
🔧 Proposed fix
try:
validate_datasets_exist(database_config, datasets)
except Exception as e:
- logger.error(e)
+ logger.exception(e)
return -1📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| datasets = parsed_args.dataset | |
| if datasets is not None: | |
| max_datasets_per_query = clp_config.query_scheduler.max_datasets_per_query | |
| if max_datasets_per_query is not None and len(datasets) > max_datasets_per_query: | |
| logger.error( | |
| f"Number of datasets ({len(datasets)}) exceeds" | |
| f" max_datasets_per_query={max_datasets_per_query}." | |
| ) | |
| return -1 | |
| try: | |
| validate_dataset_exists(database_config, dataset) | |
| validate_datasets_exist(database_config, datasets) | |
| except Exception as e: | |
| logger.error(e) | |
| return -1 | |
| datasets = parsed_args.dataset | |
| if datasets is not None: | |
| max_datasets_per_query = clp_config.query_scheduler.max_datasets_per_query | |
| if max_datasets_per_query is not None and len(datasets) > max_datasets_per_query: | |
| logger.error( | |
| f"Number of datasets ({len(datasets)}) exceeds" | |
| f" max_datasets_per_query={max_datasets_per_query}." | |
| ) | |
| return -1 | |
| try: | |
| validate_datasets_exist(database_config, datasets) | |
| except Exception as e: | |
| logger.exception(e) | |
| return -1 |
🧰 Tools
🪛 Ruff (0.15.1)
[warning] 305-306: Logging statement uses f-string
(G004)
[warning] 311-311: Do not catch blind exception: Exception
(BLE001)
[warning] 312-312: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/clp-package-utils/clp_package_utils/scripts/native/search.py`
around lines 300 - 313, Replace the current exception logging in the dataset
existence check so the full traceback is logged: in the block where
validate_datasets_exist(database_config, datasets) is called (handling
parsed_args.dataset and datasets), change the logger.error(e) call to
logger.exception(e) to mirror the earlier handler and preserve the stack trace
for failures in validate_datasets_exist.
| f"Number of datasets ({len(datasets)}) exceeds" | ||
| f" max_datasets_per_query={max_datasets_per_query}." | ||
| ) | ||
| return -1 |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Optional: avoid f-strings in logging calls (Ruff G004).
logger.error(f"Number of datasets ({len(datasets)}) exceeds ...") pre-evaluates the format string unconditionally. While the performance impact at ERROR level is negligible, this pattern is flagged by Ruff G004 and is inconsistent with lazy-evaluation style.
♻️ Suggested refactor
- logger.error(
- f"Number of datasets ({len(datasets)}) exceeds"
- f" max_datasets_per_query={max_datasets_per_query}."
- )
+ logger.error(
+ "Number of datasets (%d) exceeds max_datasets_per_query=%d.",
+ len(datasets),
+ max_datasets_per_query,
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| f"Number of datasets ({len(datasets)}) exceeds" | |
| f" max_datasets_per_query={max_datasets_per_query}." | |
| ) | |
| return -1 | |
| logger.error( | |
| "Number of datasets (%d) exceeds max_datasets_per_query=%d.", | |
| len(datasets), | |
| max_datasets_per_query, | |
| ) | |
| return -1 |
🧰 Tools
🪛 Ruff (0.15.1)
[warning] 305-306: Logging statement uses f-string
(G004)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/clp-package-utils/clp_package_utils/scripts/native/search.py`
around lines 305 - 308, Replace the f-string logging call so it uses the
logger's lazy formatting instead of pre-evaluating the string: change the
logger.error call that currently interpolates len(datasets) and
max_datasets_per_query into a parameterized log message (referencing variables
datasets and max_datasets_per_query and the logger.error invocation) to pass the
values as arguments to logger.error; keep the same message text and preserve the
subsequent return -1 behavior.
| datasets = parsed_args.dataset | ||
| if StorageEngine.CLP_S == storage_engine: | ||
| dataset = CLP_DEFAULT_DATASET_NAME if dataset is None else dataset | ||
| datasets = [CLP_DEFAULT_DATASET_NAME] if datasets is None else datasets | ||
| try: | ||
| clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True) | ||
| validate_dataset_name(clp_db_connection_params["table_prefix"], dataset) | ||
| for ds in datasets: | ||
| validate_dataset_name(clp_db_connection_params["table_prefix"], ds) | ||
| except Exception as e: | ||
| logger.error(e) | ||
| return -1 | ||
| elif dataset is not None: | ||
| elif datasets is not None: | ||
| logger.error(f"Dataset selection is not supported for storage engine: {storage_engine}.") | ||
| return -1 |
There was a problem hiding this comment.
Two minor issues in the exception handling block.
- Line 120 — Boolean positional argument reduces call-site readability. Use the keyword form.
- Lines 123–124 —
logger.error(e)only logs the message, not the traceback. This is inconsistent with the existing handler at lines 103–105 that useslogger.exception. Preferlogger.exception(e)so the stack trace is preserved.
🔧 Proposed fix
- clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)
+ clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(
+ disable_localhost_socket_connection=True
+ )
for ds in datasets:
validate_dataset_name(clp_db_connection_params["table_prefix"], ds)
except Exception as e:
- logger.error(e)
+ logger.exception(e)
return -1🧰 Tools
🪛 Ruff (0.15.1)
[error] 120-120: Boolean positional value in function call
(FBT003)
[warning] 123-123: Do not catch blind exception: Exception
(BLE001)
[warning] 124-124: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
[warning] 127-127: Logging statement uses f-string
(G004)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/clp-package-utils/clp_package_utils/scripts/search.py` around
lines 116 - 128, Replace the boolean positional argument in the call to
clp_config.database.get_clp_connection_params_and_type(True) with a keyword
argument (e.g., get_clp_connection_params_and_type(include_type=True)) to
improve readability, and change the exception logging inside the except block
from logger.error(e) to logger.exception(e) so the traceback is preserved; these
changes should be applied in the block that handles StorageEngine.CLP_S and uses
validate_dataset_name for each dataset.
| const datasets = useSearchStore((state) => state.selectedDatasets); | ||
| const searchUiState = useSearchStore((state) => state.searchUiState); | ||
| const updateDatasets = useSearchStore((state) => state.updateSelectedDatasets); |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Use useSearchStore.getState() for the updateDatasets callback.
Per the team's Zustand convention, store methods used as callbacks (not for reactive rendering) should be accessed via useStore.getState().method rather than through a selector. This avoids listing a stable function reference as a useEffect dependency and aligns with the project's existing patterns.
Proposed fix
const datasets = useSearchStore((state) => state.selectedDatasets);
const searchUiState = useSearchStore((state) => state.searchUiState);
- const updateDatasets = useSearchStore((state) => state.updateSelectedDatasets);Then replace all updateDatasets(...) calls (and remove it from useEffect dependency arrays) with useSearchStore.getState().updateSelectedDatasets(...).
Based on learnings: "use useStore.getState().method for callbacks since the output is not reactive and doesn't need state as a dependency in the hook, and use useStore((state) => state.property) with proper selectors for reactive components that need to re-render when state changes."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@components/webui/client/src/pages/SearchPage/SearchControls/Dataset/DatasetSelect.tsx`
around lines 27 - 29, The selector is pulling updateSelectedDatasets as a
reactive function which then gets used as a stable callback; change usage so you
don't subscribe to the function: stop using useSearchStore((state) =>
state.updateSelectedDatasets) and instead call
useSearchStore.getState().updateSelectedDatasets(...) wherever updateDatasets is
invoked (e.g., in DatasetSelect component), and remove updateDatasets from any
useEffect dependency arrays so the hook doesn't depend on a stable function
reference; keep using selectors (useSearchStore((s) => s.selectedDatasets) and
useSearchStore((s) => s.searchUiState)) for reactive state only.
…validation consistency
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/clp-package-utils/clp_package_utils/scripts/native/utils.py (1)
75-93:⚠️ Potential issue | 🔴 CriticalUpdate
decompress.pyandarchive_manager.pyto use the renamed function.The old
validate_dataset_existsfunction has been removed and replaced withvalidate_datasets_exist. Two files still reference the old function and will fail on import:
decompress.py(lines 39, 151)archive_manager.py(lines 31, 203)Both files must be updated to import
validate_datasets_existand pass datasets as a list (e.g.,validate_datasets_exist(config, [dataset])instead ofvalidate_datasets_exist(config, dataset)).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/clp-package-utils/clp_package_utils/scripts/native/utils.py` around lines 75 - 93, Replace the old validate_dataset_exists usage with the new validate_datasets_exist: update the imports in the modules that still call the old function to import validate_datasets_exist, and change call sites that currently pass a single string (e.g., validate_dataset_exists(config, dataset)) to pass a list/iterable (e.g., validate_datasets_exist(config, [dataset]) or validate_datasets_exist(config, datasets_list)); ensure you reference the function name validate_datasets_exist and update both call sites that previously used validate_dataset_exists so they pass a list of dataset names.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/clp-package-utils/clp_package_utils/scripts/native/utils.py`:
- Around line 92-93: Replace the non-idiomatic check and raw list error with a
clearer, friendlier exception: change `if len(missing) > 0:` to `if missing:`,
and raise either a dedicated exception (e.g., define `class
DatasetNotFoundError(ValueError)` and raise that) or at minimum a shorter
ValueError using a joined string for readability like `raise
ValueError(f"Datasets not found: {', '.join(missing)}")`; refer to the existing
`missing` variable and the current `ValueError` site to locate and update the
code.
---
Outside diff comments:
In `@components/clp-package-utils/clp_package_utils/scripts/native/utils.py`:
- Around line 75-93: Replace the old validate_dataset_exists usage with the new
validate_datasets_exist: update the imports in the modules that still call the
old function to import validate_datasets_exist, and change call sites that
currently pass a single string (e.g., validate_dataset_exists(config, dataset))
to pass a list/iterable (e.g., validate_datasets_exist(config, [dataset]) or
validate_datasets_exist(config, datasets_list)); ensure you reference the
function name validate_datasets_exist and update both call sites that previously
used validate_dataset_exists so they pass a list of dataset names.
---
Duplicate comments:
In
`@components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py`:
- Around line 99-100: The code uses extract_json_config.datasets[0] guarded only
by "datasets is not None", which still raises IndexError for an empty list;
update the logic around the dataset assignment in extract_stream_task.py (the
block that sets dataset = extract_json_config.datasets[0] if
extract_json_config.datasets is not None else None) to ensure the list is
non-empty before indexing—e.g., check "if extract_json_config.datasets and
len(extract_json_config.datasets) > 0" or use a safe extractor like
"next(iter(extract_json_config.datasets), None)" so dataset falls back to None
for empty lists.
In
`@components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py`:
- Around line 736-738: When datasets is None before calling
get_archives_for_search, set datasets to the full list of existing dataset IDs
instead of passing None; update the code around where archives_for_search is
computed so that if datasets is None you call the project/dataset discovery
helper (e.g., a function that lists all datasets using db_conn/table_prefix or
an existing helper like list_all_datasets/get_all_dataset_ids) and assign that
list to datasets, then call get_archives_for_search(db_conn, table_prefix,
search_config, archive_end_ts_lower_bound, datasets) so the for ds in ... loop
in get_archives_for_search never receives None.
- Around line 178-179: The current guard uses "is not None" but still allows an
empty list and can cause IndexError when accessing
self.__job_config.datasets[0]; update the logic around extraction_dataset in
query_scheduler.py so you only index when datasets is truthy (e.g., check if
self.__job_config.datasets before taking [0]) and pass None to archive_exists
when datasets is empty or missing; change the assignment of extraction_dataset
and the subsequent archive_exists call to use this falsy check so empty lists
are handled defensively.
- Around line 417-427: The SQL currently injects the dataset name directly into
the query via f"'{ds}'" in the loop (see variables union_parts, ds,
get_archives_table_name, where_clause, query); change this to use a parameter
placeholder for the dataset literal while continuing to use
get_archives_table_name(table_prefix, ds) for the table identifier (table names
remain non-parameterized). Concretely, build each union part with a placeholder
(e.g., %s or the DB driver's placeholder) instead of embedding '{ds}', append
the corresponding ds value to a params list for each union part, and when
executing the final " UNION ALL ... ORDER BY end_timestamp DESC" query pass the
accumulated params so dataset values are bound safely. Ensure any existing
filter_clauses parameters are merged into the same params list in the right
order before executing.
| if len(missing) > 0: | ||
| raise ValueError(f"Dataset(s) {missing} don't exist.") |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Simplify truthiness check and improve error message readability.
Three minor issues on these two lines:
len(missing) > 0— prefer the idiomaticif missing:.- Ruff TRY003: the long message is passed inline to
ValueErrorrather than encapsulated in a custom exception (or at least a short message). Consider using a dedicated exception class or a shorter message. {missing}renders the raw Python list repr (e.g.,['ds1', 'ds2']); joining the names is more user-friendly.
♻️ Proposed refactor
- if len(missing) > 0:
- raise ValueError(f"Dataset(s) {missing} don't exist.")
+ if missing:
+ raise ValueError(f"Dataset(s) don't exist: {', '.join(missing)}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if len(missing) > 0: | |
| raise ValueError(f"Dataset(s) {missing} don't exist.") | |
| if missing: | |
| raise ValueError(f"Dataset(s) don't exist: {', '.join(missing)}") |
🧰 Tools
🪛 Ruff (0.15.1)
[warning] 93-93: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/clp-package-utils/clp_package_utils/scripts/native/utils.py`
around lines 92 - 93, Replace the non-idiomatic check and raw list error with a
clearer, friendlier exception: change `if len(missing) > 0:` to `if missing:`,
and raise either a dedicated exception (e.g., define `class
DatasetNotFoundError(ValueError)` and raise that) or at minimum a shorter
ValueError using a joined string for readability like `raise
ValueError(f"Datasets not found: {', '.join(missing)}")`; refer to the existing
`missing` variable and the current `ValueError` site to locate and update the
code.
Description
Add multi-dataset query support, allowing users to search across multiple datasets in a single query.
This is a breaking change — the
datasetfield in the query job config is replaced withdatasets: list[str](Python) /Vec<String>(Rust) /Type.Array(Type.String())(TypeScript) across all query entry points (Web UI, API server, CLI).Impact Assessment
clp-sbinary output handler, and result rendering in the Web UI.datasetfield is removed from the query job config in favour ofdatasets. Existing API clients, stored queries, and CLI scripts that use--datasetmust be updated to use--datasets. The scheduler enforces a configurablemax_datasets_per_querylimit to prevent resource exhaustion.Key changes by layer:
Job config schema —
QueryJobConfig.dataset→QueryJobConfig.datasetsin Python, Rust, andTypeScript.
components/job-orchestration/job_orchestration/scheduler/job_config.pycomponents/clp-py-utils/clp_py_utils/clp_config.pycomponents/clp-rust-utils/src/job_config/search.rscomponents/webui/common/src/schemas/search.tsConfiguration — Add
max_datasets_per_query(default 10,nullfor unlimited) toQuerySchedulerconfig inclp-config.yaml.components/clp-py-utils/clp_py_utils/clp_config.pycomponents/package-template/src/etc/clp-config.template.text.yamlcomponents/package-template/src/etc/clp-config.template.json.yamlScheduler fan-out —
query_scheduler.pyloops over all requested datasets, fetches archivesfrom each dataset's metadata table, merge-sorts by
begin_timestamp, validates dataset existenceand count limits, and passes
datasetas an explicit Celery task argument so each worker knowswhich dataset it is querying.
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.pyC++ core results attribution — Add
--datasetCLI flag toclp-sand include adatasetfield in each BSON result document written to the results cache, enabling per-result dataset
attribution.
components/core/src/clp_s/CommandLineArguments.cppcomponents/core/src/clp_s/CommandLineArguments.hppcomponents/core/src/clp_s/OutputHandlerImpl.cppcomponents/core/src/clp_s/OutputHandlerImpl.hppcomponents/core/src/clp_s/archive_constants.hppcomponents/core/src/clp_s/clp-s.cppWorker execution —
fs_search_task.pyandextract_stream_task.pyacceptdatasetas anexplicit Celery task parameter and pass it to the
clp-sbinary via--dataset.components/job-orchestration/job_orchestration/executor/query/fs_search_task.pycomponents/job-orchestration/job_orchestration/executor/query/extract_stream_task.pyWeb UI Client & Web UI server — The dataset selector changes from single-select to multi-select (
selectDataset: string | null→selectDatasets: string[]).LogViewerLinkreadsdatasetfrom each search result document instead of the global cached state. "All Time" range usesUNION ALLacross multiple datasets. The multi-select dropdown uses responsive tag collapsing and defaults to the"default"dataset when the selection is emptied. Updated Fastify routes to accept andatasetsarray.components/webui/client/src/pages/SearchPage/SearchControls/Dataset/DatasetSelect.tsxcomponents/webui/client/src/pages/SearchPage/SearchControls/Dataset/index.module.csscomponents/webui/client/src/pages/SearchPage/SearchControls/Dataset/index.tsxcomponents/webui/client/src/pages/SearchPage/SearchControls/Native/SearchButton/SubmitButton/index.tsxcomponents/webui/client/src/pages/SearchPage/SearchControls/Presto/Guided/presto-guided-search-requests.tscomponents/webui/client/src/pages/SearchPage/SearchControls/Presto/SqlSearchButton/RunButton/GuidedRunButton.tsxcomponents/webui/client/src/pages/SearchPage/SearchControls/QueryStatus/QuerySpeed/index.tsxcomponents/webui/client/src/pages/SearchPage/SearchControls/TimeRangeInput/Presto/TimeRangeFooter/TimestampKeySelect/index.tsxcomponents/webui/client/src/pages/SearchPage/SearchControls/TimeRangeInput/sql.tscomponents/webui/client/src/pages/SearchPage/SearchControls/TimeRangeInput/utils.tsxcomponents/webui/client/src/pages/SearchPage/SearchResults/SearchResultsTable/Native/Message/LogViewerLink.tsxcomponents/webui/client/src/pages/SearchPage/SearchResults/SearchResultsTable/Native/Message/index.tsxcomponents/webui/client/src/pages/SearchPage/SearchResults/SearchResultsTable/Native/SearchResultsVirtualTable/typings.tsxcomponents/webui/client/src/pages/SearchPage/SearchResults/SearchResultsTimeline/Native/NativeResultsTimeline.tsxcomponents/webui/client/src/pages/SearchPage/SearchState/Presto/useTimestampKeyInit/index.tsxcomponents/webui/client/src/pages/SearchPage/SearchState/index.tsxcomponents/webui/server/src/routes/api/search/index.tscomponents/webui/common/src/schemas/search.tsAPI server — Updated routes to accept a
datasetsarray.components/api-server/src/client.rscomponents/api-server/src/routes.rsCLI tools —
--dataset→--datasetswithnargs="+".validate_dataset_exists→validate_datasets_existto validate all datasets before submission.components/clp-package-utils/clp_package_utils/scripts/search.pycomponents/clp-package-utils/clp_package_utils/scripts/native/search.pycomponents/clp-package-utils/clp_package_utils/scripts/native/utils.pyDocs — Updated API server guide, quick-start guide, and regenerated OpenAPI spec to reflect
datasets(plural).docs/src/_static/generated/api-server-openapi.jsondocs/src/user-docs/guides-using-the-api-server.mddocs/src/user-docs/quick-start/clp-json.mdChecklist
breaking change.
Validation performed
0. Build CLP package
Task: Verify the project builds successfully with all changes (C++, Rust, Python, TypeScript).
Command:
Output:
Build completed successfully — all C++ targets, Rust binaries, Python packages, and TypeScript
client compiled without errors.
1. Start CLP and compress sample data
Task: Start CLP and run a compression job to populate the
defaultdataset with data forsubsequent search tests.
Commands:
Output:
All containers started and passed health checks. Compression succeeded.
2. Search with
--datasetsflag (single dataset)Task: Verify the new
--datasetsCLI flag works for a single dataset with count aggregation.Command:
./sbin/search.sh "*" --datasets default --countOutput:
Explanation: The search correctly targets the
defaultdataset using the new--datasetsflagand returns the expected 1,000,000 log events.
3. Search without
--datasetsflag (defaults todefault)Task: Verify that omitting the
--datasetsflag defaults to querying thedefaultdataset.Command:
./sbin/search.sh "*" --countOutput:
Explanation: Without
--datasets, the CLI defaults to["default"], producing the same resultas the explicit dataset selection in step 2.
4. Search with non-existent dataset (validation)
Task: Verify that specifying a non-existent dataset produces a clear error and exits with a
non-zero status code.
Command:
./sbin/search.sh "*" --datasets nonexistent --countOutput:
Explanation: The
validate_datasets_existfunction correctly rejects the non-existent datasetbefore submitting the query job, preventing wasted scheduler/worker resources.
5. Web UI: Dataset multi-select rendering
Task: Verify the dataset selector renders correctly in single-select and multi-select states without
vertical overflow.
Method: Opened
http://localhost:8080/searchand tested the dataset selector.Single dataset (1 selected):
anything ×) renders in full — no collapsed+ 1 ...indicator.Multiple datasets (3 selected):
anything ×) with a+ 2 ...indicator for the remaining.Explanation:
maxTagCount="responsive"is conditionally applied only when more than one dataset isselected. With a single dataset, responsive collapsing is disabled so the tag always renders in full.
When multiple datasets are selected, the container expands to 200px (via the
selectContainerExpandedCSS class) giving
rc-overflowa known width to measure against for responsive tag collapsing.6. Stop CLP
Command:
Output:
Summary by CodeRabbit
New Features
UI
API & CLI
Documentation