-
Notifications
You must be signed in to change notification settings - Fork 89
feat!: Add multi-dataset query support (resolves #1933). #1992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 8 commits
63c1c25
f8c63ac
64e814e
ad3db27
0354382
ad4f505
de7f7e3
096371e
89384a2
669641f
8451a51
f2ed336
07d2b17
f6b7a22
406c9f8
ba934b3
8a871f8
0602b19
1f1d177
985f86a
48639de
5831cc5
5cc96cc
c419a4b
b601809
e9fe3f2
afa2b5a
1941e2f
3aef903
fecf5c4
db59832
d9261af
e37971c
6312a53
51e86ef
52beb34
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -72,13 +72,13 @@ def submit_query_job( | |
| return db_cursor.lastrowid | ||
|
|
||
|
|
||
| def validate_dataset_exists(db_config: Database, dataset: str) -> None: | ||
| def validate_datasets_exist(db_config: Database, datasets: list[str]) -> None: | ||
| """ | ||
| Validates that `dataset` exists in the metadata database. | ||
| Validates that all datasets in `datasets` exist in the metadata database. | ||
|
|
||
| :param db_config: | ||
| :param dataset: | ||
| :raise: ValueError if the dataset doesn't exist. | ||
| :param datasets: | ||
| :raise: ValueError if any dataset doesn't exist. | ||
| """ | ||
| sql_adapter = SqlAdapter(db_config) | ||
| clp_db_connection_params = db_config.get_clp_connection_params_and_type(True) | ||
|
|
@@ -87,8 +87,10 @@ def validate_dataset_exists(db_config: Database, dataset: str) -> None: | |
| closing(sql_adapter.create_connection(True)) as db_conn, | ||
| closing(db_conn.cursor(dictionary=True)) as db_cursor, | ||
| ): | ||
| if dataset not in fetch_existing_datasets(db_cursor, table_prefix): | ||
| raise ValueError(f"Dataset `{dataset}` doesn't exist.") | ||
| existing_datasets = fetch_existing_datasets(db_cursor, table_prefix) | ||
| for dataset in datasets: | ||
| if dataset not in existing_datasets: | ||
| raise ValueError(f"Dataset `{dataset}` doesn't exist.") | ||
|
||
|
|
||
|
|
||
| def wait_for_query_job(sql_adapter: SqlAdapter, job_id: int) -> QueryJobStatus: | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -432,6 +432,7 @@ class QueryScheduler(BaseModel): | |||||
| host: DomainStr = "localhost" | ||||||
| port: Port = DEFAULT_PORT | ||||||
| jobs_poll_delay: PositiveFloat = 0.1 # seconds | ||||||
| max_datasets_per_query: PositiveInt | None = 10 | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧹 Nitpick | 🔵 Trivial Add an inline comment documenting the Users editing config files need to know that setting this field to 📝 Suggested documentation- max_datasets_per_query: PositiveInt | None = 10
+ max_datasets_per_query: PositiveInt | None = 10 # None means unlimited📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||
| num_archives_to_search_per_sub_job: PositiveInt = 16 | ||||||
| logging_level: LoggingLevel = "INFO" | ||||||
|
|
||||||
|
|
||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -605,6 +605,10 @@ CommandLineArguments::parse_arguments(int argc, char const** argv) { | |
| po::value<uint64_t>(&m_max_num_results)->value_name("MAX")-> | ||
| default_value(m_max_num_results), | ||
| "The maximum number of results to output" | ||
| )( | ||
| "dataset", | ||
| po::value<std::string>(&m_dataset)->value_name("DATASET"), | ||
| "The dataset name to include in each result document" | ||
| ); | ||
|
Comment on lines
+608
to
612
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The dataset name needs to be include in the query results before dumping them into the result cache. i know this feels like a hack. a more future proof interface might to to accept a JSON object string
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. alternatively, we can add multi-dataset support to the stream extraction flow as well. then we just pass an array of i think this may be more aligned with the future plan of using a single table for all datasets (maybe i haven't had a correct understanding of how we are going to refactor the datasets feature though
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. discussed offline - the |
||
|
|
||
| po::options_description file_output_handler_options("File Output Handler Options"); | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -70,11 +70,13 @@ ResultsCacheOutputHandler::ResultsCacheOutputHandler( | |||||||||
| string const& collection, | ||||||||||
| uint64_t batch_size, | ||||||||||
| uint64_t max_num_results, | ||||||||||
| string dataset, | ||||||||||
| bool should_output_timestamp | ||||||||||
| ) | ||||||||||
| : ::clp_s::search::OutputHandler(should_output_timestamp, true), | ||||||||||
| m_batch_size(batch_size), | ||||||||||
| m_max_num_results(max_num_results) { | ||||||||||
| m_max_num_results(max_num_results), | ||||||||||
| m_dataset(std::move(dataset)) { | ||||||||||
|
||||||||||
| m_max_num_results(max_num_results), | |
| m_dataset(std::move(dataset)) { | |
| m_max_num_results{max_num_results}, | |
| m_dataset{std::move(dataset)} { |
Nit: coding guidelines.
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -128,19 +128,22 @@ class ResultsCacheOutputHandler : public ::clp_s::search::OutputHandler { | |||||||||
| std::string_view message, | ||||||||||
| epochtime_t timestamp, | ||||||||||
| std::string_view archive_id, | ||||||||||
| int64_t log_event_idx | ||||||||||
| int64_t log_event_idx, | ||||||||||
| std::string_view dataset | ||||||||||
| ) | ||||||||||
| : original_path(original_path), | ||||||||||
| message(message), | ||||||||||
| timestamp(timestamp), | ||||||||||
| archive_id(archive_id), | ||||||||||
| log_event_idx(log_event_idx) {} | ||||||||||
| log_event_idx(log_event_idx), | ||||||||||
| dataset(dataset) {} | ||||||||||
|
||||||||||
| log_event_idx(log_event_idx), | |
| dataset(dataset) {} | |
| log_event_idx{log_event_idx}, | |
| dataset{dataset} {} |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -58,6 +58,7 @@ constexpr char cLogEventIx[]{"log_event_ix"}; | |||||
| constexpr char cTimestamp[]{"timestamp"}; | ||||||
| constexpr char cMessage[]{"message"}; | ||||||
| constexpr char cArchiveId[]{"archive_id"}; | ||||||
| constexpr char cDataset[]{"dataset"}; | ||||||
|
||||||
| constexpr char cDataset[]{"dataset"}; | |
| constexpr std::string_view cDataset{"dataset"}; |
To comply with our current coding guidelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably also move all of these results cache related constants out of archive_constants.hpp at some point, but not as part of this PR. @coderabbitai can you create an issue to track?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
@gibber9809, I'll create an issue to track moving the results cache constants out of archive_constants.hpp.
🏁 Script executed:
#!/bin/bash
# Create a GitHub issue to track refactoring results cache constants
gh issue create \
--repo y-scope/clp \
--title "Refactor: Move results cache constants out of archive_constants.hpp" \
--body "## Description
The results cache related constants (in \`results_cache::decompression\` and \`results_cache::search\` namespaces) should be moved out of \`components/core/src/clp_s/archive_constants.hpp\` into a more appropriate location, as they are not directly related to archive structure.
## Affected Code
File: \`components/core/src/clp_s/archive_constants.hpp\`
Namespaces to relocate:
- \`clp_s::constants::results_cache::decompression\`
- \`clp_s::constants::results_cache::search\`
## Context
This refactoring was identified during review of PR `#1992`:
- PR: https://github.com/y-scope/clp/pull/1992
- Comment: https://github.com/y-scope/clp/pull/1992#discussion_r2843034948
Requested by: `@gibber9809`" \
--assignee gibber9809
Length of output: 42
@gibber9809, I've created issue #2016 to track moving the results cache constants (results_cache::decompression and results_cache::search namespaces) out of archive_constants.hpp. The issue has been assigned to you and includes backlinks to this PR and comment for context.
🧠 Learnings used
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:581-627
Timestamp: 2024-10-07T21:38:35.979Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` method, throwing a string literal as an exception is acceptable practice.
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:581-627
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` method, throwing a string literal as an exception is acceptable practice.
Learnt from: gibber9809
Repo: y-scope/clp PR: 1163
File: components/core/src/clp_s/ColumnWriter.hpp:120-122
Timestamp: 2025-08-06T23:57:49.762Z
Learning: In the CLP codebase, the `encoded_log_dict_id_t` type alias in `components/core/src/clp_s/ColumnWriter.hpp` is appropriately placed as a class-scoped alias because the ColumnWriter.hpp header is already included in all places where this type is used, eliminating the need to move it to a common header.
Learnt from: davemarco
Repo: y-scope/clp PR: 700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:153-155
Timestamp: 2025-01-30T19:26:33.869Z
Learning: When working with constexpr strings (string literals with static storage duration), std::string_view is the preferred choice for member variables as it's more efficient and safe, avoiding unnecessary memory allocations.
Learnt from: haiqi96
Repo: y-scope/clp PR: 646
File: components/core/src/clp/streaming_archive/writer/Archive.hpp:354-354
Timestamp: 2025-01-14T16:06:54.692Z
Learning: Member variables in C++ classes should be explicitly initialized in the constructor to prevent undefined behavior, as demonstrated in the Archive class where `m_use_single_file_archive` is initialized to `false`.
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 873
File: components/core/src/clp/ffi/ir_stream/search/QueryHandlerImpl.cpp:148-157
Timestamp: 2025-05-02T22:27:59.347Z
Learning: In the `QueryHandlerImpl.cpp` file, the `unique_projected_columns` set (using `std::string_view`) is intentionally designed to only check for duplications within the local scope of the `create_projected_columns_and_projection_map` function. The team decided this is an acceptable use of `std::string_view` in a container since the referenced strings remain valid throughout the function's execution.
Uh oh!
There was an error while loading. Please reload this page.