-
Notifications
You must be signed in to change notification settings - Fork 83
feat(log-surgeon)!: Add support for a single capture group in a schema rule to have parity with the heuristic parser. #1273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…a rule to have parity with the heuristic parser.
WalkthroughAdds runtime validation in schema loading to enforce at most one regex capture per schema variable; refactors Archive to a token-centric, capture-aware processing path with a new add_token_to_dicts helper and timestamp-pattern lookup; replaces direct token type-id member access with accessor calls; and adds tests and fixtures for single- and multi-capture scenarios. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Caller as Caller
participant Archive as Archive::write_msg_using_schema
participant TS as TimestampPattern
participant TokenProc as Archive::add_token_to_dicts
participant Dict as Dictionaries/Encoders
Caller->>Archive: write_msg_using_schema(log_surgeon::LogEventView)
Archive->>TS: search_known_ts_patterns(buffer, start, end)
TS-->>Archive: pattern result / none
loop each token (token_view)
Archive->>TokenProc: add_token_to_dicts(log_view, token_view)
alt token is delimiter/newline
TokenProc-->>Dict: handle delimiter/newline
else token has capture group
TokenProc->>Dict: add pre-capture constant
TokenProc->>Dict: add encoded capture substring (register lookup)
TokenProc->>Dict: add post-capture constant
else no capture
TokenProc->>Dict: add whole token as variable
end
end
Archive-->>Caller: complete write
sequenceDiagram
autonumber
participant Loader as Schema Loader
participant Utils as Utils::load_lexer_from_file
participant Rule as Schema Rule (regex)
Loader->>Utils: load_lexer_from_file(schema_path)
loop for each schema_var rule
Utils->>Rule: rule->m_regex_ptr->get_subtree_positive_captures()
Rule-->>Utils: capture list
alt captures > 1
Utils-->>Loader: throw std::runtime_error(file,line,rule_name,capture_count)
else
Utils-->>Loader: continue
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
components/core/src/clp/Utils.cpp(2 hunks)components/core/src/clp/streaming_archive/writer/Archive.cpp(7 hunks)components/core/tests/test-ParserWithUserSchema.cpp(1 hunks)components/core/tests/test_schema_files/multiple_capture_groups.txt(1 hunks)components/core/tests/test_schema_files/single_capture_group.txt(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/tests/test-ParserWithUserSchema.cppcomponents/core/src/clp/Utils.cppcomponents/core/src/clp/streaming_archive/writer/Archive.cpp
🧠 Learnings (1)
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
🧬 Code graph analysis (2)
components/core/tests/test-ParserWithUserSchema.cpp (2)
components/core/src/clp/Utils.cpp (2)
load_lexer_from_file(125-254)load_lexer_from_file(126-126)components/core/src/clp/Utils.hpp (1)
load_lexer_from_file(52-55)
components/core/src/clp/streaming_archive/writer/Archive.cpp (3)
components/core/src/clp/streaming_archive/writer/Archive.hpp (2)
timestamp(146-146)logtype_id(229-232)components/core/src/clp/streaming_archive/writer/File.hpp (1)
timestamp(89-95)components/core/src/clp/EncodedVariableInterpreter.hpp (3)
encoded_var(52-52)encoded_var(111-111)id(51-51)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: antlr-code-committed (macos-15)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (8)
components/core/src/clp/Utils.cpp (1)
11-11: Header addition looks good.Adding resolves downstream std::string usage here.
components/core/src/clp/streaming_archive/writer/Archive.cpp (4)
362-368: Potential off-by-one in uncompressed byte count (verify semantics).If m_end_pos is exclusive, end_pos - start_pos is correct; if inclusive, add +1. Please verify Token.m_end_pos semantics, especially across buffer wrap.
Consider a focused unit test with a two-token message exercising boundary conditions (start_pos == end_pos and wrap).
371-382: Good: token_type made const and delimiter handling left intact.No issues spotted; aligns with existing flow.
485-496: OK: zero-initialised logtype_id and downstream writes.This aligns with safer initialisation and existing dictionary API.
317-341: No signature mismatch found
Header and implementation both declare write_msg_using_schema(log_surgeon::LogEventView const&); no action required.components/core/tests/test_schema_files/single_capture_group.txt (1)
1-1: Fixture is minimal and appropriate.Covers the intended single-capture scenario with surrounding literals.
components/core/tests/test_schema_files/multiple_capture_groups.txt (1)
1-1: Good negative fixture.Triggers the >1 capture validation path as desired.
components/core/tests/test-ParserWithUserSchema.cpp (1)
195-204: Exact error assertion is OK; keep in sync if message changes.If you accept the optional richer error in Utils.cpp, adjust this expectation accordingly (or match a substring).
components/core/tests/test_schema_files/single_capture_group.txt
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
components/core/src/clp/streaming_archive/writer/Archive.cpp (2)
362-366: Guard against empty buffer when computing end_pos.pos() − 1 will underflow if the buffer is empty. Unlikely, but add a defensive check/assert.
Would you add a precondition (e.g., assert(log_output_buffer->pos() > 0)) before using pos() − 1?
391-405: Fix: track var_ids for int/float dictionary fallbacks and follow negation style.When integer/float cannot be encoded, you add a dictionary entry but don’t push the var ID into m_var_ids, which breaks segment indexing. Also, prefer “false == …” per repo style.
- encoded_variable_t encoded_var{}; - if (!EncodedVariableInterpreter::convert_string_to_representable_integer_var( + encoded_variable_t encoded_var{}; + if (false == EncodedVariableInterpreter::convert_string_to_representable_integer_var( token.to_string(), encoded_var )) { variable_dictionary_id_t id{}; m_var_dict.add_entry(token.to_string(), id); + m_var_ids.push_back(id); encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id); m_logtype_dict_entry.add_dictionary_var(); } else { m_logtype_dict_entry.add_int_var(); } m_encoded_vars.push_back(encoded_var); @@ - encoded_variable_t encoded_var{}; - if (!EncodedVariableInterpreter::convert_string_to_representable_float_var( + encoded_variable_t encoded_var{}; + if (false == EncodedVariableInterpreter::convert_string_to_representable_float_var( token.to_string(), encoded_var )) { variable_dictionary_id_t id{}; m_var_dict.add_entry(token.to_string(), id); + m_var_ids.push_back(id); encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id); m_logtype_dict_entry.add_dictionary_var(); } else { m_logtype_dict_entry.add_float_var(); } m_encoded_vars.push_back(encoded_var);Also applies to: 407-421
♻️ Duplicate comments (2)
components/core/src/clp/streaming_archive/writer/Archive.cpp (2)
431-440: Also treat empty capture-id vectors as “no capture.”has_value() can hold an empty vector; later at(0) would throw. Handle empty as the no‑capture path.
- auto capture_ids{lexer.get_capture_ids_from_rule_id(token_type)}; - if (false == capture_ids.has_value()) { + auto capture_ids{lexer.get_capture_ids_from_rule_id(token_type)}; + if (false == capture_ids.has_value() || capture_ids->empty()) { variable_dictionary_id_t id{}; m_var_dict.add_entry(token.to_string(), id); m_var_ids.push_back(id); m_encoded_vars.push_back(EncodedVariableInterpreter::encode_var_dict_id(id)); m_logtype_dict_entry.add_dictionary_var(); break; }
442-454: Validate register positions before using front()/back().get_reversed_reg_positions(...) may return empty; front()/back() would be UB. Throw a clear error if empty.
- auto const [start_reg_id, end_reg_id]{register_ids.value()}; - auto const capture_start{token.get_reversed_reg_positions(start_reg_id).back()}; - auto const capture_end{token.get_reversed_reg_positions(end_reg_id).front()}; + auto const [start_reg_id, end_reg_id]{register_ids.value()}; + auto const& start_positions = token.get_reversed_reg_positions(start_reg_id); + auto const& end_positions = token.get_reversed_reg_positions(end_reg_id); + if (start_positions.empty() || end_positions.empty()) { + throw(std::runtime_error( + "Empty register positions for variable's capture group. Full token: " + + token.to_string() + )); + } + auto const capture_start{start_positions.back()}; + auto const capture_end{end_positions.front()};
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp/Utils.cpp(2 hunks)components/core/src/clp/streaming_archive/writer/Archive.cpp(7 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp/Utils.cppcomponents/core/src/clp/streaming_archive/writer/Archive.cpp
🧠 Learnings (1)
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
🧬 Code graph analysis (1)
components/core/src/clp/streaming_archive/writer/Archive.cpp (3)
components/core/src/clp/streaming_archive/writer/Archive.hpp (2)
timestamp(146-146)logtype_id(229-232)components/core/src/clp/streaming_archive/writer/File.hpp (1)
timestamp(89-95)components/core/src/clp/EncodedVariableInterpreter.cpp (6)
convert_string_to_representable_integer_var(24-61)convert_string_to_representable_integer_var(24-27)convert_string_to_representable_float_var(63-142)convert_string_to_representable_float_var(63-66)encode_var_dict_id(199-201)encode_var_dict_id(199-199)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: antlr-code-committed (macos-15)
🔇 Additional comments (3)
components/core/src/clp/Utils.cpp (1)
11-11: LGTM: include is appropriate.Needed for the new error message construction.
components/core/src/clp/streaming_archive/writer/Archive.cpp (2)
5-10: LGTM: header additions are appropriate.These headers match the new usage and remove transitive‑include reliance.
Also applies to: 15-15
479-491: LGTM: correct style and write path.Style matches guideline (false == …) and the write/update path is consistent.
SharafMohamed
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything look good, but it would be nice to have some small example log and a unit-test to test correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (5)
components/core/src/clp/streaming_archive/writer/Archive.cpp(3 hunks)components/core/src/clp/streaming_archive/writer/Archive.hpp(2 hunks)components/core/tests/test-ParserWithUserSchema.cpp(3 hunks)components/core/tests/test_log_files/log_with_capture.txt(1 hunks)components/core/tests/test_schema_files/single_capture_group.txt(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp/streaming_archive/writer/Archive.cppcomponents/core/tests/test-ParserWithUserSchema.cppcomponents/core/src/clp/streaming_archive/writer/Archive.hpp
🧠 Learnings (19)
📓 Common learnings
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1033
File: components/core/config/schemas.txt:42-43
Timestamp: 2025-07-17T16:08:23.185Z
Learning: In log-surgeon, regex patterns like `(?:\d{1,2}| \d)` are not supported. Log-surgeon has its own regex implementation with different capabilities compared to standard regex engines. Patterns like `[ 0-9]{2}` that allow character classes with spaces are valid in log-surgeon schemas.
📚 Learning: 2024-10-10T05:46:35.188Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 554
File: components/core/src/clp/ffi/KeyValuePairLogEvent.cpp:299-307
Timestamp: 2024-10-10T05:46:35.188Z
Learning: In the C++ function `get_schema_subtree_bitmap` in `KeyValuePairLogEvent.cpp`, when a loop uses `while (true)` with an internal check on `optional.has_value()`, and comments explain that this structure is used to silence `clang-tidy` warnings about unchecked optional access, this code is acceptable and should not be refactored to use `while (optional.has_value())`.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-11-01T03:26:26.386Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 570
File: components/core/tests/test-ir_encoding_methods.cpp:376-399
Timestamp: 2024-11-01T03:26:26.386Z
Learning: In the test code (`components/core/tests/test-ir_encoding_methods.cpp`), exception handling for `msgpack::unpack` can be omitted because the Catch2 testing framework captures exceptions if they occur.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-24T14:45:26.265Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 523
File: components/core/src/clp/BufferedFileReader.cpp:96-106
Timestamp: 2024-10-24T14:45:26.265Z
Learning: In `components/core/src/clp/BufferedFileReader.cpp`, refactoring the nested error handling conditions may not apply due to the specific logic in the original code.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2025-01-14T16:06:54.692Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 646
File: components/core/src/clp/streaming_archive/writer/Archive.hpp:354-354
Timestamp: 2025-01-14T16:06:54.692Z
Learning: Member variables in C++ classes should be explicitly initialized in the constructor to prevent undefined behavior, as demonstrated in the Archive class where `m_use_single_file_archive` is initialized to `false`.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-12-10T16:03:13.322Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2025-08-15T21:48:40.228Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1181
File: docs/src/user-guide/guides-retention.md:68-73
Timestamp: 2025-08-15T21:48:40.228Z
Learning: In documentation for the CLP project, when suggesting formatting improvements for variables in explanatory text, the user quinntaylormitchell prefers to maintain existing sentence structures (like keeping "that" in "i.e., that the difference...") while applying monospace formatting to technical terms and variables for consistency.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2025-09-22T15:46:34.873Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1311
File: integration-tests/tests/test_identity_transformation.py:93-97
Timestamp: 2025-09-22T15:46:34.873Z
Learning: In the CLP project, multi-line formatting is preferred over single-line ternary operators when the single line would exceed the project's line length limits, prioritizing code readability and adherence to coding standards.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-18T02:31:18.595Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 557
File: components/core/src/clp/ffi/ir_stream/utils.hpp:0-0
Timestamp: 2024-10-18T02:31:18.595Z
Learning: In `components/core/src/clp/ffi/ir_stream/utils.hpp`, the function `size_dependent_encode_and_serialize_schema_tree_node_id` assumes that the caller checks that `node_id` fits within the range of `encoded_node_id_t` before casting.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-07T21:38:35.979Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:581-627
Timestamp: 2024-10-07T21:38:35.979Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` method, throwing a string literal as an exception is acceptable practice.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:756-765
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir()` function, reaching the end of log events in a given IR is not considered an error case. The errors `std::errc::no_message_available` and `std::errc::result_out_of_range` are expected signals to break the deserialization loop and proceed accordingly.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
Applied to files:
components/core/tests/test-ParserWithUserSchema.cpp
📚 Learning: 2024-10-13T09:27:43.408Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 557
File: components/core/tests/test-ir_encoding_methods.cpp:1216-1286
Timestamp: 2024-10-13T09:27:43.408Z
Learning: In the unit test case `ffi_ir_stream_serialize_schema_tree_node_id` in `test-ir_encoding_methods.cpp`, suppressing the `readability-function-cognitive-complexity` warning is acceptable due to the expansion of Catch2 macros in C++ tests, and such test cases may not have readability issues.
Applied to files:
components/core/tests/test-ParserWithUserSchema.cpp
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Applied to files:
components/core/tests/test-ParserWithUserSchema.cpp
📚 Learning: 2024-11-29T22:50:17.206Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 595
File: components/core/tests/test-clp_s-end_to_end.cpp:109-110
Timestamp: 2024-11-29T22:50:17.206Z
Learning: In `components/core/tests/test-clp_s-end_to_end.cpp`, the success of `constructor.store()` is verified through `REQUIRE` statements and subsequent comparisons.
Applied to files:
components/core/tests/test-ParserWithUserSchema.cpp
🧬 Code graph analysis (1)
components/core/tests/test-ParserWithUserSchema.cpp (4)
components/core/src/clp/Utils.cpp (2)
load_lexer_from_file(125-255)load_lexer_from_file(126-126)components/core/src/clp/Utils.hpp (1)
load_lexer_from_file(52-55)components/core/src/clp/clp/run.hpp (1)
run(5-5)components/core/src/clp/streaming_archive/reader/Archive.hpp (1)
path(44-44)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: package-image
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: rust-checks
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-15, true)
🔇 Additional comments (14)
components/core/tests/test_log_files/log_with_capture.txt (1)
1-1: LGTM! Test data correctly exercises capture group extraction.The log line contains timestamp and tokens that match the capture pattern in single_capture_group.txt schema, allowing validation of capture-based dictionary storage.
components/core/tests/test-ParserWithUserSchema.cpp (6)
6-28: LGTM! Includes support new test infrastructure.The added headers provide necessary support for CLP-enabled file I/O, test utilities, and capture group handling.
38-83: LGTM! Helper functions correctly construct test infrastructure.The
run_clp_compressfunction properly builds the command-line arguments and delegates toclp::clp::run. Path helpers follow established patterns.
171-175: LGTM! Accessor refactoring improves encapsulation.The change from direct pointer access to
get_type_ids()accessor follows better encapsulation practices. The test context with controlled data makes this safe here.
183-193: LGTM! Validates single capture group registration.The test correctly verifies that a schema with one capture group is loaded and the capture is properly named. Note that this tests schema loading only; end-to-end extraction is covered by the dictionary verification test below.
195-204: LGTM! Validates multiple capture group rejection.The test correctly verifies that schemas with multiple capture groups are rejected at load time with a clear error message, enforcing the single-capture constraint.
206-240: LGTM! Comprehensive end-to-end validation of capture extraction.The test validates the entire capture group pipeline: compression with schema, archive creation, and dictionary content verification. It correctly expects only the captured digits ("123", "4123") in the variable dictionary and the surrounding text as static logtype components.
components/core/tests/test_schema_files/single_capture_group.txt (1)
1-3: LGTM! Schema correctly defines single capture group pattern.The schema defines a rule with one named capture group that matches alphabetic prefixes followed by digits, suitable for testing capture-based variable extraction.
components/core/src/clp/streaming_archive/writer/Archive.hpp (2)
152-152: LGTM! Parameter rename improves consistency.Renaming
log_event_viewtolog_viewaligns with usage patterns in the implementation and improves code consistency.
292-299: LGTM! New helper method properly encapsulates token processing.The
add_token_to_dictsmethod provides clear separation of concerns by handling token-to-dictionary logic independently from the main message processing flow.components/core/src/clp/streaming_archive/writer/Archive.cpp (4)
5-9: LGTM! Includes support new token processing functionality.The added headers provide necessary types and utilities for capture group handling and error reporting.
363-418: LGTM! Capture group decomposition logic is sound.The implementation correctly handles three cases:
- No capture: stores entire token as variable
- Has capture but no register positions: throws descriptive error
- Has capture with positions: decomposes into before/capture/after using Token's position manipulation
The approach of using
set_start_pos/set_end_posdelegates ring-buffer handling to the Token class, which is cleaner and less error-prone than manual substring calculations.
422-501: LGTM! Refactored message processing correctly delegates token handling.The refactored
write_msg_using_schemaproperly:
- Extracts and validates timestamps
- Handles file/archive splitting
- Delegates token processing to
add_token_to_dicts- Writes encoded messages only when logtype has content
The control flow is clear and maintains existing behaviour while supporting capture group decomposition.
394-395: I'll verify the type safety of position comparisons by investigating the return type ofget_reversed_reg_positions()and the types of the containers being used.
<function_calls>
#!/bin/bashSearch for get_reversed_reg_positions method definition
rg -nP -A10 'get_reversed_reg_positions\s*(' --type=cpp --type=hpp | head -100
</function_calls>
#!/bin/bashSearch for the method definition with return type
rg -nP 'get_reversed_reg_positions' --type=cpp --type=hpp -B2 -A15 | head -150
</function_calls>
<function_calls>
#!/bin/bashLook at the context around lines 394-395 in Archive.cpp to understand the types
rg -nP 'start_positions' components/core/src/clp/streaming_archive/writer/Archive.cpp -B5 -A5 | head -100
</function_calls>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (1)
components/core/src/clp/streaming_archive/writer/Archive.cpp (1)
377-384: Guard against empty capture-id vector.
has_value()can return true while the underlying vector is empty;capture_ids.value().at(0)on line 386 would throw. Treat an empty vector as "no capture."- if (false == capture_ids.has_value()) { + if (false == capture_ids.has_value() || capture_ids->empty()) { variable_dictionary_id_t id{}; m_var_dict.add_entry(token_view.to_string(), id); m_var_ids.push_back(id); m_encoded_vars.push_back(EncodedVariableInterpreter::encode_var_dict_id(id)); m_logtype_dict_entry.add_dictionary_var(); break; }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp/streaming_archive/writer/Archive.cpp(3 hunks)components/core/src/clp/streaming_archive/writer/Archive.hpp(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp/streaming_archive/writer/Archive.cppcomponents/core/src/clp/streaming_archive/writer/Archive.hpp
🧠 Learnings (15)
📓 Common learnings
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1033
File: components/core/config/schemas.txt:42-43
Timestamp: 2025-07-17T16:08:23.185Z
Learning: In log-surgeon, regex patterns like `(?:\d{1,2}| \d)` are not supported. Log-surgeon has its own regex implementation with different capabilities compared to standard regex engines. Patterns like `[ 0-9]{2}` that allow character classes with spaces are valid in log-surgeon schemas.
📚 Learning: 2024-10-10T05:46:35.188Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 554
File: components/core/src/clp/ffi/KeyValuePairLogEvent.cpp:299-307
Timestamp: 2024-10-10T05:46:35.188Z
Learning: In the C++ function `get_schema_subtree_bitmap` in `KeyValuePairLogEvent.cpp`, when a loop uses `while (true)` with an internal check on `optional.has_value()`, and comments explain that this structure is used to silence `clang-tidy` warnings about unchecked optional access, this code is acceptable and should not be refactored to use `while (optional.has_value())`.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-11-01T03:26:26.386Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 570
File: components/core/tests/test-ir_encoding_methods.cpp:376-399
Timestamp: 2024-11-01T03:26:26.386Z
Learning: In the test code (`components/core/tests/test-ir_encoding_methods.cpp`), exception handling for `msgpack::unpack` can be omitted because the Catch2 testing framework captures exceptions if they occur.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-24T14:45:26.265Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 523
File: components/core/src/clp/BufferedFileReader.cpp:96-106
Timestamp: 2024-10-24T14:45:26.265Z
Learning: In `components/core/src/clp/BufferedFileReader.cpp`, refactoring the nested error handling conditions may not apply due to the specific logic in the original code.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2025-01-14T16:06:54.692Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 646
File: components/core/src/clp/streaming_archive/writer/Archive.hpp:354-354
Timestamp: 2025-01-14T16:06:54.692Z
Learning: Member variables in C++ classes should be explicitly initialized in the constructor to prevent undefined behavior, as demonstrated in the Archive class where `m_use_single_file_archive` is initialized to `false`.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-12-10T16:03:13.322Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2025-08-15T21:48:40.228Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1181
File: docs/src/user-guide/guides-retention.md:68-73
Timestamp: 2025-08-15T21:48:40.228Z
Learning: In documentation for the CLP project, when suggesting formatting improvements for variables in explanatory text, the user quinntaylormitchell prefers to maintain existing sentence structures (like keeping "that" in "i.e., that the difference...") while applying monospace formatting to technical terms and variables for consistency.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2025-09-22T15:46:34.873Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1311
File: integration-tests/tests/test_identity_transformation.py:93-97
Timestamp: 2025-09-22T15:46:34.873Z
Learning: In the CLP project, multi-line formatting is preferred over single-line ternary operators when the single line would exceed the project's line length limits, prioritizing code readability and adherence to coding standards.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-18T02:31:18.595Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 557
File: components/core/src/clp/ffi/ir_stream/utils.hpp:0-0
Timestamp: 2024-10-18T02:31:18.595Z
Learning: In `components/core/src/clp/ffi/ir_stream/utils.hpp`, the function `size_dependent_encode_and_serialize_schema_tree_node_id` assumes that the caller checks that `node_id` fits within the range of `encoded_node_id_t` before casting.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-07T21:38:35.979Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:581-627
Timestamp: 2024-10-07T21:38:35.979Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` method, throwing a string literal as an exception is acceptable practice.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:756-765
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir()` function, reaching the end of log events in a given IR is not considered an error case. The errors `std::errc::no_message_available` and `std::errc::result_out_of_range` are expected signals to break the deserialization loop and proceed accordingly.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
🧬 Code graph analysis (1)
components/core/src/clp/streaming_archive/writer/Archive.cpp (2)
components/core/src/clp/EncodedVariableInterpreter.cpp (6)
convert_string_to_representable_integer_var(24-61)convert_string_to_representable_integer_var(24-27)encode_var_dict_id(199-201)encode_var_dict_id(199-199)convert_string_to_representable_float_var(63-142)convert_string_to_representable_float_var(63-66)components/core/src/clp/TimestampPattern.cpp (2)
search_known_ts_patterns(202-219)search_known_ts_patterns(202-207)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: package-image
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: rust-checks
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-15, true)
🔇 Additional comments (7)
components/core/src/clp/streaming_archive/writer/Archive.hpp (2)
152-152: LGTM!The signature update to use
log_surgeon::LogEventView const&aligns with the new token-centric processing approach.
292-299: LGTM!The new
add_token_to_dictsmethod is appropriately declared in the private section with clear documentation describing its purpose.components/core/src/clp/streaming_archive/writer/Archive.cpp (5)
5-9: LGTM!The new includes are appropriate for the added functionality:
<cstdint>,<stdexcept>for error handling, log_surgeon types, andTimestampPattern.hppfor timestamp parsing.Also applies to: 15-18, 26-27
398-420: Capture group slicing implementation looks correct.The implementation follows the suggested approach from previous reviews: manipulating token start/end positions directly rather than doing manual ring-buffer arithmetic. The before/capture/after segments are correctly extracted by adjusting
token_view's positions.
480-494: Token processing loop is well-structured.The loop correctly:
- Skips the timestamp token (index 0)
- Validates type_ids before access
- Handles delimiters appropriately
- Delegates to
add_token_to_dictsfor per-token processing
496-508: LGTM!Correct use of
false ==pattern per coding guidelines, and the logtype dictionary entry is only written when non-empty.
470-474: I apologize, but I was unable to access the repository to verify the code review comment due to repeated cloning failures in the sandbox environment. Without access to the actual codebase, I cannot definitively verify:
- Whether early return guards exist that would prevent the out-of-bounds access described
- What schema guarantees exist regarding minimum token counts when
timestamp_pattern == nullptr- Whether additional bounds checks are already in place
The concern raised in the review is legitimate and requires manual verification by someone with repository access. The potential issue of accessing
get_token(1)when only one token exists is a valid edge case that needs to be confirmed against the actual implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
components/core/src/clp/streaming_archive/writer/Archive.cpp (1)
379-386: Guard against empty capture-id vector before accessing at(0).The condition at line 379 checks
has_value()but doesn't verify the vector is non-empty. Line 388 callscapture_ids.value().at(0), which will throw if the vector is empty. This issue was previously flagged and should treat an empty vector as "no capture."Apply this diff:
- if (false == capture_ids.has_value()) { + if (false == capture_ids.has_value() || capture_ids->empty()) { variable_dictionary_id_t id{}; m_var_dict.add_entry(token_view.to_string(), id); m_var_ids.push_back(id); m_encoded_vars.push_back(EncodedVariableInterpreter::encode_var_dict_id(id)); m_logtype_dict_entry.add_dictionary_var(); break; }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp/streaming_archive/writer/Archive.cpp(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
🧠 Learnings (19)
📓 Common learnings
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1033
File: components/core/config/schemas.txt:42-43
Timestamp: 2025-07-17T16:08:23.185Z
Learning: In log-surgeon, regex patterns like `(?:\d{1,2}| \d)` are not supported. Log-surgeon has its own regex implementation with different capabilities compared to standard regex engines. Patterns like `[ 0-9]{2}` that allow character classes with spaces are valid in log-surgeon schemas.
📚 Learning: 2024-10-10T05:46:35.188Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 554
File: components/core/src/clp/ffi/KeyValuePairLogEvent.cpp:299-307
Timestamp: 2024-10-10T05:46:35.188Z
Learning: In the C++ function `get_schema_subtree_bitmap` in `KeyValuePairLogEvent.cpp`, when a loop uses `while (true)` with an internal check on `optional.has_value()`, and comments explain that this structure is used to silence `clang-tidy` warnings about unchecked optional access, this code is acceptable and should not be refactored to use `while (optional.has_value())`.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-11-01T03:26:26.386Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 570
File: components/core/tests/test-ir_encoding_methods.cpp:376-399
Timestamp: 2024-11-01T03:26:26.386Z
Learning: In the test code (`components/core/tests/test-ir_encoding_methods.cpp`), exception handling for `msgpack::unpack` can be omitted because the Catch2 testing framework captures exceptions if they occur.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-24T14:45:26.265Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 523
File: components/core/src/clp/BufferedFileReader.cpp:96-106
Timestamp: 2024-10-24T14:45:26.265Z
Learning: In `components/core/src/clp/BufferedFileReader.cpp`, refactoring the nested error handling conditions may not apply due to the specific logic in the original code.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2025-01-14T16:06:54.692Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 646
File: components/core/src/clp/streaming_archive/writer/Archive.hpp:354-354
Timestamp: 2025-01-14T16:06:54.692Z
Learning: Member variables in C++ classes should be explicitly initialized in the constructor to prevent undefined behavior, as demonstrated in the Archive class where `m_use_single_file_archive` is initialized to `false`.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-12-10T16:03:13.322Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2025-08-15T21:48:40.228Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1181
File: docs/src/user-guide/guides-retention.md:68-73
Timestamp: 2025-08-15T21:48:40.228Z
Learning: In documentation for the CLP project, when suggesting formatting improvements for variables in explanatory text, the user quinntaylormitchell prefers to maintain existing sentence structures (like keeping "that" in "i.e., that the difference...") while applying monospace formatting to technical terms and variables for consistency.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2025-09-22T15:46:34.873Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1311
File: integration-tests/tests/test_identity_transformation.py:93-97
Timestamp: 2025-09-22T15:46:34.873Z
Learning: In the CLP project, multi-line formatting is preferred over single-line ternary operators when the single line would exceed the project's line length limits, prioritizing code readability and adherence to coding standards.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-18T02:31:18.595Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 557
File: components/core/src/clp/ffi/ir_stream/utils.hpp:0-0
Timestamp: 2024-10-18T02:31:18.595Z
Learning: In `components/core/src/clp/ffi/ir_stream/utils.hpp`, the function `size_dependent_encode_and_serialize_schema_tree_node_id` assumes that the caller checks that `node_id` fits within the range of `encoded_node_id_t` before casting.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2025-06-09T17:48:56.024Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 988
File: components/log-viewer-webui/client/src/components/QueryBox/InputWithCaseSensitive/CaseSenstiveToggle.tsx:34-37
Timestamp: 2025-06-09T17:48:56.024Z
Learning: In the y-scope/clp project, prefer `false == <expression>` rather than `!<expression>` for boolean expressions in TypeScript/JavaScript files, as specified in the coding guidelines.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-13T09:27:43.408Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 557
File: components/core/tests/test-ir_encoding_methods.cpp:1216-1286
Timestamp: 2024-10-13T09:27:43.408Z
Learning: In the unit test case `ffi_ir_stream_serialize_schema_tree_node_id` in `test-ir_encoding_methods.cpp`, suppressing the `readability-function-cognitive-complexity` warning is acceptable due to the expansion of Catch2 macros in C++ tests, and such test cases may not have readability issues.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2025-09-11T13:15:39.930Z
Learnt from: anlowee
Repo: y-scope/clp PR: 1176
File: components/core/src/clp_s/ColumnWriter.cpp:67-72
Timestamp: 2025-09-11T13:15:39.930Z
Learning: In components/core/src/clp_s/ColumnWriter.cpp, when using sizeof with types like variable_dictionary_id_t, always use the fully qualified namespace (clp::variable_dictionary_id_t) to match the variable declaration and maintain consistency with other column writers in the codebase.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-14T03:45:21.323Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:331-335
Timestamp: 2024-10-14T03:45:21.323Z
Learning: When reviewing code, ensure that suggestions about unnecessary use of `std::move` with `std::shared_ptr` are only made if `std::move` is actually used in the code.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-07T21:38:35.979Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:581-627
Timestamp: 2024-10-07T21:38:35.979Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` method, throwing a string literal as an exception is acceptable practice.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:756-765
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir()` function, reaching the end of log events in a given IR is not considered an error case. The errors `std::errc::no_message_available` and `std::errc::result_out_of_range` are expected signals to break the deserialization loop and proceed accordingly.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
🧬 Code graph analysis (1)
components/core/src/clp/streaming_archive/writer/Archive.cpp (3)
components/core/src/clp/EncodedVariableInterpreter.cpp (6)
convert_string_to_representable_integer_var(24-61)convert_string_to_representable_integer_var(24-27)encode_var_dict_id(199-201)encode_var_dict_id(199-199)convert_string_to_representable_float_var(63-142)convert_string_to_representable_float_var(63-66)components/core/src/clp/streaming_archive/writer/File.hpp (1)
timestamp(89-95)components/core/src/clp/TimestampPattern.cpp (2)
search_known_ts_patterns(202-219)search_known_ts_patterns(202-207)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: package-image
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: rust-checks
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (4)
components/core/src/clp/streaming_archive/writer/Archive.cpp (4)
5-9: LGTM!The new includes are appropriate for the token-centric processing path and timestamp pattern handling introduced in this PR.
Also applies to: 15-15, 18-18, 26-26
403-422: LGTM!The token position manipulation approach cleanly handles capture splitting without manual ring-buffer arithmetic. This follows the pattern suggested in previous reviews and avoids the substring calculation issues present in earlier versions.
429-454: LGTM with note on const_cast.The timestamp pattern search and handling logic is correct. The
const_castat line 450 was previously flagged and marked as addressed. While ideallym_old_ts_patternwould be declared as a pointer-to-const in the header, this appears to be acceptable in the current codebase context.
482-497: LGTM!The token processing loop correctly handles delimiter insertion and delegates token processing to
add_token_to_dicts. The type ID extraction here is necessary for the delimiter conditional logic, so the duplication with validation inadd_token_to_dictsis acceptable.
Description
Previously, when using log surgeon for parsing the full match of a schema rule's regex pattern would be stored as a variable in CLP. This created differences from the heuristic parser's behaviour for certain cases.
For example, the heuristic's "equals" rule can be represented with the regex pattern
.*=(?<var>.*[a-zA-Z0-9].*). The heuristic parser will only store thevarcapture group as a variable (storing the prefix.*=as static text). When using log surgeon without capture groups this behaviour was not possible as we would store the full match (including the prefix.*=) as a variable.This PR allows schema rules to contain up to 1 capture group. If a capture group is present only the capture's match will be stored as a variable and anything surrounding it will be stored as static text. In the case where the capture is repeated (e.g.
text(?<var>variable)+text)) all repetitions will be stored together as a single variable.Checklist
breaking change.
Validation performed
Added unit tests for schema creation with single or multiple captures.
Summary by CodeRabbit
New Features
Bug Fixes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.