Skip to content

Conversation

@davidlion
Copy link
Member

@davidlion davidlion commented Aug 28, 2025

Description

Previously, when using log surgeon for parsing the full match of a schema rule's regex pattern would be stored as a variable in CLP. This created differences from the heuristic parser's behaviour for certain cases.

For example, the heuristic's "equals" rule can be represented with the regex pattern .*=(?<var>.*[a-zA-Z0-9].*). The heuristic parser will only store the var capture group as a variable (storing the prefix .*= as static text). When using log surgeon without capture groups this behaviour was not possible as we would store the full match (including the prefix .*=) as a variable.

This PR allows schema rules to contain up to 1 capture group. If a capture group is present only the capture's match will be stored as a variable and anything surrounding it will be stored as static text. In the case where the capture is repeated (e.g. text(?<var>variable)+text)) all repetitions will be stored together as a single variable.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Added unit tests for schema creation with single or multiple captures.

Summary by CodeRabbit

  • New Features

    • Extract variables from a single named capture group within a field while preserving surrounding text.
    • Schema now exposes the value after '=' as a named capture for direct access.
    • Improved token processing to more reliably store numeric and dictionary variables during encoding.
  • Bug Fixes

    • Runtime validation rejects rules with more than one capture group and reports file/line context.
    • Clearer handling and messages for timestamp pattern detection.
  • Tests

    • Added tests for single-capture schemas, multi-capture rejection, and dictionary-content verification.

✏️ Tip: You can customize this high-level summary in your review settings.

@davidlion davidlion requested a review from a team as a code owner August 28, 2025 15:37
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 28, 2025

Walkthrough

Adds runtime validation in schema loading to enforce at most one regex capture per schema variable; refactors Archive to a token-centric, capture-aware processing path with a new add_token_to_dicts helper and timestamp-pattern lookup; replaces direct token type-id member access with accessor calls; and adds tests and fixtures for single- and multi-capture scenarios.

Changes

Cohort / File(s) Summary
Schema validation
components/core/src/clp/Utils.cpp
Adds #include <string> and inserts runtime checks in both loops over schema_vars inside load_lexer_from_file to compute num_captures from rule->m_regex_ptr->get_subtree_positive_captures().size() and throw std::runtime_error (including schema path, rule line, rule name and capture count) when num_captures > 1. Leaves the timestamp rule skip and documents the single-capture limitation in a comment.
Archive writer — capture-aware token encoding & timestamp plumbing
components/core/src/clp/streaming_archive/writer/Archive.cpp, components/core/src/clp/streaming_archive/writer/Archive.hpp
Adds public (and duplicate private) declaration add_token_to_dicts(log_surgeon::LogEventView const&, log_surgeon::Token), changes write_msg_using_schema to use log_surgeon::LogEventView const& log_view, introduces add_token_to_dicts implementation to handle token kinds (newline, uncaught string, ints, floats, captures), delegates per-token work to it, switches token iteration to token views with increment_start_pos(), integrates TimestampPattern::search_known_ts_patterns and updates m_old_ts_pattern when patterns change, and adjusts includes/initializations and error messages.
Parser & grep accessor refactor
components/core/src/clp/GrepCore.cpp
Replaces direct m_type_ids_ptr->at(...) member access with get_type_ids()->at(...) accessor calls where tokens’ type IDs are read.
Tests — parser/schema and helpers
components/core/tests/test-ParserWithUserSchema.cpp
Replaces direct token member access with get_type_ids() calls; adds test helpers (run_clp_compress, path helpers), CLP-enabled I/O wrappers, tests covering single-capture success and multi-capture error, and assertions on dictionary contents after compression.
Test fixtures — schema examples
components/core/tests/test_schema_files/single_capture_group.txt, components/core/tests/test_schema_files/multiple_capture_groups.txt
Adds delimiters: \r\n and lines capture:text(?<group>\d+) (single capture) and multicapture:text(?<group0>var0)text(?<group1>var1)text (multiple captures) to exercise the new validation.
Config schemas update
components/core/config/schemas.txt
Changes equals entry from equals:.*=.*[a-zA-Z0-9].* to equals:.*=(?<var>.*[a-zA-Z0-9].*), introducing a named capture var.
Test logs
components/core/tests/test_log_files/log_with_capture.txt
Adds a single-line log fixture 2016-05-08 07:34:05.251 MyDog123 APet4123 test.txt.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller as Caller
  participant Archive as Archive::write_msg_using_schema
  participant TS as TimestampPattern
  participant TokenProc as Archive::add_token_to_dicts
  participant Dict as Dictionaries/Encoders
  Caller->>Archive: write_msg_using_schema(log_surgeon::LogEventView)
  Archive->>TS: search_known_ts_patterns(buffer, start, end)
  TS-->>Archive: pattern result / none
  loop each token (token_view)
    Archive->>TokenProc: add_token_to_dicts(log_view, token_view)
    alt token is delimiter/newline
      TokenProc-->>Dict: handle delimiter/newline
    else token has capture group
      TokenProc->>Dict: add pre-capture constant
      TokenProc->>Dict: add encoded capture substring (register lookup)
      TokenProc->>Dict: add post-capture constant
    else no capture
      TokenProc->>Dict: add whole token as variable
    end
  end
  Archive-->>Caller: complete write
Loading
sequenceDiagram
  autonumber
  participant Loader as Schema Loader
  participant Utils as Utils::load_lexer_from_file
  participant Rule as Schema Rule (regex)
  Loader->>Utils: load_lexer_from_file(schema_path)
  loop for each schema_var rule
    Utils->>Rule: rule->m_regex_ptr->get_subtree_positive_captures()
    Rule-->>Utils: capture list
    alt captures > 1
      Utils-->>Loader: throw std::runtime_error(file,line,rule_name,capture_count)
    else
      Utils-->>Loader: continue
    end
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Pay extra attention to:
    • The two identical validation insertion points in components/core/src/clp/Utils.cpp for consistent logic and error messaging.
    • add_token_to_dicts signature/definition and all call sites for exact log_surgeon::Token / LogEventView type compatibility.
    • Timestamp pattern lookup and the m_old_ts_pattern update via const_cast and related lifecycle/const-correctness.
    • Capture substring index math, register-ID handling, and lifetime of substring views when encoded into dictionaries.
    • Tests and new fixtures: ensure expected error strings/paths and dictionary content assertions align with the runtime messages.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding support for single capture groups in schema rules to match heuristic parser behaviour, which is the core objective of the PR.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between fbd5a12 and a7a0be1.

📒 Files selected for processing (5)
  • components/core/src/clp/Utils.cpp (2 hunks)
  • components/core/src/clp/streaming_archive/writer/Archive.cpp (7 hunks)
  • components/core/tests/test-ParserWithUserSchema.cpp (1 hunks)
  • components/core/tests/test_schema_files/multiple_capture_groups.txt (1 hunks)
  • components/core/tests/test_schema_files/single_capture_group.txt (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/tests/test-ParserWithUserSchema.cpp
  • components/core/src/clp/Utils.cpp
  • components/core/src/clp/streaming_archive/writer/Archive.cpp
🧠 Learnings (1)
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
🧬 Code graph analysis (2)
components/core/tests/test-ParserWithUserSchema.cpp (2)
components/core/src/clp/Utils.cpp (2)
  • load_lexer_from_file (125-254)
  • load_lexer_from_file (126-126)
components/core/src/clp/Utils.hpp (1)
  • load_lexer_from_file (52-55)
components/core/src/clp/streaming_archive/writer/Archive.cpp (3)
components/core/src/clp/streaming_archive/writer/Archive.hpp (2)
  • timestamp (146-146)
  • logtype_id (229-232)
components/core/src/clp/streaming_archive/writer/File.hpp (1)
  • timestamp (89-95)
components/core/src/clp/EncodedVariableInterpreter.hpp (3)
  • encoded_var (52-52)
  • encoded_var (111-111)
  • id (51-51)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: antlr-code-committed (macos-15)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (8)
components/core/src/clp/Utils.cpp (1)

11-11: Header addition looks good.

Adding resolves downstream std::string usage here.

components/core/src/clp/streaming_archive/writer/Archive.cpp (4)

362-368: Potential off-by-one in uncompressed byte count (verify semantics).

If m_end_pos is exclusive, end_pos - start_pos is correct; if inclusive, add +1. Please verify Token.m_end_pos semantics, especially across buffer wrap.

Consider a focused unit test with a two-token message exercising boundary conditions (start_pos == end_pos and wrap).


371-382: Good: token_type made const and delimiter handling left intact.

No issues spotted; aligns with existing flow.


485-496: OK: zero-initialised logtype_id and downstream writes.

This aligns with safer initialisation and existing dictionary API.


317-341: No signature mismatch found
Header and implementation both declare write_msg_using_schema(log_surgeon::LogEventView const&); no action required.

components/core/tests/test_schema_files/single_capture_group.txt (1)

1-1: Fixture is minimal and appropriate.

Covers the intended single-capture scenario with surrounding literals.

components/core/tests/test_schema_files/multiple_capture_groups.txt (1)

1-1: Good negative fixture.

Triggers the >1 capture validation path as desired.

components/core/tests/test-ParserWithUserSchema.cpp (1)

195-204: Exact error assertion is OK; keep in sync if message changes.

If you accept the optional richer error in Utils.cpp, adjust this expectation accordingly (or match a substring).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
components/core/src/clp/streaming_archive/writer/Archive.cpp (2)

362-366: Guard against empty buffer when computing end_pos.

pos() − 1 will underflow if the buffer is empty. Unlikely, but add a defensive check/assert.

Would you add a precondition (e.g., assert(log_output_buffer->pos() > 0)) before using pos() − 1?


391-405: Fix: track var_ids for int/float dictionary fallbacks and follow negation style.

When integer/float cannot be encoded, you add a dictionary entry but don’t push the var ID into m_var_ids, which breaks segment indexing. Also, prefer “false == …” per repo style.

-                encoded_variable_t encoded_var{};
-                if (!EncodedVariableInterpreter::convert_string_to_representable_integer_var(
+                encoded_variable_t encoded_var{};
+                if (false == EncodedVariableInterpreter::convert_string_to_representable_integer_var(
                             token.to_string(),
                             encoded_var
                     ))
                 {
                     variable_dictionary_id_t id{};
                     m_var_dict.add_entry(token.to_string(), id);
+                    m_var_ids.push_back(id);
                     encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id);
                     m_logtype_dict_entry.add_dictionary_var();
                 } else {
                     m_logtype_dict_entry.add_int_var();
                 }
                 m_encoded_vars.push_back(encoded_var);
@@
-                encoded_variable_t encoded_var{};
-                if (!EncodedVariableInterpreter::convert_string_to_representable_float_var(
+                encoded_variable_t encoded_var{};
+                if (false == EncodedVariableInterpreter::convert_string_to_representable_float_var(
                             token.to_string(),
                             encoded_var
                     ))
                 {
                     variable_dictionary_id_t id{};
                     m_var_dict.add_entry(token.to_string(), id);
+                    m_var_ids.push_back(id);
                     encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id);
                     m_logtype_dict_entry.add_dictionary_var();
                 } else {
                     m_logtype_dict_entry.add_float_var();
                 }
                 m_encoded_vars.push_back(encoded_var);

Also applies to: 407-421

♻️ Duplicate comments (2)
components/core/src/clp/streaming_archive/writer/Archive.cpp (2)

431-440: Also treat empty capture-id vectors as “no capture.”

has_value() can hold an empty vector; later at(0) would throw. Handle empty as the no‑capture path.

-                auto capture_ids{lexer.get_capture_ids_from_rule_id(token_type)};
-                if (false == capture_ids.has_value()) {
+                auto capture_ids{lexer.get_capture_ids_from_rule_id(token_type)};
+                if (false == capture_ids.has_value() || capture_ids->empty()) {
                     variable_dictionary_id_t id{};
                     m_var_dict.add_entry(token.to_string(), id);
                     m_var_ids.push_back(id);
                     m_encoded_vars.push_back(EncodedVariableInterpreter::encode_var_dict_id(id));
                     m_logtype_dict_entry.add_dictionary_var();
 
                     break;
                 }

442-454: Validate register positions before using front()/back().

get_reversed_reg_positions(...) may return empty; front()/back() would be UB. Throw a clear error if empty.

-                auto const [start_reg_id, end_reg_id]{register_ids.value()};
-                auto const capture_start{token.get_reversed_reg_positions(start_reg_id).back()};
-                auto const capture_end{token.get_reversed_reg_positions(end_reg_id).front()};
+                auto const [start_reg_id, end_reg_id]{register_ids.value()};
+                auto const& start_positions = token.get_reversed_reg_positions(start_reg_id);
+                auto const& end_positions = token.get_reversed_reg_positions(end_reg_id);
+                if (start_positions.empty() || end_positions.empty()) {
+                    throw(std::runtime_error(
+                            "Empty register positions for variable's capture group. Full token: "
+                            + token.to_string()
+                    ));
+                }
+                auto const capture_start{start_positions.back()};
+                auto const capture_end{end_positions.front()};
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc591cd and 03a1f02.

📒 Files selected for processing (2)
  • components/core/src/clp/Utils.cpp (2 hunks)
  • components/core/src/clp/streaming_archive/writer/Archive.cpp (7 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp/Utils.cpp
  • components/core/src/clp/streaming_archive/writer/Archive.cpp
🧠 Learnings (1)
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
🧬 Code graph analysis (1)
components/core/src/clp/streaming_archive/writer/Archive.cpp (3)
components/core/src/clp/streaming_archive/writer/Archive.hpp (2)
  • timestamp (146-146)
  • logtype_id (229-232)
components/core/src/clp/streaming_archive/writer/File.hpp (1)
  • timestamp (89-95)
components/core/src/clp/EncodedVariableInterpreter.cpp (6)
  • convert_string_to_representable_integer_var (24-61)
  • convert_string_to_representable_integer_var (24-27)
  • convert_string_to_representable_float_var (63-142)
  • convert_string_to_representable_float_var (63-66)
  • encode_var_dict_id (199-201)
  • encode_var_dict_id (199-199)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: antlr-code-committed (macos-15)
🔇 Additional comments (3)
components/core/src/clp/Utils.cpp (1)

11-11: LGTM: include is appropriate.

Needed for the new error message construction.

components/core/src/clp/streaming_archive/writer/Archive.cpp (2)

5-10: LGTM: header additions are appropriate.

These headers match the new usage and remove transitive‑include reliance.

Also applies to: 15-15


479-491: LGTM: correct style and write path.

Style matches guideline (false == …) and the write/update path is consistent.

Copy link
Contributor

@SharafMohamed SharafMohamed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything look good, but it would be nice to have some small example log and a unit-test to test correctness.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d52a8c3 and 834fdb7.

📒 Files selected for processing (5)
  • components/core/src/clp/streaming_archive/writer/Archive.cpp (3 hunks)
  • components/core/src/clp/streaming_archive/writer/Archive.hpp (2 hunks)
  • components/core/tests/test-ParserWithUserSchema.cpp (3 hunks)
  • components/core/tests/test_log_files/log_with_capture.txt (1 hunks)
  • components/core/tests/test_schema_files/single_capture_group.txt (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
  • components/core/tests/test-ParserWithUserSchema.cpp
  • components/core/src/clp/streaming_archive/writer/Archive.hpp
🧠 Learnings (19)
📓 Common learnings
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1033
File: components/core/config/schemas.txt:42-43
Timestamp: 2025-07-17T16:08:23.185Z
Learning: In log-surgeon, regex patterns like `(?:\d{1,2}| \d)` are not supported. Log-surgeon has its own regex implementation with different capabilities compared to standard regex engines. Patterns like `[ 0-9]{2}` that allow character classes with spaces are valid in log-surgeon schemas.
📚 Learning: 2024-10-10T05:46:35.188Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 554
File: components/core/src/clp/ffi/KeyValuePairLogEvent.cpp:299-307
Timestamp: 2024-10-10T05:46:35.188Z
Learning: In the C++ function `get_schema_subtree_bitmap` in `KeyValuePairLogEvent.cpp`, when a loop uses `while (true)` with an internal check on `optional.has_value()`, and comments explain that this structure is used to silence `clang-tidy` warnings about unchecked optional access, this code is acceptable and should not be refactored to use `while (optional.has_value())`.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-11-01T03:26:26.386Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 570
File: components/core/tests/test-ir_encoding_methods.cpp:376-399
Timestamp: 2024-11-01T03:26:26.386Z
Learning: In the test code (`components/core/tests/test-ir_encoding_methods.cpp`), exception handling for `msgpack::unpack` can be omitted because the Catch2 testing framework captures exceptions if they occur.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-24T14:45:26.265Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 523
File: components/core/src/clp/BufferedFileReader.cpp:96-106
Timestamp: 2024-10-24T14:45:26.265Z
Learning: In `components/core/src/clp/BufferedFileReader.cpp`, refactoring the nested error handling conditions may not apply due to the specific logic in the original code.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2025-01-14T16:06:54.692Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 646
File: components/core/src/clp/streaming_archive/writer/Archive.hpp:354-354
Timestamp: 2025-01-14T16:06:54.692Z
Learning: Member variables in C++ classes should be explicitly initialized in the constructor to prevent undefined behavior, as demonstrated in the Archive class where `m_use_single_file_archive` is initialized to `false`.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-12-10T16:03:13.322Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2025-08-15T21:48:40.228Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1181
File: docs/src/user-guide/guides-retention.md:68-73
Timestamp: 2025-08-15T21:48:40.228Z
Learning: In documentation for the CLP project, when suggesting formatting improvements for variables in explanatory text, the user quinntaylormitchell prefers to maintain existing sentence structures (like keeping "that" in "i.e., that the difference...") while applying monospace formatting to technical terms and variables for consistency.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2025-09-22T15:46:34.873Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1311
File: integration-tests/tests/test_identity_transformation.py:93-97
Timestamp: 2025-09-22T15:46:34.873Z
Learning: In the CLP project, multi-line formatting is preferred over single-line ternary operators when the single line would exceed the project's line length limits, prioritizing code readability and adherence to coding standards.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-18T02:31:18.595Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 557
File: components/core/src/clp/ffi/ir_stream/utils.hpp:0-0
Timestamp: 2024-10-18T02:31:18.595Z
Learning: In `components/core/src/clp/ffi/ir_stream/utils.hpp`, the function `size_dependent_encode_and_serialize_schema_tree_node_id` assumes that the caller checks that `node_id` fits within the range of `encoded_node_id_t` before casting.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-07T21:38:35.979Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:581-627
Timestamp: 2024-10-07T21:38:35.979Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` method, throwing a string literal as an exception is acceptable practice.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:756-765
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir()` function, reaching the end of log events in a given IR is not considered an error case. The errors `std::errc::no_message_available` and `std::errc::result_out_of_range` are expected signals to break the deserialization loop and proceed accordingly.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.

Applied to files:

  • components/core/tests/test-ParserWithUserSchema.cpp
📚 Learning: 2024-10-13T09:27:43.408Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 557
File: components/core/tests/test-ir_encoding_methods.cpp:1216-1286
Timestamp: 2024-10-13T09:27:43.408Z
Learning: In the unit test case `ffi_ir_stream_serialize_schema_tree_node_id` in `test-ir_encoding_methods.cpp`, suppressing the `readability-function-cognitive-complexity` warning is acceptable due to the expansion of Catch2 macros in C++ tests, and such test cases may not have readability issues.

Applied to files:

  • components/core/tests/test-ParserWithUserSchema.cpp
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • components/core/tests/test-ParserWithUserSchema.cpp
📚 Learning: 2024-11-29T22:50:17.206Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 595
File: components/core/tests/test-clp_s-end_to_end.cpp:109-110
Timestamp: 2024-11-29T22:50:17.206Z
Learning: In `components/core/tests/test-clp_s-end_to_end.cpp`, the success of `constructor.store()` is verified through `REQUIRE` statements and subsequent comparisons.

Applied to files:

  • components/core/tests/test-ParserWithUserSchema.cpp
🧬 Code graph analysis (1)
components/core/tests/test-ParserWithUserSchema.cpp (4)
components/core/src/clp/Utils.cpp (2)
  • load_lexer_from_file (125-255)
  • load_lexer_from_file (126-126)
components/core/src/clp/Utils.hpp (1)
  • load_lexer_from_file (52-55)
components/core/src/clp/clp/run.hpp (1)
  • run (5-5)
components/core/src/clp/streaming_archive/reader/Archive.hpp (1)
  • path (44-44)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: package-image
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: rust-checks
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-15, true)
🔇 Additional comments (14)
components/core/tests/test_log_files/log_with_capture.txt (1)

1-1: LGTM! Test data correctly exercises capture group extraction.

The log line contains timestamp and tokens that match the capture pattern in single_capture_group.txt schema, allowing validation of capture-based dictionary storage.

components/core/tests/test-ParserWithUserSchema.cpp (6)

6-28: LGTM! Includes support new test infrastructure.

The added headers provide necessary support for CLP-enabled file I/O, test utilities, and capture group handling.


38-83: LGTM! Helper functions correctly construct test infrastructure.

The run_clp_compress function properly builds the command-line arguments and delegates to clp::clp::run. Path helpers follow established patterns.


171-175: LGTM! Accessor refactoring improves encapsulation.

The change from direct pointer access to get_type_ids() accessor follows better encapsulation practices. The test context with controlled data makes this safe here.


183-193: LGTM! Validates single capture group registration.

The test correctly verifies that a schema with one capture group is loaded and the capture is properly named. Note that this tests schema loading only; end-to-end extraction is covered by the dictionary verification test below.


195-204: LGTM! Validates multiple capture group rejection.

The test correctly verifies that schemas with multiple capture groups are rejected at load time with a clear error message, enforcing the single-capture constraint.


206-240: LGTM! Comprehensive end-to-end validation of capture extraction.

The test validates the entire capture group pipeline: compression with schema, archive creation, and dictionary content verification. It correctly expects only the captured digits ("123", "4123") in the variable dictionary and the surrounding text as static logtype components.

components/core/tests/test_schema_files/single_capture_group.txt (1)

1-3: LGTM! Schema correctly defines single capture group pattern.

The schema defines a rule with one named capture group that matches alphabetic prefixes followed by digits, suitable for testing capture-based variable extraction.

components/core/src/clp/streaming_archive/writer/Archive.hpp (2)

152-152: LGTM! Parameter rename improves consistency.

Renaming log_event_view to log_view aligns with usage patterns in the implementation and improves code consistency.


292-299: LGTM! New helper method properly encapsulates token processing.

The add_token_to_dicts method provides clear separation of concerns by handling token-to-dictionary logic independently from the main message processing flow.

components/core/src/clp/streaming_archive/writer/Archive.cpp (4)

5-9: LGTM! Includes support new token processing functionality.

The added headers provide necessary types and utilities for capture group handling and error reporting.


363-418: LGTM! Capture group decomposition logic is sound.

The implementation correctly handles three cases:

  1. No capture: stores entire token as variable
  2. Has capture but no register positions: throws descriptive error
  3. Has capture with positions: decomposes into before/capture/after using Token's position manipulation

The approach of using set_start_pos/set_end_pos delegates ring-buffer handling to the Token class, which is cleaner and less error-prone than manual substring calculations.


422-501: LGTM! Refactored message processing correctly delegates token handling.

The refactored write_msg_using_schema properly:

  • Extracts and validates timestamps
  • Handles file/archive splitting
  • Delegates token processing to add_token_to_dicts
  • Writes encoded messages only when logtype has content

The control flow is clear and maintains existing behaviour while supporting capture group decomposition.


394-395: I'll verify the type safety of position comparisons by investigating the return type of get_reversed_reg_positions() and the types of the containers being used.
<function_calls>

#!/bin/bash

Search for get_reversed_reg_positions method definition

rg -nP -A10 'get_reversed_reg_positions\s*(' --type=cpp --type=hpp | head -100


</function_calls>

#!/bin/bash

Search for the method definition with return type

rg -nP 'get_reversed_reg_positions' --type=cpp --type=hpp -B2 -A15 | head -150


</function_calls>
<function_calls>

#!/bin/bash

Look at the context around lines 394-395 in Archive.cpp to understand the types

rg -nP 'start_positions' components/core/src/clp/streaming_archive/writer/Archive.cpp -B5 -A5 | head -100


</function_calls>

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
components/core/src/clp/streaming_archive/writer/Archive.cpp (1)

377-384: Guard against empty capture-id vector.

has_value() can return true while the underlying vector is empty; capture_ids.value().at(0) on line 386 would throw. Treat an empty vector as "no capture."

-            if (false == capture_ids.has_value()) {
+            if (false == capture_ids.has_value() || capture_ids->empty()) {
                 variable_dictionary_id_t id{};
                 m_var_dict.add_entry(token_view.to_string(), id);
                 m_var_ids.push_back(id);
                 m_encoded_vars.push_back(EncodedVariableInterpreter::encode_var_dict_id(id));
                 m_logtype_dict_entry.add_dictionary_var();
                 break;
             }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 834fdb7 and 8ce06ba.

📒 Files selected for processing (2)
  • components/core/src/clp/streaming_archive/writer/Archive.cpp (3 hunks)
  • components/core/src/clp/streaming_archive/writer/Archive.hpp (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
  • components/core/src/clp/streaming_archive/writer/Archive.hpp
🧠 Learnings (15)
📓 Common learnings
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1033
File: components/core/config/schemas.txt:42-43
Timestamp: 2025-07-17T16:08:23.185Z
Learning: In log-surgeon, regex patterns like `(?:\d{1,2}| \d)` are not supported. Log-surgeon has its own regex implementation with different capabilities compared to standard regex engines. Patterns like `[ 0-9]{2}` that allow character classes with spaces are valid in log-surgeon schemas.
📚 Learning: 2024-10-10T05:46:35.188Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 554
File: components/core/src/clp/ffi/KeyValuePairLogEvent.cpp:299-307
Timestamp: 2024-10-10T05:46:35.188Z
Learning: In the C++ function `get_schema_subtree_bitmap` in `KeyValuePairLogEvent.cpp`, when a loop uses `while (true)` with an internal check on `optional.has_value()`, and comments explain that this structure is used to silence `clang-tidy` warnings about unchecked optional access, this code is acceptable and should not be refactored to use `while (optional.has_value())`.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-11-01T03:26:26.386Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 570
File: components/core/tests/test-ir_encoding_methods.cpp:376-399
Timestamp: 2024-11-01T03:26:26.386Z
Learning: In the test code (`components/core/tests/test-ir_encoding_methods.cpp`), exception handling for `msgpack::unpack` can be omitted because the Catch2 testing framework captures exceptions if they occur.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-24T14:45:26.265Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 523
File: components/core/src/clp/BufferedFileReader.cpp:96-106
Timestamp: 2024-10-24T14:45:26.265Z
Learning: In `components/core/src/clp/BufferedFileReader.cpp`, refactoring the nested error handling conditions may not apply due to the specific logic in the original code.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2025-01-14T16:06:54.692Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 646
File: components/core/src/clp/streaming_archive/writer/Archive.hpp:354-354
Timestamp: 2025-01-14T16:06:54.692Z
Learning: Member variables in C++ classes should be explicitly initialized in the constructor to prevent undefined behavior, as demonstrated in the Archive class where `m_use_single_file_archive` is initialized to `false`.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-12-10T16:03:13.322Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2025-08-15T21:48:40.228Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1181
File: docs/src/user-guide/guides-retention.md:68-73
Timestamp: 2025-08-15T21:48:40.228Z
Learning: In documentation for the CLP project, when suggesting formatting improvements for variables in explanatory text, the user quinntaylormitchell prefers to maintain existing sentence structures (like keeping "that" in "i.e., that the difference...") while applying monospace formatting to technical terms and variables for consistency.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2025-09-22T15:46:34.873Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1311
File: integration-tests/tests/test_identity_transformation.py:93-97
Timestamp: 2025-09-22T15:46:34.873Z
Learning: In the CLP project, multi-line formatting is preferred over single-line ternary operators when the single line would exceed the project's line length limits, prioritizing code readability and adherence to coding standards.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-18T02:31:18.595Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 557
File: components/core/src/clp/ffi/ir_stream/utils.hpp:0-0
Timestamp: 2024-10-18T02:31:18.595Z
Learning: In `components/core/src/clp/ffi/ir_stream/utils.hpp`, the function `size_dependent_encode_and_serialize_schema_tree_node_id` assumes that the caller checks that `node_id` fits within the range of `encoded_node_id_t` before casting.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-07T21:38:35.979Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:581-627
Timestamp: 2024-10-07T21:38:35.979Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` method, throwing a string literal as an exception is acceptable practice.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:756-765
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir()` function, reaching the end of log events in a given IR is not considered an error case. The errors `std::errc::no_message_available` and `std::errc::result_out_of_range` are expected signals to break the deserialization loop and proceed accordingly.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
🧬 Code graph analysis (1)
components/core/src/clp/streaming_archive/writer/Archive.cpp (2)
components/core/src/clp/EncodedVariableInterpreter.cpp (6)
  • convert_string_to_representable_integer_var (24-61)
  • convert_string_to_representable_integer_var (24-27)
  • encode_var_dict_id (199-201)
  • encode_var_dict_id (199-199)
  • convert_string_to_representable_float_var (63-142)
  • convert_string_to_representable_float_var (63-66)
components/core/src/clp/TimestampPattern.cpp (2)
  • search_known_ts_patterns (202-219)
  • search_known_ts_patterns (202-207)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: package-image
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: rust-checks
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-15, true)
🔇 Additional comments (7)
components/core/src/clp/streaming_archive/writer/Archive.hpp (2)

152-152: LGTM!

The signature update to use log_surgeon::LogEventView const& aligns with the new token-centric processing approach.


292-299: LGTM!

The new add_token_to_dicts method is appropriately declared in the private section with clear documentation describing its purpose.

components/core/src/clp/streaming_archive/writer/Archive.cpp (5)

5-9: LGTM!

The new includes are appropriate for the added functionality: <cstdint>, <stdexcept> for error handling, log_surgeon types, and TimestampPattern.hpp for timestamp parsing.

Also applies to: 15-18, 26-27


398-420: Capture group slicing implementation looks correct.

The implementation follows the suggested approach from previous reviews: manipulating token start/end positions directly rather than doing manual ring-buffer arithmetic. The before/capture/after segments are correctly extracted by adjusting token_view's positions.


480-494: Token processing loop is well-structured.

The loop correctly:

  • Skips the timestamp token (index 0)
  • Validates type_ids before access
  • Handles delimiters appropriately
  • Delegates to add_token_to_dicts for per-token processing

496-508: LGTM!

Correct use of false == pattern per coding guidelines, and the logtype dictionary entry is only written when non-empty.


470-474: I apologize, but I was unable to access the repository to verify the code review comment due to repeated cloning failures in the sandbox environment. Without access to the actual codebase, I cannot definitively verify:

  1. Whether early return guards exist that would prevent the out-of-bounds access described
  2. What schema guarantees exist regarding minimum token counts when timestamp_pattern == nullptr
  3. Whether additional bounds checks are already in place

The concern raised in the review is legitimate and requires manual verification by someone with repository access. The potential issue of accessing get_token(1) when only one token exists is a valid edge case that needs to be confirmed against the actual implementation.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
components/core/src/clp/streaming_archive/writer/Archive.cpp (1)

379-386: Guard against empty capture-id vector before accessing at(0).

The condition at line 379 checks has_value() but doesn't verify the vector is non-empty. Line 388 calls capture_ids.value().at(0), which will throw if the vector is empty. This issue was previously flagged and should treat an empty vector as "no capture."

Apply this diff:

-            if (false == capture_ids.has_value()) {
+            if (false == capture_ids.has_value() || capture_ids->empty()) {
                 variable_dictionary_id_t id{};
                 m_var_dict.add_entry(token_view.to_string(), id);
                 m_var_ids.push_back(id);
                 m_encoded_vars.push_back(EncodedVariableInterpreter::encode_var_dict_id(id));
                 m_logtype_dict_entry.add_dictionary_var();
                 break;
             }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ce06ba and 3b36328.

📒 Files selected for processing (1)
  • components/core/src/clp/streaming_archive/writer/Archive.cpp (3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
🧠 Learnings (19)
📓 Common learnings
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1033
File: components/core/config/schemas.txt:42-43
Timestamp: 2025-07-17T16:08:23.185Z
Learning: In log-surgeon, regex patterns like `(?:\d{1,2}| \d)` are not supported. Log-surgeon has its own regex implementation with different capabilities compared to standard regex engines. Patterns like `[ 0-9]{2}` that allow character classes with spaces are valid in log-surgeon schemas.
📚 Learning: 2024-10-10T05:46:35.188Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 554
File: components/core/src/clp/ffi/KeyValuePairLogEvent.cpp:299-307
Timestamp: 2024-10-10T05:46:35.188Z
Learning: In the C++ function `get_schema_subtree_bitmap` in `KeyValuePairLogEvent.cpp`, when a loop uses `while (true)` with an internal check on `optional.has_value()`, and comments explain that this structure is used to silence `clang-tidy` warnings about unchecked optional access, this code is acceptable and should not be refactored to use `while (optional.has_value())`.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-11-01T03:26:26.386Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 570
File: components/core/tests/test-ir_encoding_methods.cpp:376-399
Timestamp: 2024-11-01T03:26:26.386Z
Learning: In the test code (`components/core/tests/test-ir_encoding_methods.cpp`), exception handling for `msgpack::unpack` can be omitted because the Catch2 testing framework captures exceptions if they occur.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-24T14:45:26.265Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 523
File: components/core/src/clp/BufferedFileReader.cpp:96-106
Timestamp: 2024-10-24T14:45:26.265Z
Learning: In `components/core/src/clp/BufferedFileReader.cpp`, refactoring the nested error handling conditions may not apply due to the specific logic in the original code.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2025-01-14T16:06:54.692Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 646
File: components/core/src/clp/streaming_archive/writer/Archive.hpp:354-354
Timestamp: 2025-01-14T16:06:54.692Z
Learning: Member variables in C++ classes should be explicitly initialized in the constructor to prevent undefined behavior, as demonstrated in the Archive class where `m_use_single_file_archive` is initialized to `false`.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-12-10T16:03:13.322Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2025-08-15T21:48:40.228Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1181
File: docs/src/user-guide/guides-retention.md:68-73
Timestamp: 2025-08-15T21:48:40.228Z
Learning: In documentation for the CLP project, when suggesting formatting improvements for variables in explanatory text, the user quinntaylormitchell prefers to maintain existing sentence structures (like keeping "that" in "i.e., that the difference...") while applying monospace formatting to technical terms and variables for consistency.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2025-09-22T15:46:34.873Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1311
File: integration-tests/tests/test_identity_transformation.py:93-97
Timestamp: 2025-09-22T15:46:34.873Z
Learning: In the CLP project, multi-line formatting is preferred over single-line ternary operators when the single line would exceed the project's line length limits, prioritizing code readability and adherence to coding standards.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-18T02:31:18.595Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 557
File: components/core/src/clp/ffi/ir_stream/utils.hpp:0-0
Timestamp: 2024-10-18T02:31:18.595Z
Learning: In `components/core/src/clp/ffi/ir_stream/utils.hpp`, the function `size_dependent_encode_and_serialize_schema_tree_node_id` assumes that the caller checks that `node_id` fits within the range of `encoded_node_id_t` before casting.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2025-06-09T17:48:56.024Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 988
File: components/log-viewer-webui/client/src/components/QueryBox/InputWithCaseSensitive/CaseSenstiveToggle.tsx:34-37
Timestamp: 2025-06-09T17:48:56.024Z
Learning: In the y-scope/clp project, prefer `false == <expression>` rather than `!<expression>` for boolean expressions in TypeScript/JavaScript files, as specified in the coding guidelines.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-13T09:27:43.408Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 557
File: components/core/tests/test-ir_encoding_methods.cpp:1216-1286
Timestamp: 2024-10-13T09:27:43.408Z
Learning: In the unit test case `ffi_ir_stream_serialize_schema_tree_node_id` in `test-ir_encoding_methods.cpp`, suppressing the `readability-function-cognitive-complexity` warning is acceptable due to the expansion of Catch2 macros in C++ tests, and such test cases may not have readability issues.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2025-09-11T13:15:39.930Z
Learnt from: anlowee
Repo: y-scope/clp PR: 1176
File: components/core/src/clp_s/ColumnWriter.cpp:67-72
Timestamp: 2025-09-11T13:15:39.930Z
Learning: In components/core/src/clp_s/ColumnWriter.cpp, when using sizeof with types like variable_dictionary_id_t, always use the fully qualified namespace (clp::variable_dictionary_id_t) to match the variable declaration and maintain consistency with other column writers in the codebase.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-14T03:45:21.323Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:331-335
Timestamp: 2024-10-14T03:45:21.323Z
Learning: When reviewing code, ensure that suggestions about unnecessary use of `std::move` with `std::shared_ptr` are only made if `std::move` is actually used in the code.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-07T21:38:35.979Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:581-627
Timestamp: 2024-10-07T21:38:35.979Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` method, throwing a string literal as an exception is acceptable practice.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:756-765
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir()` function, reaching the end of log events in a given IR is not considered an error case. The errors `std::errc::no_message_available` and `std::errc::result_out_of_range` are expected signals to break the deserialization loop and proceed accordingly.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
🧬 Code graph analysis (1)
components/core/src/clp/streaming_archive/writer/Archive.cpp (3)
components/core/src/clp/EncodedVariableInterpreter.cpp (6)
  • convert_string_to_representable_integer_var (24-61)
  • convert_string_to_representable_integer_var (24-27)
  • encode_var_dict_id (199-201)
  • encode_var_dict_id (199-199)
  • convert_string_to_representable_float_var (63-142)
  • convert_string_to_representable_float_var (63-66)
components/core/src/clp/streaming_archive/writer/File.hpp (1)
  • timestamp (89-95)
components/core/src/clp/TimestampPattern.cpp (2)
  • search_known_ts_patterns (202-219)
  • search_known_ts_patterns (202-207)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: package-image
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: rust-checks
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (4)
components/core/src/clp/streaming_archive/writer/Archive.cpp (4)

5-9: LGTM!

The new includes are appropriate for the token-centric processing path and timestamp pattern handling introduced in this PR.

Also applies to: 15-15, 18-18, 26-26


403-422: LGTM!

The token position manipulation approach cleanly handles capture splitting without manual ring-buffer arithmetic. This follows the pattern suggested in previous reviews and avoids the substring calculation issues present in earlier versions.


429-454: LGTM with note on const_cast.

The timestamp pattern search and handling logic is correct. The const_cast at line 450 was previously flagged and marked as addressed. While ideally m_old_ts_pattern would be declared as a pointer-to-const in the header, this appears to be acceptable in the current codebase context.


482-497: LGTM!

The token processing loop correctly handles delimiter insertion and delegates token processing to add_token_to_dicts. The type ID extraction here is necessary for the delimiter conditional logic, so the duplication with validation in add_token_to_dicts is acceptable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants