-
Notifications
You must be signed in to change notification settings - Fork 88
feat(clp-s): Add --sanitize-invalid-json option for opt-in JSON sanitization #1850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add automatic sanitization of JSON buffers that contain unescaped control characters (0x00-0x1F) inside string values. When simdjson returns UNESCAPED_CHARS error, the buffer is sanitized by escaping control characters to \u00XX format, and parsing is retried. - Add StringUtils::sanitize_json_buffer() that escapes control chars inside JSON strings while respecting existing escape sequences - Integrate sanitization in JsonFileIterator::read_new_json() with automatic retry after sanitization - Track sanitization byte expansion for accurate byte counting - Log warnings with detailed counts of sanitized characters - Add comprehensive unit tests for sanitization logic
WalkthroughAdds JSON and UTF‑8 sanitization: new StringUtils sanitizers replace invalid UTF‑8 and escape control characters; JsonFileIterator uses sanitizers on parse errors, updates buffer/consumption accounting and logs results; JsonParser and CLI option propagate the sanitize flag; tests and CMake updated. Changes
Sequence DiagramsequenceDiagram
participant JFI as JsonFileIterator
participant Parser as simdjson Parser
participant SU as StringUtils
participant Log as Logger
JFI->>Parser: parse(buffer)
alt parse succeeds
Parser-->>JFI: document
else parse fails (UTF8_ERROR / UNESCAPED_CHARS)
Parser-->>JFI: error
JFI->>SU: sanitize_utf8_buffer(...) or sanitize_json_buffer(...)
SU-->>JFI: BufferSanitizeResult(new_buf_occupied, char_counts)
JFI->>JFI: update buffer, buf_size, m_sanitization_bytes_added
JFI->>Log: log sanitization details (char_counts, optional path)
JFI->>Parser: parse(sanitized buffer)
Parser-->>JFI: document or error
end
JFI->>JFI: get_num_bytes_consumed() includes sanitization bytes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @components/core/tests/test-clp_s-StringUtils.cpp:
- Around line 58-63: The boolean negation in can_parse_json uses
'!result.error()' which violates the guideline; replace that expression with the
preferred form 'false == result.error()' so the function returns the same
boolean result while following the coding style (look for the can_parse_json
function, the local variable result and its call to error()).
📜 Review details
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (6)
components/core/CMakeLists.txtcomponents/core/src/clp_s/JsonFileIterator.cppcomponents/core/src/clp_s/JsonFileIterator.hppcomponents/core/src/clp_s/Utils.cppcomponents/core/src/clp_s/Utils.hppcomponents/core/tests/test-clp_s-StringUtils.cpp
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp_s/Utils.cppcomponents/core/tests/test-clp_s-StringUtils.cppcomponents/core/src/clp_s/JsonFileIterator.hppcomponents/core/src/clp_s/JsonFileIterator.cppcomponents/core/src/clp_s/Utils.hpp
🧠 Learnings (21)
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
Applied to files:
components/core/CMakeLists.txtcomponents/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.
Applied to files:
components/core/CMakeLists.txt
📚 Learning: 2025-07-01T14:49:07.333Z
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/src/glt/SQLitePreparedStatement.hpp:4-4
Timestamp: 2025-07-01T14:49:07.333Z
Learning: In the CLP codebase, standard headers like `<cstdint>` and `<string>` in alphabetical order (as seen in components/core/src/glt/SQLitePreparedStatement.hpp) are correctly ordered and do not need adjustment.
Applied to files:
components/core/CMakeLists.txt
📚 Learning: 2025-06-02T18:22:24.060Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 955
File: components/core/src/clp_s/search/sql/CMakeLists.txt:8-26
Timestamp: 2025-06-02T18:22:24.060Z
Learning: In the clp project, ANTLR code generation at build time is being removed by another PR. When reviewing CMake files, be aware that some temporary suboptimal configurations may exist to reduce merge conflicts between concurrent PRs, especially around ANTLR_TARGET calls.
Applied to files:
components/core/CMakeLists.txt
📚 Learning: 2024-11-29T22:50:17.206Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 595
File: components/core/tests/test-clp_s-end_to_end.cpp:109-110
Timestamp: 2024-11-29T22:50:17.206Z
Learning: In `components/core/tests/test-clp_s-end_to_end.cpp`, the success of `constructor.store()` is verified through `REQUIRE` statements and subsequent comparisons.
Applied to files:
components/core/CMakeLists.txt
📚 Learning: 2024-10-07T20:10:08.254Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/clp-s.cpp:196-265
Timestamp: 2024-10-07T20:10:08.254Z
Learning: In `clp-s.cpp`, the `run_serializer` function interleaves serialization and writing of IR files, making it difficult to restructure it into separate functions.
Applied to files:
components/core/CMakeLists.txt
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.
Applied to files:
components/core/src/clp_s/Utils.cppcomponents/core/tests/test-clp_s-StringUtils.cppcomponents/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
Applied to files:
components/core/src/clp_s/Utils.cppcomponents/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2025-12-09T01:57:04.839Z
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/test-SchemaSearcher.cpp:136-137
Timestamp: 2025-12-09T01:57:04.839Z
Learning: In C++ files under the components directory, keep forward declarations even when a definition follows in the same scope (such as in anonymous namespaces) to maintain consistency across multiple methods. This style reduces churn and aligns with the existing convention in this repository for both source and tests.
Applied to files:
components/core/src/clp_s/Utils.cppcomponents/core/tests/test-clp_s-StringUtils.cppcomponents/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cppcomponents/core/src/clp_s/JsonFileIterator.cppcomponents/core/src/clp_s/Utils.hpp
📚 Learning: 2024-11-01T03:26:26.386Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 570
File: components/core/tests/test-ir_encoding_methods.cpp:376-399
Timestamp: 2024-11-01T03:26:26.386Z
Learning: In the test code (`components/core/tests/test-ir_encoding_methods.cpp`), exception handling for `msgpack::unpack` can be omitted because the Catch2 testing framework captures exceptions if they occur.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-12-10T16:03:13.322Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:756-765
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir()` function, reaching the end of log events in a given IR is not considered an error case. The errors `std::errc::no_message_available` and `std::errc::result_out_of_range` are expected signals to break the deserialization loop and proceed accordingly.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:629-733
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the `JsonParser::parse_kv_log_event` method in `components/core/src/clp_s/JsonParser.cpp`, prefer passing exceptions to higher levels for more graceful handling, rather than handling them at that level.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-24T14:45:26.265Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 523
File: components/core/src/clp/BufferedFileReader.cpp:96-106
Timestamp: 2024-10-24T14:45:26.265Z
Learning: In `components/core/src/clp/BufferedFileReader.cpp`, refactoring the nested error handling conditions may not apply due to the specific logic in the original code.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-07T21:38:35.979Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:581-627
Timestamp: 2024-10-07T21:38:35.979Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` method, throwing a string literal as an exception is acceptable practice.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-10T15:19:52.408Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 554
File: components/core/src/clp/ffi/KeyValuePairLogEvent.cpp:109-111
Timestamp: 2024-10-10T15:19:52.408Z
Learning: In `KeyValuePairLogEvent.cpp`, for the class `JsonSerializationIterator`, it's acceptable to use raw pointers for member variables like `m_schema_tree_node`, and there's no need to replace them with references or smart pointers in this use case.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-10T15:21:14.506Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 554
File: components/core/src/clp/ffi/KeyValuePairLogEvent.cpp:102-102
Timestamp: 2024-10-10T15:21:14.506Z
Learning: In `KeyValuePairLogEvent.cpp`, the `get_next_child_schema_tree_node()` method in `JsonSerializationIterator` is always called after checking `has_next_child_schema_tree_node()`, ensuring proper iterator usage.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-09-27T23:19:17.079Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:717-726
Timestamp: 2024-09-27T23:19:17.079Z
Learning: In the `JsonParser::parse_from_IR` function, when deserializing, it's intentional to break out of the loop without additional error handling when `kv_log_event_result.has_error()` returns true for errors other than `no_message_available` and `result_out_of_range`, as this signifies the end of the IR file.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2025-01-30T19:26:33.869Z
Learnt from: davemarco
Repo: y-scope/clp PR: 700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:153-155
Timestamp: 2025-01-30T19:26:33.869Z
Learning: When working with constexpr strings (string literals with static storage duration), std::string_view is the preferred choice for member variables as it's more efficient and safe, avoiding unnecessary memory allocations.
Applied to files:
components/core/src/clp_s/Utils.hpp
🧬 Code graph analysis (2)
components/core/src/clp_s/Utils.cpp (1)
components/core/src/clp_s/Utils.hpp (5)
buf(114-119)i(206-210)i(206-206)c(127-138)c(127-127)
components/core/tests/test-clp_s-StringUtils.cpp (3)
components/core/src/clp_s/Utils.hpp (1)
buf(114-119)components/core/src/clp_s/Utils.cpp (2)
sanitize_json_buffer(167-234)sanitize_json_buffer(167-172)components/core/src/clp_s/clp-s.cpp (1)
parser(111-111)
🪛 Cppcheck (2.19.0)
components/core/src/clp_s/Utils.cpp
[information] 5-5: Include file
(missingIncludeSystem)
[information] 5-5: Include file
(missingIncludeSystem)
[information] 5-5: Include file
(missingIncludeSystem)
[style] 167-167: The function 'sanitize_json_buffer' is never used.
(unusedFunction)
[style] 177-177: The function 'write_numeric_value' is never used.
(unusedFunction)
[style] 212-212: The function 'sub_span' is never used.
(unusedFunction)
components/core/tests/test-clp_s-StringUtils.cpp
[information] 1-1: Include file
(missingIncludeSystem)
[information] 2-2: Include file
(missingIncludeSystem)
[information] 3-3: Include file
(missingIncludeSystem)
[information] 5-5: Include file
(missingIncludeSystem)
[information] 6-6: Include file
(missingIncludeSystem)
[information] 4-4: Include file
(missingIncludeSystem)
[information] 5-5: Include file
(missingIncludeSystem)
[information] 6-6: Include file
(missingIncludeSystem)
[information] 7-7: Include file
(missingIncludeSystem)
[information] 8-8: Include file
(missingIncludeSystem)
[information] 9-9: Include file
(missingIncludeSystem)
[information] 10-10: Include file
(missingIncludeSystem)
[information] 11-11: Include file
(missingIncludeSystem)
[information] 12-12: Include file
(missingIncludeSystem)
[style] 146-146: The function 'char_to_escaped_four_char_hex' is never used.
(unusedFunction)
[style] 160-160: The function 'bit_cast' is never used.
(unusedFunction)
[style] 177-177: The function 'write_numeric_value' is never used.
(unusedFunction)
[style] 212-212: The function 'sub_span' is never used.
(unusedFunction)
components/core/src/clp_s/JsonFileIterator.cpp
[information] 6-6: Include file
(missingIncludeSystem)
[information] 6-6: Include file
(missingIncludeSystem)
[information] 7-7: Include file
(missingIncludeSystem)
[information] 6-6: Include file
(missingIncludeSystem)
[information] 7-7: Include file
(missingIncludeSystem)
[information] 8-8: Include file
(missingIncludeSystem)
[information] 9-9: Include file
(missingIncludeSystem)
[information] 10-10: Include file
(missingIncludeSystem)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: package-image
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: check-generated
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (20)
components/core/CMakeLists.txt (1)
707-707: LGTM!The test file is correctly added to the unit test sources list.
components/core/src/clp_s/JsonFileIterator.hpp (2)
44-48: LGTM!The byte accounting correctly includes sanitization bytes in the total bytes read.
81-81: LGTM!The new member variable is appropriately placed and initialized.
components/core/src/clp_s/JsonFileIterator.cpp (3)
6-6: LGTM!The new includes are necessary for the sanitization functionality and detailed logging.
Also applies to: 9-10
70-113: LGTM!The sanitization flow is well-implemented:
- Only triggers on UNESCAPED_CHARS error as intended
- Correctly handles potential buffer reallocation
- Accurately tracks byte expansion
- Provides detailed logging for observability
- Properly retries parsing after sanitization
199-209: LGTM!The byte accounting correctly includes sanitization bytes in the consumed count, ensuring accurate compression metrics.
components/core/src/clp_s/Utils.cpp (2)
5-5: LGTM!The
<cstring>include is necessary forstd::memcpyused in the sanitization function.
167-234: LGTM!The sanitization implementation is correct and well-considered:
- State machine properly tracks string boundaries and escape sequences
- Control character detection correctly uses unsigned cast
- Buffer reallocation safely handles growth with proper padding
- Comments appropriately document edge cases and design trade-offs
components/core/src/clp_s/Utils.hpp (3)
8-8: LGTM!The
<map>include is necessary for thesanitized_char_countsmember inJsonSanitizeResult.
67-73: LGTM!The
JsonSanitizeResultstruct is well-designed for returning sanitization results with statistics.
93-119: LGTM!The API documentation is comprehensive and clearly explains:
- The function's purpose and behavior
- Buffer ownership and reallocation semantics
- Parameter roles and modification expectations
- Return value structure
components/core/tests/test-clp_s-StringUtils.cpp (9)
1-18: LGTM!The includes are appropriate, and the comment explaining the use of C++14 string literals for embedded null bytes is helpful for maintainability.
66-94: LGTM!Good coverage of no-op scenarios including the important case that control characters outside strings are left untouched.
96-127: LGTM!Comprehensive test coverage for control character escaping, including edge cases like control chars in keys and mixed with valid escape sequences.
129-171: LGTM!Excellent coverage of escape sequence edge cases. The comments explaining the expected behavior for each case (especially the triple backslash scenario) are valuable for maintainability.
173-185: LGTM!Good end-to-end validation ensuring sanitized output is parseable by simdjson.
187-210: LGTM!Good test for buffer growth scenario with verification of escape sequence count and JSON validity.
212-218: LGTM!Good coverage for JSONL format with control characters in multiple objects.
220-259: LGTM!Important edge case testing for truncated/malformed JSON. The relaxed assertions appropriately verify graceful handling without crashes, as documented in the implementation.
261-316: LGTM!The direct buffer management here is justified since these tests need access to the full
JsonSanitizeResultobject. The test logic correctly verifies character counts and the no-sanitization case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @components/core/tests/test-clp_s-StringUtils.cpp:
- Around line 59-64: The function can_parse_json uses a negated call
(!result.error()) which violates the coding guideline; replace the negation with
an explicit comparison so the return becomes false == result.error() (locate
this in the can_parse_json function where result is assigned from
parser.iterate(padded)). Ensure only that return expression is changed and
semantics remain the same.
📜 Review details
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp_s/JsonFileIterator.cppcomponents/core/tests/test-clp_s-StringUtils.cpp
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp_s/JsonFileIterator.cppcomponents/core/tests/test-clp_s-StringUtils.cpp
🧠 Learnings (15)
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cppcomponents/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cppcomponents/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/clp-s.cpp:256-258
Timestamp: 2024-10-08T15:52:50.753Z
Learning: Error handling for file path validation is performed in the `JsonParser` constructor.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-12-10T16:03:13.322Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cppcomponents/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-24T14:45:26.265Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 523
File: components/core/src/clp/BufferedFileReader.cpp:96-106
Timestamp: 2024-10-24T14:45:26.265Z
Learning: In `components/core/src/clp/BufferedFileReader.cpp`, refactoring the nested error handling conditions may not apply due to the specific logic in the original code.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-10T15:19:52.408Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 554
File: components/core/src/clp/ffi/KeyValuePairLogEvent.cpp:109-111
Timestamp: 2024-10-10T15:19:52.408Z
Learning: In `KeyValuePairLogEvent.cpp`, for the class `JsonSerializationIterator`, it's acceptable to use raw pointers for member variables like `m_schema_tree_node`, and there's no need to replace them with references or smart pointers in this use case.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:756-765
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir()` function, reaching the end of log events in a given IR is not considered an error case. The errors `std::errc::no_message_available` and `std::errc::result_out_of_range` are expected signals to break the deserialization loop and proceed accordingly.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cppcomponents/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-07T21:38:35.979Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:581-627
Timestamp: 2024-10-07T21:38:35.979Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` method, throwing a string literal as an exception is acceptable practice.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cppcomponents/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cppcomponents/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-09-27T23:19:17.079Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:717-726
Timestamp: 2024-09-27T23:19:17.079Z
Learning: In the `JsonParser::parse_from_IR` function, when deserializing, it's intentional to break out of the loop without additional error handling when `kv_log_event_result.has_error()` returns true for errors other than `no_message_available` and `result_out_of_range`, as this signifies the end of the IR file.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2025-12-09T01:57:04.839Z
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/test-SchemaSearcher.cpp:136-137
Timestamp: 2025-12-09T01:57:04.839Z
Learning: In C++ files under the components directory, keep forward declarations even when a definition follows in the same scope (such as in anonymous namespaces) to maintain consistency across multiple methods. This style reduces churn and aligns with the existing convention in this repository for both source and tests.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cppcomponents/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-11-15T03:15:45.919Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 593
File: components/core/tests/test-NetworkReader.cpp:216-219
Timestamp: 2024-11-15T03:15:45.919Z
Learning: In the `network_reader_with_valid_http_header_kv_pairs` test case in `components/core/tests/test-NetworkReader.cpp`, additional error handling for JSON parsing failures is not necessary, as the current error message is considered sufficient.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2025-06-09T17:48:56.024Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 988
File: components/log-viewer-webui/client/src/components/QueryBox/InputWithCaseSensitive/CaseSenstiveToggle.tsx:34-37
Timestamp: 2025-06-09T17:48:56.024Z
Learning: In the y-scope/clp project, prefer `false == <expression>` rather than `!<expression>` for boolean expressions in TypeScript/JavaScript files, as specified in the coding guidelines.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
🧬 Code graph analysis (2)
components/core/src/clp_s/JsonFileIterator.cpp (2)
components/core/src/clp_s/JsonFileIterator.hpp (2)
JsonFileIterator(27-32)JsonFileIterator(33-33)components/core/src/clp_s/Utils.cpp (2)
sanitize_json_buffer(167-234)sanitize_json_buffer(167-172)
components/core/tests/test-clp_s-StringUtils.cpp (2)
components/core/src/clp_s/Utils.hpp (1)
buf(114-119)components/core/src/clp_s/Utils.cpp (2)
sanitize_json_buffer(167-234)sanitize_json_buffer(167-172)
🪛 Cppcheck (2.19.0)
components/core/src/clp_s/JsonFileIterator.cpp
[information] 6-6: Include file
(missingIncludeSystem)
[information] 6-6: Include file
(missingIncludeSystem)
[information] 6-6: Include file
(missingIncludeSystem)
[information] 7-7: Include file
(missingIncludeSystem)
[information] 6-6: Include file
(missingIncludeSystem)
[information] 7-7: Include file
(missingIncludeSystem)
[information] 8-8: Include file
(missingIncludeSystem)
[information] 9-9: Include file
(missingIncludeSystem)
[information] 10-10: Include file
(missingIncludeSystem)
[information] 11-11: Include file
(missingIncludeSystem)
[information] 12-12: Include file
(missingIncludeSystem)
[style] 22-22: The function 'get_error_code' is never used.
(unusedFunction)
components/core/tests/test-clp_s-StringUtils.cpp
[information] 1-1: Include file
(missingIncludeSystem)
[information] 2-2: Include file
(missingIncludeSystem)
[information] 3-3: Include file
(missingIncludeSystem)
[information] 5-5: Include file
(missingIncludeSystem)
[information] 6-6: Include file
(missingIncludeSystem)
[information] 4-4: Include file
(missingIncludeSystem)
[information] 5-5: Include file
(missingIncludeSystem)
[information] 6-6: Include file
(missingIncludeSystem)
[information] 7-7: Include file
(missingIncludeSystem)
[information] 8-8: Include file
(missingIncludeSystem)
[information] 9-9: Include file
(missingIncludeSystem)
[information] 10-10: Include file
(missingIncludeSystem)
[information] 11-11: Include file
(missingIncludeSystem)
[information] 12-12: Include file
(missingIncludeSystem)
[style] 146-146: The function 'char_to_escaped_four_char_hex' is never used.
(unusedFunction)
[style] 160-160: The function 'bit_cast' is never used.
(unusedFunction)
[style] 177-177: The function 'write_numeric_value' is never used.
(unusedFunction)
[style] 212-212: The function 'sub_span' is never used.
(unusedFunction)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: package-image
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: check-generated
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (8)
components/core/src/clp_s/JsonFileIterator.cpp (4)
6-9: LGTM!The new includes are appropriate for the sanitization and logging functionality added.
12-24: LGTM!The constructor correctly accepts and stores the path parameter using
std::move, and the default value in the header maintains backward compatibility.
72-116: LGTM!The sanitization logic correctly handles the
UNESCAPED_CHARSerror case:
- Buffer reallocation is properly handled via pass-by-reference
m_sanitization_bytes_addedis accumulated correctly- The retry parsing after sanitization is well-structured
- Any parse error on retry is properly propagated through lines 118-121
The detailed logging with per-character counts and buffer expansion will be helpful for diagnosing data quality issues.
202-212: LGTM!The byte counting correctly includes
m_sanitization_bytes_addedin both code paths, ensuring accurate reporting of consumed bytes that reflect the sanitized content size.components/core/tests/test-clp_s-StringUtils.cpp (4)
1-18: LGTM!The includes are appropriate, and the explanatory comment about C++14 string literals is helpful for understanding why
std::string_literalsis used for embedded null bytes in tests.
67-95: LGTM!Good coverage of passthrough cases, including the important test that control characters outside JSON strings are left unchanged (letting the parser report the error).
97-181: LGTM!Excellent coverage of control character escaping and escape sequence handling. The edge cases for escaped backslashes and triple backslashes are particularly valuable for ensuring the sanitizer correctly tracks string state.
183-331: LGTM!Comprehensive test coverage including:
- Valid JSON output verification
- Buffer growth handling (100 null bytes → 600 bytes)
- JSONL multi-document format
- Graceful handling of truncated/malformed JSON
- Correct character count reporting
The tests thoroughly exercise the sanitization function and document expected behaviour for edge cases.
…ation Add a new CLI flag that enables sanitization of invalid JSON during ingestion. When enabled, the following issues are automatically fixed: - Invalid UTF-8 sequences: replaced with U+FFFD replacement character - Unescaped control characters (0x00-0x1F): escaped to \u00XX format - Invalid surrogate escapes: handled via simdjson's allow_replacement When disabled (default), parsing fails fast on any of these issues. Changes: - Add --sanitize-invalid-json flag to CommandLineArguments - Add sanitize_utf8_buffer() to Utils for UTF-8 validation/replacement - Update JsonFileIterator to conditionally sanitize on UTF8_ERROR and UNESCAPED_CHARS errors during both initial parsing and iteration - Update JsonParser to pass flag through and use conditional get_string()/unescaped_key() for surrogate escape handling
Add comprehensive tests for the new UTF-8 sanitization function covering: - Valid UTF-8 passthrough (ASCII, multibyte chars, JSON with UTF-8) - Invalid byte replacement (0xFF, 0xFE, continuation bytes without leader) - Truncated sequences (2/3/4-byte sequences cut short) - Overlong encodings (0xC0, 0xC1 lead bytes) - Surrogate code points (U+D800-U+DFFF) - Code points above U+10FFFF - Mixed valid/invalid UTF-8 - Correct replacement character counts - Buffer growth for many invalid bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In @components/core/src/clp_s/JsonFileIterator.cpp:
- Around line 89-104: The condition uses a negation operator: change the check
in the JsonFileIterator.cpp block that currently reads
"!result.sanitized_char_counts.empty()" to use the preferred style "false ==
result.sanitized_char_counts.empty()"; keep the surrounding logic and logging
(the SPDLOG_WARN call and uses of total_replaced, m_path, m_buf_occupied,
old_buf_occupied) unchanged so only the conditional expression is replaced to
conform to the coding guideline.
- Around line 301-316: Replace the negated check using '!' in the
JsonFileIterator.cpp condition with the preferred explicit comparison form:
change the if condition from using '!result.sanitized_char_counts.empty()' to
'false == result.sanitized_char_counts.empty()' so the block that computes
total_replaced and calls SPDLOG_WARN (referencing total_replaced, m_path,
m_buf_occupied, old_buf_occupied) is executed under the same logic but follows
the coding guideline.
- Around line 129-150: The condition in the loop uses the negation operator
(!char_details.empty()) which violates the coding guideline; replace it with the
equality form (false == char_details.empty()) so the check becomes explicit.
Update the for-loop in JsonFileIterator.cpp where char_details is built (refer
to the variable char_details and the loop iterating
result.sanitized_char_counts) to use false == char_details.empty() before
appending the separator, leaving the rest of the logic and logging (SPDLOG_WARN
call) unchanged.
- Around line 231-255: Summary: Replace negated boolean expressions with the
coding-guideline-preferred form. In the JsonFileIterator.cpp block handling
result.sanitized_char_counts, change the two conditions "if
(!result.sanitized_char_counts.empty())" and "if (!char_details.empty())" to use
the preferred equality style (e.g., "if (false ==
result.sanitized_char_counts.empty())" and "if (false ==
char_details.empty())"), keeping all surrounding logic (fmt::format,
SPDLOG_WARN, variable names like result.sanitized_char_counts, char_details,
m_path, m_buf_occupied, old_buf_occupied) unchanged.
- Around line 74-159: The duplicated sanitization + logging blocks in
read_new_json and get_json should be extracted into two helper methods (e.g.,
sanitize_and_log_utf8(...) and sanitize_and_log_json(...)) that encapsulate:
calling StringUtils::sanitize_utf8_buffer / sanitize_json_buffer, updating
m_buf_occupied and m_sanitization_bytes_added, preserving m_buf_size
reallocation behavior, building and emitting the same SPDLOG_WARN messages, and
retrying parsing via m_parser.iterate_many(...) (returning or setting the
resulting error/stream). Replace the repeated blocks in read_new_json and
get_json with calls to these helpers so the callers simply invoke
sanitize_and_log_utf8() or sanitize_and_log_json() and use the returned
error/stream result, keeping all existing variable names (m_buf, m_buf_size,
m_buf_occupied, m_sanitization_bytes_added, m_parser, m_stream, m_path)
unchanged.
📜 Review details
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (10)
components/core/src/clp_s/CommandLineArguments.cppcomponents/core/src/clp_s/CommandLineArguments.hppcomponents/core/src/clp_s/JsonFileIterator.cppcomponents/core/src/clp_s/JsonFileIterator.hppcomponents/core/src/clp_s/JsonParser.cppcomponents/core/src/clp_s/JsonParser.hppcomponents/core/src/clp_s/Utils.cppcomponents/core/src/clp_s/Utils.hppcomponents/core/src/clp_s/clp-s.cppcomponents/core/tests/test-clp_s-StringUtils.cpp
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp_s/CommandLineArguments.hppcomponents/core/src/clp_s/JsonParser.hppcomponents/core/src/clp_s/clp-s.cppcomponents/core/tests/test-clp_s-StringUtils.cppcomponents/core/src/clp_s/JsonFileIterator.cppcomponents/core/src/clp_s/JsonFileIterator.hppcomponents/core/src/clp_s/Utils.cppcomponents/core/src/clp_s/Utils.hppcomponents/core/src/clp_s/JsonParser.cppcomponents/core/src/clp_s/CommandLineArguments.cpp
🧠 Learnings (21)
📓 Common learnings
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-11-17T22:58:50.056Z
Learning: In the y-scope/clp repository, when enabling new linting tools (ruff, mypy) on Python components, the team uses an incremental approach: first enable the tools with errors allowed (exit code 0), apply only safe auto-fixable fixes, then address remaining issues in follow-up PRs. During the initial enablement PR, reviews should focus on correctness of auto-fixes rather than flagging new code quality issues.
📚 Learning: 2025-01-14T16:06:54.692Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 646
File: components/core/src/clp/streaming_archive/writer/Archive.hpp:354-354
Timestamp: 2025-01-14T16:06:54.692Z
Learning: Member variables in C++ classes should be explicitly initialized in the constructor to prevent undefined behavior, as demonstrated in the Archive class where `m_use_single_file_archive` is initialized to `false`.
Applied to files:
components/core/src/clp_s/CommandLineArguments.hpp
📚 Learning: 2024-12-10T16:03:13.322Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.
Applied to files:
components/core/src/clp_s/JsonParser.hppcomponents/core/src/clp_s/clp-s.cppcomponents/core/tests/test-clp_s-StringUtils.cppcomponents/core/src/clp_s/JsonFileIterator.cppcomponents/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
Applied to files:
components/core/src/clp_s/JsonParser.hppcomponents/core/src/clp_s/clp-s.cppcomponents/core/tests/test-clp_s-StringUtils.cppcomponents/core/src/clp_s/JsonFileIterator.cppcomponents/core/src/clp_s/Utils.cppcomponents/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.
Applied to files:
components/core/src/clp_s/JsonParser.hppcomponents/core/src/clp_s/clp-s.cppcomponents/core/tests/test-clp_s-StringUtils.cppcomponents/core/src/clp_s/JsonFileIterator.cppcomponents/core/src/clp_s/Utils.cppcomponents/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:629-733
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the `JsonParser::parse_kv_log_event` method in `components/core/src/clp_s/JsonParser.cpp`, prefer passing exceptions to higher levels for more graceful handling, rather than handling them at that level.
Applied to files:
components/core/src/clp_s/JsonParser.hppcomponents/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
Applied to files:
components/core/src/clp_s/JsonParser.hppcomponents/core/tests/test-clp_s-StringUtils.cppcomponents/core/src/clp_s/JsonFileIterator.cppcomponents/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-07T21:38:35.979Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:581-627
Timestamp: 2024-10-07T21:38:35.979Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` method, throwing a string literal as an exception is acceptable practice.
Applied to files:
components/core/src/clp_s/JsonParser.hppcomponents/core/tests/test-clp_s-StringUtils.cppcomponents/core/src/clp_s/JsonFileIterator.cppcomponents/core/src/clp_s/JsonFileIterator.hppcomponents/core/src/clp_s/JsonParser.cpp
📚 Learning: 2025-12-09T01:57:04.839Z
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/test-SchemaSearcher.cpp:136-137
Timestamp: 2025-12-09T01:57:04.839Z
Learning: In C++ files under the components directory, keep forward declarations even when a definition follows in the same scope (such as in anonymous namespaces) to maintain consistency across multiple methods. This style reduces churn and aligns with the existing convention in this repository for both source and tests.
Applied to files:
components/core/src/clp_s/clp-s.cppcomponents/core/tests/test-clp_s-StringUtils.cppcomponents/core/src/clp_s/JsonFileIterator.cppcomponents/core/src/clp_s/Utils.cppcomponents/core/src/clp_s/JsonParser.cppcomponents/core/src/clp_s/CommandLineArguments.cpp
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cppcomponents/core/src/clp_s/Utils.hpp
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-11-15T03:15:45.919Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 593
File: components/core/tests/test-NetworkReader.cpp:216-219
Timestamp: 2024-11-15T03:15:45.919Z
Learning: In the `network_reader_with_valid_http_header_kv_pairs` test case in `components/core/tests/test-NetworkReader.cpp`, additional error handling for JSON parsing failures is not necessary, as the current error message is considered sufficient.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:756-765
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir()` function, reaching the end of log events in a given IR is not considered an error case. The errors `std::errc::no_message_available` and `std::errc::result_out_of_range` are expected signals to break the deserialization loop and proceed accordingly.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cppcomponents/core/src/clp_s/JsonFileIterator.cppcomponents/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 549
File: components/core/src/clp/ffi/ir_stream/Deserializer.hpp:149-151
Timestamp: 2024-10-08T15:52:50.753Z
Learning: Exceptions are disabled when calling `nlohmann::json::parse` by passing `nullptr` and `false` as the second and third arguments.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/clp-s.cpp:256-258
Timestamp: 2024-10-08T15:52:50.753Z
Learning: Error handling for file path validation is performed in the `JsonParser` constructor.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cppcomponents/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-10T15:19:52.408Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 554
File: components/core/src/clp/ffi/KeyValuePairLogEvent.cpp:109-111
Timestamp: 2024-10-10T15:19:52.408Z
Learning: In `KeyValuePairLogEvent.cpp`, for the class `JsonSerializationIterator`, it's acceptable to use raw pointers for member variables like `m_schema_tree_node`, and there's no need to replace them with references or smart pointers in this use case.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cppcomponents/core/src/clp_s/JsonFileIterator.hpp
📚 Learning: 2024-09-27T23:19:17.079Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:717-726
Timestamp: 2024-09-27T23:19:17.079Z
Learning: In the `JsonParser::parse_from_IR` function, when deserializing, it's intentional to break out of the loop without additional error handling when `kv_log_event_result.has_error()` returns true for errors other than `no_message_available` and `result_out_of_range`, as this signifies the end of the IR file.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2025-01-30T19:26:33.869Z
Learnt from: davemarco
Repo: y-scope/clp PR: 700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:153-155
Timestamp: 2025-01-30T19:26:33.869Z
Learning: When working with constexpr strings (string literals with static storage duration), std::string_view is the preferred choice for member variables as it's more efficient and safe, avoiding unnecessary memory allocations.
Applied to files:
components/core/src/clp_s/Utils.hpp
📚 Learning: 2024-11-12T18:46:20.933Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 584
File: components/core/src/clp_s/SchemaTree.hpp:171-171
Timestamp: 2024-11-12T18:46:20.933Z
Learning: In `components/core/src/clp_s/SchemaTree.hpp`, it's acceptable to use `std::string_view` as keys in `m_node_map` because `SchemaNode`'s `m_key_name` remains valid even after move operations or reallocations, preventing dangling references.
Applied to files:
components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-12-09T15:25:14.043Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 558
File: components/core/src/clp/ffi/KeyValuePairLogEvent.cpp:474-480
Timestamp: 2024-12-09T15:25:14.043Z
Learning: In `components/core/src/clp/ffi/KeyValuePairLogEvent.cpp`, node IDs are validated before accessing `child_schema_tree_node` in the function `serialize_node_id_value_pairs_to_json`, ensuring `get_node` does not throw exceptions due to invalid node IDs.
Applied to files:
components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-11-12T18:47:03.828Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 584
File: components/core/src/clp_s/SchemaTree.hpp:91-94
Timestamp: 2024-11-12T18:47:03.828Z
Learning: In `components/core/src/clp_s/SchemaTree.hpp`, the `SchemaNode` class uses `std::unique_ptr<char[]> m_key_buf` and `std::string_view m_key_name` to ensure that references to `m_key_name` remain valid even after `SchemaNode` is move-constructed.
Applied to files:
components/core/src/clp_s/JsonParser.cpp
🧬 Code graph analysis (4)
components/core/tests/test-clp_s-StringUtils.cpp (2)
components/core/src/clp_s/Utils.hpp (4)
buf(114-119)buf(137-142)i(229-233)i(229-229)components/core/src/clp_s/Utils.cpp (4)
sanitize_json_buffer(167-234)sanitize_json_buffer(167-172)sanitize_utf8_buffer(328-376)sanitize_utf8_buffer(328-333)
components/core/src/clp_s/Utils.cpp (1)
components/core/src/clp_s/Utils.hpp (6)
buf(114-119)buf(137-142)i(229-233)i(229-229)c(150-161)c(150-150)
components/core/src/clp_s/JsonParser.cpp (1)
components/core/src/clp_s/CommandLineArguments.hpp (1)
m_sanitize_invalid_json(116-116)
components/core/src/clp_s/CommandLineArguments.cpp (1)
components/core/src/clp_s/CommandLineArguments.hpp (1)
m_sanitize_invalid_json(116-116)
🪛 Cppcheck (2.19.0)
components/core/tests/test-clp_s-StringUtils.cpp
[information] 1-1: Include file
(missingIncludeSystem)
[information] 2-2: Include file
(missingIncludeSystem)
[information] 3-3: Include file
(missingIncludeSystem)
[information] 5-5: Include file
(missingIncludeSystem)
[information] 6-6: Include file
(missingIncludeSystem)
[information] 4-4: Include file
(missingIncludeSystem)
[information] 5-5: Include file
(missingIncludeSystem)
[information] 6-6: Include file
(missingIncludeSystem)
[information] 7-7: Include file
(missingIncludeSystem)
[information] 8-8: Include file
(missingIncludeSystem)
[information] 9-9: Include file
(missingIncludeSystem)
[information] 10-10: Include file
(missingIncludeSystem)
[information] 11-11: Include file
(missingIncludeSystem)
[information] 12-12: Include file
(missingIncludeSystem)
[style] 169-169: The function 'char_to_escaped_four_char_hex' is never used.
(unusedFunction)
[style] 183-183: The function 'bit_cast' is never used.
(unusedFunction)
[style] 200-200: The function 'write_numeric_value' is never used.
(unusedFunction)
[style] 235-235: The function 'sub_span' is never used.
(unusedFunction)
components/core/src/clp_s/JsonFileIterator.cpp
[information] 6-6: Include file
(missingIncludeSystem)
[information] 6-6: Include file
(missingIncludeSystem)
[information] 6-6: Include file
(missingIncludeSystem)
[information] 7-7: Include file
(missingIncludeSystem)
[information] 6-6: Include file
(missingIncludeSystem)
[information] 7-7: Include file
(missingIncludeSystem)
[information] 8-8: Include file
(missingIncludeSystem)
[information] 9-9: Include file
(missingIncludeSystem)
[information] 10-10: Include file
(missingIncludeSystem)
[information] 11-11: Include file
(missingIncludeSystem)
[information] 12-12: Include file
(missingIncludeSystem)
[style] 22-22: The function 'get_error_code' is never used.
(unusedFunction)
[style] 24-24: The function 'get_filename' is never used.
(unusedFunction)
[style] 142-142: The function 'read_numeric_value' is never used.
(unusedFunction)
[style] 235-235: The function 'sub_span' is never used.
(unusedFunction)
components/core/src/clp_s/Utils.cpp
[information] 5-5: Include file
(missingIncludeSystem)
[information] 5-5: Include file
(missingIncludeSystem)
[information] 5-5: Include file
(missingIncludeSystem)
[style] 167-167: The function 'sanitize_json_buffer' is never used.
(unusedFunction)
[style] 328-328: The function 'sanitize_utf8_buffer' is never used.
(unusedFunction)
[style] 183-183: The function 'bit_cast' is never used.
(unusedFunction)
[style] 200-200: The function 'write_numeric_value' is never used.
(unusedFunction)
[style] 235-235: The function 'sub_span' is never used.
(unusedFunction)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: package-image
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
- GitHub Check: check-generated
🔇 Additional comments (23)
components/core/src/clp_s/CommandLineArguments.cpp (1)
259-264: LGTM! Clear option description.The new
--sanitize-invalid-jsonoption is properly integrated with a comprehensive description of the sanitization behavior. The placement in the Compression options block is appropriate, and the use ofpo::bool_switchfollows the established pattern.components/core/src/clp_s/CommandLineArguments.hpp (1)
116-116: LGTM! Proper member initialization.The new getter and member variable follow established patterns. Inline initialization to
falseensures safe default behavior and aligns with modern C++ best practices.Also applies to: 207-207
components/core/src/clp_s/clp-s.cpp (1)
110-110: LGTM! Consistent initialization.The initialization of
option.sanitize_invalid_jsonfollows the established pattern and correctly propagates the CLI flag into the parser configuration.components/core/src/clp_s/JsonParser.hpp (1)
43-43: LGTM! Properly initialized members.Both the option field and class member are correctly initialized inline to
false, ensuring safe default behavior (strict mode) and following modern C++ best practices.Also applies to: 243-243
components/core/src/clp_s/JsonParser.cpp (3)
147-147: LGTM! Proper member initialization in constructor.The initialization of
m_sanitize_invalid_jsonfrom the option follows the established pattern in the constructor's initializer list.
672-673: All JsonFileIterator constructor call sites have been properly updated.There is one instantiation site in the codebase (JsonParser.cpp:672-673) that uses the updated constructor signature with both new parameters (
sanitize_invalid_jsonandpath.path) in the correct order.
227-227: Cannot verify the core claim without evidence of the previous code implementation.This review comment's assertion that "hardcoded
truewas replaced withm_sanitize_invalid_json" cannot be confirmed from the current codebase. While the current code at lines 227, 306, 411, 456, and 559 uses them_sanitize_invalid_jsonvariable (which defaults tofalsefor strict parsing), there is no visible evidence of the previous implementation in the repository.Additionally, the command-line interface indicates the feature is opt-in (
--sanitize-invalid-jsonflag with defaultfalse), meaning users must explicitly enable permissive mode. No migration guides, documentation updates, major version bumps, or changelog entries were found to suggest this is a breaking change requiring user notification.components/core/src/clp_s/JsonFileIterator.hpp (3)
22-34: LGTM: Constructor signature update for sanitization support.The new
sanitize_invalid_jsonandpathparameters are well-documented and appropriately positioned with a sensible default forpath. The API change enables opt-in sanitization behaviour.
50-54: LGTM: Byte counting now includes sanitization overhead.Returning
m_bytes_read + m_sanitization_bytes_addedcorrectly accounts for the fact that sanitized content (which is what gets compressed) may be larger than the original file bytes.
84-101: Remove this comment. The concern is based on a misunderstanding of C++ member initialization semantics. In C++, member variables are initialized in their declaration order—not in the order they appear in the initializer list—regardless of how the initializer list is written. The declaration order in the header file is correct (m_buf_size, m_max_document_size, m_buf, m_reader, m_path, m_sanitize_invalid_json), and there is no initialization order bug.Likely an incorrect or invalid review comment.
components/core/src/clp_s/Utils.hpp (2)
67-73: LGTM: Well-designed result struct for sanitization metadata.The
JsonSanitizeResultstruct cleanly encapsulates both the new buffer size and per-character sanitization counts, enabling detailed logging without complicating the function signature.
93-142: LGTM: Clear and comprehensive API documentation.The documentation properly explains:
- Buffer ownership and reallocation semantics
- The caller's responsibility for deleting the buffer
- The simdjson padding requirement
- Return value semantics
This makes the API safe to use correctly.
components/core/src/clp_s/Utils.cpp (3)
167-234: LGTM: Correct JSON control character sanitization logic.The implementation correctly:
- Tracks in-string state with proper escape sequence handling
- Only sanitizes control characters inside JSON strings
- Uses unicode escape format (
\u00XX) for consistency- Handles buffer reallocation when needed
- Returns early when no changes are made
236-326: LGTM: Comprehensive UTF-8 validation.The
validate_utf8_sequencehelper correctly validates:
- ASCII bytes
- Multi-byte sequence structure and continuation bytes
- Overlong encodings (C0, C1 lead bytes; E0 with byte2 < 0xA0; F0 with byte2 < 0x90)
- Surrogate code points (ED with byte2 >= 0xA0)
- Code points above U+10FFFF (F4 with byte2 > 0x8F)
The maximal subpart replacement strategy (advancing one byte on error) aligns with Unicode best practices.
328-376: LGTM: UTF-8 sanitization implementation is correct.The use of
0xFFas a sentinel key for counting invalid sequences is clever since 0xFF is never a valid UTF-8 byte. The implementation follows the same buffer management pattern assanitize_json_buffer.components/core/src/clp_s/JsonFileIterator.cpp (2)
12-24: LGTM: Constructor initialization order matches declaration order.The member initialization list correctly initializes members in their declaration order, preventing undefined behaviour.
339-349: LGTM: Byte counting correctly includes sanitization overhead.The updated
get_num_bytes_consumedproperly accounts for bytes added during sanitization, ensuring accurate byte tracking for both partial and complete buffer consumption scenarios.components/core/tests/test-clp_s-StringUtils.cpp (6)
20-65: LGTM: Well-designed test helpers with proper resource management.The
sanitize_stringhelper correctly handles both the reallocation and non-reallocation cases with appropriate exception safety. The try-catch ensures the buffer is always deleted.
67-95: LGTM: Good coverage of no-op cases.Testing that valid JSON passes through unchanged ensures the sanitizer doesn't inadvertently modify correct input.
133-181: LGTM: Thorough escape sequence edge case testing.The tests correctly verify:
- Escaped quotes don't toggle string state
- Escaped backslashes are handled correctly
- Control characters after escape sequences are properly identified
- Invalid escape targets (like
\<control-char>) are left as-isThis is critical for correctness.
281-324: LGTM: UTF-8 helper functions are correct.The
count_replacement_charsfunction correctly identifies U+FFFD sequences (0xEF 0xBF 0xBD) and properly skips past them after counting.
351-429: LGTM: Comprehensive UTF-8 invalid sequence testing.Excellent coverage of edge cases including:
- Invalid lead bytes (0xFF, 0xFE)
- Orphan continuation bytes
- Truncated multi-byte sequences
- Overlong encodings
- Surrogate code points
- Code points above U+10FFFF
These tests align well with the validation logic in
validate_utf8_sequence.
507-522: LGTM: Buffer growth test verifies correct expansion.Testing with 100 invalid bytes expanding to 300 bytes (each 1-byte invalid sequence becomes 3-byte U+FFFD) correctly exercises the buffer reallocation path.
Add guards in get_json() to exit with error if sanitization functions don't modify the buffer. This prevents infinite loops in edge cases where simdjson reports UNESCAPED_CHARS or UTF8_ERROR but the sanitization functions don't find anything to fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In @components/core/src/clp_s/JsonFileIterator.cpp:
- Around line 78-112: Replace the negated empty check using
!result.sanitized_char_counts.empty() with the guideline-preferred form false ==
result.sanitized_char_counts.empty(); locate this check inside the UTF-8
sanitization branch in JsonFileIterator.cpp (the block that handles
simdjson::error_code::UTF8_ERROR and uses result.sanitized_char_counts) and
change the if condition accordingly to match the project's consistency rule.
- Around line 294-339: Replace the negation using '!' with the
guideline-preferred equality comparison: change the conditional that checks
result.sanitized_char_counts.empty() (currently written as if
(!result.sanitized_char_counts.empty())) to use the form if (false ==
result.sanitized_char_counts.empty()) so the check in the JsonFileIterator.cpp
sanitization branch (involving result.sanitized_char_counts) follows the
project's coding style.
- Around line 115-158: Replace the use of negation with the guideline-preferred
comparison for the string emptiness check: change the condition in the loop that
builds char_details (the if that currently reads using !char_details.empty()) to
use the explicit comparison form (false == char_details.empty()) so the check in
the char_details accumulation block follows the project's coding style; update
only that condition in the char_details construction inside the JSON
sanitization branch of JsonFileIterator.cpp (the block that iterates
result.sanitized_char_counts).
- Around line 215-272: Replace the two uses of the negation operator with
explicit equality-to-false comparisons to match the coding guideline: change the
condition checking result.sanitized_char_counts.empty() (currently written as if
(!result.sanitized_char_counts.empty())) to if (false ==
result.sanitized_char_counts.empty()), and change the condition checking
char_details.empty() (currently if (!char_details.empty())) to if (false ==
char_details.empty()) in the JsonFileIterator logic that handles sanitization so
behavior remains identical but style conforms.
📜 Review details
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp_s/JsonFileIterator.cpp
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp_s/JsonFileIterator.cpp
🧠 Learnings (17)
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-12-10T16:03:13.322Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-10T15:19:52.408Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 554
File: components/core/src/clp/ffi/KeyValuePairLogEvent.cpp:109-111
Timestamp: 2024-10-10T15:19:52.408Z
Learning: In `KeyValuePairLogEvent.cpp`, for the class `JsonSerializationIterator`, it's acceptable to use raw pointers for member variables like `m_schema_tree_node`, and there's no need to replace them with references or smart pointers in this use case.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/clp-s.cpp:256-258
Timestamp: 2024-10-08T15:52:50.753Z
Learning: Error handling for file path validation is performed in the `JsonParser` constructor.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-24T14:45:26.265Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 523
File: components/core/src/clp/BufferedFileReader.cpp:96-106
Timestamp: 2024-10-24T14:45:26.265Z
Learning: In `components/core/src/clp/BufferedFileReader.cpp`, refactoring the nested error handling conditions may not apply due to the specific logic in the original code.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-11-15T03:15:45.919Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 593
File: components/core/tests/test-NetworkReader.cpp:216-219
Timestamp: 2024-11-15T03:15:45.919Z
Learning: In the `network_reader_with_valid_http_header_kv_pairs` test case in `components/core/tests/test-NetworkReader.cpp`, additional error handling for JSON parsing failures is not necessary, as the current error message is considered sufficient.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:629-733
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the `JsonParser::parse_kv_log_event` method in `components/core/src/clp_s/JsonParser.cpp`, prefer passing exceptions to higher levels for more graceful handling, rather than handling them at that level.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:756-765
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir()` function, reaching the end of log events in a given IR is not considered an error case. The errors `std::errc::no_message_available` and `std::errc::result_out_of_range` are expected signals to break the deserialization loop and proceed accordingly.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2025-06-09T17:48:56.024Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 988
File: components/log-viewer-webui/client/src/components/QueryBox/InputWithCaseSensitive/CaseSenstiveToggle.tsx:34-37
Timestamp: 2025-06-09T17:48:56.024Z
Learning: In the y-scope/clp project, prefer `false == <expression>` rather than `!<expression>` for boolean expressions in TypeScript/JavaScript files, as specified in the coding guidelines.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-10T05:46:35.188Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 554
File: components/core/src/clp/ffi/KeyValuePairLogEvent.cpp:299-307
Timestamp: 2024-10-10T05:46:35.188Z
Learning: In the C++ function `get_schema_subtree_bitmap` in `KeyValuePairLogEvent.cpp`, when a loop uses `while (true)` with an internal check on `optional.has_value()`, and comments explain that this structure is used to silence `clang-tidy` warnings about unchecked optional access, this code is acceptable and should not be refactored to use `while (optional.has_value())`.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-13T09:27:43.408Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 557
File: components/core/tests/test-ir_encoding_methods.cpp:1216-1286
Timestamp: 2024-10-13T09:27:43.408Z
Learning: In the unit test case `ffi_ir_stream_serialize_schema_tree_node_id` in `test-ir_encoding_methods.cpp`, suppressing the `readability-function-cognitive-complexity` warning is acceptable due to the expansion of Catch2 macros in C++ tests, and such test cases may not have readability issues.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-07T21:38:35.979Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:581-627
Timestamp: 2024-10-07T21:38:35.979Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` method, throwing a string literal as an exception is acceptable practice.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-09-27T23:19:17.079Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:717-726
Timestamp: 2024-09-27T23:19:17.079Z
Learning: In the `JsonParser::parse_from_IR` function, when deserializing, it's intentional to break out of the loop without additional error handling when `kv_log_event_result.has_error()` returns true for errors other than `no_message_available` and `result_out_of_range`, as this signifies the end of the IR file.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2025-12-09T01:57:04.839Z
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/test-SchemaSearcher.cpp:136-137
Timestamp: 2025-12-09T01:57:04.839Z
Learning: In C++ files under the components directory, keep forward declarations even when a definition follows in the same scope (such as in anonymous namespaces) to maintain consistency across multiple methods. This style reduces churn and aligns with the existing convention in this repository for both source and tests.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
🧬 Code graph analysis (1)
components/core/src/clp_s/JsonFileIterator.cpp (2)
components/core/src/clp_s/JsonFileIterator.hpp (2)
JsonFileIterator(28-34)JsonFileIterator(35-35)components/core/src/clp_s/Utils.cpp (4)
sanitize_utf8_buffer(328-376)sanitize_utf8_buffer(328-333)sanitize_json_buffer(167-234)sanitize_json_buffer(167-172)
🪛 Cppcheck (2.19.0)
components/core/src/clp_s/JsonFileIterator.cpp
[information] 6-6: Include file
(missingIncludeSystem)
[information] 6-6: Include file
(missingIncludeSystem)
[information] 6-6: Include file
(missingIncludeSystem)
[information] 7-7: Include file
(missingIncludeSystem)
[information] 6-6: Include file
(missingIncludeSystem)
[information] 7-7: Include file
(missingIncludeSystem)
[information] 8-8: Include file
(missingIncludeSystem)
[information] 9-9: Include file
(missingIncludeSystem)
[information] 10-10: Include file
(missingIncludeSystem)
[information] 11-11: Include file
(missingIncludeSystem)
[information] 12-12: Include file
(missingIncludeSystem)
[style] 22-22: The function 'get_error_code' is never used.
(unusedFunction)
[style] 24-24: The function 'get_filename' is never used.
(unusedFunction)
[style] 142-142: The function 'read_numeric_value' is never used.
(unusedFunction)
[style] 235-235: The function 'sub_span' is never used.
(unusedFunction)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: package-image
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: check-generated
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-15, false)
🔇 Additional comments (5)
components/core/src/clp_s/JsonFileIterator.cpp (5)
6-6: LGTM: Includes added for new sanitization functionality.The fmt/format.h include supports the detailed logging, and Utils.hpp provides the sanitization functions used throughout the file.
Also applies to: 9-9
12-26: LGTM: Constructor properly initialized for sanitization support.The new parameters (sanitize_invalid_json flag and path) are correctly initialized, and std::move appropriately avoids copying the path string.
255-260: LGTM: Proper guard against infinite loop.The check for empty
sanitized_char_countscorrectly prevents infinite loops by detecting when sanitization makes no changes despite simdjson reporting an error.
321-326: LGTM: Proper guard against infinite loop.The check correctly prevents infinite loops by detecting when UTF-8 sanitization makes no changes despite encountering a UTF8_ERROR.
349-359: LGTM: Byte counting correctly includes sanitization bytes.The logic properly accounts for bytes added during sanitization in both code paths, ensuring accurate tracking of the data that will be compressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @components/core/tests/test-clp_s-StringUtils.cpp:
- Around line 448-505: Extract a small test helper that centralizes buffer
allocation/copying, calls StringUtils::sanitize_utf8_buffer, builds the output
string from buf using result.new_buf_occupied, frees the buffer, and returns
both the JsonSanitizeResult and the output string (e.g., a
SanitizeResultWithOutput struct and a sanitize_utf8_with_result function).
Replace the repeated allocate/call/cleanup blocks in the TEST_CASE sections
(including the other repeated region around lines 528-583) with calls to this
helper and use its returned result and output for assertions to remove
duplication.
- Around line 30-52: Replace manual new/delete in sanitize_string with RAII:
allocate the buffer with std::unique_ptr<char[]> (e.g., unique_ptr<char[]>
buf(new char[buf_size + simdjson::SIMDJSON_PADDING]) and use buf.get() when
calling StringUtils::sanitize_json_buffer), and ensure you handle the
sanitizer's reallocation semantics by detecting if the returned buffer pointer
differs and resetting the unique_ptr to own the new pointer (or release before
the call and reset after). This preserves automatic cleanup on exceptions while
accommodating sanitize_json_buffer possibly deleting/replacing the buffer
(update references to buf, buf_size, and buf_occupied accordingly).
📜 Review details
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/core/tests/test-clp_s-StringUtils.cpp
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/tests/test-clp_s-StringUtils.cpp
🧠 Learnings (11)
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-12-10T16:03:13.322Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-11-15T03:15:45.919Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 593
File: components/core/tests/test-NetworkReader.cpp:216-219
Timestamp: 2024-11-15T03:15:45.919Z
Learning: In the `network_reader_with_valid_http_header_kv_pairs` test case in `components/core/tests/test-NetworkReader.cpp`, additional error handling for JSON parsing failures is not necessary, as the current error message is considered sufficient.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:756-765
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir()` function, reaching the end of log events in a given IR is not considered an error case. The errors `std::errc::no_message_available` and `std::errc::result_out_of_range` are expected signals to break the deserialization loop and proceed accordingly.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-07T21:38:35.979Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:581-627
Timestamp: 2024-10-07T21:38:35.979Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` method, throwing a string literal as an exception is acceptable practice.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 549
File: components/core/src/clp/ffi/ir_stream/Deserializer.hpp:149-151
Timestamp: 2024-10-08T15:52:50.753Z
Learning: Exceptions are disabled when calling `nlohmann::json::parse` by passing `nullptr` and `false` as the second and third arguments.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2025-12-09T01:57:04.839Z
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/test-SchemaSearcher.cpp:136-137
Timestamp: 2025-12-09T01:57:04.839Z
Learning: In C++ files under the components directory, keep forward declarations even when a definition follows in the same scope (such as in anonymous namespaces) to maintain consistency across multiple methods. This style reduces churn and aligns with the existing convention in this repository for both source and tests.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
🧬 Code graph analysis (1)
components/core/tests/test-clp_s-StringUtils.cpp (2)
components/core/src/clp_s/Utils.hpp (4)
buf(114-119)buf(137-142)i(229-233)i(229-229)components/core/src/clp_s/Utils.cpp (4)
sanitize_json_buffer(167-234)sanitize_json_buffer(167-172)sanitize_utf8_buffer(328-376)sanitize_utf8_buffer(328-333)
🪛 Cppcheck (2.19.0)
components/core/tests/test-clp_s-StringUtils.cpp
[information] 1-1: Include file
(missingIncludeSystem)
[information] 2-2: Include file
(missingIncludeSystem)
[information] 3-3: Include file
(missingIncludeSystem)
[information] 5-5: Include file
(missingIncludeSystem)
[information] 6-6: Include file
(missingIncludeSystem)
[information] 4-4: Include file
(missingIncludeSystem)
[information] 5-5: Include file
(missingIncludeSystem)
[information] 6-6: Include file
(missingIncludeSystem)
[information] 7-7: Include file
(missingIncludeSystem)
[information] 8-8: Include file
(missingIncludeSystem)
[information] 9-9: Include file
(missingIncludeSystem)
[information] 10-10: Include file
(missingIncludeSystem)
[information] 11-11: Include file
(missingIncludeSystem)
[information] 12-12: Include file
(missingIncludeSystem)
[style] 169-169: The function 'char_to_escaped_four_char_hex' is never used.
(unusedFunction)
[style] 183-183: The function 'bit_cast' is never used.
(unusedFunction)
[style] 200-200: The function 'write_numeric_value' is never used.
(unusedFunction)
[style] 235-235: The function 'sub_span' is never used.
(unusedFunction)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: package-image
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: check-generated
- GitHub Check: build (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (7)
components/core/tests/test-clp_s-StringUtils.cpp (7)
1-18: LGTM! Well-documented use of string literals.The comment explaining the use of C++14 string literals and string concatenation for embedded null bytes is helpful for future maintainers.
59-64: LGTM!Correctly uses
false == result.error()following the coding guideline preference for this pattern over!result.error().
67-95: LGTM! Comprehensive no-change scenarios.Good coverage of edge cases including escaped characters, unicode escapes, multiple objects, and control chars outside strings.
133-181: LGTM! Thorough escape sequence handling tests.Excellent coverage of tricky escape sequence scenarios including escaped quotes, escaped backslashes, and their interactions with control characters.
230-275: LGTM! Well-documented edge case testing.The comments clearly explain the expected behaviour for truncated/malformed JSON, and the assertions correctly verify graceful handling without crashes.
312-323: LGTM! Correct loop bounds for counting replacement characters.The condition
i + 2 < s.size()correctly ensures valid access tos[i],s[i+1], ands[i+2]. Thei += 2inside the match block, combined with the loop's++i, properly advances past each 3-byte replacement sequence.
351-429: LGTM! Comprehensive UTF-8 validation test coverage.Excellent coverage of UTF-8 edge cases including:
- Invalid standalone bytes (0xFF, 0xFE)
- Orphan continuation bytes (0x80-0xBF)
- Truncated multi-byte sequences (2, 3, and 4-byte)
- Overlong encodings
- Surrogate code points
- Code points above U+10FFFF
This aligns well with the maximal subpart replacement strategy documented in the implementation.
Extract sanitize_and_log_utf8() and sanitize_and_log_json() helper methods to reduce code duplication. The sanitization-and-log pattern was repeated 4 times with slight variations. Both helpers return true if sanitization made changes, allowing callers to detect infinite loop conditions in get_json().
- Add SanitizeResultWithOutput struct for returning both result and output - Add sanitize_json_with_result() and sanitize_utf8_with_result() helpers - Simplify sanitize_string() and sanitize_utf8_string() to use helpers - Replace manual buffer management in count tests with helper calls This reduces code duplication in test files and improves maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @components/core/src/clp_s/JsonFileIterator.cpp:
- Around line 182-191: The SPDLOG_WARN call in JsonFileIterator.cpp that logs
the buffer expansion uses unsigned subtraction for the byte-difference
(m_buf_occupied - old_buf_occupied); change this to use consistent signed
arithmetic (e.g., compute the difference as a signed type such as
ssize_t/ptrdiff_t or cast both m_buf_occupied and old_buf_occupied to a signed
integral) so the logged "Buffer expanded by" value matches the signed form used
in sanitize_and_log_utf8(), keeping variables like total_sanitized, char_details
and m_path unchanged.
📜 Review details
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp_s/JsonFileIterator.cppcomponents/core/src/clp_s/JsonFileIterator.hpp
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp_s/JsonFileIterator.hppcomponents/core/src/clp_s/JsonFileIterator.cpp
🧠 Learnings (1)
📚 Learning: 2025-12-09T01:57:04.839Z
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/test-SchemaSearcher.cpp:136-137
Timestamp: 2025-12-09T01:57:04.839Z
Learning: In C++ files under the components directory, keep forward declarations even when a definition follows in the same scope (such as in anonymous namespaces) to maintain consistency across multiple methods. This style reduces churn and aligns with the existing convention in this repository for both source and tests.
Applied to files:
components/core/src/clp_s/JsonFileIterator.cpp
🧬 Code graph analysis (1)
components/core/src/clp_s/JsonFileIterator.cpp (2)
components/core/src/clp_s/JsonFileIterator.hpp (2)
JsonFileIterator(28-34)JsonFileIterator(35-35)components/core/src/clp_s/Utils.cpp (4)
sanitize_utf8_buffer(328-376)sanitize_utf8_buffer(328-333)sanitize_json_buffer(167-234)sanitize_json_buffer(167-172)
🪛 Cppcheck (2.19.0)
components/core/src/clp_s/JsonFileIterator.cpp
[information] 6-6: Include file
(missingIncludeSystem)
[information] 6-6: Include file
(missingIncludeSystem)
[information] 6-6: Include file
(missingIncludeSystem)
[information] 7-7: Include file
(missingIncludeSystem)
[information] 6-6: Include file
(missingIncludeSystem)
[information] 7-7: Include file
(missingIncludeSystem)
[information] 8-8: Include file
(missingIncludeSystem)
[information] 9-9: Include file
(missingIncludeSystem)
[information] 10-10: Include file
(missingIncludeSystem)
[information] 11-11: Include file
(missingIncludeSystem)
[information] 12-12: Include file
(missingIncludeSystem)
[style] 22-22: The function 'get_error_code' is never used.
(unusedFunction)
[style] 24-24: The function 'get_filename' is never used.
(unusedFunction)
[style] 142-142: The function 'read_numeric_value' is never used.
(unusedFunction)
[style] 169-169: The function 'char_to_escaped_four_char_hex' is never used.
(unusedFunction)
[style] 183-183: The function 'bit_cast' is never used.
(unusedFunction)
[style] 235-235: The function 'sub_span' is never used.
(unusedFunction)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: package-image
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: check-generated
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
- GitHub Check: build (macos-15)
🔇 Additional comments (9)
components/core/src/clp_s/JsonFileIterator.hpp (3)
28-34: LGTM! Well-designed constructor extension.The new parameters are logically ordered, with the required
sanitize_invalid_jsonflag before the optionalpathparameter. The documentation clearly explains the purpose of each new parameter.
50-54: Clear documentation and implementation for byte tracking.The inline implementation correctly adds
m_sanitization_bytes_addedto the total, ensuring that callers get an accurate count of bytes that will be compressed (including any bytes added during sanitization).
84-96: Well-documented helper method declarations.Both methods are appropriately marked
[[nodiscard]]since the return value indicates whether sanitization made changes, which is critical for preventing infinite loops in the calling code.components/core/src/clp_s/JsonFileIterator.cpp (6)
12-26: LGTM! Clean constructor initialization.The member initialization order follows the declaration order in the header, and
std::move(path)is correctly used to avoid unnecessary copies.
74-92: Good sequential sanitization approach with clear documentation.The sequential handling allows both UTF-8 and control character issues to be fixed in a single pass through
read_new_json(). The explicitstatic_cast<void>()with explanatory comments clearly communicates the intentional discard of return values, and any persistent errors are correctly caught at line 94.
125-154: Solid implementation with proper byte tracking.The use of signed arithmetic at line 149 (
static_cast<int64_t>) is appropriate since U+FFFD replacements could theoretically shrink or expand the buffer. The early return on emptysanitized_char_countscorrectly indicates no changes were made, which is critical for infinite loop prevention in the caller.
218-240: Excellent defensive programming for infinite loop prevention.The check at line 223 ensures that if
sanitize_and_log_json()makes no changes, the original error is reported rather than endlessly retrying. The complete re-initialization of the document stream (lines 231-239) correctly resets all tracking state before restarting iteration.
260-282: Consistent handling of UTF-8 sanitization with proper loop prevention.The pattern mirrors the UNESCAPED_CHARS handling above, maintaining code consistency. The heuristic for distinguishing mid-stream UTF-8 errors from truncation errors is well-documented, and the sanitization path correctly prevents infinite loops.
292-302: Correct byte accounting in both code paths.The implementation properly includes
m_sanitization_bytes_addedin both cases: when documents remain in the buffer (line 298-299) and when all documents are consumed (line 301). The comment at lines 295-296 clearly explains the rationale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @components/core/tests/test-clp_s-StringUtils.cpp:
- Around line 333-344: The skip logic in count_replacement_chars is implicit (it
uses i += 2 inside a for-loop and relies on the loop's ++i to advance 3 bytes);
replace the for-loop with an explicit while-loop or adjust the match branch to
explicitly advance by 3 (e.g., use while (i + 2 < s.size()) and on match do i +=
3) so the byte-skip is obvious; update the function count_replacement_chars
accordingly and keep the same matching checks for 0xEF, 0xBF, 0xBD.
📜 Review details
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/core/tests/test-clp_s-StringUtils.cpp
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/tests/test-clp_s-StringUtils.cpp
🧠 Learnings (14)
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-12-10T16:03:13.322Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-11-15T03:15:45.919Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 593
File: components/core/tests/test-NetworkReader.cpp:216-219
Timestamp: 2024-11-15T03:15:45.919Z
Learning: In the `network_reader_with_valid_http_header_kv_pairs` test case in `components/core/tests/test-NetworkReader.cpp`, additional error handling for JSON parsing failures is not necessary, as the current error message is considered sufficient.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:756-765
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir()` function, reaching the end of log events in a given IR is not considered an error case. The errors `std::errc::no_message_available` and `std::errc::result_out_of_range` are expected signals to break the deserialization loop and proceed accordingly.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-07T21:38:35.979Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:581-627
Timestamp: 2024-10-07T21:38:35.979Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` method, throwing a string literal as an exception is acceptable practice.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 549
File: components/core/src/clp/ffi/ir_stream/Deserializer.hpp:149-151
Timestamp: 2024-10-08T15:52:50.753Z
Learning: Exceptions are disabled when calling `nlohmann::json::parse` by passing `nullptr` and `false` as the second and third arguments.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-11-01T03:26:26.386Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 570
File: components/core/tests/test-ir_encoding_methods.cpp:376-399
Timestamp: 2024-11-01T03:26:26.386Z
Learning: In the test code (`components/core/tests/test-ir_encoding_methods.cpp`), exception handling for `msgpack::unpack` can be omitted because the Catch2 testing framework captures exceptions if they occur.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-24T14:45:26.265Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 523
File: components/core/src/clp/BufferedFileReader.cpp:96-106
Timestamp: 2024-10-24T14:45:26.265Z
Learning: In `components/core/src/clp/BufferedFileReader.cpp`, refactoring the nested error handling conditions may not apply due to the specific logic in the original code.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-13T09:27:43.408Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 557
File: components/core/tests/test-ir_encoding_methods.cpp:1216-1286
Timestamp: 2024-10-13T09:27:43.408Z
Learning: In the unit test case `ffi_ir_stream_serialize_schema_tree_node_id` in `test-ir_encoding_methods.cpp`, suppressing the `readability-function-cognitive-complexity` warning is acceptable due to the expansion of Catch2 macros in C++ tests, and such test cases may not have readability issues.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2025-12-09T01:57:04.839Z
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/test-SchemaSearcher.cpp:136-137
Timestamp: 2025-12-09T01:57:04.839Z
Learning: In C++ files under the components directory, keep forward declarations even when a definition follows in the same scope (such as in anonymous namespaces) to maintain consistency across multiple methods. This style reduces churn and aligns with the existing convention in this repository for both source and tests.
Applied to files:
components/core/tests/test-clp_s-StringUtils.cpp
🧬 Code graph analysis (1)
components/core/tests/test-clp_s-StringUtils.cpp (2)
components/core/src/clp_s/Utils.hpp (4)
buf(114-119)buf(137-142)i(229-233)i(229-229)components/core/src/clp_s/Utils.cpp (4)
sanitize_json_buffer(167-234)sanitize_json_buffer(167-172)sanitize_utf8_buffer(328-376)sanitize_utf8_buffer(328-333)
🪛 Cppcheck (2.19.0)
components/core/tests/test-clp_s-StringUtils.cpp
[information] 1-1: Include file
(missingIncludeSystem)
[information] 2-2: Include file
(missingIncludeSystem)
[information] 3-3: Include file
(missingIncludeSystem)
[information] 4-4: Include file
(missingIncludeSystem)
[information] 6-6: Include file
(missingIncludeSystem)
[information] 7-7: Include file
(missingIncludeSystem)
[information] 4-4: Include file
(missingIncludeSystem)
[information] 5-5: Include file
(missingIncludeSystem)
[information] 6-6: Include file
(missingIncludeSystem)
[information] 7-7: Include file
(missingIncludeSystem)
[information] 8-8: Include file
(missingIncludeSystem)
[information] 9-9: Include file
(missingIncludeSystem)
[information] 10-10: Include file
(missingIncludeSystem)
[information] 11-11: Include file
(missingIncludeSystem)
[information] 12-12: Include file
(missingIncludeSystem)
[style] 169-169: The function 'char_to_escaped_four_char_hex' is never used.
(unusedFunction)
[style] 183-183: The function 'bit_cast' is never used.
(unusedFunction)
[style] 200-200: The function 'write_numeric_value' is never used.
(unusedFunction)
[style] 235-235: The function 'sub_span' is never used.
(unusedFunction)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
🔇 Additional comments (6)
components/core/tests/test-clp_s-StringUtils.cpp (6)
1-19: LGTM!The includes are appropriate, and the explanatory comment about C++14 string literals and concatenation for embedded null bytes is helpful for understanding the test input construction.
21-77: LGTM!The helper functions are well-documented. The raw pointer usage is appropriately justified in the comments, and line 75 correctly follows the coding guideline by using
false == result.error()instead of the negation operator. As per coding guidelines, this is the preferred pattern.
79-287: LGTM!Comprehensive test coverage for JSON sanitization, including edge cases for escape sequence handling, buffer growth, and truncated JSON. The tests appropriately document expected behaviour for malformed input scenarios.
347-370: LGTM!Good coverage of valid UTF-8 cases with helpful comments documenting the byte sequences for each Unicode character.
372-509: LGTM!Excellent coverage of UTF-8 edge cases including truncated sequences, overlong encodings, surrogate code points, and code points above U+10FFFF. These tests are important for ensuring correct handling of potentially malicious or malformed input.
515-536: LGTM!Good verification of the character count tracking functionality, covering both the case with multiple different control characters and the no-op case.
Code quality improvements: - Rename JsonSanitizeResult to BufferSanitizeResult for clarity - Rename helper methods: sanitize_and_log_utf8 -> sanitize_invalid_utf8_and_log, sanitize_and_log_json -> sanitize_control_chars_and_log - Fix typo: "delimeters" -> "delimiters" - Fix docs: "Pointer to pointer" -> "Reference to pointer" - Add @note about buffer reallocation to both sanitization helpers - Add null pointer checks to sanitization functions - Extract reinitialize_document_stream() helper to reduce duplication - Add named constant cUtf8ReplacementChar for U+FFFD magic string - Remove unused variable in char_to_hex() - Use consistent signed arithmetic in log messages Efficiency improvements: - Use lazy-copy approach: only allocate when first issue is found - Copy valid ranges in bulk using append(ptr, len) instead of byte-by-byte - Zero allocation in the common case when no sanitization is needed
…replacement_chars
Description
Add a new
--sanitize-invalid-jsonoption that provides consistent, opt-in sanitization of invalid JSON during ingestion.Previous Behavior (Inconsistent)
UTF8_ERRORUNESCAPED_CHARS\uD800)The inconsistency arose because
get_string(true)andunescaped_key(true)were hardcoded, which silently handled invalid surrogate escapes while other malformed JSON caused hard failures.New Behavior (Consistent)
--sanitize-invalid-json=falseor not specified (default - strict mode):UTF8_ERRORUNESCAPED_CHARS--sanitize-invalid-json=true(permissive mode):\u00XX, warning loggedMotivation
In rare circumstances, we have observed malformed JSON in log data:
0x00) or other control characters from C-string handling bugsThis change provides users explicit control over how to handle malformed JSON:
--sanitize-invalid-json=false--sanitize-invalid-json=trueChanges
--sanitize-invalid-jsonoption to CommandLineArguments (default: false)StringUtils::sanitize_utf8_buffer()for UTF-8 validation and replacement with U+FFFDJsonFileIteratorto conditionally sanitize onUTF8_ERRORandUNESCAPED_CHARSerrors during both initial parsing and document iterationJsonParserto use conditionalget_string(m_sanitize_invalid_json)andunescaped_key(m_sanitize_invalid_json)instead of hardcodedtrueCode Quality & Efficiency Improvements
Recent refactoring improved the implementation:
Code quality:
JsonSanitizeResulttoBufferSanitizeResultfor claritysanitize_invalid_utf8_and_log,sanitize_control_chars_and_log,reinitialize_document_stream)Efficiency:
Checklist
Validation performed
clp-core-dependencies-aarch64-manylinux_2_28:dev)UNESCAPED_CHARS✓--sanitize-invalid-json=true: succeeds, logs warning about escaped characters ✓UTF8_ERROR✓--sanitize-invalid-json=true: succeeds, logs warning about replaced sequences ✓--helpshows new option with descriptionUnit Tests
Comprehensive test coverage for both
sanitize_json_buffer()andsanitize_utf8_buffer():Summary by CodeRabbit
Release Notes
New Features
--sanitize-invalid-jsoncommand-line option to handle invalid encoding during compression. Escapes control characters in JSON and replaces invalid UTF-8 sequences with replacement characters.Tests
✏️ Tip: You can customize this high-level summary in your review settings.