Skip to content

Conversation

@jackluo923
Copy link
Member

@jackluo923 jackluo923 commented Jan 8, 2026

Description

Add a new --sanitize-invalid-json option that provides consistent, opt-in sanitization of invalid JSON during ingestion.

Previous Behavior (Inconsistent)

Issue Previous Behavior
Invalid UTF-8 sequences ❌ Fails with UTF8_ERROR
Unescaped control characters (0x00-0x1F) ❌ Fails with UNESCAPED_CHARS
Invalid surrogate escapes (e.g., \uD800) ✅ Silently replaced with U+FFFD

The inconsistency arose because get_string(true) and unescaped_key(true) were hardcoded, which silently handled invalid surrogate escapes while other malformed JSON caused hard failures.

New Behavior (Consistent)

--sanitize-invalid-json=false or not specified (default - strict mode):

Issue Behavior
Invalid UTF-8 sequences ❌ Fails with UTF8_ERROR
Unescaped control characters ❌ Fails with UNESCAPED_CHARS
Invalid surrogate escapes ❌ Fails (no silent replacement)

--sanitize-invalid-json=true (permissive mode):

Issue Behavior
Invalid UTF-8 sequences ✅ Replaced with U+FFFD, warning logged
Unescaped control characters ✅ Escaped to \u00XX, warning logged
Invalid surrogate escapes ✅ Replaced with U+FFFD

Motivation

In rare circumstances, we have observed malformed JSON in log data:

  • Null bytes (0x00) or other control characters from C-string handling bugs
  • Invalid UTF-8 sequences from encoding issues or corrupted data
  • Invalid Unicode surrogate escapes (e.g., unpaired surrogates)

This change provides users explicit control over how to handle malformed JSON:

  • Default (strict): Fail fast on any malformed JSON for data integrity, or be explicit with --sanitize-invalid-json=false
  • Permissive: Sanitize and continue for maximum data recovery with --sanitize-invalid-json=true

Changes

  • Add --sanitize-invalid-json option to CommandLineArguments (default: false)
  • Add StringUtils::sanitize_utf8_buffer() for UTF-8 validation and replacement with U+FFFD
  • Update JsonFileIterator to conditionally sanitize on UTF8_ERROR and UNESCAPED_CHARS errors during both initial parsing and document iteration
  • Update JsonParser to use conditional get_string(m_sanitize_invalid_json) and unescaped_key(m_sanitize_invalid_json) instead of hardcoded true
  • Track sanitization byte changes for accurate byte counting

Code Quality & Efficiency Improvements

Recent refactoring improved the implementation:

Code quality:

  • Renamed JsonSanitizeResult to BufferSanitizeResult for clarity
  • Extracted helper methods with descriptive names (sanitize_invalid_utf8_and_log, sanitize_control_chars_and_log, reinitialize_document_stream)
  • Added defensive null pointer checks
  • Fixed typos and documentation
  • Used named constants instead of magic strings

Efficiency:

  • Lazy-copy approach: zero allocation when no sanitization is needed
  • Bulk range copying instead of byte-by-byte operations
  • Single pass through the data with deferred copying

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Built and ran in Docker container (clp-core-dependencies-aarch64-manylinux_2_28:dev)
  • Tested control character sanitization:
    • Without flag: fails with UNESCAPED_CHARS
    • --sanitize-invalid-json=true: succeeds, logs warning about escaped characters ✓
  • Tested invalid UTF-8 sanitization:
    • Without flag: fails with UTF8_ERROR
    • --sanitize-invalid-json=true: succeeds, logs warning about replaced sequences ✓
  • Verified --help shows new option with description

Unit Tests

Comprehensive test coverage for both sanitize_json_buffer() and sanitize_utf8_buffer():

  • Valid input passthrough: ASCII, multibyte UTF-8 (2/3/4-byte), JSON with UTF-8, empty strings
  • Control character escaping: Null bytes, various control chars (0x00-0x1F), chars in keys/values
  • Escape sequence handling: Escaped quotes, escaped backslashes, control chars after escapes
  • Invalid UTF-8 replacement: 0xFF, 0xFE, orphan continuation bytes, truncated sequences
  • Edge cases: Overlong encodings, surrogate code points, code points above U+10FFFF
  • Buffer management: Growth with many control chars, correct replacement counts
  • Null/empty buffer handling: Defensive checks

Summary by CodeRabbit

Release Notes

  • New Features

    • Added JSON data sanitization capabilities with new --sanitize-invalid-json command-line option to handle invalid encoding during compression. Escapes control characters in JSON and replaces invalid UTF-8 sequences with replacement characters.
  • Tests

    • Added comprehensive test coverage for JSON and UTF-8 sanitization functionality.

✏️ Tip: You can customize this high-level summary in your review settings.


Add automatic sanitization of JSON buffers that contain unescaped
control characters (0x00-0x1F) inside string values. When simdjson
returns UNESCAPED_CHARS error, the buffer is sanitized by escaping
control characters to \u00XX format, and parsing is retried.

- Add StringUtils::sanitize_json_buffer() that escapes control chars
  inside JSON strings while respecting existing escape sequences
- Integrate sanitization in JsonFileIterator::read_new_json() with
  automatic retry after sanitization
- Track sanitization byte expansion for accurate byte counting
- Log warnings with detailed counts of sanitized characters
- Add comprehensive unit tests for sanitization logic
@jackluo923 jackluo923 requested review from a team and gibber9809 as code owners January 8, 2026 12:13
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 8, 2026

Walkthrough

Adds JSON and UTF‑8 sanitization: new StringUtils sanitizers replace invalid UTF‑8 and escape control characters; JsonFileIterator uses sanitizers on parse errors, updates buffer/consumption accounting and logs results; JsonParser and CLI option propagate the sanitize flag; tests and CMake updated.

Changes

Cohort / File(s) Summary
Build configuration
components/core/CMakeLists.txt
Added tests/test-clp_s-StringUtils.cpp to unitTest sources.
Sanitization utilities
components/core/src/clp_s/Utils.hpp, components/core/src/clp_s/Utils.cpp
Added BufferSanitizeResult / JsonSanitizeResult types and StringUtils::sanitize_json_buffer / sanitize_utf8_buffer; implement control‑char escaping and UTF‑8 replacement (U+FFFD), buffer growth/padding handling, and per‑char counts; added <map>/<cstring> includes.
JsonFileIterator
components/core/src/clp_s/JsonFileIterator.hpp, components/core/src/clp_s/JsonFileIterator.cpp
Constructor extended to accept bool sanitize_invalid_json and std::string path; new members m_path, m_sanitize_invalid_json, m_sanitization_bytes_added; on UTF8/UNESCAPED_CHARS errors optionally sanitize (UTF‑8 or JSON), update buffer/occupancy, reinitialize parser stream, log counts, and retry parsing; get_num_bytes_consumed() accounts for added bytes; added fmt/format.h and Utils.hpp includes.
Parser / ingest
components/core/src/clp_s/JsonParser.hpp, components/core/src/clp_s/JsonParser.cpp
Added sanitize_invalid_json option and m_sanitize_invalid_json; propagate flag and file path into JsonFileIterator construction; string/key reads use sanitize flag instead of always forcing unescaped behavior.
CLI wiring
components/core/src/clp_s/CommandLineArguments.hpp, components/core/src/clp_s/CommandLineArguments.cpp, components/core/src/clp_s/clp-s.cpp
Added --sanitize-invalid-json option, accessor get_sanitize_invalid_json(), and propagate value into JsonParserOption.sanitize_invalid_json.
Tests
components/core/tests/test-clp_s-StringUtils.cpp
New comprehensive Catch2 tests for JSON and UTF‑8 sanitization covering no‑ops, control‑char escaping, escape handling edge cases, buffer growth, JSONL, truncation, replacement counts, and simdjson parseability.

Sequence Diagram

sequenceDiagram
    participant JFI as JsonFileIterator
    participant Parser as simdjson Parser
    participant SU as StringUtils
    participant Log as Logger

    JFI->>Parser: parse(buffer)
    alt parse succeeds
        Parser-->>JFI: document
    else parse fails (UTF8_ERROR / UNESCAPED_CHARS)
        Parser-->>JFI: error
        JFI->>SU: sanitize_utf8_buffer(...) or sanitize_json_buffer(...)
        SU-->>JFI: BufferSanitizeResult(new_buf_occupied, char_counts)
        JFI->>JFI: update buffer, buf_size, m_sanitization_bytes_added
        JFI->>Log: log sanitization details (char_counts, optional path)
        JFI->>Parser: parse(sanitized buffer)
        Parser-->>JFI: document or error
    end
    JFI->>JFI: get_num_bytes_consumed() includes sanitization bytes
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the primary change: adding a new command-line option for opt-in JSON sanitization in clp-s.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @components/core/tests/test-clp_s-StringUtils.cpp:
- Around line 58-63: The boolean negation in can_parse_json uses
'!result.error()' which violates the guideline; replace that expression with the
preferred form 'false == result.error()' so the function returns the same
boolean result while following the coding style (look for the can_parse_json
function, the local variable result and its call to error()).
📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 877d0fc and aae9a7e.

📒 Files selected for processing (6)
  • components/core/CMakeLists.txt
  • components/core/src/clp_s/JsonFileIterator.cpp
  • components/core/src/clp_s/JsonFileIterator.hpp
  • components/core/src/clp_s/Utils.cpp
  • components/core/src/clp_s/Utils.hpp
  • components/core/tests/test-clp_s-StringUtils.cpp
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp_s/Utils.cpp
  • components/core/tests/test-clp_s-StringUtils.cpp
  • components/core/src/clp_s/JsonFileIterator.hpp
  • components/core/src/clp_s/JsonFileIterator.cpp
  • components/core/src/clp_s/Utils.hpp
🧠 Learnings (21)
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.

Applied to files:

  • components/core/CMakeLists.txt
  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Applied to files:

  • components/core/CMakeLists.txt
📚 Learning: 2025-07-01T14:49:07.333Z
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/src/glt/SQLitePreparedStatement.hpp:4-4
Timestamp: 2025-07-01T14:49:07.333Z
Learning: In the CLP codebase, standard headers like `<cstdint>` and `<string>` in alphabetical order (as seen in components/core/src/glt/SQLitePreparedStatement.hpp) are correctly ordered and do not need adjustment.

Applied to files:

  • components/core/CMakeLists.txt
📚 Learning: 2025-06-02T18:22:24.060Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 955
File: components/core/src/clp_s/search/sql/CMakeLists.txt:8-26
Timestamp: 2025-06-02T18:22:24.060Z
Learning: In the clp project, ANTLR code generation at build time is being removed by another PR. When reviewing CMake files, be aware that some temporary suboptimal configurations may exist to reduce merge conflicts between concurrent PRs, especially around ANTLR_TARGET calls.

Applied to files:

  • components/core/CMakeLists.txt
📚 Learning: 2024-11-29T22:50:17.206Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 595
File: components/core/tests/test-clp_s-end_to_end.cpp:109-110
Timestamp: 2024-11-29T22:50:17.206Z
Learning: In `components/core/tests/test-clp_s-end_to_end.cpp`, the success of `constructor.store()` is verified through `REQUIRE` statements and subsequent comparisons.

Applied to files:

  • components/core/CMakeLists.txt
📚 Learning: 2024-10-07T20:10:08.254Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/clp-s.cpp:196-265
Timestamp: 2024-10-07T20:10:08.254Z
Learning: In `clp-s.cpp`, the `run_serializer` function interleaves serialization and writing of IR files, making it difficult to restructure it into separate functions.

Applied to files:

  • components/core/CMakeLists.txt
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.

Applied to files:

  • components/core/src/clp_s/Utils.cpp
  • components/core/tests/test-clp_s-StringUtils.cpp
  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.

Applied to files:

  • components/core/src/clp_s/Utils.cpp
  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2025-12-09T01:57:04.839Z
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/test-SchemaSearcher.cpp:136-137
Timestamp: 2025-12-09T01:57:04.839Z
Learning: In C++ files under the components directory, keep forward declarations even when a definition follows in the same scope (such as in anonymous namespaces) to maintain consistency across multiple methods. This style reduces churn and aligns with the existing convention in this repository for both source and tests.

Applied to files:

  • components/core/src/clp_s/Utils.cpp
  • components/core/tests/test-clp_s-StringUtils.cpp
  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
  • components/core/src/clp_s/JsonFileIterator.cpp
  • components/core/src/clp_s/Utils.hpp
📚 Learning: 2024-11-01T03:26:26.386Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 570
File: components/core/tests/test-ir_encoding_methods.cpp:376-399
Timestamp: 2024-11-01T03:26:26.386Z
Learning: In the test code (`components/core/tests/test-ir_encoding_methods.cpp`), exception handling for `msgpack::unpack` can be omitted because the Catch2 testing framework captures exceptions if they occur.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-12-10T16:03:13.322Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:756-765
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir()` function, reaching the end of log events in a given IR is not considered an error case. The errors `std::errc::no_message_available` and `std::errc::result_out_of_range` are expected signals to break the deserialization loop and proceed accordingly.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:629-733
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the `JsonParser::parse_kv_log_event` method in `components/core/src/clp_s/JsonParser.cpp`, prefer passing exceptions to higher levels for more graceful handling, rather than handling them at that level.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-24T14:45:26.265Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 523
File: components/core/src/clp/BufferedFileReader.cpp:96-106
Timestamp: 2024-10-24T14:45:26.265Z
Learning: In `components/core/src/clp/BufferedFileReader.cpp`, refactoring the nested error handling conditions may not apply due to the specific logic in the original code.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-07T21:38:35.979Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:581-627
Timestamp: 2024-10-07T21:38:35.979Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` method, throwing a string literal as an exception is acceptable practice.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-10T15:19:52.408Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 554
File: components/core/src/clp/ffi/KeyValuePairLogEvent.cpp:109-111
Timestamp: 2024-10-10T15:19:52.408Z
Learning: In `KeyValuePairLogEvent.cpp`, for the class `JsonSerializationIterator`, it's acceptable to use raw pointers for member variables like `m_schema_tree_node`, and there's no need to replace them with references or smart pointers in this use case.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-10T15:21:14.506Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 554
File: components/core/src/clp/ffi/KeyValuePairLogEvent.cpp:102-102
Timestamp: 2024-10-10T15:21:14.506Z
Learning: In `KeyValuePairLogEvent.cpp`, the `get_next_child_schema_tree_node()` method in `JsonSerializationIterator` is always called after checking `has_next_child_schema_tree_node()`, ensuring proper iterator usage.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-09-27T23:19:17.079Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:717-726
Timestamp: 2024-09-27T23:19:17.079Z
Learning: In the `JsonParser::parse_from_IR` function, when deserializing, it's intentional to break out of the loop without additional error handling when `kv_log_event_result.has_error()` returns true for errors other than `no_message_available` and `result_out_of_range`, as this signifies the end of the IR file.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2025-01-30T19:26:33.869Z
Learnt from: davemarco
Repo: y-scope/clp PR: 700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:153-155
Timestamp: 2025-01-30T19:26:33.869Z
Learning: When working with constexpr strings (string literals with static storage duration), std::string_view is the preferred choice for member variables as it's more efficient and safe, avoiding unnecessary memory allocations.

Applied to files:

  • components/core/src/clp_s/Utils.hpp
🧬 Code graph analysis (2)
components/core/src/clp_s/Utils.cpp (1)
components/core/src/clp_s/Utils.hpp (5)
  • buf (114-119)
  • i (206-210)
  • i (206-206)
  • c (127-138)
  • c (127-127)
components/core/tests/test-clp_s-StringUtils.cpp (3)
components/core/src/clp_s/Utils.hpp (1)
  • buf (114-119)
components/core/src/clp_s/Utils.cpp (2)
  • sanitize_json_buffer (167-234)
  • sanitize_json_buffer (167-172)
components/core/src/clp_s/clp-s.cpp (1)
  • parser (111-111)
🪛 Cppcheck (2.19.0)
components/core/src/clp_s/Utils.cpp

[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[style] 167-167: The function 'sanitize_json_buffer' is never used.

(unusedFunction)


[style] 177-177: The function 'write_numeric_value' is never used.

(unusedFunction)


[style] 212-212: The function 'sub_span' is never used.

(unusedFunction)

components/core/tests/test-clp_s-StringUtils.cpp

[information] 1-1: Include file

(missingIncludeSystem)


[information] 2-2: Include file

(missingIncludeSystem)


[information] 3-3: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 6-6: Include file

(missingIncludeSystem)


[information] 4-4: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 6-6: Include file

(missingIncludeSystem)


[information] 7-7: Include file

(missingIncludeSystem)


[information] 8-8: Include file

(missingIncludeSystem)


[information] 9-9: Include file

(missingIncludeSystem)


[information] 10-10: Include file

(missingIncludeSystem)


[information] 11-11: Include file

(missingIncludeSystem)


[information] 12-12: Include file

(missingIncludeSystem)


[style] 146-146: The function 'char_to_escaped_four_char_hex' is never used.

(unusedFunction)


[style] 160-160: The function 'bit_cast' is never used.

(unusedFunction)


[style] 177-177: The function 'write_numeric_value' is never used.

(unusedFunction)


[style] 212-212: The function 'sub_span' is never used.

(unusedFunction)

components/core/src/clp_s/JsonFileIterator.cpp

[information] 6-6: Include file

(missingIncludeSystem)


[information] 6-6: Include file

(missingIncludeSystem)


[information] 7-7: Include file

(missingIncludeSystem)


[information] 6-6: Include file

(missingIncludeSystem)


[information] 7-7: Include file

(missingIncludeSystem)


[information] 8-8: Include file

(missingIncludeSystem)


[information] 9-9: Include file

(missingIncludeSystem)


[information] 10-10: Include file

(missingIncludeSystem)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: package-image
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: check-generated
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (20)
components/core/CMakeLists.txt (1)

707-707: LGTM!

The test file is correctly added to the unit test sources list.

components/core/src/clp_s/JsonFileIterator.hpp (2)

44-48: LGTM!

The byte accounting correctly includes sanitization bytes in the total bytes read.


81-81: LGTM!

The new member variable is appropriately placed and initialized.

components/core/src/clp_s/JsonFileIterator.cpp (3)

6-6: LGTM!

The new includes are necessary for the sanitization functionality and detailed logging.

Also applies to: 9-10


70-113: LGTM!

The sanitization flow is well-implemented:

  • Only triggers on UNESCAPED_CHARS error as intended
  • Correctly handles potential buffer reallocation
  • Accurately tracks byte expansion
  • Provides detailed logging for observability
  • Properly retries parsing after sanitization

199-209: LGTM!

The byte accounting correctly includes sanitization bytes in the consumed count, ensuring accurate compression metrics.

components/core/src/clp_s/Utils.cpp (2)

5-5: LGTM!

The <cstring> include is necessary for std::memcpy used in the sanitization function.


167-234: LGTM!

The sanitization implementation is correct and well-considered:

  • State machine properly tracks string boundaries and escape sequences
  • Control character detection correctly uses unsigned cast
  • Buffer reallocation safely handles growth with proper padding
  • Comments appropriately document edge cases and design trade-offs
components/core/src/clp_s/Utils.hpp (3)

8-8: LGTM!

The <map> include is necessary for the sanitized_char_counts member in JsonSanitizeResult.


67-73: LGTM!

The JsonSanitizeResult struct is well-designed for returning sanitization results with statistics.


93-119: LGTM!

The API documentation is comprehensive and clearly explains:

  • The function's purpose and behavior
  • Buffer ownership and reallocation semantics
  • Parameter roles and modification expectations
  • Return value structure
components/core/tests/test-clp_s-StringUtils.cpp (9)

1-18: LGTM!

The includes are appropriate, and the comment explaining the use of C++14 string literals for embedded null bytes is helpful for maintainability.


66-94: LGTM!

Good coverage of no-op scenarios including the important case that control characters outside strings are left untouched.


96-127: LGTM!

Comprehensive test coverage for control character escaping, including edge cases like control chars in keys and mixed with valid escape sequences.


129-171: LGTM!

Excellent coverage of escape sequence edge cases. The comments explaining the expected behavior for each case (especially the triple backslash scenario) are valuable for maintainability.


173-185: LGTM!

Good end-to-end validation ensuring sanitized output is parseable by simdjson.


187-210: LGTM!

Good test for buffer growth scenario with verification of escape sequence count and JSON validity.


212-218: LGTM!

Good coverage for JSONL format with control characters in multiple objects.


220-259: LGTM!

Important edge case testing for truncated/malformed JSON. The relaxed assertions appropriately verify graceful handling without crashes, as documented in the implementation.


261-316: LGTM!

The direct buffer management here is justified since these tests need access to the full JsonSanitizeResult object. The test logic correctly verifies character counts and the no-sanitization case.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @components/core/tests/test-clp_s-StringUtils.cpp:
- Around line 59-64: The function can_parse_json uses a negated call
(!result.error()) which violates the coding guideline; replace the negation with
an explicit comparison so the return becomes false == result.error() (locate
this in the can_parse_json function where result is assigned from
parser.iterate(padded)). Ensure only that return expression is changed and
semantics remain the same.
📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0b82533 and c9fc698.

📒 Files selected for processing (2)
  • components/core/src/clp_s/JsonFileIterator.cpp
  • components/core/tests/test-clp_s-StringUtils.cpp
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp_s/JsonFileIterator.cpp
  • components/core/tests/test-clp_s-StringUtils.cpp
🧠 Learnings (15)
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/clp-s.cpp:256-258
Timestamp: 2024-10-08T15:52:50.753Z
Learning: Error handling for file path validation is performed in the `JsonParser` constructor.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-12-10T16:03:13.322Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-24T14:45:26.265Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 523
File: components/core/src/clp/BufferedFileReader.cpp:96-106
Timestamp: 2024-10-24T14:45:26.265Z
Learning: In `components/core/src/clp/BufferedFileReader.cpp`, refactoring the nested error handling conditions may not apply due to the specific logic in the original code.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-10T15:19:52.408Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 554
File: components/core/src/clp/ffi/KeyValuePairLogEvent.cpp:109-111
Timestamp: 2024-10-10T15:19:52.408Z
Learning: In `KeyValuePairLogEvent.cpp`, for the class `JsonSerializationIterator`, it's acceptable to use raw pointers for member variables like `m_schema_tree_node`, and there's no need to replace them with references or smart pointers in this use case.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:756-765
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir()` function, reaching the end of log events in a given IR is not considered an error case. The errors `std::errc::no_message_available` and `std::errc::result_out_of_range` are expected signals to break the deserialization loop and proceed accordingly.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-07T21:38:35.979Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:581-627
Timestamp: 2024-10-07T21:38:35.979Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` method, throwing a string literal as an exception is acceptable practice.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-09-27T23:19:17.079Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:717-726
Timestamp: 2024-09-27T23:19:17.079Z
Learning: In the `JsonParser::parse_from_IR` function, when deserializing, it's intentional to break out of the loop without additional error handling when `kv_log_event_result.has_error()` returns true for errors other than `no_message_available` and `result_out_of_range`, as this signifies the end of the IR file.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2025-12-09T01:57:04.839Z
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/test-SchemaSearcher.cpp:136-137
Timestamp: 2025-12-09T01:57:04.839Z
Learning: In C++ files under the components directory, keep forward declarations even when a definition follows in the same scope (such as in anonymous namespaces) to maintain consistency across multiple methods. This style reduces churn and aligns with the existing convention in this repository for both source and tests.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-11-15T03:15:45.919Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 593
File: components/core/tests/test-NetworkReader.cpp:216-219
Timestamp: 2024-11-15T03:15:45.919Z
Learning: In the `network_reader_with_valid_http_header_kv_pairs` test case in `components/core/tests/test-NetworkReader.cpp`, additional error handling for JSON parsing failures is not necessary, as the current error message is considered sufficient.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2025-06-09T17:48:56.024Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 988
File: components/log-viewer-webui/client/src/components/QueryBox/InputWithCaseSensitive/CaseSenstiveToggle.tsx:34-37
Timestamp: 2025-06-09T17:48:56.024Z
Learning: In the y-scope/clp project, prefer `false == <expression>` rather than `!<expression>` for boolean expressions in TypeScript/JavaScript files, as specified in the coding guidelines.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
🧬 Code graph analysis (2)
components/core/src/clp_s/JsonFileIterator.cpp (2)
components/core/src/clp_s/JsonFileIterator.hpp (2)
  • JsonFileIterator (27-32)
  • JsonFileIterator (33-33)
components/core/src/clp_s/Utils.cpp (2)
  • sanitize_json_buffer (167-234)
  • sanitize_json_buffer (167-172)
components/core/tests/test-clp_s-StringUtils.cpp (2)
components/core/src/clp_s/Utils.hpp (1)
  • buf (114-119)
components/core/src/clp_s/Utils.cpp (2)
  • sanitize_json_buffer (167-234)
  • sanitize_json_buffer (167-172)
🪛 Cppcheck (2.19.0)
components/core/src/clp_s/JsonFileIterator.cpp

[information] 6-6: Include file

(missingIncludeSystem)


[information] 6-6: Include file

(missingIncludeSystem)


[information] 6-6: Include file

(missingIncludeSystem)


[information] 7-7: Include file

(missingIncludeSystem)


[information] 6-6: Include file

(missingIncludeSystem)


[information] 7-7: Include file

(missingIncludeSystem)


[information] 8-8: Include file

(missingIncludeSystem)


[information] 9-9: Include file

(missingIncludeSystem)


[information] 10-10: Include file

(missingIncludeSystem)


[information] 11-11: Include file

(missingIncludeSystem)


[information] 12-12: Include file

(missingIncludeSystem)


[style] 22-22: The function 'get_error_code' is never used.

(unusedFunction)

components/core/tests/test-clp_s-StringUtils.cpp

[information] 1-1: Include file

(missingIncludeSystem)


[information] 2-2: Include file

(missingIncludeSystem)


[information] 3-3: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 6-6: Include file

(missingIncludeSystem)


[information] 4-4: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 6-6: Include file

(missingIncludeSystem)


[information] 7-7: Include file

(missingIncludeSystem)


[information] 8-8: Include file

(missingIncludeSystem)


[information] 9-9: Include file

(missingIncludeSystem)


[information] 10-10: Include file

(missingIncludeSystem)


[information] 11-11: Include file

(missingIncludeSystem)


[information] 12-12: Include file

(missingIncludeSystem)


[style] 146-146: The function 'char_to_escaped_four_char_hex' is never used.

(unusedFunction)


[style] 160-160: The function 'bit_cast' is never used.

(unusedFunction)


[style] 177-177: The function 'write_numeric_value' is never used.

(unusedFunction)


[style] 212-212: The function 'sub_span' is never used.

(unusedFunction)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: package-image
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: check-generated
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (8)
components/core/src/clp_s/JsonFileIterator.cpp (4)

6-9: LGTM!

The new includes are appropriate for the sanitization and logging functionality added.


12-24: LGTM!

The constructor correctly accepts and stores the path parameter using std::move, and the default value in the header maintains backward compatibility.


72-116: LGTM!

The sanitization logic correctly handles the UNESCAPED_CHARS error case:

  • Buffer reallocation is properly handled via pass-by-reference
  • m_sanitization_bytes_added is accumulated correctly
  • The retry parsing after sanitization is well-structured
  • Any parse error on retry is properly propagated through lines 118-121

The detailed logging with per-character counts and buffer expansion will be helpful for diagnosing data quality issues.


202-212: LGTM!

The byte counting correctly includes m_sanitization_bytes_added in both code paths, ensuring accurate reporting of consumed bytes that reflect the sanitized content size.

components/core/tests/test-clp_s-StringUtils.cpp (4)

1-18: LGTM!

The includes are appropriate, and the explanatory comment about C++14 string literals is helpful for understanding why std::string_literals is used for embedded null bytes in tests.


67-95: LGTM!

Good coverage of passthrough cases, including the important test that control characters outside JSON strings are left unchanged (letting the parser report the error).


97-181: LGTM!

Excellent coverage of control character escaping and escape sequence handling. The edge cases for escaped backslashes and triple backslashes are particularly valuable for ensuring the sanitizer correctly tracks string state.


183-331: LGTM!

Comprehensive test coverage including:

  • Valid JSON output verification
  • Buffer growth handling (100 null bytes → 600 bytes)
  • JSONL multi-document format
  • Graceful handling of truncated/malformed JSON
  • Correct character count reporting

The tests thoroughly exercise the sanitization function and document expected behaviour for edge cases.

…ation

Add a new CLI flag that enables sanitization of invalid JSON during
ingestion. When enabled, the following issues are automatically fixed:

- Invalid UTF-8 sequences: replaced with U+FFFD replacement character
- Unescaped control characters (0x00-0x1F): escaped to \u00XX format
- Invalid surrogate escapes: handled via simdjson's allow_replacement

When disabled (default), parsing fails fast on any of these issues.

Changes:
- Add --sanitize-invalid-json flag to CommandLineArguments
- Add sanitize_utf8_buffer() to Utils for UTF-8 validation/replacement
- Update JsonFileIterator to conditionally sanitize on UTF8_ERROR and
  UNESCAPED_CHARS errors during both initial parsing and iteration
- Update JsonParser to pass flag through and use conditional
  get_string()/unescaped_key() for surrogate escape handling
@jackluo923 jackluo923 changed the title feat(clp_s): Sanitize JSON buffers with unescaped control characters feat(clp-s): Add --sanitize-invalid-json option for opt-in JSON sanitization Jan 8, 2026
Add comprehensive tests for the new UTF-8 sanitization function covering:
- Valid UTF-8 passthrough (ASCII, multibyte chars, JSON with UTF-8)
- Invalid byte replacement (0xFF, 0xFE, continuation bytes without leader)
- Truncated sequences (2/3/4-byte sequences cut short)
- Overlong encodings (0xC0, 0xC1 lead bytes)
- Surrogate code points (U+D800-U+DFFF)
- Code points above U+10FFFF
- Mixed valid/invalid UTF-8
- Correct replacement character counts
- Buffer growth for many invalid bytes
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In @components/core/src/clp_s/JsonFileIterator.cpp:
- Around line 89-104: The condition uses a negation operator: change the check
in the JsonFileIterator.cpp block that currently reads
"!result.sanitized_char_counts.empty()" to use the preferred style "false ==
result.sanitized_char_counts.empty()"; keep the surrounding logic and logging
(the SPDLOG_WARN call and uses of total_replaced, m_path, m_buf_occupied,
old_buf_occupied) unchanged so only the conditional expression is replaced to
conform to the coding guideline.
- Around line 301-316: Replace the negated check using '!' in the
JsonFileIterator.cpp condition with the preferred explicit comparison form:
change the if condition from using '!result.sanitized_char_counts.empty()' to
'false == result.sanitized_char_counts.empty()' so the block that computes
total_replaced and calls SPDLOG_WARN (referencing total_replaced, m_path,
m_buf_occupied, old_buf_occupied) is executed under the same logic but follows
the coding guideline.
- Around line 129-150: The condition in the loop uses the negation operator
(!char_details.empty()) which violates the coding guideline; replace it with the
equality form (false == char_details.empty()) so the check becomes explicit.
Update the for-loop in JsonFileIterator.cpp where char_details is built (refer
to the variable char_details and the loop iterating
result.sanitized_char_counts) to use false == char_details.empty() before
appending the separator, leaving the rest of the logic and logging (SPDLOG_WARN
call) unchanged.
- Around line 231-255: Summary: Replace negated boolean expressions with the
coding-guideline-preferred form. In the JsonFileIterator.cpp block handling
result.sanitized_char_counts, change the two conditions "if
(!result.sanitized_char_counts.empty())" and "if (!char_details.empty())" to use
the preferred equality style (e.g., "if (false ==
result.sanitized_char_counts.empty())" and "if (false ==
char_details.empty())"), keeping all surrounding logic (fmt::format,
SPDLOG_WARN, variable names like result.sanitized_char_counts, char_details,
m_path, m_buf_occupied, old_buf_occupied) unchanged.
- Around line 74-159: The duplicated sanitization + logging blocks in
read_new_json and get_json should be extracted into two helper methods (e.g.,
sanitize_and_log_utf8(...) and sanitize_and_log_json(...)) that encapsulate:
calling StringUtils::sanitize_utf8_buffer / sanitize_json_buffer, updating
m_buf_occupied and m_sanitization_bytes_added, preserving m_buf_size
reallocation behavior, building and emitting the same SPDLOG_WARN messages, and
retrying parsing via m_parser.iterate_many(...) (returning or setting the
resulting error/stream). Replace the repeated blocks in read_new_json and
get_json with calls to these helpers so the callers simply invoke
sanitize_and_log_utf8() or sanitize_and_log_json() and use the returned
error/stream result, keeping all existing variable names (m_buf, m_buf_size,
m_buf_occupied, m_sanitization_bytes_added, m_parser, m_stream, m_path)
unchanged.
📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c9fc698 and b0b8b0f.

📒 Files selected for processing (10)
  • components/core/src/clp_s/CommandLineArguments.cpp
  • components/core/src/clp_s/CommandLineArguments.hpp
  • components/core/src/clp_s/JsonFileIterator.cpp
  • components/core/src/clp_s/JsonFileIterator.hpp
  • components/core/src/clp_s/JsonParser.cpp
  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/Utils.cpp
  • components/core/src/clp_s/Utils.hpp
  • components/core/src/clp_s/clp-s.cpp
  • components/core/tests/test-clp_s-StringUtils.cpp
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp_s/CommandLineArguments.hpp
  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/clp-s.cpp
  • components/core/tests/test-clp_s-StringUtils.cpp
  • components/core/src/clp_s/JsonFileIterator.cpp
  • components/core/src/clp_s/JsonFileIterator.hpp
  • components/core/src/clp_s/Utils.cpp
  • components/core/src/clp_s/Utils.hpp
  • components/core/src/clp_s/JsonParser.cpp
  • components/core/src/clp_s/CommandLineArguments.cpp
🧠 Learnings (21)
📓 Common learnings
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-11-17T22:58:50.056Z
Learning: In the y-scope/clp repository, when enabling new linting tools (ruff, mypy) on Python components, the team uses an incremental approach: first enable the tools with errors allowed (exit code 0), apply only safe auto-fixable fixes, then address remaining issues in follow-up PRs. During the initial enablement PR, reviews should focus on correctness of auto-fixes rather than flagging new code quality issues.
📚 Learning: 2025-01-14T16:06:54.692Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 646
File: components/core/src/clp/streaming_archive/writer/Archive.hpp:354-354
Timestamp: 2025-01-14T16:06:54.692Z
Learning: Member variables in C++ classes should be explicitly initialized in the constructor to prevent undefined behavior, as demonstrated in the Archive class where `m_use_single_file_archive` is initialized to `false`.

Applied to files:

  • components/core/src/clp_s/CommandLineArguments.hpp
📚 Learning: 2024-12-10T16:03:13.322Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.

Applied to files:

  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/clp-s.cpp
  • components/core/tests/test-clp_s-StringUtils.cpp
  • components/core/src/clp_s/JsonFileIterator.cpp
  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.

Applied to files:

  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/clp-s.cpp
  • components/core/tests/test-clp_s-StringUtils.cpp
  • components/core/src/clp_s/JsonFileIterator.cpp
  • components/core/src/clp_s/Utils.cpp
  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.

Applied to files:

  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/clp-s.cpp
  • components/core/tests/test-clp_s-StringUtils.cpp
  • components/core/src/clp_s/JsonFileIterator.cpp
  • components/core/src/clp_s/Utils.cpp
  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:629-733
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the `JsonParser::parse_kv_log_event` method in `components/core/src/clp_s/JsonParser.cpp`, prefer passing exceptions to higher levels for more graceful handling, rather than handling them at that level.

Applied to files:

  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.

Applied to files:

  • components/core/src/clp_s/JsonParser.hpp
  • components/core/tests/test-clp_s-StringUtils.cpp
  • components/core/src/clp_s/JsonFileIterator.cpp
  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-07T21:38:35.979Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:581-627
Timestamp: 2024-10-07T21:38:35.979Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` method, throwing a string literal as an exception is acceptable practice.

Applied to files:

  • components/core/src/clp_s/JsonParser.hpp
  • components/core/tests/test-clp_s-StringUtils.cpp
  • components/core/src/clp_s/JsonFileIterator.cpp
  • components/core/src/clp_s/JsonFileIterator.hpp
  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2025-12-09T01:57:04.839Z
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/test-SchemaSearcher.cpp:136-137
Timestamp: 2025-12-09T01:57:04.839Z
Learning: In C++ files under the components directory, keep forward declarations even when a definition follows in the same scope (such as in anonymous namespaces) to maintain consistency across multiple methods. This style reduces churn and aligns with the existing convention in this repository for both source and tests.

Applied to files:

  • components/core/src/clp_s/clp-s.cpp
  • components/core/tests/test-clp_s-StringUtils.cpp
  • components/core/src/clp_s/JsonFileIterator.cpp
  • components/core/src/clp_s/Utils.cpp
  • components/core/src/clp_s/JsonParser.cpp
  • components/core/src/clp_s/CommandLineArguments.cpp
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
  • components/core/src/clp_s/Utils.hpp
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-11-15T03:15:45.919Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 593
File: components/core/tests/test-NetworkReader.cpp:216-219
Timestamp: 2024-11-15T03:15:45.919Z
Learning: In the `network_reader_with_valid_http_header_kv_pairs` test case in `components/core/tests/test-NetworkReader.cpp`, additional error handling for JSON parsing failures is not necessary, as the current error message is considered sufficient.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:756-765
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir()` function, reaching the end of log events in a given IR is not considered an error case. The errors `std::errc::no_message_available` and `std::errc::result_out_of_range` are expected signals to break the deserialization loop and proceed accordingly.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
  • components/core/src/clp_s/JsonFileIterator.cpp
  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 549
File: components/core/src/clp/ffi/ir_stream/Deserializer.hpp:149-151
Timestamp: 2024-10-08T15:52:50.753Z
Learning: Exceptions are disabled when calling `nlohmann::json::parse` by passing `nullptr` and `false` as the second and third arguments.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/clp-s.cpp:256-258
Timestamp: 2024-10-08T15:52:50.753Z
Learning: Error handling for file path validation is performed in the `JsonParser` constructor.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-10T15:19:52.408Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 554
File: components/core/src/clp/ffi/KeyValuePairLogEvent.cpp:109-111
Timestamp: 2024-10-10T15:19:52.408Z
Learning: In `KeyValuePairLogEvent.cpp`, for the class `JsonSerializationIterator`, it's acceptable to use raw pointers for member variables like `m_schema_tree_node`, and there's no need to replace them with references or smart pointers in this use case.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
  • components/core/src/clp_s/JsonFileIterator.hpp
📚 Learning: 2024-09-27T23:19:17.079Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:717-726
Timestamp: 2024-09-27T23:19:17.079Z
Learning: In the `JsonParser::parse_from_IR` function, when deserializing, it's intentional to break out of the loop without additional error handling when `kv_log_event_result.has_error()` returns true for errors other than `no_message_available` and `result_out_of_range`, as this signifies the end of the IR file.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2025-01-30T19:26:33.869Z
Learnt from: davemarco
Repo: y-scope/clp PR: 700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:153-155
Timestamp: 2025-01-30T19:26:33.869Z
Learning: When working with constexpr strings (string literals with static storage duration), std::string_view is the preferred choice for member variables as it's more efficient and safe, avoiding unnecessary memory allocations.

Applied to files:

  • components/core/src/clp_s/Utils.hpp
📚 Learning: 2024-11-12T18:46:20.933Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 584
File: components/core/src/clp_s/SchemaTree.hpp:171-171
Timestamp: 2024-11-12T18:46:20.933Z
Learning: In `components/core/src/clp_s/SchemaTree.hpp`, it's acceptable to use `std::string_view` as keys in `m_node_map` because `SchemaNode`'s `m_key_name` remains valid even after move operations or reallocations, preventing dangling references.

Applied to files:

  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-12-09T15:25:14.043Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 558
File: components/core/src/clp/ffi/KeyValuePairLogEvent.cpp:474-480
Timestamp: 2024-12-09T15:25:14.043Z
Learning: In `components/core/src/clp/ffi/KeyValuePairLogEvent.cpp`, node IDs are validated before accessing `child_schema_tree_node` in the function `serialize_node_id_value_pairs_to_json`, ensuring `get_node` does not throw exceptions due to invalid node IDs.

Applied to files:

  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-11-12T18:47:03.828Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 584
File: components/core/src/clp_s/SchemaTree.hpp:91-94
Timestamp: 2024-11-12T18:47:03.828Z
Learning: In `components/core/src/clp_s/SchemaTree.hpp`, the `SchemaNode` class uses `std::unique_ptr<char[]> m_key_buf` and `std::string_view m_key_name` to ensure that references to `m_key_name` remain valid even after `SchemaNode` is move-constructed.

Applied to files:

  • components/core/src/clp_s/JsonParser.cpp
🧬 Code graph analysis (4)
components/core/tests/test-clp_s-StringUtils.cpp (2)
components/core/src/clp_s/Utils.hpp (4)
  • buf (114-119)
  • buf (137-142)
  • i (229-233)
  • i (229-229)
components/core/src/clp_s/Utils.cpp (4)
  • sanitize_json_buffer (167-234)
  • sanitize_json_buffer (167-172)
  • sanitize_utf8_buffer (328-376)
  • sanitize_utf8_buffer (328-333)
components/core/src/clp_s/Utils.cpp (1)
components/core/src/clp_s/Utils.hpp (6)
  • buf (114-119)
  • buf (137-142)
  • i (229-233)
  • i (229-229)
  • c (150-161)
  • c (150-150)
components/core/src/clp_s/JsonParser.cpp (1)
components/core/src/clp_s/CommandLineArguments.hpp (1)
  • m_sanitize_invalid_json (116-116)
components/core/src/clp_s/CommandLineArguments.cpp (1)
components/core/src/clp_s/CommandLineArguments.hpp (1)
  • m_sanitize_invalid_json (116-116)
🪛 Cppcheck (2.19.0)
components/core/tests/test-clp_s-StringUtils.cpp

[information] 1-1: Include file

(missingIncludeSystem)


[information] 2-2: Include file

(missingIncludeSystem)


[information] 3-3: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 6-6: Include file

(missingIncludeSystem)


[information] 4-4: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 6-6: Include file

(missingIncludeSystem)


[information] 7-7: Include file

(missingIncludeSystem)


[information] 8-8: Include file

(missingIncludeSystem)


[information] 9-9: Include file

(missingIncludeSystem)


[information] 10-10: Include file

(missingIncludeSystem)


[information] 11-11: Include file

(missingIncludeSystem)


[information] 12-12: Include file

(missingIncludeSystem)


[style] 169-169: The function 'char_to_escaped_four_char_hex' is never used.

(unusedFunction)


[style] 183-183: The function 'bit_cast' is never used.

(unusedFunction)


[style] 200-200: The function 'write_numeric_value' is never used.

(unusedFunction)


[style] 235-235: The function 'sub_span' is never used.

(unusedFunction)

components/core/src/clp_s/JsonFileIterator.cpp

[information] 6-6: Include file

(missingIncludeSystem)


[information] 6-6: Include file

(missingIncludeSystem)


[information] 6-6: Include file

(missingIncludeSystem)


[information] 7-7: Include file

(missingIncludeSystem)


[information] 6-6: Include file

(missingIncludeSystem)


[information] 7-7: Include file

(missingIncludeSystem)


[information] 8-8: Include file

(missingIncludeSystem)


[information] 9-9: Include file

(missingIncludeSystem)


[information] 10-10: Include file

(missingIncludeSystem)


[information] 11-11: Include file

(missingIncludeSystem)


[information] 12-12: Include file

(missingIncludeSystem)


[style] 22-22: The function 'get_error_code' is never used.

(unusedFunction)


[style] 24-24: The function 'get_filename' is never used.

(unusedFunction)


[style] 142-142: The function 'read_numeric_value' is never used.

(unusedFunction)


[style] 235-235: The function 'sub_span' is never used.

(unusedFunction)

components/core/src/clp_s/Utils.cpp

[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[style] 167-167: The function 'sanitize_json_buffer' is never used.

(unusedFunction)


[style] 328-328: The function 'sanitize_utf8_buffer' is never used.

(unusedFunction)


[style] 183-183: The function 'bit_cast' is never used.

(unusedFunction)


[style] 200-200: The function 'write_numeric_value' is never used.

(unusedFunction)


[style] 235-235: The function 'sub_span' is never used.

(unusedFunction)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: package-image
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: check-generated
🔇 Additional comments (23)
components/core/src/clp_s/CommandLineArguments.cpp (1)

259-264: LGTM! Clear option description.

The new --sanitize-invalid-json option is properly integrated with a comprehensive description of the sanitization behavior. The placement in the Compression options block is appropriate, and the use of po::bool_switch follows the established pattern.

components/core/src/clp_s/CommandLineArguments.hpp (1)

116-116: LGTM! Proper member initialization.

The new getter and member variable follow established patterns. Inline initialization to false ensures safe default behavior and aligns with modern C++ best practices.

Also applies to: 207-207

components/core/src/clp_s/clp-s.cpp (1)

110-110: LGTM! Consistent initialization.

The initialization of option.sanitize_invalid_json follows the established pattern and correctly propagates the CLI flag into the parser configuration.

components/core/src/clp_s/JsonParser.hpp (1)

43-43: LGTM! Properly initialized members.

Both the option field and class member are correctly initialized inline to false, ensuring safe default behavior (strict mode) and following modern C++ best practices.

Also applies to: 243-243

components/core/src/clp_s/JsonParser.cpp (3)

147-147: LGTM! Proper member initialization in constructor.

The initialization of m_sanitize_invalid_json from the option follows the established pattern in the constructor's initializer list.


672-673: All JsonFileIterator constructor call sites have been properly updated.

There is one instantiation site in the codebase (JsonParser.cpp:672-673) that uses the updated constructor signature with both new parameters (sanitize_invalid_json and path.path) in the correct order.


227-227: Cannot verify the core claim without evidence of the previous code implementation.

This review comment's assertion that "hardcoded true was replaced with m_sanitize_invalid_json" cannot be confirmed from the current codebase. While the current code at lines 227, 306, 411, 456, and 559 uses the m_sanitize_invalid_json variable (which defaults to false for strict parsing), there is no visible evidence of the previous implementation in the repository.

Additionally, the command-line interface indicates the feature is opt-in (--sanitize-invalid-json flag with default false), meaning users must explicitly enable permissive mode. No migration guides, documentation updates, major version bumps, or changelog entries were found to suggest this is a breaking change requiring user notification.

components/core/src/clp_s/JsonFileIterator.hpp (3)

22-34: LGTM: Constructor signature update for sanitization support.

The new sanitize_invalid_json and path parameters are well-documented and appropriately positioned with a sensible default for path. The API change enables opt-in sanitization behaviour.


50-54: LGTM: Byte counting now includes sanitization overhead.

Returning m_bytes_read + m_sanitization_bytes_added correctly accounts for the fact that sanitized content (which is what gets compressed) may be larger than the original file bytes.


84-101: Remove this comment. The concern is based on a misunderstanding of C++ member initialization semantics. In C++, member variables are initialized in their declaration order—not in the order they appear in the initializer list—regardless of how the initializer list is written. The declaration order in the header file is correct (m_buf_size, m_max_document_size, m_buf, m_reader, m_path, m_sanitize_invalid_json), and there is no initialization order bug.

Likely an incorrect or invalid review comment.

components/core/src/clp_s/Utils.hpp (2)

67-73: LGTM: Well-designed result struct for sanitization metadata.

The JsonSanitizeResult struct cleanly encapsulates both the new buffer size and per-character sanitization counts, enabling detailed logging without complicating the function signature.


93-142: LGTM: Clear and comprehensive API documentation.

The documentation properly explains:

  • Buffer ownership and reallocation semantics
  • The caller's responsibility for deleting the buffer
  • The simdjson padding requirement
  • Return value semantics

This makes the API safe to use correctly.

components/core/src/clp_s/Utils.cpp (3)

167-234: LGTM: Correct JSON control character sanitization logic.

The implementation correctly:

  • Tracks in-string state with proper escape sequence handling
  • Only sanitizes control characters inside JSON strings
  • Uses unicode escape format (\u00XX) for consistency
  • Handles buffer reallocation when needed
  • Returns early when no changes are made

236-326: LGTM: Comprehensive UTF-8 validation.

The validate_utf8_sequence helper correctly validates:

  • ASCII bytes
  • Multi-byte sequence structure and continuation bytes
  • Overlong encodings (C0, C1 lead bytes; E0 with byte2 < 0xA0; F0 with byte2 < 0x90)
  • Surrogate code points (ED with byte2 >= 0xA0)
  • Code points above U+10FFFF (F4 with byte2 > 0x8F)

The maximal subpart replacement strategy (advancing one byte on error) aligns with Unicode best practices.


328-376: LGTM: UTF-8 sanitization implementation is correct.

The use of 0xFF as a sentinel key for counting invalid sequences is clever since 0xFF is never a valid UTF-8 byte. The implementation follows the same buffer management pattern as sanitize_json_buffer.

components/core/src/clp_s/JsonFileIterator.cpp (2)

12-24: LGTM: Constructor initialization order matches declaration order.

The member initialization list correctly initializes members in their declaration order, preventing undefined behaviour.


339-349: LGTM: Byte counting correctly includes sanitization overhead.

The updated get_num_bytes_consumed properly accounts for bytes added during sanitization, ensuring accurate byte tracking for both partial and complete buffer consumption scenarios.

components/core/tests/test-clp_s-StringUtils.cpp (6)

20-65: LGTM: Well-designed test helpers with proper resource management.

The sanitize_string helper correctly handles both the reallocation and non-reallocation cases with appropriate exception safety. The try-catch ensures the buffer is always deleted.


67-95: LGTM: Good coverage of no-op cases.

Testing that valid JSON passes through unchanged ensures the sanitizer doesn't inadvertently modify correct input.


133-181: LGTM: Thorough escape sequence edge case testing.

The tests correctly verify:

  • Escaped quotes don't toggle string state
  • Escaped backslashes are handled correctly
  • Control characters after escape sequences are properly identified
  • Invalid escape targets (like \<control-char>) are left as-is

This is critical for correctness.


281-324: LGTM: UTF-8 helper functions are correct.

The count_replacement_chars function correctly identifies U+FFFD sequences (0xEF 0xBF 0xBD) and properly skips past them after counting.


351-429: LGTM: Comprehensive UTF-8 invalid sequence testing.

Excellent coverage of edge cases including:

  • Invalid lead bytes (0xFF, 0xFE)
  • Orphan continuation bytes
  • Truncated multi-byte sequences
  • Overlong encodings
  • Surrogate code points
  • Code points above U+10FFFF

These tests align well with the validation logic in validate_utf8_sequence.


507-522: LGTM: Buffer growth test verifies correct expansion.

Testing with 100 invalid bytes expanding to 300 bytes (each 1-byte invalid sequence becomes 3-byte U+FFFD) correctly exercises the buffer reallocation path.

Add guards in get_json() to exit with error if sanitization functions
don't modify the buffer. This prevents infinite loops in edge cases
where simdjson reports UNESCAPED_CHARS or UTF8_ERROR but the
sanitization functions don't find anything to fix.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In @components/core/src/clp_s/JsonFileIterator.cpp:
- Around line 78-112: Replace the negated empty check using
!result.sanitized_char_counts.empty() with the guideline-preferred form false ==
result.sanitized_char_counts.empty(); locate this check inside the UTF-8
sanitization branch in JsonFileIterator.cpp (the block that handles
simdjson::error_code::UTF8_ERROR and uses result.sanitized_char_counts) and
change the if condition accordingly to match the project's consistency rule.
- Around line 294-339: Replace the negation using '!' with the
guideline-preferred equality comparison: change the conditional that checks
result.sanitized_char_counts.empty() (currently written as if
(!result.sanitized_char_counts.empty())) to use the form if (false ==
result.sanitized_char_counts.empty()) so the check in the JsonFileIterator.cpp
sanitization branch (involving result.sanitized_char_counts) follows the
project's coding style.
- Around line 115-158: Replace the use of negation with the guideline-preferred
comparison for the string emptiness check: change the condition in the loop that
builds char_details (the if that currently reads using !char_details.empty()) to
use the explicit comparison form (false == char_details.empty()) so the check in
the char_details accumulation block follows the project's coding style; update
only that condition in the char_details construction inside the JSON
sanitization branch of JsonFileIterator.cpp (the block that iterates
result.sanitized_char_counts).
- Around line 215-272: Replace the two uses of the negation operator with
explicit equality-to-false comparisons to match the coding guideline: change the
condition checking result.sanitized_char_counts.empty() (currently written as if
(!result.sanitized_char_counts.empty())) to if (false ==
result.sanitized_char_counts.empty()), and change the condition checking
char_details.empty() (currently if (!char_details.empty())) to if (false ==
char_details.empty()) in the JsonFileIterator logic that handles sanitization so
behavior remains identical but style conforms.
📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b0b8b0f and ee13dd3.

📒 Files selected for processing (1)
  • components/core/src/clp_s/JsonFileIterator.cpp
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp_s/JsonFileIterator.cpp
🧠 Learnings (17)
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-12-10T16:03:13.322Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-10T15:19:52.408Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 554
File: components/core/src/clp/ffi/KeyValuePairLogEvent.cpp:109-111
Timestamp: 2024-10-10T15:19:52.408Z
Learning: In `KeyValuePairLogEvent.cpp`, for the class `JsonSerializationIterator`, it's acceptable to use raw pointers for member variables like `m_schema_tree_node`, and there's no need to replace them with references or smart pointers in this use case.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/clp-s.cpp:256-258
Timestamp: 2024-10-08T15:52:50.753Z
Learning: Error handling for file path validation is performed in the `JsonParser` constructor.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-24T14:45:26.265Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 523
File: components/core/src/clp/BufferedFileReader.cpp:96-106
Timestamp: 2024-10-24T14:45:26.265Z
Learning: In `components/core/src/clp/BufferedFileReader.cpp`, refactoring the nested error handling conditions may not apply due to the specific logic in the original code.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-11-15T03:15:45.919Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 593
File: components/core/tests/test-NetworkReader.cpp:216-219
Timestamp: 2024-11-15T03:15:45.919Z
Learning: In the `network_reader_with_valid_http_header_kv_pairs` test case in `components/core/tests/test-NetworkReader.cpp`, additional error handling for JSON parsing failures is not necessary, as the current error message is considered sufficient.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:629-733
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the `JsonParser::parse_kv_log_event` method in `components/core/src/clp_s/JsonParser.cpp`, prefer passing exceptions to higher levels for more graceful handling, rather than handling them at that level.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:756-765
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir()` function, reaching the end of log events in a given IR is not considered an error case. The errors `std::errc::no_message_available` and `std::errc::result_out_of_range` are expected signals to break the deserialization loop and proceed accordingly.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2025-06-09T17:48:56.024Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 988
File: components/log-viewer-webui/client/src/components/QueryBox/InputWithCaseSensitive/CaseSenstiveToggle.tsx:34-37
Timestamp: 2025-06-09T17:48:56.024Z
Learning: In the y-scope/clp project, prefer `false == <expression>` rather than `!<expression>` for boolean expressions in TypeScript/JavaScript files, as specified in the coding guidelines.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-10T05:46:35.188Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 554
File: components/core/src/clp/ffi/KeyValuePairLogEvent.cpp:299-307
Timestamp: 2024-10-10T05:46:35.188Z
Learning: In the C++ function `get_schema_subtree_bitmap` in `KeyValuePairLogEvent.cpp`, when a loop uses `while (true)` with an internal check on `optional.has_value()`, and comments explain that this structure is used to silence `clang-tidy` warnings about unchecked optional access, this code is acceptable and should not be refactored to use `while (optional.has_value())`.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-13T09:27:43.408Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 557
File: components/core/tests/test-ir_encoding_methods.cpp:1216-1286
Timestamp: 2024-10-13T09:27:43.408Z
Learning: In the unit test case `ffi_ir_stream_serialize_schema_tree_node_id` in `test-ir_encoding_methods.cpp`, suppressing the `readability-function-cognitive-complexity` warning is acceptable due to the expansion of Catch2 macros in C++ tests, and such test cases may not have readability issues.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-10-07T21:38:35.979Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:581-627
Timestamp: 2024-10-07T21:38:35.979Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` method, throwing a string literal as an exception is acceptable practice.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2024-09-27T23:19:17.079Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:717-726
Timestamp: 2024-09-27T23:19:17.079Z
Learning: In the `JsonParser::parse_from_IR` function, when deserializing, it's intentional to break out of the loop without additional error handling when `kv_log_event_result.has_error()` returns true for errors other than `no_message_available` and `result_out_of_range`, as this signifies the end of the IR file.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
📚 Learning: 2025-12-09T01:57:04.839Z
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/test-SchemaSearcher.cpp:136-137
Timestamp: 2025-12-09T01:57:04.839Z
Learning: In C++ files under the components directory, keep forward declarations even when a definition follows in the same scope (such as in anonymous namespaces) to maintain consistency across multiple methods. This style reduces churn and aligns with the existing convention in this repository for both source and tests.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
🧬 Code graph analysis (1)
components/core/src/clp_s/JsonFileIterator.cpp (2)
components/core/src/clp_s/JsonFileIterator.hpp (2)
  • JsonFileIterator (28-34)
  • JsonFileIterator (35-35)
components/core/src/clp_s/Utils.cpp (4)
  • sanitize_utf8_buffer (328-376)
  • sanitize_utf8_buffer (328-333)
  • sanitize_json_buffer (167-234)
  • sanitize_json_buffer (167-172)
🪛 Cppcheck (2.19.0)
components/core/src/clp_s/JsonFileIterator.cpp

[information] 6-6: Include file

(missingIncludeSystem)


[information] 6-6: Include file

(missingIncludeSystem)


[information] 6-6: Include file

(missingIncludeSystem)


[information] 7-7: Include file

(missingIncludeSystem)


[information] 6-6: Include file

(missingIncludeSystem)


[information] 7-7: Include file

(missingIncludeSystem)


[information] 8-8: Include file

(missingIncludeSystem)


[information] 9-9: Include file

(missingIncludeSystem)


[information] 10-10: Include file

(missingIncludeSystem)


[information] 11-11: Include file

(missingIncludeSystem)


[information] 12-12: Include file

(missingIncludeSystem)


[style] 22-22: The function 'get_error_code' is never used.

(unusedFunction)


[style] 24-24: The function 'get_filename' is never used.

(unusedFunction)


[style] 142-142: The function 'read_numeric_value' is never used.

(unusedFunction)


[style] 235-235: The function 'sub_span' is never used.

(unusedFunction)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: package-image
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: check-generated
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build-macos (macos-15, false)
🔇 Additional comments (5)
components/core/src/clp_s/JsonFileIterator.cpp (5)

6-6: LGTM: Includes added for new sanitization functionality.

The fmt/format.h include supports the detailed logging, and Utils.hpp provides the sanitization functions used throughout the file.

Also applies to: 9-9


12-26: LGTM: Constructor properly initialized for sanitization support.

The new parameters (sanitize_invalid_json flag and path) are correctly initialized, and std::move appropriately avoids copying the path string.


255-260: LGTM: Proper guard against infinite loop.

The check for empty sanitized_char_counts correctly prevents infinite loops by detecting when sanitization makes no changes despite simdjson reporting an error.


321-326: LGTM: Proper guard against infinite loop.

The check correctly prevents infinite loops by detecting when UTF-8 sanitization makes no changes despite encountering a UTF8_ERROR.


349-359: LGTM: Byte counting correctly includes sanitization bytes.

The logic properly accounts for bytes added during sanitization in both code paths, ensuring accurate tracking of the data that will be compressed.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @components/core/tests/test-clp_s-StringUtils.cpp:
- Around line 448-505: Extract a small test helper that centralizes buffer
allocation/copying, calls StringUtils::sanitize_utf8_buffer, builds the output
string from buf using result.new_buf_occupied, frees the buffer, and returns
both the JsonSanitizeResult and the output string (e.g., a
SanitizeResultWithOutput struct and a sanitize_utf8_with_result function).
Replace the repeated allocate/call/cleanup blocks in the TEST_CASE sections
(including the other repeated region around lines 528-583) with calls to this
helper and use its returned result and output for assertions to remove
duplication.
- Around line 30-52: Replace manual new/delete in sanitize_string with RAII:
allocate the buffer with std::unique_ptr<char[]> (e.g., unique_ptr<char[]>
buf(new char[buf_size + simdjson::SIMDJSON_PADDING]) and use buf.get() when
calling StringUtils::sanitize_json_buffer), and ensure you handle the
sanitizer's reallocation semantics by detecting if the returned buffer pointer
differs and resetting the unique_ptr to own the new pointer (or release before
the call and reset after). This preserves automatic cleanup on exceptions while
accommodating sanitize_json_buffer possibly deleting/replacing the buffer
(update references to buf, buf_size, and buf_occupied accordingly).
📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee13dd3 and 9c2140e.

📒 Files selected for processing (1)
  • components/core/tests/test-clp_s-StringUtils.cpp
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/tests/test-clp_s-StringUtils.cpp
🧠 Learnings (11)
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-12-10T16:03:13.322Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-11-15T03:15:45.919Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 593
File: components/core/tests/test-NetworkReader.cpp:216-219
Timestamp: 2024-11-15T03:15:45.919Z
Learning: In the `network_reader_with_valid_http_header_kv_pairs` test case in `components/core/tests/test-NetworkReader.cpp`, additional error handling for JSON parsing failures is not necessary, as the current error message is considered sufficient.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:756-765
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir()` function, reaching the end of log events in a given IR is not considered an error case. The errors `std::errc::no_message_available` and `std::errc::result_out_of_range` are expected signals to break the deserialization loop and proceed accordingly.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-07T21:38:35.979Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:581-627
Timestamp: 2024-10-07T21:38:35.979Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` method, throwing a string literal as an exception is acceptable practice.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 549
File: components/core/src/clp/ffi/ir_stream/Deserializer.hpp:149-151
Timestamp: 2024-10-08T15:52:50.753Z
Learning: Exceptions are disabled when calling `nlohmann::json::parse` by passing `nullptr` and `false` as the second and third arguments.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2025-12-09T01:57:04.839Z
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/test-SchemaSearcher.cpp:136-137
Timestamp: 2025-12-09T01:57:04.839Z
Learning: In C++ files under the components directory, keep forward declarations even when a definition follows in the same scope (such as in anonymous namespaces) to maintain consistency across multiple methods. This style reduces churn and aligns with the existing convention in this repository for both source and tests.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
🧬 Code graph analysis (1)
components/core/tests/test-clp_s-StringUtils.cpp (2)
components/core/src/clp_s/Utils.hpp (4)
  • buf (114-119)
  • buf (137-142)
  • i (229-233)
  • i (229-229)
components/core/src/clp_s/Utils.cpp (4)
  • sanitize_json_buffer (167-234)
  • sanitize_json_buffer (167-172)
  • sanitize_utf8_buffer (328-376)
  • sanitize_utf8_buffer (328-333)
🪛 Cppcheck (2.19.0)
components/core/tests/test-clp_s-StringUtils.cpp

[information] 1-1: Include file

(missingIncludeSystem)


[information] 2-2: Include file

(missingIncludeSystem)


[information] 3-3: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 6-6: Include file

(missingIncludeSystem)


[information] 4-4: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 6-6: Include file

(missingIncludeSystem)


[information] 7-7: Include file

(missingIncludeSystem)


[information] 8-8: Include file

(missingIncludeSystem)


[information] 9-9: Include file

(missingIncludeSystem)


[information] 10-10: Include file

(missingIncludeSystem)


[information] 11-11: Include file

(missingIncludeSystem)


[information] 12-12: Include file

(missingIncludeSystem)


[style] 169-169: The function 'char_to_escaped_four_char_hex' is never used.

(unusedFunction)


[style] 183-183: The function 'bit_cast' is never used.

(unusedFunction)


[style] 200-200: The function 'write_numeric_value' is never used.

(unusedFunction)


[style] 235-235: The function 'sub_span' is never used.

(unusedFunction)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: package-image
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: check-generated
  • GitHub Check: build (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (7)
components/core/tests/test-clp_s-StringUtils.cpp (7)

1-18: LGTM! Well-documented use of string literals.

The comment explaining the use of C++14 string literals and string concatenation for embedded null bytes is helpful for future maintainers.


59-64: LGTM!

Correctly uses false == result.error() following the coding guideline preference for this pattern over !result.error().


67-95: LGTM! Comprehensive no-change scenarios.

Good coverage of edge cases including escaped characters, unicode escapes, multiple objects, and control chars outside strings.


133-181: LGTM! Thorough escape sequence handling tests.

Excellent coverage of tricky escape sequence scenarios including escaped quotes, escaped backslashes, and their interactions with control characters.


230-275: LGTM! Well-documented edge case testing.

The comments clearly explain the expected behaviour for truncated/malformed JSON, and the assertions correctly verify graceful handling without crashes.


312-323: LGTM! Correct loop bounds for counting replacement characters.

The condition i + 2 < s.size() correctly ensures valid access to s[i], s[i+1], and s[i+2]. The i += 2 inside the match block, combined with the loop's ++i, properly advances past each 3-byte replacement sequence.


351-429: LGTM! Comprehensive UTF-8 validation test coverage.

Excellent coverage of UTF-8 edge cases including:

  • Invalid standalone bytes (0xFF, 0xFE)
  • Orphan continuation bytes (0x80-0xBF)
  • Truncated multi-byte sequences (2, 3, and 4-byte)
  • Overlong encodings
  • Surrogate code points
  • Code points above U+10FFFF

This aligns well with the maximal subpart replacement strategy documented in the implementation.

Extract sanitize_and_log_utf8() and sanitize_and_log_json() helper
methods to reduce code duplication. The sanitization-and-log pattern
was repeated 4 times with slight variations.

Both helpers return true if sanitization made changes, allowing
callers to detect infinite loop conditions in get_json().
- Add SanitizeResultWithOutput struct for returning both result and output
- Add sanitize_json_with_result() and sanitize_utf8_with_result() helpers
- Simplify sanitize_string() and sanitize_utf8_string() to use helpers
- Replace manual buffer management in count tests with helper calls

This reduces code duplication in test files and improves maintainability.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @components/core/src/clp_s/JsonFileIterator.cpp:
- Around line 182-191: The SPDLOG_WARN call in JsonFileIterator.cpp that logs
the buffer expansion uses unsigned subtraction for the byte-difference
(m_buf_occupied - old_buf_occupied); change this to use consistent signed
arithmetic (e.g., compute the difference as a signed type such as
ssize_t/ptrdiff_t or cast both m_buf_occupied and old_buf_occupied to a signed
integral) so the logged "Buffer expanded by" value matches the signed form used
in sanitize_and_log_utf8(), keeping variables like total_sanitized, char_details
and m_path unchanged.
📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9c2140e and 4dbf8f3.

📒 Files selected for processing (2)
  • components/core/src/clp_s/JsonFileIterator.cpp
  • components/core/src/clp_s/JsonFileIterator.hpp
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp_s/JsonFileIterator.hpp
  • components/core/src/clp_s/JsonFileIterator.cpp
🧠 Learnings (1)
📚 Learning: 2025-12-09T01:57:04.839Z
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/test-SchemaSearcher.cpp:136-137
Timestamp: 2025-12-09T01:57:04.839Z
Learning: In C++ files under the components directory, keep forward declarations even when a definition follows in the same scope (such as in anonymous namespaces) to maintain consistency across multiple methods. This style reduces churn and aligns with the existing convention in this repository for both source and tests.

Applied to files:

  • components/core/src/clp_s/JsonFileIterator.cpp
🧬 Code graph analysis (1)
components/core/src/clp_s/JsonFileIterator.cpp (2)
components/core/src/clp_s/JsonFileIterator.hpp (2)
  • JsonFileIterator (28-34)
  • JsonFileIterator (35-35)
components/core/src/clp_s/Utils.cpp (4)
  • sanitize_utf8_buffer (328-376)
  • sanitize_utf8_buffer (328-333)
  • sanitize_json_buffer (167-234)
  • sanitize_json_buffer (167-172)
🪛 Cppcheck (2.19.0)
components/core/src/clp_s/JsonFileIterator.cpp

[information] 6-6: Include file

(missingIncludeSystem)


[information] 6-6: Include file

(missingIncludeSystem)


[information] 6-6: Include file

(missingIncludeSystem)


[information] 7-7: Include file

(missingIncludeSystem)


[information] 6-6: Include file

(missingIncludeSystem)


[information] 7-7: Include file

(missingIncludeSystem)


[information] 8-8: Include file

(missingIncludeSystem)


[information] 9-9: Include file

(missingIncludeSystem)


[information] 10-10: Include file

(missingIncludeSystem)


[information] 11-11: Include file

(missingIncludeSystem)


[information] 12-12: Include file

(missingIncludeSystem)


[style] 22-22: The function 'get_error_code' is never used.

(unusedFunction)


[style] 24-24: The function 'get_filename' is never used.

(unusedFunction)


[style] 142-142: The function 'read_numeric_value' is never used.

(unusedFunction)


[style] 169-169: The function 'char_to_escaped_four_char_hex' is never used.

(unusedFunction)


[style] 183-183: The function 'bit_cast' is never used.

(unusedFunction)


[style] 235-235: The function 'sub_span' is never used.

(unusedFunction)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: package-image
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: check-generated
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: build (macos-15)
🔇 Additional comments (9)
components/core/src/clp_s/JsonFileIterator.hpp (3)

28-34: LGTM! Well-designed constructor extension.

The new parameters are logically ordered, with the required sanitize_invalid_json flag before the optional path parameter. The documentation clearly explains the purpose of each new parameter.


50-54: Clear documentation and implementation for byte tracking.

The inline implementation correctly adds m_sanitization_bytes_added to the total, ensuring that callers get an accurate count of bytes that will be compressed (including any bytes added during sanitization).


84-96: Well-documented helper method declarations.

Both methods are appropriately marked [[nodiscard]] since the return value indicates whether sanitization made changes, which is critical for preventing infinite loops in the calling code.

components/core/src/clp_s/JsonFileIterator.cpp (6)

12-26: LGTM! Clean constructor initialization.

The member initialization order follows the declaration order in the header, and std::move(path) is correctly used to avoid unnecessary copies.


74-92: Good sequential sanitization approach with clear documentation.

The sequential handling allows both UTF-8 and control character issues to be fixed in a single pass through read_new_json(). The explicit static_cast<void>() with explanatory comments clearly communicates the intentional discard of return values, and any persistent errors are correctly caught at line 94.


125-154: Solid implementation with proper byte tracking.

The use of signed arithmetic at line 149 (static_cast<int64_t>) is appropriate since U+FFFD replacements could theoretically shrink or expand the buffer. The early return on empty sanitized_char_counts correctly indicates no changes were made, which is critical for infinite loop prevention in the caller.


218-240: Excellent defensive programming for infinite loop prevention.

The check at line 223 ensures that if sanitize_and_log_json() makes no changes, the original error is reported rather than endlessly retrying. The complete re-initialization of the document stream (lines 231-239) correctly resets all tracking state before restarting iteration.


260-282: Consistent handling of UTF-8 sanitization with proper loop prevention.

The pattern mirrors the UNESCAPED_CHARS handling above, maintaining code consistency. The heuristic for distinguishing mid-stream UTF-8 errors from truncation errors is well-documented, and the sanitization path correctly prevents infinite loops.


292-302: Correct byte accounting in both code paths.

The implementation properly includes m_sanitization_bytes_added in both cases: when documents remain in the buffer (line 298-299) and when all documents are consumed (line 301). The comment at lines 295-296 clearly explains the rationale.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @components/core/tests/test-clp_s-StringUtils.cpp:
- Around line 333-344: The skip logic in count_replacement_chars is implicit (it
uses i += 2 inside a for-loop and relies on the loop's ++i to advance 3 bytes);
replace the for-loop with an explicit while-loop or adjust the match branch to
explicitly advance by 3 (e.g., use while (i + 2 < s.size()) and on match do i +=
3) so the byte-skip is obvious; update the function count_replacement_chars
accordingly and keep the same matching checks for 0xEF, 0xBF, 0xBD.
📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4dbf8f3 and c68c607.

📒 Files selected for processing (1)
  • components/core/tests/test-clp_s-StringUtils.cpp
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/tests/test-clp_s-StringUtils.cpp
🧠 Learnings (14)
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-12-10T16:03:13.322Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-11-15T03:15:45.919Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 593
File: components/core/tests/test-NetworkReader.cpp:216-219
Timestamp: 2024-11-15T03:15:45.919Z
Learning: In the `network_reader_with_valid_http_header_kv_pairs` test case in `components/core/tests/test-NetworkReader.cpp`, additional error handling for JSON parsing failures is not necessary, as the current error message is considered sufficient.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:756-765
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir()` function, reaching the end of log events in a given IR is not considered an error case. The errors `std::errc::no_message_available` and `std::errc::result_out_of_range` are expected signals to break the deserialization loop and proceed accordingly.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-07T21:38:35.979Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:581-627
Timestamp: 2024-10-07T21:38:35.979Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` method, throwing a string literal as an exception is acceptable practice.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 549
File: components/core/src/clp/ffi/ir_stream/Deserializer.hpp:149-151
Timestamp: 2024-10-08T15:52:50.753Z
Learning: Exceptions are disabled when calling `nlohmann::json::parse` by passing `nullptr` and `false` as the second and third arguments.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-11-01T03:26:26.386Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 570
File: components/core/tests/test-ir_encoding_methods.cpp:376-399
Timestamp: 2024-11-01T03:26:26.386Z
Learning: In the test code (`components/core/tests/test-ir_encoding_methods.cpp`), exception handling for `msgpack::unpack` can be omitted because the Catch2 testing framework captures exceptions if they occur.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-24T14:45:26.265Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 523
File: components/core/src/clp/BufferedFileReader.cpp:96-106
Timestamp: 2024-10-24T14:45:26.265Z
Learning: In `components/core/src/clp/BufferedFileReader.cpp`, refactoring the nested error handling conditions may not apply due to the specific logic in the original code.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2024-10-13T09:27:43.408Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 557
File: components/core/tests/test-ir_encoding_methods.cpp:1216-1286
Timestamp: 2024-10-13T09:27:43.408Z
Learning: In the unit test case `ffi_ir_stream_serialize_schema_tree_node_id` in `test-ir_encoding_methods.cpp`, suppressing the `readability-function-cognitive-complexity` warning is acceptable due to the expansion of Catch2 macros in C++ tests, and such test cases may not have readability issues.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
📚 Learning: 2025-12-09T01:57:04.839Z
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/test-SchemaSearcher.cpp:136-137
Timestamp: 2025-12-09T01:57:04.839Z
Learning: In C++ files under the components directory, keep forward declarations even when a definition follows in the same scope (such as in anonymous namespaces) to maintain consistency across multiple methods. This style reduces churn and aligns with the existing convention in this repository for both source and tests.

Applied to files:

  • components/core/tests/test-clp_s-StringUtils.cpp
🧬 Code graph analysis (1)
components/core/tests/test-clp_s-StringUtils.cpp (2)
components/core/src/clp_s/Utils.hpp (4)
  • buf (114-119)
  • buf (137-142)
  • i (229-233)
  • i (229-229)
components/core/src/clp_s/Utils.cpp (4)
  • sanitize_json_buffer (167-234)
  • sanitize_json_buffer (167-172)
  • sanitize_utf8_buffer (328-376)
  • sanitize_utf8_buffer (328-333)
🪛 Cppcheck (2.19.0)
components/core/tests/test-clp_s-StringUtils.cpp

[information] 1-1: Include file

(missingIncludeSystem)


[information] 2-2: Include file

(missingIncludeSystem)


[information] 3-3: Include file

(missingIncludeSystem)


[information] 4-4: Include file

(missingIncludeSystem)


[information] 6-6: Include file

(missingIncludeSystem)


[information] 7-7: Include file

(missingIncludeSystem)


[information] 4-4: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 6-6: Include file

(missingIncludeSystem)


[information] 7-7: Include file

(missingIncludeSystem)


[information] 8-8: Include file

(missingIncludeSystem)


[information] 9-9: Include file

(missingIncludeSystem)


[information] 10-10: Include file

(missingIncludeSystem)


[information] 11-11: Include file

(missingIncludeSystem)


[information] 12-12: Include file

(missingIncludeSystem)


[style] 169-169: The function 'char_to_escaped_four_char_hex' is never used.

(unusedFunction)


[style] 183-183: The function 'bit_cast' is never used.

(unusedFunction)


[style] 200-200: The function 'write_numeric_value' is never used.

(unusedFunction)


[style] 235-235: The function 'sub_span' is never used.

(unusedFunction)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
🔇 Additional comments (6)
components/core/tests/test-clp_s-StringUtils.cpp (6)

1-19: LGTM!

The includes are appropriate, and the explanatory comment about C++14 string literals and concatenation for embedded null bytes is helpful for understanding the test input construction.


21-77: LGTM!

The helper functions are well-documented. The raw pointer usage is appropriately justified in the comments, and line 75 correctly follows the coding guideline by using false == result.error() instead of the negation operator. As per coding guidelines, this is the preferred pattern.


79-287: LGTM!

Comprehensive test coverage for JSON sanitization, including edge cases for escape sequence handling, buffer growth, and truncated JSON. The tests appropriately document expected behaviour for malformed input scenarios.


347-370: LGTM!

Good coverage of valid UTF-8 cases with helpful comments documenting the byte sequences for each Unicode character.


372-509: LGTM!

Excellent coverage of UTF-8 edge cases including truncated sequences, overlong encodings, surrogate code points, and code points above U+10FFFF. These tests are important for ensuring correct handling of potentially malicious or malformed input.


515-536: LGTM!

Good verification of the character count tracking functionality, covering both the case with multiple different control characters and the no-op case.

jackluo923 and others added 3 commits January 9, 2026 04:25
Code quality improvements:
- Rename JsonSanitizeResult to BufferSanitizeResult for clarity
- Rename helper methods: sanitize_and_log_utf8 -> sanitize_invalid_utf8_and_log,
  sanitize_and_log_json -> sanitize_control_chars_and_log
- Fix typo: "delimeters" -> "delimiters"
- Fix docs: "Pointer to pointer" -> "Reference to pointer"
- Add @note about buffer reallocation to both sanitization helpers
- Add null pointer checks to sanitization functions
- Extract reinitialize_document_stream() helper to reduce duplication
- Add named constant cUtf8ReplacementChar for U+FFFD magic string
- Remove unused variable in char_to_hex()
- Use consistent signed arithmetic in log messages

Efficiency improvements:
- Use lazy-copy approach: only allocate when first issue is found
- Copy valid ranges in bulk using append(ptr, len) instead of byte-by-byte
- Zero allocation in the common case when no sanitization is needed
@junhaoliao junhaoliao added this to the Backlog milestone Jan 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants