Skip to content

Comments

perf(clp-s::timestamp_parser): Improve marshalling speed for padded integer fields (resolves #1968).#2013

Open
gibber9809 wants to merge 1 commit intoy-scope:mainfrom
gibber9809:fast-timestamp-marshal
Open

perf(clp-s::timestamp_parser): Improve marshalling speed for padded integer fields (resolves #1968).#2013
gibber9809 wants to merge 1 commit intoy-scope:mainfrom
gibber9809:fast-timestamp-marshal

Conversation

@gibber9809
Copy link
Contributor

@gibber9809 gibber9809 commented Feb 20, 2026

Description

This PR significantly improves decompression speed for padded integer fields in timestamps, leading to much higher decompression speeds overall. As identified in #1968 most of the overhead for timestamp marshalling was coming from fmt::format() calls, which appeared to be dynamically parsing the format strings on each call.

Decompression speedup for open source datasets compared to decompression speed without this optimization:

dataset speedup %
mongodb +23.14%
cockroachdb +3.75%
elasticsearch +16.81%

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Validated that unit tests still pass
  • Benchmarked improvement in decompression speed

Summary by CodeRabbit

  • Refactor
    • Enhanced error handling and validation in date and time parsing operations to improve overall reliability.
    • Strengthened consistency in timestamp formatting throughout the application.
    • Improved robustness of temporal data processing with refined internal validation logic to ensure more predictable behaviour.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 20, 2026

Walkthrough

Introduces a new helper function append_positive_left_padded_integer() to replace fmt::format calls for padding integers with zeros throughout timestamp parsing and marshaling logic. The helper validates non-negativity and integrates error propagation via existing error-handling mechanisms.

Changes

Cohort / File(s) Summary
Timestamp Parser Helper Function
components/core/src/clp_s/timestamp_parser/TimestampParser.cpp
Adds append_positive_left_padded_integer() helper function for left-padded integer formatting with negative value validation. Replaces fmt::format calls in date-time and numeric timestamp marshaling for years, months, days, hours, minutes, seconds, sub-seconds, and timezone handling. Implements error propagation via YSTDLIB_ERROR_HANDLING_TRYV throughout affected code paths.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: a performance optimization for timestamp marshalling by improving padded integer field handling, and references the related issue.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/core/src/clp_s/timestamp_parser/TimestampParser.cpp (1)

751-773: 🧹 Nitpick | 🔵 Trivial

Near-duplicate logic in the 'T' case could be extracted into a shared helper.

Both the date-time marshal (lines 751–773) and numeric marshal (lines 872–894) create a temporary string, pad it, strip trailing zeros, check for all-zeros, and append to the buffer. This is an existing pattern, but the PR touches both blocks. Consider extracting this into a small helper to reduce duplication.

Also applies to: 872-894

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/core/src/clp_s/timestamp_parser/TimestampParser.cpp` around lines
751 - 773, Extract the repeated logic that builds a 9-digit left-padded
nanoseconds string, trims trailing zeros, validates not-all-zero, and appends to
the output into a small helper (e.g., formatOrAppendFractionalNanoseconds or
appendTrimmedSubseconds) and call it from the 'T' branch in TimestampParser::...
(the date-time marshal case) and from the numeric marshal case; the helper
should accept the nanoseconds value (from time_of_day.subseconds().count() or
equivalent) plus a reference to the buffer (or return a string and an ErrorCode)
and must reproduce the current behavior: pad to 9 digits, strip trailing '0's,
return ErrorCode::IncompatibleTimestampPattern when all digits are zero, and
otherwise append the trimmed substring to the buffer so both code paths use the
same implementation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/core/src/clp_s/timestamp_parser/TimestampParser.cpp`:
- Around line 946-951: The current code in TimestampParser.cpp creates a
temporary std::string via std::to_string(value) before appending padding and the
value to buffer; replace that allocation by formatting the integer directly into
a small stack char array using std::to_chars (include <charconv>), compute
digits written, compute num_padding_characters as before, then call
buffer.append(num_padding_characters, padding_character) and buffer.append(ptr,
ptr + digits_written) instead of using value_str—target the block that declares
value, length, padding_character and the buffer appends to eliminate the heap
allocation.
- Around line 936-953: The function append_positive_left_padded_integer
currently takes an int which causes implicit narrowing at call sites; change the
parameter type of value from int to int64_t in both the function declaration and
the function definition (append_positive_left_padded_integer) so it accepts
int64_t (matching extract_absolute_subsecond_nanoseconds()/epochtime_t and
chrono .count() results); ensure any includes for <cstdint> are present and
update any forward declarations or prototypes to the new signature so callers no
longer perform narrowing conversions.

---

Outside diff comments:
In `@components/core/src/clp_s/timestamp_parser/TimestampParser.cpp`:
- Around line 751-773: Extract the repeated logic that builds a 9-digit
left-padded nanoseconds string, trims trailing zeros, validates not-all-zero,
and appends to the output into a small helper (e.g.,
formatOrAppendFractionalNanoseconds or appendTrimmedSubseconds) and call it from
the 'T' branch in TimestampParser::... (the date-time marshal case) and from the
numeric marshal case; the helper should accept the nanoseconds value (from
time_of_day.subseconds().count() or equivalent) plus a reference to the buffer
(or return a string and an ErrorCode) and must reproduce the current behavior:
pad to 9 digits, strip trailing '0's, return
ErrorCode::IncompatibleTimestampPattern when all digits are zero, and otherwise
append the trimmed substring to the buffer so both code paths use the same
implementation.

Comment on lines +936 to +953
auto append_positive_left_padded_integer(
std::string& buffer,
size_t length,
int value,
char padding_character
) -> ystdlib::error_handling::Result<void> {
if (value < 0) {
return ErrorCode{ErrorCodeEnum::InvalidDate};
}

auto value_str{std::to_string(value)};
size_t const num_padding_characters{
value_str.length() >= length ? size_t{0} : length - value_str.length()
};
buffer.append(num_padding_characters, padding_character);
buffer.append(value_str);
return ystdlib::error_handling::success();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find epochtime_t definition
echo "=== epochtime_t definition ==="
rg -n 'using epochtime_t|typedef.*epochtime_t' --type=cpp --type=hpp

Repository: y-scope/clp

Length of output: 130


🏁 Script executed:

#!/bin/bash
# Find the file structure and locate the function declaration
echo "=== Function declaration (should be around line 312) ==="
sed -n '310,315p' components/core/src/clp_s/timestamp_parser/TimestampParser.cpp

echo ""
echo "=== Function definition (lines 936-953) ==="
sed -n '936,953p' components/core/src/clp_s/timestamp_parser/TimestampParser.cpp

Repository: y-scope/clp

Length of output: 889


🏁 Script executed:

#!/bin/bash
# Find ALL call sites of append_positive_left_padded_integer with context
echo "=== All call sites of append_positive_left_padded_integer ==="
rg -n 'append_positive_left_padded_integer' components/core/src/clp_s/timestamp_parser/TimestampParser.cpp -B2 -A2

Repository: y-scope/clp

Length of output: 6859


🏁 Script executed:

#!/bin/bash
# Search more broadly for epochtime_t definition
echo "=== Searching for epochtime_t in all files ==="
find . -type f \( -name "*.h" -o -name "*.hpp" -o -name "*.cpp" \) -exec grep -l "epochtime_t" {} \; | head -10

echo ""
echo "=== First file containing epochtime_t ==="
find . -type f \( -name "*.h" -o -name "*.hpp" -o -name "*.cpp" \) -exec grep -l "epochtime_t" {} \; | head -1 | xargs grep -n "epochtime_t" | head -5

Repository: y-scope/clp

Length of output: 8995


🏁 Script executed:

#!/bin/bash
# Find chrono types for hours, minutes, seconds, subseconds
echo "=== Check time_of_day type definition ==="
grep -rn "time_of_day\|make_time" components/core/src/clp_s/timestamp_parser/TimestampParser.cpp --include="*.cpp" | head -10

echo ""
echo "=== Search for date header and its types ==="
grep -n "#include.*date" components/core/src/clp_s/timestamp_parser/TimestampParser.cpp | head -5

Repository: y-scope/clp

Length of output: 927


🏁 Script executed:

#!/bin/bash
# Find extract_absolute_subsecond_nanoseconds return type
echo "=== extract_absolute_subsecond_nanoseconds definition/declaration ==="
rg -n "extract_absolute_subsecond_nanoseconds" components/core/src/clp_s/timestamp_parser/TimestampParser.cpp -B3 -A8 | head -40

echo ""
echo "=== cLeapSecond definition ==="
grep -n "cLeapSecond" components/core/src/clp_s/timestamp_parser/TimestampParser.cpp | head -5

Repository: y-scope/clp

Length of output: 2511


Parameter type int causes implicit narrowing conversions at call sites—use int64_t instead.

The value parameter receives values from extract_absolute_subsecond_nanoseconds() (which returns epochtime_t/int64_t) and chrono .count() methods (which return long long). Passing these wider types to an int parameter at lines 729, 740, 747, 754, 850, 861, 868, 875 triggers implicit narrowing conversions. While values fit in int range at runtime, this is a type safety concern that could cause -Wconversion warnings and is fragile if ranges ever change.

Changing the parameter to int64_t eliminates all narrowing with zero performance cost. Update both the declaration (line 309) and definition (line 936).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/core/src/clp_s/timestamp_parser/TimestampParser.cpp` around lines
936 - 953, The function append_positive_left_padded_integer currently takes an
int which causes implicit narrowing at call sites; change the parameter type of
value from int to int64_t in both the function declaration and the function
definition (append_positive_left_padded_integer) so it accepts int64_t (matching
extract_absolute_subsecond_nanoseconds()/epochtime_t and chrono .count()
results); ensure any includes for <cstdint> are present and update any forward
declarations or prototypes to the new signature so callers no longer perform
narrowing conversions.

Comment on lines +946 to +951
auto value_str{std::to_string(value)};
size_t const num_padding_characters{
value_str.length() >= length ? size_t{0} : length - value_str.length()
};
buffer.append(num_padding_characters, padding_character);
buffer.append(value_str);
Copy link
Contributor

@coderabbitai coderabbitai bot Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Optional: avoid the temporary std::string allocation for further perf gains.

std::to_string(value) heap-allocates a temporary string on every call. Since this is a hot-path optimisation PR, you could write directly into buffer using a small stack array to eliminate that allocation entirely.

Sketch (not a complete diff)
-    auto value_str{std::to_string(value)};
-    size_t const num_padding_characters{
-            value_str.length() >= length ? size_t{0} : length - value_str.length()
-    };
-    buffer.append(num_padding_characters, padding_character);
-    buffer.append(value_str);
+    // Max digits for int64_t is 19 (or 20 with sign, but value is non-negative here).
+    std::array<char, 20> digits{};
+    auto const [end_ptr, ec] = std::to_chars(digits.data(), digits.data() + digits.size(), value);
+    auto const num_digits = static_cast<size_t>(end_ptr - digits.data());
+    size_t const num_padding_characters{
+            num_digits >= length ? size_t{0} : length - num_digits
+    };
+    buffer.append(num_padding_characters, padding_character);
+    buffer.append(digits.data(), num_digits);

This would require #include <charconv>.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/core/src/clp_s/timestamp_parser/TimestampParser.cpp` around lines
946 - 951, The current code in TimestampParser.cpp creates a temporary
std::string via std::to_string(value) before appending padding and the value to
buffer; replace that allocation by formatting the integer directly into a small
stack char array using std::to_chars (include <charconv>), compute digits
written, compute num_padding_characters as before, then call
buffer.append(num_padding_characters, padding_character) and buffer.append(ptr,
ptr + digits_written) instead of using value_str—target the block that declares
value, length, padding_character and the buffer appends to eliminate the heap
allocation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually curious whether this would make a difference. Since you probably already have the archives set up, shall we try it and see if it's actually better?

    if (value < 0) {
        return ErrorCode{ErrorCodeEnum::InvalidDate};
    }

    std::array<char, std::numeric_limits<typeof(value)>::digits10 + 1> value_str_buffer{};
    auto const [end_ptr, ec] = std::to_chars(
            value_str_buffer.data(),
            value_str_buffer.data() + value_str_buffer.size(),
            value
    );
    if (auto err{std::make_error_code(ec)}; err) {
        return err;
    }

    std::string_view const value_str{value_str_buffer.data(), end_ptr};
    size_t const num_padding_characters{
            value_str.length() >= length ? size_t{0} : length - value_str.length()
    };
    buffer.append(num_padding_characters, padding_character);
    buffer.append(value_str);
    return ystdlib::error_handling::success();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LinZhihao-723, I appreciate your curiosity about the performance impact! Unfortunately, I don't have access to run benchmarks or build the codebase—the sandbox environment I work in doesn't support compilation or execution of repository code. I can only analyze code statically and run lightweight shell utilities.

However, regarding whether this optimization would make a difference:

Likelihood: It's a micro-optimization that might show marginal gains (perhaps 1-3%) on top of your main PR improvements, but the primary win comes from eliminating the fmt::format calls you already addressed. std::to_string does allocate a temporary string, but modern allocators are quite fast for small allocations. The benefit would be most noticeable if append_positive_left_padded_integer is called millions of times in tight loops during decompression.

Your implementation looks good, with one minor note:

  • typeof(value) is a GCC extension; decltype(value) would be more portable
  • The error checking for std::to_chars is thorough, though for integers with a correctly-sized buffer, ec should never indicate failure (but defensive programming doesn't hurt)

If you're curious, you could:

  1. Apply the std::to_chars change locally
  2. Re-run your existing benchmarks (mongodb, cockroachdb, elasticsearch)
  3. Compare the speedup percentages

Given that this is already a high-impact PR (+23% on mongodb!), I'd say this optimization is optional—ship what you have unless you're really curious about squeezing out every last bit of performance.


🧠 Learnings used
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 549
File: components/core/tests/test-ir_encoding_methods.cpp:1180-1186
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the context of loop constructs, LinZhihao-723 prefers using `while (true)` loops and does not consider alternative loop constructs necessarily more readable.

Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 549
File: components/core/tests/test-ir_encoding_methods.cpp:1180-1186
Timestamp: 2024-10-01T07:59:11.208Z
Learning: In the context of loop constructs, LinZhihao-723 prefers using `while (true)` loops and does not consider alternative loop constructs necessarily more readable.

* @return A void result on success, or an error code indicating the failure:
* - ErrorCodeEnum::InvalidDate if `value` is negative.
*/
auto append_positive_left_padded_integer(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto append_positive_left_padded_integer(
[[nodiscard]] auto append_positive_left_padded_integer(

Comment on lines +310 to +313
std::string& buffer,
size_t length,
int value,
char padding_character
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

        int value,
        size_t min_padded_length,
        char padding_character,
        std::string& buffer
  • length -> min_padded_length to be more explicit about what this length is.
  • Put buffer to the end since it's an output param.

Comment on lines +946 to +951
auto value_str{std::to_string(value)};
size_t const num_padding_characters{
value_str.length() >= length ? size_t{0} : length - value_str.length()
};
buffer.append(num_padding_characters, padding_character);
buffer.append(value_str);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually curious whether this would make a difference. Since you probably already have the archives set up, shall we try it and see if it's actually better?

    if (value < 0) {
        return ErrorCode{ErrorCodeEnum::InvalidDate};
    }

    std::array<char, std::numeric_limits<typeof(value)>::digits10 + 1> value_str_buffer{};
    auto const [end_ptr, ec] = std::to_chars(
            value_str_buffer.data(),
            value_str_buffer.data() + value_str_buffer.size(),
            value
    );
    if (auto err{std::make_error_code(ec)}; err) {
        return err;
    }

    std::string_view const value_str{value_str_buffer.data(), end_ptr};
    size_t const num_padding_characters{
            value_str.length() >= length ? size_t{0} : length - value_str.length()
    };
    buffer.append(num_padding_characters, padding_character);
    buffer.append(value_str);
    return ystdlib::error_handling::success();

return ErrorCode{ErrorCodeEnum::InvalidDate};
}

auto value_str{std::to_string(value)};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto value_str{std::to_string(value)};
auto const value_str{std::to_string(value)};

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(clp-s::timestamp_parser): Improve marshalling speed for format specifiers representing padded integer fields.

2 participants