Skip to content

fix(clp-s): Skip empty files during compression to avoid false errors (fixes #1993).#2067

Open
junhaoliao wants to merge 1 commit intoy-scope:mainfrom
junhaoliao:fix-empty-file-unstructured
Open

fix(clp-s): Skip empty files during compression to avoid false errors (fixes #1993).#2067
junhaoliao wants to merge 1 commit intoy-scope:mainfrom
junhaoliao:fix-empty-file-unstructured

Conversation

@junhaoliao
Copy link
Member

@junhaoliao junhaoliao commented Mar 4, 2026

Description

When compressing a directory containing empty files (e.g., empty syslog rotation files), clp-s
fails with errors because try_deduce_reader_type() returns FileType::Unknown for empty files —
there are no bytes to inspect for type detection.

This PR adds an empty-file check in both clp-s compression code paths — the log-converter
(--unstructured mode in CLP-JSON compress.sh) and JsonParser::ingest() (structured JSON mode) — so that empty filesystem files are skipped with an informational log message instead of causing task failures.

This brings clp-s behaviour in line with CLP-TEXT (clp binary), which already handles empty files
gracefully by iterating over zero messages without error.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

1. Unstructured compression with ~/samples/hive-24hr (clp-s, --unstructured)

Task: Verify that compressing ~/samples/hive-24hr — which contains empty syslog rotation
files — with --unstructured no longer fails with "Received input that was not unstructured
logtext".

Command:

cd build/clp-package
./sbin/start-clp.sh

./sbin/compress.sh --unstructured ~/samples/hive-24hr

Output:

2026-03-04T15:54:45.531 INFO [compress] Compression job 1 submitted.
2026-03-04T15:54:52.045 INFO [compress] Compressed 483.00MB into 4.52MB (106.78x). Speed: 97.81MB/s.
2026-03-04T15:54:58.557 INFO [compress] Compressed 571.48MB into 6.80MB (84.00x). Speed: 49.91MB/s.
2026-03-04T15:55:15.585 INFO [compress] Compressed 893.13MB into 14.98MB (59.61x). Speed: 31.36MB/s.
2026-03-04T15:55:20.092 INFO [compress] Compressed 1.18GB into 22.69MB (53.44x). Speed: 36.75MB/s.
2026-03-04T15:55:22.096 INFO [compress] Compressed 1.50GB into 30.88MB (49.65x). Speed: 43.82MB/s.
2026-03-04T15:55:31.113 INFO [compress] Compressed 1.81GB into 38.79MB (47.82x). Speed: 42.15MB/s.
2026-03-04T15:55:32.115 INFO [compress] Compressed 2.13GB into 46.84MB (46.51x). Speed: 48.40MB/s.
2026-03-04T15:55:32.615 INFO [compress] Compression finished.
2026-03-04T15:55:32.616 INFO [compress] Compressed 2.44GB into 54.57MB (45.81x). Speed: 55.35MB/s.

Explanation: The compression job completes successfully with no task errors. Previously, the
empty syslog rotation files would cause task failures with "Received input that was not unstructured
logtext". Now they are skipped gracefully and all 2.44 GB of data compresses into 54.57 MB.

2. Structured JSON compression with empty files (clp-s)

Task: Verify that compressing a directory containing an empty file in structured JSON mode also
succeeds.

Command:

mkdir -p /tmp/test-empty-json2
echo '{"timestamp": "2024-01-01T00:00:00.000Z", "msg": "hello"}' > /tmp/test-empty-json2/data.jsonl
touch /tmp/test-empty-json2/empty.jsonl

./sbin/compress.sh --timestamp-key timestamp /tmp/test-empty-json2

Output:

2026-03-04T15:34:41.879 INFO [compress] Compression job 3 submitted.
2026-03-04T15:34:42.381 INFO [compress] Compressed 58.00B into 608.00B (0.10x). Speed: 119.77B/s.
2026-03-04T15:34:42.882 INFO [compress] Compression finished.
2026-03-04T15:34:42.882 INFO [compress] Compressed 58.00B into 608.00B (0.10x). Speed: 114.18B/s.

Explanation: The structured JSON compression also handles the empty file gracefully — no
"Could not deduce content type" error.

3. CLP-TEXT consistency check

Task: Verify that CLP-TEXT (clp binary) already handles empty files gracefully, confirming
this fix makes clp-s consistent.

Command:

# Switch to clp-text config
cp etc/clp-config.template.text.yaml etc/clp-config.yaml
./sbin/stop-clp.sh && ./sbin/start-clp.sh

./sbin/compress.sh /tmp/test-empty

Output:

2026-03-04T15:36:41.276 INFO [compress] Compression job 5 submitted.
2026-03-04T15:36:41.777 INFO [compress] Compressed 30.00B into 195.00B (0.15x). Speed: 69.14B/s.
2026-03-04T15:36:42.278 INFO [compress] Compression finished.
2026-03-04T15:36:42.279 INFO [compress] Compressed 30.00B into 195.00B (0.15x). Speed: 51.43B/s.

Explanation: CLP-TEXT successfully compresses the directory with the empty file — no errors. Both
engines now handle empty files gracefully.

4. Regression check: normal JSON compression

Task: Verify that normal JSON compression (without empty files) is not affected.

Command:

# Restore clp-s config
cp etc/clp-config.template.json.yaml etc/clp-config.yaml
./sbin/stop-clp.sh && ./sbin/start-clp.sh

./sbin/compress.sh --timestamp-key timestamp ~/samples/postgresql.jsonl

Output:

2026-03-04T15:55:39.069 INFO [compress] Compression job 2 submitted.
2026-03-04T15:55:41.573 INFO [compress] Compression finished.
2026-03-04T15:55:41.573 INFO [compress] Compressed 385.21MB into 10.06MB (38.31x). Speed: 182.53MB/s.

Explanation: Normal JSON compression continues to work correctly with no regressions.

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Improved handling of empty input files from filesystem sources. The system now detects and gracefully skips empty files during the ingestion and parsing workflows, preventing unnecessary processing attempts.

@junhaoliao junhaoliao requested a review from davidlion March 4, 2026 19:47
@junhaoliao junhaoliao requested review from a team and gibber9809 as code owners March 4, 2026 19:47
@junhaoliao junhaoliao added this to the March 2026 milestone Mar 4, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 4, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: af966561-34eb-4512-a8b5-3178fc917e94

📥 Commits

Reviewing files that changed from the base of the PR and between 35b5ef1 and cc463c1.

📒 Files selected for processing (2)
  • components/core/src/clp_s/JsonParser.cpp
  • components/core/src/clp_s/log_converter/log_converter.cpp

Walkthrough

The PR adds checks to skip empty input files during ingestion and parsing. It detects empty files by querying filesystem size before processing in both JsonParser and log_converter components. Empty files are skipped with informational logging when the input source is the filesystem.

Changes

Cohort / File(s) Summary
Empty File Handling
components/core/src/clp_s/JsonParser.cpp, components/core/src/clp_s/log_converter/log_converter.cpp
Adds filesystem size checks to detect and skip empty input files before processing, with error-code-based detection and informational logging. Includes necessary headers for filesystem operations.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related issues

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: skipping empty files during compression to avoid false errors, and references the related issue (#1993).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant