Skip to content

Comments

[FEAT] buffer implementation#658

Open
gaurav7261 wants to merge 2 commits intoClickHouse:mainfrom
gaurav7261:buffer_implementation
Open

[FEAT] buffer implementation#658
gaurav7261 wants to merge 2 commits intoClickHouse:mainfrom
gaurav7261:buffer_implementation

Conversation

@gaurav7261
Copy link

@gaurav7261 gaurav7261 commented Feb 13, 2026

Summary

• Buffer implementation: Adds internal record buffering to ClickHouseSinkTask so that records from multiple poll() calls are accumulated and flushed as a single large batch, reducing the number of
inserts to ClickHouse.
• Two new config options: bufferCount (number of records to accumulate before flushing) and bufferFlushTime (max time in ms before flushing regardless of count). Both default to 0 (disabled), preserving
backward compatibility.

Checklist

Delete items not relevant to your PR:

  • Unit and integration tests covering the common scenarios were added
  • A human-readable description of the changes was provided to include in CHANGELOG
  • For significant changes, documentation in https://github.com/ClickHouse/clickhouse-docs was updated with further explanations or tutorials

@CLAassistant
Copy link

CLAassistant commented Feb 13, 2026

CLA assistant check
All committers have signed the CLA.

@gaurav7261 gaurav7261 force-pushed the buffer_implementation branch 2 times, most recently from 7df69a7 to 6e823d9 Compare February 13, 2026 13:18
@rschu1ze
Copy link
Member

@antaljanosbenjamin ^^

@antaljanosbenjamin
Copy link
Member

I think @chernser is a better person to review this, as I don't have experience with this repo, only the Kafka table engine.

@chernser chernser requested a review from mzitnik February 17, 2026 15:53
@chernser
Copy link
Contributor

@mzitnik would you please take a look?

@gaurav7261 thank you for the contribution!
However there are some immediate comments:

  • feature description is missing. We require an open issue and good explanation of why and how. It is not a complete IP but something we can use to review the PR
  • feature requires good set of tests.

@gaurav7261
Copy link
Author

Yes working on same only

@gaurav7261 gaurav7261 force-pushed the buffer_implementation branch from 6e823d9 to ee627bd Compare February 17, 2026 17:35
@gaurav7261
Copy link
Author

@mzitnik can you please review now

@mzitnik
Copy link
Collaborator

mzitnik commented Feb 17, 2026

@gaurav7261
thank you for your contribution!

as @chernser asked below
another can you explain what is the porpuse of the PR and what you are trying to solve.

@gaurav7261
Copy link
Author

gaurav7261 commented Feb 17, 2026

@mzitnik our replica.fetch.max.bytes and fetch.max.bytes are 20mb and 55mb on broker, and our throughput of kafka is 15mbps, and avg record size is 3k, so we are not able to pull more than 7-9k record per poll, due to which we are unable to create large batch, i know current implementation have some loophole like in below scenario When buffering is enabled, put() returns successfully with records sitting in an in-memory buffer, never written to ClickHouse. Since flush() is a no-op right now, those offsets are committed for data that only exists in RAM. If the process crashes or a rebalance happens, those records are permanently lost. This breaks at-least-once delivery, correct me if i am wrong here? i am working on manual commit if batching is enabled like how s3 confluent sink connector do, wdyt, is it correct way to move forward, i am also implementing precommit like here if buffering is enabled, please check

@gaurav7261 gaurav7261 force-pushed the buffer_implementation branch from 9e115b7 to 6560216 Compare February 18, 2026 18:21
@gaurav7261 gaurav7261 requested a review from mzitnik February 18, 2026 18:21
@gaurav7261
Copy link
Author

@mzitnik please review again, resolved comments

@gaurav7261
Copy link
Author

@mzitnik can you please review it

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds internal record buffering capability to the ClickHouse Kafka Connect sink connector. The implementation allows records from multiple poll() calls to be accumulated in memory and flushed to ClickHouse as a single large batch, reducing the number of insert operations and improving throughput.

Changes:

  • Added two new configuration options: bufferCount (record count threshold) and bufferFlushTime (time-based threshold in milliseconds), both defaulting to 0 (disabled) for backward compatibility
  • Implemented buffering logic in ClickHouseSinkTask with proper offset management to ensure at-least-once delivery semantics
  • Added comprehensive test coverage for buffering behavior, offset management, partition rebalancing, and error handling scenarios

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/main/java/com/clickhouse/kafka/connect/sink/ClickHouseSinkTask.java Core buffering implementation with put(), flushBuffer(), preCommit(), close(), and stop() logic
src/main/java/com/clickhouse/kafka/connect/sink/ClickHouseSinkConfig.java Added BUFFER_COUNT and BUFFER_FLUSH_TIME configuration options with defaults and validation
src/test/java/com/clickhouse/kafka/connect/sink/ClickHouseSinkTaskBufferTest.java Comprehensive test suite covering buffering, offset management, partition handling, and error scenarios
CHANGELOG.md Added entry for version 1.3.6 documenting the new buffering feature

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +68 to +72
this.bufferFlushTime = clickHouseSinkConfig.getBufferFlushTime();
this.bufferingEnabled = this.bufferCount > 0;
this.buffer = this.bufferingEnabled ? new ArrayList<>(this.bufferCount) : new ArrayList<>();
this.lastFlushTime = System.currentTimeMillis();
this.flushedOffsets = new HashMap<>();
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding validation or a warning when bufferFlushTime > 0 but bufferCount = 0. While the config documentation states that bufferFlushTime is "Only effective when bufferCount > 0", users might miss this and wonder why time-based flushing isn't working. Adding a log warning in the start() method (after line 68) would help users catch this misconfiguration: if (this.bufferFlushTime > 0 && this.bufferCount == 0) { LOGGER.warn("bufferFlushTime is set but will be ignored because bufferCount is 0"); }

Copilot uses AI. Check for mistakes.
Comment on lines 73 to 77

if (bufferingEnabled) {
LOGGER.info("Internal buffering enabled: bufferCount={}, bufferFlushTime={}ms",
bufferCount, bufferFlushTime);
}
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same log message "Internal buffering enabled: bufferCount={}, bufferFlushTime={}ms" appears in both ClickHouseSinkConfig constructor (line 285) and ClickHouseSinkTask.start() method (line 75). This will result in duplicate log entries when buffering is enabled. Consider removing one of these log statements - the one in ClickHouseSinkConfig seems more appropriate since it logs during configuration parsing, while the one in ClickHouseSinkTask is redundant.

Suggested change
if (bufferingEnabled) {
LOGGER.info("Internal buffering enabled: bufferCount={}, bufferFlushTime={}ms",
bufferCount, bufferFlushTime);
}

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants