Skip to content

Conversation

@benjamin-awd
Copy link
Contributor

@benjamin-awd benjamin-awd commented Oct 25, 2025

Summary

This PR adds the ArrowStream format option for the Clickhouse sink. This provides a more efficient binary protocol for ingesting log data into ClickHouse compared to the existing JSON formats, with improved performance at high throughput.

Vector configuration

  sinks:
    clickhouse:
      type: clickhouse
      endpoint: http://localhost:8123
      database: mydatabase
      table: logs
      format: arrow_stream  # New format option (defaults to JSONEachRow)
      compression: gzip
      auth:
        strategy: basic
        user: default
        password: "${CLICKHOUSE_PASSWORD}"

How did you test this PR?

Tested locally and in development environment using data at a rate of a few hundred thousand rows per second.

Change Type

  • Bug fix
  • New feature
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Closes #24074

Notes

Internal comparison between formats (pointing Vector at two identical tables, the only difference being format)

WITH
    a_log AS
    (
        SELECT
            `table`,
            format,
            rows,
            bytes,
            flush_query_id
        FROM system.asynchronous_insert_log
        WHERE (status = 'Ok') AND (`table` IN ('t1', 't2')) AND (event_time >= (now() - toIntervalMinute(15)))
    ),
    q_log AS
    (
        SELECT
            query_id,
            query_duration_ms
        FROM system.query_log
        PREWHERE (type = 'QueryFinish') AND (query_kind = 'AsyncInsertFlush') AND (event_time >= (now() - toIntervalMinute(15)))
    )
SELECT
    a.`table`,
    a.format,
    count() AS total_flushes,
    sum(a.rows) AS total_rows_inserted,
    formatReadableSize(sum(a.bytes)) AS total_data_inserted,
    sum(q.query_duration_ms) AS total_flush_time_ms,
    sum(a.rows) / sum(q.query_duration_ms / 1000.) AS avg_rows_per_second,
    concat(formatReadableSize(sum(a.bytes) / sum(q.query_duration_ms / 1000.)), '/s') AS avg_bytes_per_second,
    sum(a.rows) / count() AS avg_rows_per_flush
FROM a_log AS a
INNER JOIN q_log AS q ON a.flush_query_id = q.query_id
GROUP BY
    a.`table`,
    a.format
ORDER BY
    a.`table` ASC,
    a.format ASC

Query id: d41311c0-cb10-403c-8d4d-7f6cf6cb8f13

Row 1:
──────
table:                jsoneachrow_table
format:               JSONEachRow
total_flushes:        34084
total_rows_inserted:  42829429 -- 42.83 million
total_data_inserted:  65.39 GiB
total_flush_time_ms:  14707745 -- 14.71 million
avg_rows_per_second:  2912.0323339845772
avg_bytes_per_second: 4.55 MiB/s
avg_rows_per_flush:   1256.5845851425888

Row 2:
──────
table:                arrowstream_table
format:               ArrowStream
total_flushes:        35934
total_rows_inserted:  45153872 -- 45.15 million
total_data_inserted:  17.27 GiB
total_flush_time_ms:  3356282 -- 3.36 million
avg_rows_per_second:  13453.539362902164
avg_bytes_per_second: 5.27 MiB/s
avg_rows_per_flush:   1256.5779484610675

@benjamin-awd benjamin-awd requested a review from a team as a code owner October 25, 2025 14:15
@github-actions github-actions bot added the domain: sinks Anything related to the Vector's sinks label Oct 25, 2025
@benjamin-awd benjamin-awd requested a review from a team as a code owner October 25, 2025 14:30
@github-actions github-actions bot added the domain: external docs Anything related to Vector's external, public documentation label Oct 25, 2025
@pront
Copy link
Member

pront commented Oct 27, 2025

Hi @benjamin-awd, thank you for this contribution. It will take a while to review since it's >2.5k LoC. I was wondering if we can split it this somehow. Would be make it sense to make a generic arrow codec? Just an idea.

@pront pront self-assigned this Oct 27, 2025
@benjamin-awd
Copy link
Contributor Author

Hey @pront, thanks for taking a look -- I think it'd be nice to have a generic arrow codec (assuming you mean like lib/codecs/src/encoding/format/arrow.rs) but it's quite tricky because of the batching logic involved. If I'm not wrong, this will require a significant overhaul of Vector's code and I'm not sure if it's something I have bandwidth for at the moment 😕

I think current approach is a decent compromise since it's relatively generic. The only requirement is an override at the request-builder level meaning that any sink can implement it.

@benjamin-awd
Copy link
Contributor Author

So after playing around with Claude code for a bit (probably burned the equivalent of a few trees), it seems that it is somewhat possible to create a generic Arrow codec, although this requires the creation of a BatchEncoder struct and BatchSerializer trait. If that's something you're keen to review, I can split it into a separate PR without the Clickhouse changes https://github.com/vectordotdev/vector/compare/master...benjamin-awd:vector:add-ch-arrow-codec?expand=1

Although of course it will most likely take quite a bit of effort & time to push it through the gate compared to this one

@pront
Copy link
Member

pront commented Oct 29, 2025

So after playing around with Claude code for a bit (probably burned the equivalent of a few trees), it seems that it is somewhat possible to create a generic Arrow codec, although this requires the creation of a BatchEncoder struct and BatchSerializer trait. If that's something you're keen to review, I can split it into a separate PR without the Clickhouse changes master...benjamin-awd:vector:add-ch-arrow-codec?expand=1 (compare)

Although of course it will most likely take quite a bit of effort & time to push it through the gate compared to this one

I think it's worth exploring as a followup instead. Also, it would be very helpful to create a feature request to record these details and gauge community interest.

@benjamin-awd
Copy link
Contributor Author

Hi @pront, I've pushed the arrow codec changes to #24124. Regarding the feature request, I think it lies somewhere between adding support for batch codecs and supporting the Arrow as a serialization format -- this also opens up some interesting paths where we could write to Arrow’s Variant type and write to Parquet without needing a schema which would be quite nice.

@pront
Copy link
Member

pront commented Nov 5, 2025

This is awesome @benjamin-awd, I need some time to review but I will make it a priotiy.

@LaurenceChau
Copy link

LaurenceChau commented Dec 9, 2025

This feature looks great! Does the this feature available in nightly build docker image?

@benjamin-awd
Copy link
Contributor Author

Unfortunately not -- whatever is in this PR should work if you build a custom image but caveat emptor.

After #24288 is merged I'm planning to follow up with another PR for the Clickhouse sink-specific logic

@LaurenceChau
Copy link

is the #24124 ready in nightly build? This is very useful feature to reduce clickhouse cost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: external docs Anything related to Vector's external, public documentation domain: sinks Anything related to the Vector's sinks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add ArrowStream format to Clickhouse sink

4 participants