Skip to content

Conversation

@benjamin-awd
Copy link
Contributor

@benjamin-awd benjamin-awd commented Oct 31, 2025

Summary

This PR adds a generic Arrow encoder to support Apache Arrow IPC serialization. This will enable sinks (e.g. ClickHouse) to serialize and transmit structured events efficiently as compared to row+text-based formats. By introducing a unified Arrow serialization layer, Vector can now interoperate more easily with Arrow-native systems and improve performance for columnar workflows.

Vector configuration

An example of how this configuration would look with a sink:

sinks:
  type: clickhouse
  host: http://localhost:8123
  table: my_table
  batch_encoding:
     codec: arrow_stream

How did you test this PR?

Tested using a Clickhouse sink implementation (not included in this PR in order to keep the scope limited)

Change Type

  • Bug fix
  • New feature
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Split from: #24075 (comment) Related: #24074 (requires this to be implemented)
Related: #1374 -- should hopefully allow this to move forward

Notes

  • Please read our Vector contributor resources.
  • Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
  • Some CI checks run only after we manually approve them.
    • We recommend adding a pre-push hook, please see this template.
    • Alternatively, we recommend running the following locally before pushing to the remote branch:
      • make fmt
      • make check-clippy (if there are failures it's possible some of them can be fixed with make clippy-fix)
      • make test
  • After a review is requested, please avoid force pushes to help us review incrementally.
    • Feel free to push as many commits as you want. They will be squashed into one before merging.
    • For example, you can run git merge origin master and git push.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run make build-licenses to regenerate the license inventory and commit the changes (if any). More details here.

Copy link
Member

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is starting to look great. There are some failing checks that need attention.

@pront pront changed the title feat(codecs): add arrow IPC stream batch codec feat(codecs): add arrow IPC stream batch encoder Nov 11, 2025
@pront
Copy link
Member

pront commented Nov 12, 2025

make check-clippy failed, after all checks are passing we can queue this for merging!

@benjamin-awd
Copy link
Contributor Author

@pront should be good now, I've been manually running cargo clippy --workspace --all-targets -- -D warnings && cargo fmt --all but realized it differs from the output of make check-clippy

I had to use a downgraded version of cmake due to incompatibilities with 4.x but seems to work with 3.30:

export PATH=/opt/cmake-3.30/CMake.app/Contents/bin:$PATH && make check-clippy

@pront
Copy link
Member

pront commented Nov 13, 2025

had to use a downgraded version of cmake due to incompatibilities with 4.x

Thank you for raising this, feel free to create a new issue to track this.

@pront pront enabled auto-merge November 13, 2025 15:27
@pront pront added this pull request to the merge queue Nov 14, 2025
Merged via the queue into vectordotdev:master with commit 41e3849 Nov 14, 2025
44 checks passed
elkh510 pushed a commit to elkh510/vector that referenced this pull request Nov 19, 2025
* enhancement(clickhouse sink): add support for ArrowStream format

* refactor: do not allow dynamic database/table for arrow

* chore: add type assertions

* refactor: use clearer function names in request builder

* chore: add snafu for request builder

* chore: move imports to top package

* chore: handle templates for table and database

* build(deps): move arrow under external libs

* build(deps): bump arrow/arrow-schema to latest

* chore: add support for all timeunits e.g. nanoseconds

* chore: add handling for LowCardinality

* chore: add support for Clickhouse decimal

* chore: add capacity and string zero copy optimization

* chore: use logging macros from extern crate

* chore: add debug logs for schema

* chore: use simpler message for NoSchemaProvided snafu

* chore: DRY target scale within decimal builders

* docs: add note about upcasting i128 -> i256

* chore: move i256 import statement to top

* chore: remove redundant schema log

* chore: only log schema a single time at startup

* chore: use arrow payload size instead of redundant json size

* chore: add support for uint

* chore: add support for rfc3339 timestamp

* refactor: move arrow to clickhouse sink subfolder

* refactor: move Arrow encoder to shared util

* chore: remove redundant feature gates

* refactor: rename schema.rs -> arrow_schema.rs

* chore: add changelog fragment

* chore: add helper function to handle clickhouse type parsing

* refactor: remove default coercions in arrow_schema

* refactor: add constants for decimal precision values

* refactor: extract type modifier unwrapping into helper

* refactor: remove unnecessary vector allocation

* refactor: organize type mapping with comments

* refactor: use constants and functional style in decimal parsing

* chore: update LICENSE-3rdparty

* chore: remove duplicated tests

* chore: update generated docs

* chore: remove redundant docstring

* refactor: create generic arrow codec

* chore: remove duplicated changelog

* chore: cargo fmt

* chore: fix lock file

* chore: clippy

* chore: add missing int/float types

* make fmt

* update licenses

* chore: remove schema provider logic

* chore: add with_capacity for helper functions

* chore: add support for null values

* chore: clean up encode_events_to_arrow_ipc_stream

* chore: use macros for primitives and null constraint checks

* add links to changelog

* chore: clippy

* chore: fix duplicate events

* chore: remove duplicated import

* chore: clippy

---------

Co-authored-by: Pavlos Rontidis <[email protected]>
elkh510 pushed a commit to elkh510/vector that referenced this pull request Nov 19, 2025
* enhancement(clickhouse sink): add support for ArrowStream format

* refactor: do not allow dynamic database/table for arrow

* chore: add type assertions

* refactor: use clearer function names in request builder

* chore: add snafu for request builder

* chore: move imports to top package

* chore: handle templates for table and database

* build(deps): move arrow under external libs

* build(deps): bump arrow/arrow-schema to latest

* chore: add support for all timeunits e.g. nanoseconds

* chore: add handling for LowCardinality

* chore: add support for Clickhouse decimal

* chore: add capacity and string zero copy optimization

* chore: use logging macros from extern crate

* chore: add debug logs for schema

* chore: use simpler message for NoSchemaProvided snafu

* chore: DRY target scale within decimal builders

* docs: add note about upcasting i128 -> i256

* chore: move i256 import statement to top

* chore: remove redundant schema log

* chore: only log schema a single time at startup

* chore: use arrow payload size instead of redundant json size

* chore: add support for uint

* chore: add support for rfc3339 timestamp

* refactor: move arrow to clickhouse sink subfolder

* refactor: move Arrow encoder to shared util

* chore: remove redundant feature gates

* refactor: rename schema.rs -> arrow_schema.rs

* chore: add changelog fragment

* chore: add helper function to handle clickhouse type parsing

* refactor: remove default coercions in arrow_schema

* refactor: add constants for decimal precision values

* refactor: extract type modifier unwrapping into helper

* refactor: remove unnecessary vector allocation

* refactor: organize type mapping with comments

* refactor: use constants and functional style in decimal parsing

* chore: update LICENSE-3rdparty

* chore: remove duplicated tests

* chore: update generated docs

* chore: remove redundant docstring

* refactor: create generic arrow codec

* chore: remove duplicated changelog

* chore: cargo fmt

* chore: fix lock file

* chore: clippy

* chore: add missing int/float types

* make fmt

* update licenses

* chore: remove schema provider logic

* chore: add with_capacity for helper functions

* chore: add support for null values

* chore: clean up encode_events_to_arrow_ipc_stream

* chore: use macros for primitives and null constraint checks

* add links to changelog

* chore: clippy

* chore: fix duplicate events

* chore: remove duplicated import

* chore: clippy

---------

Co-authored-by: Pavlos Rontidis <[email protected]>
@LaurenceChau
Copy link

LaurenceChau commented Dec 12, 2025

saw this PR was merged. Is the new arrow_stream format available in nightly build now? Actually we are also try to reduce the clickhouse cost by adapting the arrowstream format.

@benjamin-awd
Copy link
Contributor Author

benjamin-awd commented Dec 12, 2025

@LaurenceChau not yet, only after #24373 is merged

I haven't added support for complex types like arrays, tuples etc -- I'm planning to add that in another PR to reduce the total LOC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: codecs Anything related to Vector's codecs (encoding/decoding) domain: sinks Anything related to the Vector's sinks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants