-
Notifications
You must be signed in to change notification settings - Fork 2k
feat(codecs): add arrow IPC stream batch encoder #24124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c56d23f to
0157379
Compare
0157379 to
857d3da
Compare
pront
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is starting to look great. There are some failing checks that need attention.
92b3482 to
a060845
Compare
|
|
54b3f41 to
9654ddb
Compare
|
@pront should be good now, I've been manually running I had to use a downgraded version of cmake due to incompatibilities with 4.x but seems to work with 3.30: export PATH=/opt/cmake-3.30/CMake.app/Contents/bin:$PATH && make check-clippy |
Thank you for raising this, feel free to create a new issue to track this. |
* enhancement(clickhouse sink): add support for ArrowStream format * refactor: do not allow dynamic database/table for arrow * chore: add type assertions * refactor: use clearer function names in request builder * chore: add snafu for request builder * chore: move imports to top package * chore: handle templates for table and database * build(deps): move arrow under external libs * build(deps): bump arrow/arrow-schema to latest * chore: add support for all timeunits e.g. nanoseconds * chore: add handling for LowCardinality * chore: add support for Clickhouse decimal * chore: add capacity and string zero copy optimization * chore: use logging macros from extern crate * chore: add debug logs for schema * chore: use simpler message for NoSchemaProvided snafu * chore: DRY target scale within decimal builders * docs: add note about upcasting i128 -> i256 * chore: move i256 import statement to top * chore: remove redundant schema log * chore: only log schema a single time at startup * chore: use arrow payload size instead of redundant json size * chore: add support for uint * chore: add support for rfc3339 timestamp * refactor: move arrow to clickhouse sink subfolder * refactor: move Arrow encoder to shared util * chore: remove redundant feature gates * refactor: rename schema.rs -> arrow_schema.rs * chore: add changelog fragment * chore: add helper function to handle clickhouse type parsing * refactor: remove default coercions in arrow_schema * refactor: add constants for decimal precision values * refactor: extract type modifier unwrapping into helper * refactor: remove unnecessary vector allocation * refactor: organize type mapping with comments * refactor: use constants and functional style in decimal parsing * chore: update LICENSE-3rdparty * chore: remove duplicated tests * chore: update generated docs * chore: remove redundant docstring * refactor: create generic arrow codec * chore: remove duplicated changelog * chore: cargo fmt * chore: fix lock file * chore: clippy * chore: add missing int/float types * make fmt * update licenses * chore: remove schema provider logic * chore: add with_capacity for helper functions * chore: add support for null values * chore: clean up encode_events_to_arrow_ipc_stream * chore: use macros for primitives and null constraint checks * add links to changelog * chore: clippy * chore: fix duplicate events * chore: remove duplicated import * chore: clippy --------- Co-authored-by: Pavlos Rontidis <[email protected]>
* enhancement(clickhouse sink): add support for ArrowStream format * refactor: do not allow dynamic database/table for arrow * chore: add type assertions * refactor: use clearer function names in request builder * chore: add snafu for request builder * chore: move imports to top package * chore: handle templates for table and database * build(deps): move arrow under external libs * build(deps): bump arrow/arrow-schema to latest * chore: add support for all timeunits e.g. nanoseconds * chore: add handling for LowCardinality * chore: add support for Clickhouse decimal * chore: add capacity and string zero copy optimization * chore: use logging macros from extern crate * chore: add debug logs for schema * chore: use simpler message for NoSchemaProvided snafu * chore: DRY target scale within decimal builders * docs: add note about upcasting i128 -> i256 * chore: move i256 import statement to top * chore: remove redundant schema log * chore: only log schema a single time at startup * chore: use arrow payload size instead of redundant json size * chore: add support for uint * chore: add support for rfc3339 timestamp * refactor: move arrow to clickhouse sink subfolder * refactor: move Arrow encoder to shared util * chore: remove redundant feature gates * refactor: rename schema.rs -> arrow_schema.rs * chore: add changelog fragment * chore: add helper function to handle clickhouse type parsing * refactor: remove default coercions in arrow_schema * refactor: add constants for decimal precision values * refactor: extract type modifier unwrapping into helper * refactor: remove unnecessary vector allocation * refactor: organize type mapping with comments * refactor: use constants and functional style in decimal parsing * chore: update LICENSE-3rdparty * chore: remove duplicated tests * chore: update generated docs * chore: remove redundant docstring * refactor: create generic arrow codec * chore: remove duplicated changelog * chore: cargo fmt * chore: fix lock file * chore: clippy * chore: add missing int/float types * make fmt * update licenses * chore: remove schema provider logic * chore: add with_capacity for helper functions * chore: add support for null values * chore: clean up encode_events_to_arrow_ipc_stream * chore: use macros for primitives and null constraint checks * add links to changelog * chore: clippy * chore: fix duplicate events * chore: remove duplicated import * chore: clippy --------- Co-authored-by: Pavlos Rontidis <[email protected]>
|
saw this PR was merged. Is the new arrow_stream format available in nightly build now? Actually we are also try to reduce the clickhouse cost by adapting the arrowstream format. |
|
@LaurenceChau not yet, only after #24373 is merged I haven't added support for complex types like arrays, tuples etc -- I'm planning to add that in another PR to reduce the total LOC |
Summary
This PR adds a generic Arrow encoder to support Apache Arrow IPC serialization. This will enable sinks (e.g. ClickHouse) to serialize and transmit structured events efficiently as compared to row+text-based formats. By introducing a unified Arrow serialization layer, Vector can now interoperate more easily with Arrow-native systems and improve performance for columnar workflows.
Vector configuration
An example of how this configuration would look with a sink:
How did you test this PR?
Tested using a Clickhouse sink implementation (not included in this PR in order to keep the scope limited)
Change Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.References
Split from: #24075 (comment) Related: #24074 (requires this to be implemented)
Related: #1374 -- should hopefully allow this to move forward
Notes
@vectordotdev/vectorto reach out to us regarding this PR.pre-pushhook, please see this template.make fmtmake check-clippy(if there are failures it's possible some of them can be fixed withmake clippy-fix)make testgit merge origin masterandgit push.Cargo.lock), pleaserun
make build-licensesto regenerate the license inventory and commit the changes (if any). More details here.