Skip to content

Conversation

@wgtmac
Copy link
Member

@wgtmac wgtmac commented Feb 8, 2025

Closes #33

@wgtmac
Copy link
Member Author

wgtmac commented Feb 9, 2025

I have managed to add sparrow as a vendored thirdparty dependency to libiceberg. However, there are still two issues that break CI:

  1. It cannot compile on Windows due to int128_t is not supported by MSVC: https://github.com/apache/iceberg-cpp/actions/runs/13217571682/job/36898649736?pr=44.
  2. It cannot compile on MacOS with AppleClang 15.0.0.15000309 since Sparrow README recommends Apple Clang 16 or higher: https://github.com/apache/iceberg-cpp/actions/runs/13217571682/job/36898649906?pr=44.

There are some other minor issues:

  1. It should use PREPEND not APPEND at this line: https://github.com/man-group/sparrow/blob/532bf486243fc396f7a9392820fa7a77071782ad/CMakeLists.txt#L50. Otherwise I have to add list(PREPEND CMAKE_MODULE_PATH ${CMAKE_CURRENT_BINARY_DIR}/_deps/sparrow-src/cmake) on my side when using FetchContent to vendor sparrow to avoid error include could not find requested file: sanitizers.
  2. It would be good to add a namespace Sparrow:: when installing the exported config file at https://github.com/man-group/sparrow/blob/532bf486243fc396f7a9392820fa7a77071782ad/CMakeLists.txt#L354.
  3. It should use target-oriented functions like target_compile_definition for https://github.com/man-group/sparrow/blob/532bf486243fc396f7a9392820fa7a77071782ad/CMakeLists.txt#L140 and similar places. Otherwise, projects depending on it are required to add same definitions in order to compile to avoid errors with regards to the date library.

These are not blocking issues at the moment. I need more time to get familiar with its API and best practice. Just want to report what I have found so far. @JohanMabille @Alex-PLACET @pitrou

@JohanMabille
Copy link

JohanMabille commented Feb 10, 2025

Hi there,

Thanks for the test and report! We got similar issues when trying to build and use sparrow in WASM and we are pushing fixes for that (more specifically, the target_compile_definitions that are missing). I'll take the opportunity to fix the other issues you faced, we will release a new version right after.

@wgtmac wgtmac changed the title WIP: explore nanoarrow and sparrow Add Arrow C Data Interface and dependencies of nanoarrow & sparrow Mar 19, 2025
@wgtmac wgtmac marked this pull request as ready for review March 19, 2025 09:51
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably in the end we're only going to want one of the two libraries, right? It seems weird to potentially link up to 4 separate Arrow implementations in one project...

@wgtmac
Copy link
Member Author

wgtmac commented Mar 19, 2025

@lidavidm Yes, this is for the nanoarrow and sparrow comparison and finally one of them will be only used internally by the libiceberg. If libiceberg is pluggable enough, I think we can also remove the libiceberg-arrow and thus the dependency of arrow.

Let me fix the CI first. Then I think we need reach a consensus to choose one.

@lidavidm
Copy link
Member

IMO, it's going to be hard/pointless to choose one until we actually write code that needs it. I hope we can get the ball rolling sooner rather than later on actual implementation/binding work

@wgtmac
Copy link
Member Author

wgtmac commented Mar 19, 2025

Agreed. The point of this PR is to demonstrate that we can use the Arrow C data interface as the main API for file reader and writer and internally we can process the data with the help of nanoarrow/sparrow. This is an independent work. In the meanwhile, I'll try to figure out the minimal set of interfaces that we need and create a PR for PoC later this week or next week.

@wgtmac
Copy link
Member Author

wgtmac commented Mar 19, 2025

1/4 Test #1: arrow_unittest ...................Exit code 0xc0000135
***Exception: 0.01 sec

I have run into an error: https://github.com/apache/iceberg-cpp/actions/runs/13943870527/job/39026477819. It seems that I can get rid of it by removing the sparrow dependency. So I'm inclined to use nanoarrow for now and revisit this issue in the future.

I will polish this PR to depend on only nanoarrow. As the next step, I can add a conversion between ArrowSchema and the Iceberg schema. WDYT? @lidavidm @zhjwpku @Fokko

@lidavidm
Copy link
Member

Sounds good to me

@wgtmac wgtmac changed the title Add Arrow C Data Interface and dependencies of nanoarrow & sparrow Add Arrow C Data Interface and nanoarrow dependency Mar 20, 2025
@zhjwpku
Copy link
Collaborator

zhjwpku commented Mar 20, 2025

Agreed, nanoarrow sounds good to me.

Copy link
Collaborator

@zhjwpku zhjwpku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for working on this.

@wgtmac
Copy link
Member Author

wgtmac commented Mar 20, 2025

@Xuanwo @Fokko Could you help review this?

@Fokko Fokko merged commit 9fd3d53 into apache:main Mar 21, 2025
6 checks passed
@Fokko
Copy link
Contributor

Fokko commented Mar 21, 2025

This looks good to me, thanks @wgtmac for working on this, and thanks @lidavidm and @zhjwpku for the review 👍

@wgtmac
Copy link
Member Author

wgtmac commented Mar 21, 2025

Thank you, @Fokko!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DISCUSS] Columnar data protocol: Arrow or implement a new one?

5 participants