Skip to content

GH-36411: [C++][Python] Some fixes for sdist build and wheel building#1

Closed
raulcd wants to merge 1 commit intoWillAyd:use-meson-pythonfrom
raulcd:raulcd-use-meson-python
Closed

GH-36411: [C++][Python] Some fixes for sdist build and wheel building#1
raulcd wants to merge 1 commit intoWillAyd:use-meson-pythonfrom
raulcd:raulcd-use-meson-python

Conversation

@raulcd
Copy link
Copy Markdown

@raulcd raulcd commented Jan 15, 2026

@WillAyd this is a PR against your branch. I thought it would be better to show you first! 😄

I have tested all the wheels and sdist at:
apache#48882

@github-actions
Copy link
Copy Markdown

❌ GitHub issue apache#36411 could not be retrieved.

Comment on lines +218 to +219
# Avoid building Pyarrow if it is a source distribution.
if not get_option('sdist')
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only change in this file is this if and endif. The rest is just formatting

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can change arrow_dep = [] in the top level configuration to arrow_dep = disabler() when building an sdist; that should circumvent the need for having to constantly branch like this I think?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried, it fails with the same issue, I could add the changes on this specific file to the existing sdist option check (instead of having two separate checks):

diff --git a/python/meson.build b/python/meson.build
index c65a837868..f4622e09af 100644
--- a/python/meson.build
+++ b/python/meson.build
@@ -18,7 +18,6 @@
 project(
     'pyarrow',
     'cython',
-    'cpp',
     version: run_command(
         'python',
         '-m',
@@ -51,8 +50,10 @@ endif
 
 # https://github.com/mesonbuild/meson-python/issues/647
 if get_option('sdist')
-    arrow_dep = []
+    arrow_dep = disabler()
 else
+    add_languages('cpp', native: false)
+    cc = meson.get_compiler('cpp')
     arrow_dep = dependency(
         'arrow',
         'Arrow',
@@ -86,6 +87,4 @@ print(incdir)
     numpy_dep = declare_dependency(include_directories: incdir_numpy)
 endif
 
-cc = meson.get_compiler('cpp')
-
 subdir('pyarrow')

but the change to avoid building cython (python/pyarrow/meson.build) are also still necessary.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK. No strong preference here - I think its a pretty minor consideration. Happy to keep as is

@WillAyd
Copy link
Copy Markdown
Owner

WillAyd commented Jan 15, 2026

Nice thanks!

The meson size seems 15% bigger than the current one, we might want to investigate if we are including something unnecessary afterwards. I've done a diffoscope between a sdist from main and the sdist from meson and the differences seem fine.

You are referring to the sdist right? And you are saying that you've diffed the files before/after and they are the same?

@raulcd
Copy link
Copy Markdown
Author

raulcd commented Jan 15, 2026

Yes, I've diffed the sdists there are some differences, for example we must remove possible python cached directories and some other minor things but I'll validate both sdist and wheels exhaustively once I fix the wheels too.

@WillAyd
Copy link
Copy Markdown
Owner

WillAyd commented Jan 15, 2026

Ah nice catch. Although I have to say I'm surprised meson-python wouldn't take care of excluding the cache

delocate-listdeps ${source_dir}/python/dist/*.whl

echo "=== (${PYTHON_VERSION}) Bundle shared libraries into wheel ==="
delocate-wheel -w ${source_dir}/python/repaired_wheels -v ${source_dir}/python/dist/*.whl
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required for macOS but we don't need repairwheel on linux huh? That seems a bit odd.

I ran into this a lot with Windows and it was called out during reviews, so worth double checking the need for this before going back

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed there seems to be something wrong with the manylinux wheels too. We should test those wheels probably on a clean environment where Arrow hasn't been built (and can't be found).
I can't seem to find the libarrow shared libraries on the manylinux wheels but they are included on the existing wheels, I looked at the latest uploaded to PyPI.
I started investigating the following conversations:

And from what I can see there it seems like "this shouldn't be meson's concern" but should be tackled with the auditwheel, delocate, delvewheel tools.
With CMake we had thePYARROW_BUNDLE_ARROW_CPP flag and the bundle_arrow_lib function managing that, maybe we should ask on your PR.
BTW I am triggering CI jobs on a WIP PR: apache#48882 because I don't want to "polute" your PR.

Copy link
Copy Markdown
Owner

@WillAyd WillAyd Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that's correct - for building and distributing to another system we need to use any of those tools.

However, building and running on the same system, I would think the "system" libraries would still be locatable at runtime.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (as you've already found) Meson really doesn't want to include this in its scope, likely because of all the differences across platforms.

With CMake we had thePYARROW_BUNDLE_ARROW_CPP flag and the bundle_arrow_lib function managing that, maybe we should ask on your PR.

I admittedly struggle to read CMake so I might be overlooking, but this function seems to just place the existing arrow library into the wheel without any name mangling or patching of existing libraries right? If so, I think the pushback we would get is that's anincomplete solution that can easily lead to conflicts with the system, particularly depending on how runtime libraries have already been loaded. I'm also not sure how well that works on Windows.

In either case, I'm no expert, so may not hurt to ask again :-)

Copy link
Copy Markdown
Author

@raulcd raulcd Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to have all wheels green on my PR and validate the contents are the expected ones. I'll manually check against published 23.0.0 wheels. Once this is done I'll share the approach and ask about feedback. There's also something I want to investigate, it feels slightly complex/verbose to have to declare all the -Csetup-args="-Dvar=${PYARROW_WITH_VAR}" \ and wondering about the developer experience when working locally. I'll see if I can investigate that too after I fix the Windows wheels.
Thanks for all the work you did here!

Copy link
Copy Markdown
Owner

@WillAyd WillAyd Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering about the developer experience when working locally.

I usually do -Csetup-args="-Dauto_features=enabled" to enable everything, and then explicitly disable things that aren't available on my system (ex: -Csetup-args="-Dcuda=disabled")

The whole -Csetup-args=... I find to be a poor experience, particurlarly since you have to repeat that pattern for every setting. I believe that limitation is enforced by pip:

https://pip.pypa.io/en/stable/cli/pip_install/#cmdoption-C

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done more digging on the manylinux side and we are already calling auditwheel repair that's why it wasn't failing. Before auditwheel repair using auditwheel -v show on the original wheel:

2026-01-21T11:34:59.1785416Z The following external shared libraries are required by the wheel:
2026-01-21T11:34:59.1785484Z {
2026-01-21T11:34:59.1785654Z     "libarrow.so.2400": "/tmp/arrow-dist/lib/libarrow.so.2400.0.0",
2026-01-21T11:34:59.1785843Z     "libarrow_acero.so.2400": "/tmp/arrow-dist/lib/libarrow_acero.so.2400.0.0",
2026-01-21T11:34:59.1786049Z     "libarrow_compute.so.2400": "/tmp/arrow-dist/lib/libarrow_compute.so.2400.0.0",
2026-01-21T11:34:59.1786238Z     "libarrow_dataset.so.2400": "/tmp/arrow-dist/lib/libarrow_dataset.so.2400.0.0",
2026-01-21T11:34:59.1786427Z     "libarrow_flight.so.2400": "/tmp/arrow-dist/lib/libarrow_flight.so.2400.0.0",
2026-01-21T11:34:59.1786645Z     "libarrow_substrait.so.2400": "/tmp/arrow-dist/lib/libarrow_substrait.so.2400.0.0",
2026-01-21T11:34:59.1786812Z     "libparquet.so.2400": "/tmp/arrow-dist/lib/libparquet.so.2400.0.0"
2026-01-21T11:34:59.1786880Z }

After auditwheel repair using auditwheel -v show on the repaired wheel:

2026-01-21T11:35:23.1142034Z             "libparquet-84e0c9ca.so.2400.0.0": {
2026-01-21T11:35:23.1142407Z                 "soname": "libparquet-84e0c9ca.so.2400.0.0",
2026-01-21T11:35:23.1142633Z                 "path": "/tmp/tmps3l07x0n/pyarrow.libs/libparquet-84e0c9ca.so.2400.0.0",
2026-01-21T11:35:23.1142853Z                 "realpath": "/tmp/tmps3l07x0n/pyarrow.libs/libparquet-84e0c9ca.so.2400.0.0",
2026-01-21T11:35:23.1142929Z                 "platform": {
2026-01-21T11:35:23.1143027Z                     "_elf_osabi": "ELFOSABI_SYSV",
2026-01-21T11:35:23.1143106Z                     "_elf_class": 64,
2026-01-21T11:35:23.1143194Z                     "_elf_little_endian": true,
2026-01-21T11:35:23.1143283Z                     "_elf_machine": "EM_X86_64",
2026-01-21T11:35:23.1143403Z                     "_base_arch": "<Architecture.x86_64: 'x86_64'>",
2026-01-21T11:35:23.1143485Z                     "_ext_arch": null,
2026-01-21T11:35:23.1143562Z                     "_error_msg": null
2026-01-21T11:35:23.1143624Z                 },
2026-01-21T11:35:23.1143697Z                 "needed": [
2026-01-21T11:35:23.1143798Z                     "libarrow-85ba3dab.so.2400.0.0",
2026-01-21T11:35:23.1143877Z                     "libdl.so.2",
2026-01-21T11:35:23.1143956Z                     "libpthread.so.0",
2026-01-21T11:35:23.1144031Z                     "libstdc++.so.6",
2026-01-21T11:35:23.1144108Z                     "libm.so.6",
2026-01-21T11:35:23.1144181Z                     "libgcc_s.so.1",
2026-01-21T11:35:23.1144250Z                     "libc.so.6",
2026-01-21T11:35:23.1144340Z                     "ld-linux-x86-64.so.2"
2026-01-21T11:35:23.1144400Z                 ]
2026-01-21T11:35:23.1144464Z             },

I initially thought we were missing the libraries because they are on a different location auditwheel pushes the required shared libraries to pyarrow.libs but bundling with CMake was copying them (installing under pyarrow subfolder) and then we were linking against those and building them in the wheel.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nice find. I wonder if we need to update get_library_dirs for the auditwheel-fixed wheels then to avoid test failures? At a glance, I don't see that as adding pyarrow.libs to the return value (I had to do a similar fix for Windows with delvewheel)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively there might be a setting for the 3 tools to control where the libraries are getting bundled relative to the package root

-DARROW_WITH_ZLIB=%ARROW_WITH_ZLIB% ^
-DARROW_WITH_ZSTD=%ARROW_WITH_ZSTD% ^
-DCMAKE_BUILD_TYPE=%CMAKE_BUILD_TYPE% ^
-DCMAKE_INSTALL_PREFIX=C:\arrow-dist ^
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point me to the logs for this one? I'm surprised you'd have to do anything here - does this not get executed within a conda environment?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use vcpkg for the dependencies on our Windows wheels not conda. This is the previous failure:
https://github.com/ursacomputing/crossbow/actions/runs/21254041726/job/61163356938

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. That's interesting that CMake installs the ArrowComputeConfig.cmake file but then can't find it on a search with the default path.

Checking the logs I see that ArrowComputeConfig.cmake gets installed to C:/Program Files/arrow/lib/cmake/ArrowCompute/ArrowComputeConfig.cmake The CMake Docs suggest that a search location of <prefix>/(lib/<arch>|lib*|share)/cmake/<name>*/ is a Unix convention, so I wonder if its a bug in the Arrow CMake config to install that there in the first place?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, there has been some trial and error. The initial failure was on:
https://github.com/ursacomputing/crossbow/actions/runs/21252214504/job/61156610086

2026-01-22T15:05:10.1086958Z   -- Installing: C:/arrow-dist/lib/cmake/ArrowCompute/ArrowComputeConfig.cmake

We use C:/arrow-dist nowadays. That log I sent you is when I tried installing on system

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All tests (per each commit) can be found on this PR and the archery jobs triggered:
apache#48882

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, if we install on system and define --cmake-prefix-path=C:\Program Files\arrow the wheel is also built successfully (is able to find arrow and its sublibraries) but fails bundling them on delvewheel repair, last check:
https://github.com/ursacomputing/crossbow/actions/runs/21364405244/job/61492024534
We should investigate where does delvewheel look for them and potentially fix the installation path as discussed previously.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to check later but I remember running into this issue already. Delvewheel has an --include-imports option that will include shared libraries, but the standard CMake packaging solution puts the .lib and .dll files into separate directories, which is not ultimately bundled correctly. That may be part of or the entire issue.

See also adang1345/delvewheel#66

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also some discussion on the original issue about this, where I started to install conda on the Windows CI job because it was the only workaround I could think of at the time

apache#45854 (comment)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, in the meantime I was able to build the wheel with delvewheel, passing the dll folders via: --add-path "C:\Program Files\arrow\bin" ^
The curent issue is that 4 tests fail. All those tests seem to require the libs (building extensions for example). I'll try --include-imports and keep investigating. https://github.com/ursacomputing/crossbow/actions/runs/21366559041/job/61499813405

My idea is to have a PR where, if possible, all wheels are successful to share and discuss all those things we've found to try and come up with "better" solutions. I'll rebase all commits to have a cleaner history to share.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All those tests seem to require the libs

Are you referring to the actual .lib file? I thought Windows always required that when using shared libraries to get the symbol visibility to work, i.e. just providing the .dll doesn't make things work. I think that's why those tests are failing.

My idea is to have a PR where, if possible, all wheels are successful to share and discuss all those things we've found to try and come up with "better" solutions. I'll rebase all commits to have a cleaner history to share.

👍

%PYTHON_CMD% -m delvewheel repair -vv ^
--ignore-existing --with-mangle ^
--add-path "C:\Program Files\arrow\bin" ^
--include-imports "C:\Program Files\arrow\lib" ^
Copy link
Copy Markdown
Owner

@WillAyd WillAyd Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unclear how the crossbow jobs are getting triggered off of this, but my expectation is that this won't work given the .lib and .dll files are in a different folder. Let me know

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are correct, I was just trying something. I am going to test copying *.lib to the bin folder. At this point I am just trying to learn more about what delvewheel expects.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. An alternative is to build the Windows libraries as static, although I don't recall if I checked how much of a difference that made to package size

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wheel was built and both *.lib and *.dll were bundled but three tests building cython extensions are currently failing. I think there's something going on with pa.get_libraries or pa.get_library_dirs and those new bundled libraries. I'll have to keep investigating:
https://github.com/ursacomputing/crossbow/actions/runs/21398885144/job/61604207220

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was something similar I had to patch for conda environments to get it to resolve to the lib/bin files correctly. See https://github.com/apache/arrow/blob/ddca98cf82e5cd7a7057ee785a75e8c31e4d820c/python/pyarrow/__init__.py#L431

We probably need to extend that logic to include the directory that delvewheel is bundling these into within the wheel


%PYTHON_CMD% -m delvewheel repair -vv ^
--ignore-existing --with-mangle --include-imports ^
--no-mangle "arrow.dll;arrow_python.dll;arrow_acero.dll;arrow_dataset.dll;arrow_flight.dll;arrow_substrait.dll;parquet.dll" ^
Copy link
Copy Markdown
Owner

@WillAyd WillAyd Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense currently as there are places in the Python code that expect these names. delvewheel would probably say this makes it much easier to mix up system-installed libraries with vendored copies, which could lead to strange runtime behavior...but that wouldn't be a new problem I think

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last wheel on Windows was successful 🥳
https://github.com/ursacomputing/crossbow/actions/runs/21437338927/job/61731199957

I am going to do a final validation for wheels and I am going to squash commits, rebase main (also on your branch) and try to prepare things to get merged on your branch so we can share all the different findings.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome - nice work!

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, all wheels are green as seen here 🥳 :
apache#48882 (comment)

The sdist failure is related to LICENSE.txt not being packaged:

../meson.build:18:0: ERROR: File LICENSE.txt does not exist.

I think the wheels are also not packaging the LICENSE and NOTICE files they are green because we are not validating those at the moment. This is something that I am fighting currently with setuptools too and I am adding a step when validating the wheels.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yea I had some upstream discussions about this. Will link when I find them again, but I think the best solution is to symlink the LICENSE/NOTICE.txt files when building from the source tree, and as part of the meson.add_dist_script we need to copy in the actual files, and replace the relative paths of ../LICENSE.txt and ../NOTICE.txt in the meson.build file with just LICENSE.txt and NOTICE.txt, respectively

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://github.com/apache/arrow/pull/45854/files#r2600276684 and the corresponding comment. On a second look, I never fully implemented that solution, but the pieces are there

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the blockers to implementing that was a hesitation to use symlinks because of poor Windows support, but I don't think that matters much more now that Windows 10 is EOL. See also apache#47141 (comment) with some more background info

@raulcd raulcd force-pushed the raulcd-use-meson-python branch 2 times, most recently from 883ac0b to 3176285 Compare January 29, 2026 11:49
@raulcd raulcd force-pushed the raulcd-use-meson-python branch from 3176285 to 946977c Compare January 29, 2026 11:53
@raulcd raulcd changed the title GH-36411: [C++][Python] Some fixes for sdist build GH-36411: [C++][Python] Some fixes for sdist build and wheel building Jan 29, 2026
@raulcd
Copy link
Copy Markdown
Author

raulcd commented Jan 29, 2026

I've rebased your main branch (use-meson-python) against latest main and I've squashed commits for this branch so it's easier to understand removing a bunch of trial an error.
If you are ok we can add this commit to your branch and keep working there with some improvements, discussions. Currently all wheels and sdist jobs are green with this.

@github-actions
Copy link
Copy Markdown

❌ GitHub issue apache#36411 could not be retrieved.

@WillAyd
Copy link
Copy Markdown
Owner

WillAyd commented Jan 29, 2026

Yep lets do it - thanks!

shutil.copy(notice_file, dest_dir)
dest_notice = dest_dir / 'NOTICE.txt'
dest_notice.unlink(missing_ok=True)
shutil.copy(notice_file, dest_notice)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this script will still need to modify the top-level meson.build file to convert '../LICENSE.txt' into 'LICENSE.txt (ditto for NOTICE) - otherwise building from the sdist outside of the source tree won't work

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which meson.build do you refer?
I can see:
license_files: ['LICENSE.txt'], on python/meson.build

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nevermind - as long as we have symlinks from python/LICENSE.txt to LICENSE.txt then nothing needs to change.

I was thinking about the situation where we only had the top level LICENSE.txt. In that case, when building from the Arrow source repo, python/meson.build would reference license_files: ['../LICENSE.txt'] and we would have to change that to license_files: ['LICENSE.txt'] in the dist script while copying the files in, if the symlink approach ends up getting rejected

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the links are still necessary (I tried removing them) otherwise meson complains about missing licenses :)

@raulcd
Copy link
Copy Markdown
Author

raulcd commented Jan 30, 2026

ok! Closing this! I've just pushed the commit to your branch!

@raulcd raulcd closed this Jan 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants