Skip to content

Commit 736e2f1

Browse files
MarsBarLeenoatamir
andauthored
[BLOG] Add packaging_painpoints_in_2021 (#509)
Co-authored-by: Noa Tamir <[email protected]>
1 parent 18ee436 commit 736e2f1

File tree

3 files changed

+230
-0
lines changed

3 files changed

+230
-0
lines changed
Lines changed: 228 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,228 @@
1+
---
2+
title: 'Python packaging in 2021 - pain points and bright spots'
3+
published: January 24, 2021
4+
author: ralf-gommers
5+
description: 'This blog post contains the raw
6+
output of the 30-minute brainstorm (only cleaned up for textual issues) and
7+
my annotations on it (in italics) which capture some of the discussion during
8+
the session and links and context that may be helpful.'
9+
category: [Packaging]
10+
featuredImage:
11+
src: /posts/packaging_painpoints_in_2021/blog_feature_org.svg
12+
alt: 'An illustration of a brown and a dark brown hand coming towards each other to pass a business card with the logo of Quansight Labs.'
13+
hero:
14+
imageSrc: /posts/packaging_painpoints_in_2021/blog_hero_var1.svg
15+
imageAlt: 'An illustration of a brown hand holding up a microphone, with some graphical elements highlighting the top of the microphone'
16+
---
17+
18+
At Quansight we have a weekly "Q-share" session on Fridays where everyone can
19+
share/demo things they have worked on, recently learned, or that simply seem
20+
interesting to share with their colleagues. This can be about anything, from
21+
new utilities to low-level performance, from building inclusive communities
22+
to how to write better documentation, from UX design to what legal &
23+
accounting does to support the business. This week I decided to try something
24+
different: hold a brainstorm on the state of Python packaging today.
25+
26+
The ~30 participants were mostly from the PyData world, but not exclusively -
27+
it included people with backgrounds and preferences ranging from C, C++ and
28+
Fortran to JavaScript, R and DevOps - and with experience as end-users,
29+
packagers, library authors, and educators. This blog post contains the raw
30+
output of the 30-minute brainstorm (only cleaned up for textual issues) and
31+
my annotations on it (in italics) which capture some of the discussion during
32+
the session and links and context that may be helpful. I think it sketches a
33+
decent picture of the main pain points of Python packaging for users and
34+
developers interacting with the Python data and numerical computing ecosystem.
35+
36+
The main intended audience for this post is maintainers working on
37+
packaging and distribution tools, and library authors needing to package
38+
their projects. Readers newer to Python packaging may want to start with the
39+
blog posts linked in the Conclusion section.
40+
41+
42+
## Topic 1: The pip/wheel/PyPI world
43+
44+
- PyPI file size limit .deep learning, CUDA 11
45+
46+
_RG: the default file size limit on PyPI is (I believe) 60 MB currently, which is not enough for many libraries with heavy dependencies. You must file an issue to ask for an exemption, which can take quite a while to be resolved because PyPI has only a few volunteers. See, e.g., [this issue where MXNet asks for a 1 GB limit](https://github.com/pypa/pypi-support/issues/243). The issue will get significantly worse with CUDA 11, which is several times larger than CUDA 10. Looking at for example [PyTorch wheels](https://pypi.org/project/torch/#files), which are already 776 MB now, that simply won't fit. When I asked one of the PyPI maintainers recently, the answer was "interesting - that will be a problem"._
47+
48+
- Native dependencies for wheels
49+
- two proposed solutions: pynativelib, or spec to interface with outside packages
50+
51+
_RG: what projects do with non-Python dependencies today is simply don't declare the dependency, and use either static linking or put the needed shared libraries into the wheel with name mangling. The [pynativelib](https://mail.python.org/pipermail/wheel-builders/2016-April/000090.html) idea is technically feasible for simple scenarios, but will always be limited because of how pip, wheels and PyPI work. It's not in use today. The more sensible route seems to be to work out how to let packages interact with external package managers, and figure out what's out of scope completely. I'm not aware of significant work happening on this topic._
52+
53+
- ABI tags are limited: multi-CUDA options, multi-SIMD options
54+
55+
_RG: for packages with compiled code, ABI tags in the wheel filename can be used to specify certain properties, like what Python version, operating system or set of base libraries like glibc the wheel is for. Anything that doesn't have an ABI tag cannot be supported, though. Examples of unsupported features are CUDA version and CPU capabilities for SIMD instruction sets. Therefore projects can only make a single choice there for what they upload to PyPI; everything else must be hosted elsewhere._
56+
57+
- The rise of new (and old) architectures: Linux on ARM64, macOS M1, PowerPC, AIX, Z/OS
58+
59+
_RG: this is mostly a problem for library authors, who typically rely on public CI for each supported platform. When users clamour for macOS M1 support so they can make optimal use of their shiny new hardware, this puts maintainers in a difficult position - CI support appearing is likely to take many months still._
60+
61+
- Uniform install instructions for venv's cannot be given.
62+
63+
_RG: using virtual environments is good practice, but writing install instructions for them is very difficult because (a) there are too many tools and (b) activating environments is clunky and platform-specific (see, e.g., [this section of the Python Packaging User Guide](https://packaging.python.org/tutorials/installing-packages/#creating-virtual-environments))._
64+
65+
- Setuptools incorporating distutils with a TBD new build API
66+
67+
_RG: `distutils` will be deprecated in Python 3.10 and removed in Python 3.12. This is likely good news in the long term, but for now it means building packages with nontrivial compiled dependencies is going to be unstable. See for example [this issue about `numpy.distutils` and `setuptools` interaction](https://github.com/pypa/setuptools/issues/2372)._
68+
69+
- manylinux: Alpine Linux, and (maybe) the death of CentOS
70+
- Is musl compatible yet?
71+
72+
_RG: I'm not sure if this is a major isssue. Right now Alpine Linux doesn't work with the `manylinux1/2010/2014` wheels on PyPI, because Alpine uses musl libc rather than glibc. The manylinux standard is based on CentOS, which is being discontinued. If that leads to growth of Alpine (perhaps not likely) then we'll be getting more users asking for wheels. Right now all we can say is "that's unsupported, sorry"._
73+
74+
- Isolated builds with `pyproject.toml` are quite complicated to spec, and still unclear how that will hold up in the real world
75+
76+
_RG: Isolated builds are a good idea, but with ABI compatibility constraints one needs to deal with like when building against the NumPy C API, getting `pyproject.toml` right is complicated. See for example [the `oldest-supported-numpy` meta-package](https://github.com/scipy/oldest-supported-numpy/blob/master/setup.cfg) to help with this._
77+
78+
- Filling out configuration is tedious
79+
- Which freaking tool should I use!
80+
- Reproducible builds! (practically impossible) – IPython is reproducible for both `.tar.gz` anf `.zip`, but needs custom post-process.
81+
82+
_RG: See [this blog post from Matthias Bussonnier](https://labs.quansight.org/blog/2020/08/ipython-reproducible-builds/) for more on this topic._
83+
84+
- `setup.py` is dynamic making downstream packaging/automation much harder (including for conda)
85+
- Use [flit](http://flit.readthedocs.org/) – completely static!
86+
87+
_RG: For pure Python packages Flit is great indeed. For packages with compiled code, we'll be stuck with `setup.py` for a while longer, I'm afraid. It is what it is - this has gotten better already with `pyproject.toml` and it's not one of the biggest issues anymore imho._
88+
89+
- Lack of relocation tools in Windows, i.e., mangling a DLL name and change it on DLLs that depend on it
90+
91+
_RG: Probably a gap indeed, not 100% sure. [Mach-O Mach-O mangler](https://github.com/njsmith/machomachomangler) supports Windows, but a Windows equivalent of [auditwheel](https://github.com/pypa/auditwheel) and [delocate](https://github.com/matthew-brett/delocate) seems to be missing._
92+
93+
- This world does not know that the conda world exists.
94+
95+
_RG: yeah, kinda. Devs do know conda and conda-forge exist, but as far as I can tell they do see the world as centered around PyPI, wheels and pip, and conda as one of many other package managers (next to Homebrew, Linux distros, Chocolatey, and so on). Which isn't quite right - PyData users now represent over half of all Python users, and the Conda ecosystem has specific capabilities related to that stack that are (and may remain) missing from PyPI/pip._
96+
97+
- Shipping JupyterLab 3.0 extensions without Node.js is kinda cool.
98+
- Rust/Maturin to develop Python packages is nice.
99+
- The relocation of depending binaries must be done manually in some cases.
100+
101+
102+
## Topic 2: The conda world
103+
104+
- Conda itself is almost unmaintained
105+
106+
_RG: a quick peek at [the conda repo](https://github.com/conda/conda/pulse) indeed shows declining activity. Hopefully the new Conda Community organization will help generate new energy for `conda` contributions._
107+
108+
- Mamba is still a little tricky/slow to install
109+
110+
_RG: this is definitely true if you start with conda; it can take minutes for conda to do the solve to install mamba and its dependencies from conda-forge. With the recently added [Mambaforge](https://github.com/conda-forge/miniforge#mambaforge) installer that problem should be mostly solved now._
111+
112+
- No TensorFlow
113+
- Does conda-forge count?
114+
115+
_RG: this is actually no longer true, the TensorFlow package in `defaults` is reasonably but not completely up to date (2.3.0 CPU and GPU packages at the moment, missing 2.3.1-2.4.1). In general it's a problem indeed that important packages like TensorFlow and PyTorch in `defaults` may be one or even more versions behind._
116+
117+
- Conda-forge is growing too large
118+
- In which respect?
119+
120+
_RG: As discussed in [this blog post on conda performance](https://www.anaconda.com/blog/how-we-made-conda-faster-4-7), the more packages there are in a channel, the slower the conda solves get. Given the ever-growing size of conda-forge, conda is getting slower and slower when used with conda-forge._
121+
122+
- Why oh why are defaults and conda-forge still not using a 100% compatible toolchain??
123+
124+
_RG: This would be so helpful. Not being able to mix `defaults` and `conda-forge`, and users doing it anyway because they need packages from both, is a long-standing issue._
125+
126+
- Conda solves in CI often get really bogged down
127+
- Also conda is never the default in CI so you need to do quite a bit of manipulation on some platforms
128+
- Confusion in the community over the new Anaconda ToS
129+
130+
_RG: Anaconda has done the right thing here imho: make large institutional users pay if they use the `defaults` channel. Packaging is hard and labour-intensive, and having a sustainable revenue stream to offset the cost of packaging difficult libraries, maintaining core tooling like `conda-build`, and serving packages to millions of users is important. Conda-forge and all other channels are still completely free. Unfortunately Anaconda communicated that very poorly, which led to unnecessary confusion. See [this blog posts from the conda-forge team](https://conda-forge.org/blog/posts/2020-11-20-anaconda-tos/)
131+
132+
- There's no way to distinguish packages that provide the same binaries, e.g., `turbojpeg` vs. `libjpeg`, both package `libjpeg`
133+
- Need to separate metadata (especially dependencies) from packages.
134+
135+
_RG: this to me seems like a much smaller issue than with PyPI. The comment was about the ability to patch metadata of published conda packages more easily._
136+
137+
- R community seems slow to adopt it
138+
139+
_RG: I've heard from R users before that conda felt a bit "foreign". In the discussion it was suggested that part of that is that conda is Python-based which can still bubble up to the surface on occasion. And that Mamba, which is written in C++, may find a more positive reception in the R community._
140+
141+
- End-to-end security?
142+
143+
_RG: [Purism](https://puri.sm/) and [this blog post from its devs about security in the software supply chain](https://puri.sm/posts/the-future-of-software-supply-chain-security/) were mentioned._
144+
145+
## Topic 3: Other
146+
147+
- The Conda and PyPA worlds still aren't talking to each other much.
148+
149+
_RG: It's great to see that both the PyPA team and the Conda community are getting better organized. There's now more funding for work on PyPI, Warehouse and Pip; the teams working on it have grown, there's good project management, roadmapping and even Pip user experience research going on. And the Conda community has set up a new organization to align on standards, incubate promising new projects. The next step will hopefully be more productive interactions between those two somewhat separate groups._
150+
151+
- `pip` and `conda` interaction still not great
152+
- I'd like a way to have "safe" (e.g., pure Python) packages so you can confidently mix them. `+100`
153+
- The pip dependency resolver is not great yet and it also ends up with conflicts with conda.
154+
155+
_RG: True. For doing serious scientific computing, data science and ML type work, you often need the combination, using `conda` for the core libraries, and filling in the gaps with `pip` for packages that are not packaged for conda or installing things from source._
156+
157+
158+
## Topic 4: There's always good news too
159+
160+
- [The death of `easy_install`](https://github.com/pypa/setuptools/pull/2544)!!!
161+
- Mambaforge
162+
- macos-arm64 support in conda-forge
163+
164+
_RG: It'll be a while before we get wheels for core PyData packages, but conda-forge already has support, see [this blog post](https://conda-forge.org/blog/posts/2020-10-29-macos-arm64/)
165+
166+
- The new conda community umbrella org
167+
- Pip and PyPI are now well-managed and have a nontrivial amount of funding
168+
- Pip has a solver :) ; the solver is a "backtracking solver" because metadata on PyPI is incomplete :(
169+
170+
171+
## Bonus topic: A puzzle
172+
173+
A challenge with a prize for who posts the most concise correct solution by
174+
tomorrow (note, you do need a very fast internet connection for this): create
175+
a concise recipe for installing the latest versions of NumPy, TensorFlow,
176+
PyTorch, CuPy, Dask, JAX and MXNet in a single environment.
177+
178+
👆🏼 This is a trick question I am sure
179+
180+
_RG: everyone had till the end of the next day to submit a solution. It turns out to be not so simple, the winning entry (with six out of seven packages, JAX is missing) came from Chris Ostrouchov - our local Nix evangelist:_
181+
182+
```bash
183+
$ nix-shell -I nixpkgs=https://github.com/nixos/nixpkgs/archive/986cf21c45b798c97f2424e17b819af3ecf8083e.tar.gz
184+
-p python3Packages.numpy
185+
python3Packages.tensorflow python3Packages.pytorch python3Packages.cupy python3Packages.dask python3Packages.mxnet
186+
```
187+
188+
_My solution:_
189+
190+
```bash
191+
# To create an environment with the most recent version of all mainstream
192+
# Python array/tensor libraries installed:
193+
$ conda create -n many-libs python=3.7
194+
$ conda activate many-libs
195+
$ conda install cudatoolkit=10.2
196+
$ pip install numpy torch jax jaxlib tensorflow mxnet cupy-cuda102 dask toolz sparse # you may need to turn off the resolver if you're on pip >= 20.3.1
197+
198+
# Conda doesn't manage to find a winning combination here; pip has a hard time
199+
# too and probably not all constraints are satisfied, but nothing crashes
200+
# and basic tests work as they are supposed to.
201+
```
202+
203+
## Conclusion
204+
205+
A lot of good things are happening in various places in the Python packaging world. We're in a better situation than several years ago. As the brainstorm content above shows there are challenges as well, but that has always been the case. If I could have just one wish, I'd wish for good program management across and coordination between the PyPA and Conda communities.
206+
207+
This is my blog post though, so I get a few more wishes:
208+
209+
- Merging Conda and Mamba so they share a single installer and UX with two modes ('fast' and 'stable/legacy').
210+
- Strictly enforced metadata correctness and manylinux compliance for new PyPI uploads. The stricter the better, see [packaging-problems #264](https://github.com/pypa/packaging-problems/issues/264).
211+
- GPU hardware for conda-forge so it can test its GPU packages in CI.
212+
- Consolidate the zoo of virtual environment tools (`venv`, `virtualenv`, `poetry`, `pipenv`, `pyenv`, `pyflow` - probably there are even more) so we can actually recommend one to users of our projects.
213+
- Not having to think about problems around Fortran on Windows ever again.
214+
215+
While I'm at it, let me also link to some of the most informative blog posts on the topic I know of and enjoyed reading over the years:
216+
217+
- [Tarek Ziadé - The fate of Distutils - Pycon Summit + Packaging Sprint detailed report (2010)](https://tarekziade.wordpress.com/2010/03/03/the-fate-of-distutils-pycon-summit-packaging-sprint-detailed-report/)
218+
- [Travis Oliphant - Why I promote conda (2013)](http://technicaldiscovery.blogspot.com/2013/12/why-i-promote-conda.html)
219+
- [Jake VanderPlas - Conda: Myths and Misconceptions (2016)](https://jakevdp.github.io/blog/2016/08/25/conda-myths-and-misconceptions/)
220+
- [Wes McKinney - conda-forge and PyData's CentOS moment (2016)](https://wesmckinney.com/blog/conda-forge-centos-moment/)
221+
- [Donald Stufft - Powering the Python Package Index (2016)](https://caremad.io/posts/2016/05/powering-pypi/)
222+
- [Pauli Virtanen - Building Python wheels with Fortran for Windows (2017)](https://web.archive.org/web/20210616203153/https://pav.iki.fi/blog/2017-10-08/pywingfortran.html)
223+
- [Uwe Korn - How we build Apache Arrow's manylinux wheels (2019)](https://uwekorn.com/2019/09/15/how-we-build-apache-arrows-manylinux-wheels.html)
224+
- [Pradyun Gedam - Testing the next-gen pip dependency resolver (2020)](https://pradyunsg.me/blog/2020/03/27/pip-resolver-testing/)
225+
- [Sumana Harihareswara - Releasing pip 20.3, featuring new dependency resolver (2020)](https://pyfound.blogspot.com/2020/11/pip-20-3-new-resolver.html)
226+
227+
228+
I hope this sketches a useful picture of where we are today. If we missed any major issues (I'm sure we did), I'd love to hear them!

apps/labs/public/posts/packaging_painpoints_in_2021/blog_feature_org.svg

Lines changed: 1 addition & 0 deletions
Loading

apps/labs/public/posts/packaging_painpoints_in_2021/blog_hero_var1.svg

Lines changed: 1 addition & 0 deletions
Loading

0 commit comments

Comments
 (0)