Skip to content

Commit f0163cf

Browse files
jaimergpCJ-Wright
andcommitted
add blog/2020-10-02-versions.md
Co-authored-by: cj-wright <[email protected]>
1 parent 38216a5 commit f0163cf

File tree

1 file changed

+172
-0
lines changed

1 file changed

+172
-0
lines changed

blog/2020-10-02-versions.md

Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
---
2+
authors:
3+
- cj-wright
4+
tags: [conda-forge]
5+
---
6+
7+
# The API Territory and Version Number Map
8+
9+
tl;dr Depending on specific version numbers of underlying libraries may
10+
be too inaccurate and cause headaches as upstream libraries evolve and
11+
change. A more detailed approach is needed. In this post I outline
12+
current and potential work on a path towards a more complete inspection
13+
of requirements based on APIs and dynamic pinning of libraries.
14+
15+
<!--truncate-->
16+
17+
## What Constitutes a Good Version Number
18+
19+
Version numbers should constitute a set that has the following
20+
properties
21+
22+
1. The set must be unbounded
23+
2. The set must be orderable (maybe)
24+
25+
Of course sets that meet these requirements might not convey a lot of
26+
information about the software they represent other than if two things
27+
are equivalent and their comparative ages. Note that the requirement to
28+
be orderable may not be needed, but is generally useful when considering
29+
the idea of an "upgrade" since it provides a clear delineation between
30+
older and newer packages. In many cases, the structure of the version
31+
number provides additional information. For some projects the version
32+
number includes the date of the release, often using [cal
33+
ver](https://calver.org/). Many projects use [semantic
34+
versioning](https://semver.org/), which attempts to encode information
35+
about the underlying source code's API in the version number.
36+
37+
## Version Numbers and API Pinning
38+
39+
One of the most important places where version numbers are specified is
40+
during the pinning of APIs. Source code often requires specific APIs
41+
from the libraries it uses. This requires a pin specifying which
42+
versions of the underlying libraries can be used. The package manager
43+
then uses these pins to make certain a compatible environment is
44+
created.
45+
46+
However, these pins (or even the lack of pins) produce problems.
47+
Firstly, the pins are a one-time, local statement about the current and
48+
future, global ecosystem of packages. For instance a pin of `scipy` to
49+
the current major version number may not hold up over time, newer
50+
versions of `scipy` may break the API while not changing the major
51+
version number. Similarly the lack of pin for `scipy` could be false as
52+
the API breaks. Even pins that establish firm upper and lower bounds may
53+
be false as new versions of the pinned library restore the missing API.
54+
These issues are particularly problematic for dependency systems that
55+
tie the pins to a particular version of the source code, requiring a new
56+
version to be created to update the pins. Conda-Forge is able to avoid
57+
some of these issues via [repodata
58+
patching](https://github.com/conda-forge/conda-forge-repodata-patches-feedstock),
59+
dynamically updating a package's stated requirements. Overall this
60+
process is fraught, as each package depends on different portions of a
61+
library's API, a version bump that breaks one package may leave others
62+
unscathed.
63+
64+
## A Potential Path Forward
65+
66+
All of the above issues are caused by the confusion of [the map for the
67+
territory](https://en.wikipedia.org/wiki/Map%E2%80%93territory_relation).
68+
The map, in this case the version number of a library, can not
69+
accurately represent the territory, the API itself. To fix this issue we
70+
need a more accurate description of the territory. Achieving this will
71+
not be easy, but I think there is an approach that gets close enough to
72+
limit the number of errors.
73+
74+
We need a programmatic way to check if a particular library, for a
75+
particular version, provides the required API. I think this can be
76+
achieved iteratively, with each step providing additional clarity and
77+
difficulty of implementation. Note that in the steps below, I'm using
78+
python packaging as an example, but I imagine that these steps are
79+
general enough to apply to other languages and ecosystems.
80+
81+
1. Determine which libraries are requirements of the code, this is
82+
provided by tools like
83+
[depfinder](https://github.com/ericdill/depfinder) and are starting
84+
to be integrated into the Conda-Forge bot systems (although they are
85+
still highly experimental and being worked on).
86+
2. Determine if the a version of the library provides the needed
87+
modules. This could be accomplished by using depfinder to find the
88+
imports and use the mapping provided by
89+
[libcfgraph](https://github.com/regro/libcfgraph/tree/master/import_maps)
90+
between the import names and the versions of packages that ship
91+
those imports.
92+
3. Determine if an imported module provides the symbols being imported.
93+
This would require a listing of all the symbols in a given python
94+
module, including top level scoped variables, function names, class
95+
names, methods, etc.
96+
4. For callables determine if the used call signature matches the
97+
method or function definition.
98+
99+
The [depfinder](https://github.com/ericdill/depfinder) project has made
100+
significant advances along this path, providing an easy to use tool to
101+
extract accurate import and package requirement data from source code.
102+
Depfinder even has cases to handle imports that are within code blocks
103+
that might make the requirement optional or use the python standard
104+
library. Future work on depfinder, including using more accurate maps
105+
between imports and package names and providing metadata on package
106+
requirements that are collectively exhaustive (for instance imports of
107+
`pyqt4` vs. `pyqt5` in a `try: except:` block), will provide even more
108+
accurate information on requirements.
109+
110+
At each one of the above stages we can provide significant value to
111+
users, maintainers and source code authors by helping them to keep their
112+
requirements consistent and warning when there are conflicts.
113+
Conda-Forge can update its repodata as new versions of imported
114+
libraries are created, to properly represent if that version is API
115+
compatible with it's downstream consumers. Additionally the tables that
116+
list all the symbols and call signatures can be provided to 3rd party
117+
consumers that may want to patch their own metadata or check if a piece
118+
of source code is self consistent in its requirements. This will also
119+
help with the loosening of pins, creating more solvable environments for
120+
Conda-Forge and other packaging ecosystems. Furthermore, as this tooling
121+
matures and becomes more accurate it can be incorporated into the
122+
Conda-Forge bot systems to automatically update dependencies during
123+
version bumps and repodata patches, helping reduce maintenance burden.
124+
125+
Tools built from the symbol table can also have impacts far beyond
126+
Conda-Forge. For instance, the symbol tables could allow source code
127+
authors to have a line by line inspection of their code, revealing which
128+
lines force the use of older or newer versions of dependencies. This
129+
could enable large scale migrations of source code with surgical
130+
precision, enabling developers to extract and re-write the few lines of
131+
code preventing the use of a new version of a library.
132+
133+
## Caveats
134+
135+
There are some important caveats to this approach that need to be kept
136+
in mind.
137+
138+
1. All of this work is aimed at understanding the API of a given
139+
library, this approach can not provide insight into the code inside
140+
of the API, or if changes there impact downstream consumers. For
141+
instance, version updates that fix bugs and security flaws in
142+
library code may not change the API at all. From this tooling's
143+
perspective there is no reason to upgrade since the API is not
144+
different. Of course there is a strong reason to upgrade in this
145+
case, since buggy or vulnerable libraries could be a huge headache
146+
and liability for downstream code and should be removed as quickly
147+
as possible.
148+
2. Some features may depend on broader adoption by the community. For
149+
instance, this approach would benefit greatly from python type
150+
hints, since the API could be constrained down to the expected
151+
types. Such type constraints would provide much more accuracy to the
152+
API version range as any changes could be detected. However, type
153+
hints may not be adopted in the python community at a high enough
154+
rate to truly be useful for this application.
155+
3. Source code is fundamentally flexible. There may be knots of code
156+
that even this approach could not cut through, especially as
157+
multiple languages and runtime module loading come into the picture.
158+
My personal hope would be that the code recognizes when these
159+
situations occur, provides its best guess of what is going on, and
160+
provides sufficient metadata to users so that they understand the
161+
decreased accuracy of the results. Fundamentally the tooling can
162+
only provide very educated guesses and context to users, who then
163+
need to go figure out what is actually going on inside the code.
164+
165+
## Conclusion
166+
167+
Version number based pins are imprecise representations of API
168+
compatibility. More accurate representations based on source code
169+
inspection would make the Conda-Forge ecosystem more robust and flexible
170+
while reducing maintenance burden. Some of the path to achieving this is
171+
built, and near future steps can be achieved with current tooling and
172+
databases.

0 commit comments

Comments
 (0)