add blog/2020-10-02-versions.md

jaimergp · CJ-Wright · jaimergp · commit f0163cf9b265 · 2023-11-04T13:24:56.000+01:00
Co-authored-by: cj-wright &lt;cj-wright@users.noreply.github.com&gt;
diff --git a/blog/2020-10-02-versions.md b/blog/2020-10-02-versions.md
@@ -0,0 +1,172 @@
+---
+authors:
+  - cj-wright
+tags: [conda-forge]
+---
+
+# The API Territory and Version Number Map
+
+tl;dr Depending on specific version numbers of underlying libraries may
+be too inaccurate and cause headaches as upstream libraries evolve and
+change. A more detailed approach is needed. In this post I outline
+current and potential work on a path towards a more complete inspection
+of requirements based on APIs and dynamic pinning of libraries.
+
+<!--truncate-->
+
+## What Constitutes a Good Version Number
+
+Version numbers should constitute a set that has the following
+properties
+
+1.  The set must be unbounded
+2.  The set must be orderable (maybe)
+
+Of course sets that meet these requirements might not convey a lot of
+information about the software they represent other than if two things
+are equivalent and their comparative ages. Note that the requirement to
+be orderable may not be needed, but is generally useful when considering
+the idea of an "upgrade" since it provides a clear delineation between
+older and newer packages. In many cases, the structure of the version
+number provides additional information. For some projects the version
+number includes the date of the release, often using [cal
+ver](https://calver.org/). Many projects use [semantic
+versioning](https://semver.org/), which attempts to encode information
+about the underlying source code's API in the version number.
+
+## Version Numbers and API Pinning
+
+One of the most important places where version numbers are specified is
+during the pinning of APIs. Source code often requires specific APIs
+from the libraries it uses. This requires a pin specifying which
+versions of the underlying libraries can be used. The package manager
+then uses these pins to make certain a compatible environment is
+created.
+
+However, these pins (or even the lack of pins) produce problems.
+Firstly, the pins are a one-time, local statement about the current and
+future, global ecosystem of packages. For instance a pin of `scipy` to
+the current major version number may not hold up over time, newer
+versions of `scipy` may break the API while not changing the major
+version number. Similarly the lack of pin for `scipy` could be false as
+the API breaks. Even pins that establish firm upper and lower bounds may
+be false as new versions of the pinned library restore the missing API.
+These issues are particularly problematic for dependency systems that
+tie the pins to a particular version of the source code, requiring a new
+version to be created to update the pins. Conda-Forge is able to avoid
+some of these issues via [repodata
+patching](https://github.com/conda-forge/conda-forge-repodata-patches-feedstock),
+dynamically updating a package's stated requirements. Overall this
+process is fraught, as each package depends on different portions of a
+library's API, a version bump that breaks one package may leave others
+unscathed.
+
+## A Potential Path Forward
+
+All of the above issues are caused by the confusion of [the map for the
+territory](https://en.wikipedia.org/wiki/Map%E2%80%93territory_relation).
+The map, in this case the version number of a library, can not
+accurately represent the territory, the API itself. To fix this issue we
+need a more accurate description of the territory. Achieving this will
+not be easy, but I think there is an approach that gets close enough to
+limit the number of errors.
+
+We need a programmatic way to check if a particular library, for a
+particular version, provides the required API. I think this can be
+achieved iteratively, with each step providing additional clarity and
+difficulty of implementation. Note that in the steps below, I'm using
+python packaging as an example, but I imagine that these steps are
+general enough to apply to other languages and ecosystems.
+
+1.  Determine which libraries are requirements of the code, this is
+    provided by tools like
+    [depfinder](https://github.com/ericdill/depfinder) and are starting
+    to be integrated into the Conda-Forge bot systems (although they are
+    still highly experimental and being worked on).
+2.  Determine if the a version of the library provides the needed
+    modules. This could be accomplished by using depfinder to find the
+    imports and use the mapping provided by
+    [libcfgraph](https://github.com/regro/libcfgraph/tree/master/import_maps)
+    between the import names and the versions of packages that ship
+    those imports.
+3.  Determine if an imported module provides the symbols being imported.
+    This would require a listing of all the symbols in a given python
+    module, including top level scoped variables, function names, class
+    names, methods, etc.
+4.  For callables determine if the used call signature matches the
+    method or function definition.
+
+The [depfinder](https://github.com/ericdill/depfinder) project has made
+significant advances along this path, providing an easy to use tool to
+extract accurate import and package requirement data from source code.
+Depfinder even has cases to handle imports that are within code blocks
+that might make the requirement optional or use the python standard
+library. Future work on depfinder, including using more accurate maps
+between imports and package names and providing metadata on package
+requirements that are collectively exhaustive (for instance imports of
+`pyqt4` vs. `pyqt5` in a `try: except:` block), will provide even more
+accurate information on requirements.
+
+At each one of the above stages we can provide significant value to
+users, maintainers and source code authors by helping them to keep their
+requirements consistent and warning when there are conflicts.
+Conda-Forge can update its repodata as new versions of imported
+libraries are created, to properly represent if that version is API
+compatible with it's downstream consumers. Additionally the tables that
+list all the symbols and call signatures can be provided to 3rd party
+consumers that may want to patch their own metadata or check if a piece
+of source code is self consistent in its requirements. This will also
+help with the loosening of pins, creating more solvable environments for
+Conda-Forge and other packaging ecosystems. Furthermore, as this tooling
+matures and becomes more accurate it can be incorporated into the
+Conda-Forge bot systems to automatically update dependencies during
+version bumps and repodata patches, helping reduce maintenance burden.
+
+Tools built from the symbol table can also have impacts far beyond
+Conda-Forge. For instance, the symbol tables could allow source code
+authors to have a line by line inspection of their code, revealing which
+lines force the use of older or newer versions of dependencies. This
+could enable large scale migrations of source code with surgical
+precision, enabling developers to extract and re-write the few lines of
+code preventing the use of a new version of a library.
+
+## Caveats
+
+There are some important caveats to this approach that need to be kept
+in mind.
+
+1.  All of this work is aimed at understanding the API of a given
+    library, this approach can not provide insight into the code inside
+    of the API, or if changes there impact downstream consumers. For
+    instance, version updates that fix bugs and security flaws in
+    library code may not change the API at all. From this tooling's
+    perspective there is no reason to upgrade since the API is not
+    different. Of course there is a strong reason to upgrade in this
+    case, since buggy or vulnerable libraries could be a huge headache
+    and liability for downstream code and should be removed as quickly
+    as possible.
+2.  Some features may depend on broader adoption by the community. For
+    instance, this approach would benefit greatly from python type
+    hints, since the API could be constrained down to the expected
+    types. Such type constraints would provide much more accuracy to the
+    API version range as any changes could be detected. However, type
+    hints may not be adopted in the python community at a high enough
+    rate to truly be useful for this application.
+3.  Source code is fundamentally flexible. There may be knots of code
+    that even this approach could not cut through, especially as
+    multiple languages and runtime module loading come into the picture.
+    My personal hope would be that the code recognizes when these
+    situations occur, provides its best guess of what is going on, and
+    provides sufficient metadata to users so that they understand the
+    decreased accuracy of the results. Fundamentally the tooling can
+    only provide very educated guesses and context to users, who then
+    need to go figure out what is actually going on inside the code.
+
+## Conclusion
+
+Version number based pins are imprecise representations of API
+compatibility. More accurate representations based on source code
+inspection would make the Conda-Forge ecosystem more robust and flexible
+while reducing maintenance burden. Some of the path to achieving this is
+built, and near future steps can be achieved with current tooling and
+databases.