|
| 1 | +--- |
| 2 | +authors: |
| 3 | + - cj-wright |
| 4 | +tags: [conda-forge] |
| 5 | +--- |
| 6 | + |
| 7 | +# Conda-Forge Operational Risk |
| 8 | + |
| 9 | +Recently I've been thinking about operational risk (op. risk). |
| 10 | +Operational risks arise from failures of processes, for instance a |
| 11 | +missing email, or an automated software system not running properly. |
| 12 | +Many commercial institutions are interested in minimizing op. risk, |
| 13 | +since it is risk that produces no value, as opposed to risks associated |
| 14 | +with investing. This is also something I think about in my job at |
| 15 | +[Lab49](https://www.lab49.com/), where I'm a software engineering |
| 16 | +consultant focusing on financial institutions. I think there is also a |
| 17 | +good analogy for Conda-Forge, even though we are not a commercial |
| 18 | +outfit. In this case the risk we incur isn't the potential for lost |
| 19 | +earnings but frustration for our users and maintainers in the form of |
| 20 | +bugs and lackluster user experience. In this post I explore three main |
| 21 | +sources of operational risk for Conda-Forge: Automation, Top-Down |
| 22 | +Control, and Self-Service Structure. |
| 23 | + |
| 24 | +<!--truncate--> |
| 25 | + |
| 26 | +## A brief conda-forge primer |
| 27 | + |
| 28 | +Conda-Forge is an ecosystem and community that grew around building |
| 29 | +packages for the conda package manager. Conda-Forge uses continuous |
| 30 | +integration services to build packages from GitHub repos called |
| 31 | +feedstocks. This structure enables teams of contributors to maintain |
| 32 | +packages via a pull request based workflow. At time of writing |
| 33 | +Conda-Forge has over 10000 feedstocks and ships more than 120 million |
| 34 | +packages a month. |
| 35 | + |
| 36 | +## Self-Service Structure |
| 37 | + |
| 38 | +Conda-Forge is built around a self-service structure for each stage in a |
| 39 | +feedstock's lifecyle. The creation of new feedstocks relies on would be |
| 40 | +maintainers to submit PRs to staged-recipes. Although language specific |
| 41 | +help teams and staged-recipes reviewers provide some assistance and |
| 42 | +oversight, the PR submitter plays the most important role in proposing |
| 43 | +the package and shepherding it to acceptance. Once the feedstock is |
| 44 | +accepted the maintenance is federated with most upkeep being performed |
| 45 | +by the maintainers, who have extensive permissions and control over the |
| 46 | +feedstock. If fixes or updates are needed for a package, maintainers and |
| 47 | +users are encouraged to open their own pull requests. |
| 48 | + |
| 49 | +This structure can present a few challenges for minimizing op. risk. The |
| 50 | +most important challenge is the disconnect between feedstock maintainers |
| 51 | +and users. While most maintainers are package users, most of our users |
| 52 | +are not maintainers, and are unlikely to become maintainers. The |
| 53 | +disparity between maintainers and users can come from a few sources, |
| 54 | +some under our control and others not. For instance we can write better |
| 55 | +documentation, lowering the barrier to entry, but we don't have control |
| 56 | +over how our user's incentive structures value Conda-Forge |
| 57 | +contributions. This produces a gap in representation in the Conda-Forge |
| 58 | +organizational structure, where non-maintainer users' issues and |
| 59 | +desires are not communicated to maintainers and Core. |
| 60 | + |
| 61 | +For instance, are we servicing the needs of developers using our |
| 62 | +binaries as dependencies to code they are compiling locally. As another |
| 63 | +example, are there support gaps for developers and scientists using |
| 64 | +Conda-Forge in academic and government laboratories, who might not have |
| 65 | +the skills or capacity to fix feedstocks. Our reliance on the public |
| 66 | +GitHub platform may prevent some users without access from raising their |
| 67 | +concerns. Since these users may be under-represented we don't even know |
| 68 | +if we are meeting their needs and how best to help. |
| 69 | + |
| 70 | +## Top-Down Control |
| 71 | + |
| 72 | +While the majority of Conda-Forge's permissions structure is federated, |
| 73 | +certain important parts are centralized, with the Core developers making |
| 74 | +key decisions. Often these decisions are focused on stability of the |
| 75 | +ecosystem, for instance what versions of languages to support. |
| 76 | +Additionally, maintenance and enhancements to the Conda-Forge |
| 77 | +infrastructure are mostly performed by Core developers. |
| 78 | + |
| 79 | +However, the Core developers are usually experienced feedstock |
| 80 | +maintainers, expert conda users, and have bought into the Conda-Forge |
| 81 | +ecosystem and mission. This means that decisions can be made without the |
| 82 | +perspective of new users or maintainers, or from potential users that |
| 83 | +are skeptical of the Conda-Forge approach. |
| 84 | + |
| 85 | +For instance, decisions about application binary interface pins are |
| 86 | +usually made by core, although these changes have impacts on downstream |
| 87 | +maintainers. It is possible that most maintainers don't know about what |
| 88 | +these pins are, how they are changed and how that affects their |
| 89 | +feedstocks. |
| 90 | + |
| 91 | +## Automation |
| 92 | + |
| 93 | +Automation has been used to great effect to make Conda-Forge possible. |
| 94 | +The various bots and web services enable Conda-Forge's current scale, |
| 95 | +providing help and support from running builds, bumping versions, and |
| 96 | +checking feedstock quality. However, this automation presents its own |
| 97 | +operational risks and magnifies existing operational risks. |
| 98 | + |
| 99 | +Automation has a tendency to fail when we least expect it and often we |
| 100 | +lack the ability to fix it. The January 2018 Travis-CI outage is a great |
| 101 | +example of this, where the CI service we were using for macOS builds |
| 102 | +experienced reduced capacity and then a complete outage, causing builds |
| 103 | +to queue for days. Recently there was a sudden decrease in the number of |
| 104 | +parallel builds on Azure causing a similar queue of builds. Automation |
| 105 | +can cause issues by enabling users to make decisions without all the |
| 106 | +needed information. While many feedstocks have effective smoke tests for |
| 107 | +their packages the autotick bot doesn't currently check for new |
| 108 | +dependencies, potentially leading to missing or incorrect package |
| 109 | +metadata. |
| 110 | + |
| 111 | +## Conclusion |
| 112 | + |
| 113 | +Overall Conda-Forge has managed its operational risk well. Most |
| 114 | +importantly Conda-Forge's transparent open source nature allows us to |
| 115 | +address these issues head on by engaging with the community. |
0 commit comments