Skip to content

Commit b24c975

Browse files
jaimergpCJ-Wright
andcommitted
add blog/2020-07-02-op-risk.md
Co-authored-by: cj-wright <[email protected]>
1 parent c49e927 commit b24c975

File tree

1 file changed

+115
-0
lines changed

1 file changed

+115
-0
lines changed

blog/2020-07-02-op-risk.md

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
---
2+
authors:
3+
- cj-wright
4+
tags: [conda-forge]
5+
---
6+
7+
# Conda-Forge Operational Risk
8+
9+
Recently I've been thinking about operational risk (op. risk).
10+
Operational risks arise from failures of processes, for instance a
11+
missing email, or an automated software system not running properly.
12+
Many commercial institutions are interested in minimizing op. risk,
13+
since it is risk that produces no value, as opposed to risks associated
14+
with investing. This is also something I think about in my job at
15+
[Lab49](https://www.lab49.com/), where I'm a software engineering
16+
consultant focusing on financial institutions. I think there is also a
17+
good analogy for Conda-Forge, even though we are not a commercial
18+
outfit. In this case the risk we incur isn't the potential for lost
19+
earnings but frustration for our users and maintainers in the form of
20+
bugs and lackluster user experience. In this post I explore three main
21+
sources of operational risk for Conda-Forge: Automation, Top-Down
22+
Control, and Self-Service Structure.
23+
24+
<!--truncate-->
25+
26+
## A brief conda-forge primer
27+
28+
Conda-Forge is an ecosystem and community that grew around building
29+
packages for the conda package manager. Conda-Forge uses continuous
30+
integration services to build packages from GitHub repos called
31+
feedstocks. This structure enables teams of contributors to maintain
32+
packages via a pull request based workflow. At time of writing
33+
Conda-Forge has over 10000 feedstocks and ships more than 120 million
34+
packages a month.
35+
36+
## Self-Service Structure
37+
38+
Conda-Forge is built around a self-service structure for each stage in a
39+
feedstock's lifecyle. The creation of new feedstocks relies on would be
40+
maintainers to submit PRs to staged-recipes. Although language specific
41+
help teams and staged-recipes reviewers provide some assistance and
42+
oversight, the PR submitter plays the most important role in proposing
43+
the package and shepherding it to acceptance. Once the feedstock is
44+
accepted the maintenance is federated with most upkeep being performed
45+
by the maintainers, who have extensive permissions and control over the
46+
feedstock. If fixes or updates are needed for a package, maintainers and
47+
users are encouraged to open their own pull requests.
48+
49+
This structure can present a few challenges for minimizing op. risk. The
50+
most important challenge is the disconnect between feedstock maintainers
51+
and users. While most maintainers are package users, most of our users
52+
are not maintainers, and are unlikely to become maintainers. The
53+
disparity between maintainers and users can come from a few sources,
54+
some under our control and others not. For instance we can write better
55+
documentation, lowering the barrier to entry, but we don't have control
56+
over how our user's incentive structures value Conda-Forge
57+
contributions. This produces a gap in representation in the Conda-Forge
58+
organizational structure, where non-maintainer users' issues and
59+
desires are not communicated to maintainers and Core.
60+
61+
For instance, are we servicing the needs of developers using our
62+
binaries as dependencies to code they are compiling locally. As another
63+
example, are there support gaps for developers and scientists using
64+
Conda-Forge in academic and government laboratories, who might not have
65+
the skills or capacity to fix feedstocks. Our reliance on the public
66+
GitHub platform may prevent some users without access from raising their
67+
concerns. Since these users may be under-represented we don't even know
68+
if we are meeting their needs and how best to help.
69+
70+
## Top-Down Control
71+
72+
While the majority of Conda-Forge's permissions structure is federated,
73+
certain important parts are centralized, with the Core developers making
74+
key decisions. Often these decisions are focused on stability of the
75+
ecosystem, for instance what versions of languages to support.
76+
Additionally, maintenance and enhancements to the Conda-Forge
77+
infrastructure are mostly performed by Core developers.
78+
79+
However, the Core developers are usually experienced feedstock
80+
maintainers, expert conda users, and have bought into the Conda-Forge
81+
ecosystem and mission. This means that decisions can be made without the
82+
perspective of new users or maintainers, or from potential users that
83+
are skeptical of the Conda-Forge approach.
84+
85+
For instance, decisions about application binary interface pins are
86+
usually made by core, although these changes have impacts on downstream
87+
maintainers. It is possible that most maintainers don't know about what
88+
these pins are, how they are changed and how that affects their
89+
feedstocks.
90+
91+
## Automation
92+
93+
Automation has been used to great effect to make Conda-Forge possible.
94+
The various bots and web services enable Conda-Forge's current scale,
95+
providing help and support from running builds, bumping versions, and
96+
checking feedstock quality. However, this automation presents its own
97+
operational risks and magnifies existing operational risks.
98+
99+
Automation has a tendency to fail when we least expect it and often we
100+
lack the ability to fix it. The January 2018 Travis-CI outage is a great
101+
example of this, where the CI service we were using for macOS builds
102+
experienced reduced capacity and then a complete outage, causing builds
103+
to queue for days. Recently there was a sudden decrease in the number of
104+
parallel builds on Azure causing a similar queue of builds. Automation
105+
can cause issues by enabling users to make decisions without all the
106+
needed information. While many feedstocks have effective smoke tests for
107+
their packages the autotick bot doesn't currently check for new
108+
dependencies, potentially leading to missing or incorrect package
109+
metadata.
110+
111+
## Conclusion
112+
113+
Overall Conda-Forge has managed its operational risk well. Most
114+
importantly Conda-Forge's transparent open source nature allows us to
115+
address these issues head on by engaging with the community.

0 commit comments

Comments
 (0)