Conversation
|
Merging this PR will trigger the following deployment actions. Support deploymentsNo support upgrades will be triggered Staging deployments
Production deployments
|
|
FYI we've had two outages recently due to regressions in Helm null handling. I noticed it first with Project Pythia (incident report), and @GeorgianaElena noticed it in Earthscope Although merging that PR has meant that earthscope no longer has any null problems, we still see regressions for Project Pythia: function test-helm() {
local version="${1:?need arg}"
test -d outputs && rm -r outputs;
mkdir outputs;
echo "Testing $version"
podman run --rm -it -v $PWD/outputs:/outputs -v $PWD:/app -w /app "alpine/helm:$version" template helm-charts/basehub --values=config/clusters/projectpythia-binder/binderhub.values.yaml --output-dir /outputs >/dev/null;
grep 'hub\.jupyter\.org/node-purpose' outputs/**/*.yaml
}
versions=("3.17.0" "3.17.1" "3.20.0" "4.1.1")
for version in "${versions[@]}"; do
test-helm "$version"
printf "\n\n"
doneI dug into this, and it's caused by changes to sibling merging (of values set in a child chart). Here's a reproducer: https://github.com/agoose77/reproducer-helm-merge-changes I've opened a bug report here: helm/helm#31919 Anecdotally, it feels like Helm has been grappling with these bugs for a while. So what do we do? It seems like this is only a bug in this particular merging scenario. We could either update the binderhub chart to filter out these nulls, or opt to set the value only once via e.g. jsonnet. Alternatively, we wait for Helm to fix this ... but that will likely take quite a while. I'm happy to dig in and figure out the fix for Helm, but I'm not sure if that's the best use of our time. |
|
@agoose77 would upgrading to helm cause this regression to come back? Or is the concern that there will be other bugs that we run into? |
|
I think merging via helm 4 would break at least one cluster atm given the open bug report. |
|
@agoose77 I looked into that, and i agree. I also think we can't fix this with |
|
@agoose77, I believe the only places where this breaks is when when we set the selector to null here, right? infrastructure/config/clusters/projectpythia-binder/binderhub.values.yaml Lines 122 to 124 in a22201b As a workaround, until helm fixes this, why not stop setting this node purpose selector in the basehub infrastructure/helm-charts/basehub/values.yaml Lines 45 to 46 in a22201b And instead set it in each hub's config? I know it's extra work, but we can write a script that does it and shouldn't take us much time. |
|
@GeorgianaElena I'm a bit nervous because it might happen to other hubs. We'd need to validate that each hub is currently not broken by the upgrade. Future hub changes could also trigger these bugs, but we'd catch them in production equally as in local development. (Not ideal to know about a Helm bug that we might step on at any time, but manageable). I haven't taken the view "we're going to do this, how do we do it safely" — let me put that hat on now. |
Tested it locally, and seems fine. Also bumps kubectl version. Remove a lot of superfluous and drifted-to-no-longer-be-accurate comments. Once this is merged, we would need to make sure everyone upgrades their local install of helm to v4 as well Ref 2i2c-org#7470
for more information, see https://pre-commit.ci
|
I see this as basically us having found an issue in helm, and @agoose77 opening the issue upstream clearly helped - there's a fix coming in helm/helm#31946. We could just wait for that to land, and then deploy it. And yes, this is something we have to do, and sooner than later :) |
|
Ok, I'm not sure we can wait for helm/helm#31946 - it looks LLM generated, and from someone who also opened a few hundred PRs in a few hundred other repos recently, and is their first PR in helm. I do believe it'll eventually get fixed in helm, but that PR may not be it. Another option for us to consider is to move it to basehub/values.jsonnet, and selectively apply it based on the name of the cluster. Normally I'd not suggest doing that, but given this is a time limited regression, I'm ok with that. And we'd need to add a piece of doc to any jetstream cluster that it needs an exception written in there. |
Tested it locally, and seems fine. Also bumps kubectl version.
Remove a lot of superfluous and drifted-to-no-longer-be-accurate comments.
Once this is merged, we would need to make sure everyone upgrades their local install of helm to v4 as well
Ref #7470