Investigate build failures on Project Pythia BinderHub

Late last week, Project Pythia encountered failing BinderHub builds on their JS2 BinderHub (see [FreshDesk ticket](https://2i2c.freshdesk.com/a/tickets/3673)). @GeorgianaElena and I both picked this up today. 

@GeorgianaElena noticed that the ephemeral storage on was close to full (and warnings thereof in the debug logs). The ephemeral storage is used by Docker via `hostPath` (see also https://binderhub-service.readthedocs.io/en/latest/explanation/architecture.html) to store the build cache, and other `dockerd` state. We acknowledged that we've not seen this particular error across our other BinderHub deployments, and @GeorgianaElena's hypothesis for this was that this cluster is special; as a JS2 cluster, the user node pool never scales down to zero. This means that the build cache held by Docker grows with time, and is never cleared. Eventually, we run out of space. This contrasts with other clusters which periodically scale down and empty the cache.

@GeorgianaElena took a look at the filesystem to support this suggestion,
```
$ df -hT
Filesystem           Type            Size      Used Available Use% Mounted on
overlay              overlay        28.9G     20.8G      8.1G  72% /
tmpfs                tmpfs          64.0M         0     64.0M   0% /dev
/dev/vda1            ext4           28.9G     20.8G      8.1G  72% /etc/hosts
/dev/vda1            ext4           28.9G     20.8G      8.1G  72% /dev/termination-log
/dev/vda1            ext4           28.9G     20.8G      8.1G  72% /etc/hostname
/dev/vda1            ext4           28.9G     20.8G      8.1G  72% /etc/resolv.conf
shm                  tmpfs          64.0M         0     64.0M   0% /dev/shm
/dev/vda1            ext4           28.9G     20.8G      8.1G  72% /var/lib/docker
/dev/vda1            ext4           28.9G     20.8G      8.1G  72% /var/lib/binderhub-binderhub/docker-api
tmpfs                tmpfs           5.9G      3.0M      5.9G   0% /run/binderhub-binderhub/docker-api
tmpfs                tmpfs          58.8G     12.0K     58.8G   0% /run/secrets/kubernetes.io/serviceaccount
none                 tmpfs          29.4G         0     29.4G   0% /tmp
```
It can be seen that we're using a lot of the space.

We thought about different ways of resolving this. In normal usage, a low-occupancy BinderHub on other providers won't likely see this. For high-occupancy clusters, this failure mode may become apparent again. In any case, for this investigative work we were thinking only of JS2.

There are several possible approaches to a solution:
1. Always clear the build cache after each build
2. Disable layer caching at the `repo2docker` level (easier)
3. Encourage k8s to provide a new node for building (which has a fresh cache) when the storage is near full.
4. Run a job to clear up the storage when it gets too full

The easiest solution that doesn't entirely disable layer caching is to attempt to signal to k8s that a new node is required once the cache approaches full. This would allow k8s to schedule user pods on the full node. This will potentially worsen startup times for builds if the node-pool keeps resizing back down to 1, and there may be other behaviour scenarios depending upon the pod packing strategy.

I implemented [a patch to the cluster](https://github.com/2i2c-org/infrastructure/pull/6498) that lets us specify the `ephemeral-storage` resource request per build. This is a bit of a sledgehammer -- it's not clear whether we can make an informed guess about the cache requirements of a random build in the same way that we do about memory requirements for singleuser.

Given that we have restarted the problematic node, we probably don't need to identify the fix _today_. So, there's now a chance to step back and think about what a proper solution looks like.

> [!Note]
> There are a few assumptions here, so it's worth a second pair of eyes to validate the approach.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate build failures on Project Pythia BinderHub #6499

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate build failures on Project Pythia BinderHub #6499

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions