Skip to content

Investigate build failures on Project Pythia BinderHub #6499

@agoose77

Description

@agoose77

Late last week, Project Pythia encountered failing BinderHub builds on their JS2 BinderHub (see FreshDesk ticket). @GeorgianaElena and I both picked this up today.

@GeorgianaElena noticed that the ephemeral storage on was close to full (and warnings thereof in the debug logs). The ephemeral storage is used by Docker via hostPath (see also https://binderhub-service.readthedocs.io/en/latest/explanation/architecture.html) to store the build cache, and other dockerd state. We acknowledged that we've not seen this particular error across our other BinderHub deployments, and @GeorgianaElena's hypothesis for this was that this cluster is special; as a JS2 cluster, the user node pool never scales down to zero. This means that the build cache held by Docker grows with time, and is never cleared. Eventually, we run out of space. This contrasts with other clusters which periodically scale down and empty the cache.

@GeorgianaElena took a look at the filesystem to support this suggestion,

$ df -hT
Filesystem           Type            Size      Used Available Use% Mounted on
overlay              overlay        28.9G     20.8G      8.1G  72% /
tmpfs                tmpfs          64.0M         0     64.0M   0% /dev
/dev/vda1            ext4           28.9G     20.8G      8.1G  72% /etc/hosts
/dev/vda1            ext4           28.9G     20.8G      8.1G  72% /dev/termination-log
/dev/vda1            ext4           28.9G     20.8G      8.1G  72% /etc/hostname
/dev/vda1            ext4           28.9G     20.8G      8.1G  72% /etc/resolv.conf
shm                  tmpfs          64.0M         0     64.0M   0% /dev/shm
/dev/vda1            ext4           28.9G     20.8G      8.1G  72% /var/lib/docker
/dev/vda1            ext4           28.9G     20.8G      8.1G  72% /var/lib/binderhub-binderhub/docker-api
tmpfs                tmpfs           5.9G      3.0M      5.9G   0% /run/binderhub-binderhub/docker-api
tmpfs                tmpfs          58.8G     12.0K     58.8G   0% /run/secrets/kubernetes.io/serviceaccount
none                 tmpfs          29.4G         0     29.4G   0% /tmp

It can be seen that we're using a lot of the space.

We thought about different ways of resolving this. In normal usage, a low-occupancy BinderHub on other providers won't likely see this. For high-occupancy clusters, this failure mode may become apparent again. In any case, for this investigative work we were thinking only of JS2.

There are several possible approaches to a solution:

  1. Always clear the build cache after each build
  2. Disable layer caching at the repo2docker level (easier)
  3. Encourage k8s to provide a new node for building (which has a fresh cache) when the storage is near full.
  4. Run a job to clear up the storage when it gets too full

The easiest solution that doesn't entirely disable layer caching is to attempt to signal to k8s that a new node is required once the cache approaches full. This would allow k8s to schedule user pods on the full node. This will potentially worsen startup times for builds if the node-pool keeps resizing back down to 1, and there may be other behaviour scenarios depending upon the pod packing strategy.

I implemented a patch to the cluster that lets us specify the ephemeral-storage resource request per build. This is a bit of a sledgehammer -- it's not clear whether we can make an informed guess about the cache requirements of a random build in the same way that we do about memory requirements for singleuser.

Given that we have restarted the problematic node, we probably don't need to identify the fix today. So, there's now a chance to step back and think about what a proper solution looks like.

Note

There are a few assumptions here, so it's worth a second pair of eyes to validate the approach.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions