How to run custom GPU/nvidia containers specified in workflow #2592
Unanswered
jjhidalgar
asked this question in
Questions
Replies: 1 comment 4 replies
-
@ddelange |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm using gha-runner-scale-set-controller helm chart to deploy runners. I have the following requirement:
When using the new helm chart "gha-runner-scale-set-controller", there are two ways in which I can enable github actions "container customization", so that, I can run a specific image on every job, like this:
These two options are "dind" (docker in docker) and "kubernetes". You specify this in the runner helm values.yaml
Regardless of which option you specify, the runner will run first with the github actions runner docker image, and with the pod template you specify in the values.yaml. However, the "containers" like the one I specify above with node:14.16 will run differently depending on which mode you select.
Kubernetes Mode
This would be my preferred method, as first you get a runner pod, and then you get your container as a separate pod, which is the kubernetes way. However, only the "runner" pod, which is not the one running your code, has been customized with your pod template specification. The secondary pod with the image you chose, is missing a lot of stuff, such as nodeSelectors, tolerations, and resource specification (i.e. nvidia.com/gpu=1), which means, that your container doesn't have access to the GPU.
The pod with the "-workflow" suffix is running a node:14.16 image, and has the proper env variables. However, it does not support "options" according to this. Or any pod customization at all. I don't see any way to at least add labels to the pod, which could be useful.
This is addressed (in the upstream repo) by this PR, which would then need to be added to this repo: actions/runner-container-hooks#50
Dind Mode
On "dind" mode, basically you add a docker:dind image as a sidecar container in the pod, and you instruct the runner container in the pod to use the dind socket, so you are able to run containers on this sidecar.
That also works, but it doesn't feel kubernetes native and thus has some disadvantages. But the biggest issue, is that this docker-in-docker solution doesn't let me access to the GPU either. Or at least, I have not been able to solve the issue. The runner container does have access to the GPU, but the containers created within it do not.
Older post with some extra info:
This is my values.yaml for the helm chart gha-runner-scale-set (which is a template for the CRD)
What have I tried?
I tried to replace the dind container (and also the runner container) with:
ghcr.io/ddelange/actions-runner-controller-releases/actions-runner-dind:v2.300.2-ubuntu-20.04-c1e2c4e
from https://github.com/ddelange/actions-runner-controller-releases @ddelange
but with no success.
I get errors like "ERROR --- RUNNER_NAME" or "Executing the custom container implementation failed. Please contact your self hosted runner administrator."
I tried other nvidia dind images as well.
I tried setting some capabilities both in dind and in runner, but with no success:
Any ideas, suggestions or general help would be much appreciated.
Regards.
Beta Was this translation helpful? Give feedback.
All reactions