How to run custom GPU/nvidia containers specified in workflow #2592

jjhidalgar · 2023-05-15T15:54:36Z

jjhidalgar
May 15, 2023

I'm using gha-runner-scale-set-controller helm chart to deploy runners. I have the following requirement:

I need to run a custom docker image specified in the workflow for each repository
That image must be able to use GPU
I'm using EKS, It uses containerd under the hood

When using the new helm chart "gha-runner-scale-set-controller", there are two ways in which I can enable github actions "container customization", so that, I can run a specific image on every job, like this:

jobs:
  build:
    container:
      image: node:14.16
      env:
        NODE_ENV: development
      ports:
        - 80

These two options are "dind" (docker in docker) and "kubernetes". You specify this in the runner helm values.yaml

containerMode:
  type: "kubernetes" # "dind" or "kubernetes"

Regardless of which option you specify, the runner will run first with the github actions runner docker image, and with the pod template you specify in the values.yaml. However, the "containers" like the one I specify above with node:14.16 will run differently depending on which mode you select.

Kubernetes Mode
This would be my preferred method, as first you get a runner pod, and then you get your container as a separate pod, which is the kubernetes way. However, only the "runner" pod, which is not the one running your code, has been customized with your pod template specification. The secondary pod with the image you chose, is missing a lot of stuff, such as nodeSelectors, tolerations, and resource specification (i.e. nvidia.com/gpu=1), which means, that your container doesn't have access to the GPU.

The pod with the "-workflow" suffix is running a node:14.16 image, and has the proper env variables. However, it does not support "options" according to this. Or any pod customization at all. I don't see any way to at least add labels to the pod, which could be useful.

This is addressed (in the upstream repo) by this PR, which would then need to be added to this repo: actions/runner-container-hooks#50

Dind Mode
On "dind" mode, basically you add a docker:dind image as a sidecar container in the pod, and you instruct the runner container in the pod to use the dind socket, so you are able to run containers on this sidecar.

That also works, but it doesn't feel kubernetes native and thus has some disadvantages. But the biggest issue, is that this docker-in-docker solution doesn't let me access to the GPU either. Or at least, I have not been able to solve the issue. The runner container does have access to the GPU, but the containers created within it do not.

Older post with some extra info:

This is my values.yaml for the helm chart gha-runner-scale-set (which is a template for the CRD)

githubConfigUrl: "REDACTEED"

minRunners: 0
maxRunners: 4
runnerGroup: "default"

githubConfigSecret:
  github_token: "REDACTED"
# containerMode:
#   type: "dind"  ## type can be set to dind or kubernetes

template:
  spec:
    securityContext:
      fsGroup: 1000
    nodeSelector:
      node-role: karpenter-node
      node.kubernetes.io/instance-type: g4dn.xlarge
    tolerations:
    - key: karpenter
      operator: Exists
      effect: NoSchedule
    initContainers:
    - name: init-dind-externals
      image: ghcr.io/actions/actions-runner:latest
      command: ["cp", "-r", "-v", "/home/runner/externals/.", "/home/runner/tmpDir/"]
      volumeMounts:
        - name: dind-externals
          mountPath: /home/runner/tmpDir
    containers:
    - name: runner
      image: ghcr.io/actions/actions-runner:latest
      command: ["/home/runner/run.sh"]
      resources:
        requests:
          cpu: 1
          memory: 1Gi
          nvidia.com/gpu: 1
        limits:
          memory: 2Gi
          nvidia.com/gpu: 1
      env:
        - name: DOCKER_HOST
          value: tcp://localhost:2376
        - name: DOCKER_TLS_VERIFY
          value: "1"
        - name: DOCKER_CERT_PATH
          value: /certs/client
        ### Kube Mode INIT
        # - name: ACTIONS_RUNNER_CONTAINER_HOOKS
        #   value: /home/runner/k8s/index.js
        # - name: ACTIONS_RUNNER_POD_NAME
        #   valueFrom:
        #     fieldRef:
        #       fieldPath: metadata.name
        # - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
        #   value: "true"
        ### Kube Mode END
      volumeMounts:
        - name: work
          mountPath: /home/runner/_work
        - name: dind-cert
          mountPath: /certs/client
          readOnly: true
    - name: dind
      image: docker:dind
      securityContext:
        privileged: true
      volumeMounts:
        - name: work
          mountPath: /home/runner/_work
        - name: dind-cert
          mountPath: /certs/client
        - name: dind-externals
          mountPath: /home/runner/externals
    volumes:
    - name: work
      emptyDir: {}
    - name: dind-cert
      emptyDir: {}
    - name: dind-externals
      emptyDir: {}

What have I tried?

I tried to replace the dind container (and also the runner container) with:
ghcr.io/ddelange/actions-runner-controller-releases/actions-runner-dind:v2.300.2-ubuntu-20.04-c1e2c4e
from https://github.com/ddelange/actions-runner-controller-releases @ddelange
but with no success.
I get errors like "ERROR --- RUNNER_NAME" or "Executing the custom container implementation failed. Please contact your self hosted runner administrator."
I tried other nvidia dind images as well.
I tried setting some capabilities both in dind and in runner, but with no success:

      securityContext:
        allowPrivilegeEscalation: true
        seLinuxOptions:
          level: s0
          role: system_r
          type: super_t
          user: system_u
        capabilities:
          add:
          - DAC_READ_SEARCH
          - NET_ADMIN
          - SYS_ADMIN
          - SYS_RESOURCE

Any ideas, suggestions or general help would be much appreciated.

Regards.

jjhidalgar · 2023-05-15T16:27:00Z

jjhidalgar
May 15, 2023
Author

@ddelange
Do you have any example on how you use those gpu runner builds?

4 replies

ddelange May 16, 2023

Hi @jaimehrubiks 👋

I think you're the first one to try this :) We were planning to set this up (my fork being the first step), but we never got around to it. The fork only switches the base image, so I doubt the errors above are due to that change.

That being said, my next step in our setup would have been to add gpu runtime support to the gha runner pod spec.

In our k8s 1.23 cluster, we add runtimeClassName: nvidia to our pod spec if we want the pod to schedule on a gpu node. This RuntimeClass (below) will set tolerations and nodeSelector on the pod, but most importantly it will run the pod using the nvidia runtime instead of the default runc runtime, as defined in the node's containerd config (also below).

This runtime uses the nvidia-container-runtime executable, coming from apt-get install nvidia-driver-530, which is installed on the node as part of the booting sequence. I imagine EKS does something similar.

The RuntimeClass:

apiVersion: node.k8s.io/v1
handler: nvidia
kind: RuntimeClass
metadata:
  creationTimestamp: "2022-06-15T08:14:52Z"
  labels:
    addon.kops.k8s.io/name: nvidia.addons.k8s.io
    app.kubernetes.io/managed-by: kops
    k8s-addon: nvidia.addons.k8s.io
  name: nvidia
  resourceVersion: "331"
  uid: c8b19ea2-8b86-4a8b-a44f-ab278d973177
scheduling:
  nodeSelector:
    kops.k8s.io/gpu: "1"
  tolerations:
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists

The containerd config pulled from one of our nodes (set by kOps):

version = 2

[plugins]

  [plugins."io.containerd.grpc.v1.cri"]

    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "runc"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            SystemdCgroup = true

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          runtime_type = "io.containerd.runc.v1"
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            SystemdCgroup = true
            BinaryName = "/usr/bin/nvidia-container-runtime"

Hope that helps!

jjhidalgar May 17, 2023
Author

Thanks for the response, it was helpful. Although I think this is not my issue (or part of it), my containerd settings mark nvidia as the "default" runtime, so I don't need to specify the runtime as it is always NVIDIA. I updated my post with more details and better way explanations of what is happening. I also contributed to this conversation actions/runner-container-hooks#50 (comment)

piotrhryszko-img Sep 17, 2023

Hi @jaimehrubiks did you ever manage to solve it? I’m facing the same issue and haven’t found a solution

jjhidalgar Sep 17, 2023
Author

Hi @piotrhryszko-img
Currently I solved by developing a custom pod admission mutating controller, that reads the env variables from the job.jobid.container.image.env section and based on some custom values, it sets the corresponding node selector, tolerations and nvidia.com/gpu resources. I can't share the code because it's company private. But I could guide you if needed.

However, this feature should fix the issue actions/runner-container-hooks#50 but it's still not done yet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to run custom GPU/nvidia containers specified in workflow #2592

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to run custom GPU/nvidia containers specified in workflow #2592

Uh oh!

Uh oh!

jjhidalgar May 15, 2023

Replies: 1 comment · 4 replies

Uh oh!

jjhidalgar May 15, 2023 Author

Uh oh!

Uh oh!

ddelange May 16, 2023

Uh oh!

jjhidalgar May 17, 2023 Author

Uh oh!

piotrhryszko-img Sep 17, 2023

Uh oh!

jjhidalgar Sep 17, 2023 Author

jjhidalgar
May 15, 2023

Replies: 1 comment 4 replies

jjhidalgar
May 15, 2023
Author

jjhidalgar May 17, 2023
Author

jjhidalgar Sep 17, 2023
Author