Testing clearml-agent in K8s with GPU

### Describe the bug a clear and concise description of what the bug is.

When I execute this, I always receive this error. If I use the pip agent on the same machine, it works fine:
[notice] To update, run: python3.10 -m pip install --upgrade pip
/root/__start_agent__.sh: line 14:   684 Segmentation fault      (core dumped) $LOCAL_PYTHON -m clearml_agent execute --full-monitoring --require-queue --id 5c06200ada0f4aa2bfe9a16e0ebee84b

### What's your helm version?

v3.16.3

### What's your kubectl version?

v1.27.16 +rke2r2

### What's the chart version?

5.2.2

### Enter the changed values of values.yaml?

``` yaml
# -- Global parameters section
global:
  # -- Images registry
  imageRegistry: "docker.io"

# -- Private image registry configuration
imageCredentials:
  # -- Use private authentication mode
  enabled: false
  # -- If this is set, chart will not generate a secret but will use what is defined here
  existingSecret: ""
  # -- Registry name
  registry: docker.io
  # -- Registry username
  username: someone
  # -- Registry password
  password: pwd
  # -- Email
  email: someone@host.com

# -- ClearMl generic configurations
clearml:
  # -- If this is set, chart will not generate a secret but will use what is defined here
  existingAgentk8sglueSecret: ""
  # -- Agent k8s Glue basic auth key
  agentk8sglueKey: "xxx"
  # -- Agent k8s Glue basic auth secret
  agentk8sglueSecret: "xxx"

  # -- If this is set, chart will not generate a secret but will use what is defined here
  existingClearmlConfigSecret: ""
  # The secret should be defined as the following example
  #
  # apiVersion: v1
  # kind: Secret
  # metadata:
  #   name: secret-name
  # stringData:
  #   clearml.conf: |-
  #     sdk {
  #     }
  # -- ClearML configuration file
  clearmlConfig: |-
    sdk {
    }

# -- This agent will spawn queued experiments in new pods, a good use case is to combine this with
# GPU autoscaling nodes.
# https://github.com/allegroai/clearml-agent/tree/master/docker/k8s-glue
agentk8sglue:
  # -- Glue Agent image configuration
  image:
    registry: ""
    repository: "allegroai/clearml-agent-k8s-base"
    tag: "1.24-21"
  # -- Glue Agent number of pods
  replicaCount: 1

  # -- Glue Agent pod resources
  resources: {}

  # -- Glue Agent pod initContainers configs
  initContainers:
    # -- Glue Agent initcontainers pod resources
    resources: {}

  # -- if set, don't create a serviceAccountName but use defined existing one
  serviceExistingAccountName: ""

  # -- Check certificates validity for evefry UrlReference below.
  clearmlcheckCertificate: true

  # -- Reference to Api server url
  apiServerUrlReference: "https://api-clearml.xxx.com"
  # -- Reference to File server url
  fileServerUrlReference: "https://files-clearml.xxx.com"
  # -- Reference to Web server url
  webServerUrlReference: "https://app-clearml.xxx.com"

  # -- default container image for ClearML Task pod
  defaultContainerImage: clearml/fractional-gpu:u22-cu11.7-12gb
  # -- ClearML queue this agent will consume. Multiple queues can be specified with the following format: queue1,queue2,queue3
  queue: kollama
  # -- if ClearML queue does not exist, it will be create it if the value is set to true
  createQueueIfNotExists: false
  # -- labels setup for Agent pod (example in values.yaml comments)
  labels: {}
    # schedulerName: scheduler
  # -- annotations setup for Agent pod (example in values.yaml comments)
  annotations: {}
    # key1: value1
  # -- Extra Environment variables for Glue Agent
  extraEnvs: []
    # - name: PYTHONPATH
    #   value: "somepath"
  # -- container securityContext setup for Agent pod (example in values.yaml comments)
  podSecurityContext: {}
    #  runAsUser: 1001
    #  fsGroup: 1001
  # -- container securityContext setup for Agent pod (example in values.yaml comments)
  containerSecurityContext: {}
    #  runAsUser: 1001
    #  fsGroup: 1001
  # -- additional existing ClusterRoleBindings
  additionalClusterRoleBindings: []
    # - privileged
  # -- additional existing RoleBindings
  additionalRoleBindings: []
    # - privileged
  # -- nodeSelector setup for Agent pod (example in values.yaml comments)
  nodeSelector: {}
    # fleet: agent-nodes
  # -- tolerations setup for Agent pod (example in values.yaml comments)
  tolerations: []
  # -- affinity setup for Agent pod (example in values.yaml comments)
  affinity: {}
  # -- volumes definition for Glue Agent (example in values.yaml comments)
  volumes: []
    # - name: "yourvolume"
    #   nfs:
    #    server: 192.168.0.1
    #    path: /var/nfs/mount
  # -- volume mounts definition for Glue Agent (example in values.yaml comments)
  volumeMounts: []
    # - name: yourvolume
    #   mountPath: /yourpath
    #   subPath: userfolder

  # -- file definition for Glue Agent (example in values.yaml comments)
  fileMounts: []
    # - name: "integration.py"
    #   folderPath: "/mnt/python"
    #   fileContent: |-
    #     def get_template(*args, **kwargs):
    #       print("args: {}".format(args))
    #       print("kwargs: {}".format(kwargs))
    #       return {
    #           "template": {
    #           }
    #       }

  # -- base template for pods spawned to consume ClearML Task
  basePodTemplate:
    # -- labels setup for pods spawned to consume ClearML Task (example in values.yaml comments)
    labels: {}
      # schedulerName: scheduler
    # -- annotations setup for pods spawned to consume ClearML Task (example in values.yaml comments)
    annotations: {}
      # key1: value1
    # -- initContainers definition for pods spawned to consume ClearML Task (example in values.yaml comments)
    hostPID: true
    # initContainers:
    #   - name: train-container
    #     image: clearml/fractional-gpu:u22-cu11.7-12gb
    #     command: ['python3', '-c', 'print(f"Free GPU Memory: (free, global) {torch.cuda.mem_get_info()}")']
    # initContainers: []
      # - name: volume-dirs-init-cntr
      #   image: busybox:1.35
      #  command:
      #    - /bin/bash
      #    - -c
      #    - >
      #      /bin/echo "this is an init";
    # -- schedulerName setup for pods spawned to consume ClearML Task
    schedulerName: ""
    # -- volumes definition for pods spawned to consume ClearML Task (example in values.yaml comments)
    volumes: []
      # - name: "yourvolume"
      #   nfs:
      #    server: 192.168.0.1
      #    path: /var/nfs/mount
    # -- volume mounts definition for pods spawned to consume ClearML Task (example in values.yaml comments)
    volumeMounts: []
      # - name: yourvolume
      #   mountPath: /yourpath
      #   subPath: userfolder
    # -- file definition for pods spawned to consume ClearML Task (example in values.yaml comments)
    fileMounts: []
      # - name: "mounted-file.txt"
      #   folderPath: "/mnt/"
      #   fileContent: |-
      #     this is a test file
      #     with test content
    # -- environment variables for pods spawned to consume ClearML Task (example in values.yaml comments)
    env: []
      # # to setup access to private repo, setup secret with git credentials:
      # - name: CLEARML_AGENT_GIT_USER
      #   value: mygitusername
      # - name: CLEARML_AGENT_GIT_PASS
      #   valueFrom:
      #     secretKeyRef:
      #       name: git-password
      #       key: git-password
      # - name: CURL_CA_BUNDLE
      #   value: ""
      # - name: PYTHONWARNINGS
      #   value: "ignore:Unverified HTTPS request"
    # -- resources declaration for pods spawned to consume ClearML Task (example in values.yaml comments)
    resources:
      limits:
        nvidia.com/gpu: 1
    # -- priorityClassName setup for pods spawned to consume ClearML Task
    priorityClassName: ""
    # -- nodeSelector setup for pods spawned to consume ClearML Task (example in values.yaml comments)
    nodeSelector:  
      nvidia.com/gpu.product: NVIDIA-L40S-SHARED
    #   fleet: gpu-nodes
    # -- tolerations setup for pods spawned to consume ClearML Task (example in values.yaml comments)
    tolerations:
      - key: "nvidia.com/gpu"
        operator: Exists
        effect: "NoSchedule"
    # -- affinity setup for pods spawned to consume ClearML Task
    affinity: {}
    # -- securityContext setup for pods spawned to consume ClearML Task (example in values.yaml comments)
    podSecurityContext: {}
      #  runAsUser: 1001
      #  fsGroup: 1001
    # -- securityContext setup for containers spawned to consume ClearML Task (example in values.yaml comments)
    containerSecurityContext: {}
      #  runAsUser: 1001
      #  fsGroup: 1001
    # -- hostAliases setup for pods spawned to consume ClearML Task (example in values.yaml comments)
    hostAliases: []
    # - ip: "127.0.0.1"
    #   hostnames:
    #   - "foo.local"
    #   - "bar.local"

# -- Sessions internal service configuration
sessions:
  # -- Enable/Disable sessions portmode WARNING: only one Agent deployment can have this set to true
  portModeEnabled: false
  # -- specific annotations for session services
  svcAnnotations: {}
  # -- service type ("NodePort" or "ClusterIP" or "LoadBalancer")
  svcType: "NodePort"
  # -- External IP sessions clients can connect to
  externalIP: 0.0.0.0
  # -- starting range of exposed NodePorts
  startingPort: 30000
  # -- maximum number of NodePorts exposed
  maxServices: 20
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing clearml-agent in K8s with GPU #335

Describe the bug a clear and concise description of what the bug is.

What's your helm version?

What's your kubectl version?

What's the chart version?

Enter the changed values of values.yaml?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Testing clearml-agent in K8s with GPU #335

Description

Describe the bug a clear and concise description of what the bug is.

What's your helm version?

What's your kubectl version?

What's the chart version?

Enter the changed values of values.yaml?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions