Skip to content

Testing clearml-agent in K8s with GPU #335

@jjamgmv

Description

@jjamgmv

Describe the bug a clear and concise description of what the bug is.

When I execute this, I always receive this error. If I use the pip agent on the same machine, it works fine:
[notice] To update, run: python3.10 -m pip install --upgrade pip
/root/start_agent.sh: line 14: 684 Segmentation fault (core dumped) $LOCAL_PYTHON -m clearml_agent execute --full-monitoring --require-queue --id 5c06200ada0f4aa2bfe9a16e0ebee84b

What's your helm version?

v3.16.3

What's your kubectl version?

v1.27.16 +rke2r2

What's the chart version?

5.2.2

Enter the changed values of values.yaml?

# -- Global parameters section
global:
  # -- Images registry
  imageRegistry: "docker.io"

# -- Private image registry configuration
imageCredentials:
  # -- Use private authentication mode
  enabled: false
  # -- If this is set, chart will not generate a secret but will use what is defined here
  existingSecret: ""
  # -- Registry name
  registry: docker.io
  # -- Registry username
  username: someone
  # -- Registry password
  password: pwd
  # -- Email
  email: someone@host.com

# -- ClearMl generic configurations
clearml:
  # -- If this is set, chart will not generate a secret but will use what is defined here
  existingAgentk8sglueSecret: ""
  # -- Agent k8s Glue basic auth key
  agentk8sglueKey: "xxx"
  # -- Agent k8s Glue basic auth secret
  agentk8sglueSecret: "xxx"

  # -- If this is set, chart will not generate a secret but will use what is defined here
  existingClearmlConfigSecret: ""
  # The secret should be defined as the following example
  #
  # apiVersion: v1
  # kind: Secret
  # metadata:
  #   name: secret-name
  # stringData:
  #   clearml.conf: |-
  #     sdk {
  #     }
  # -- ClearML configuration file
  clearmlConfig: |-
    sdk {
    }

# -- This agent will spawn queued experiments in new pods, a good use case is to combine this with
# GPU autoscaling nodes.
# https://github.com/allegroai/clearml-agent/tree/master/docker/k8s-glue
agentk8sglue:
  # -- Glue Agent image configuration
  image:
    registry: ""
    repository: "allegroai/clearml-agent-k8s-base"
    tag: "1.24-21"
  # -- Glue Agent number of pods
  replicaCount: 1

  # -- Glue Agent pod resources
  resources: {}

  # -- Glue Agent pod initContainers configs
  initContainers:
    # -- Glue Agent initcontainers pod resources
    resources: {}

  # -- if set, don't create a serviceAccountName but use defined existing one
  serviceExistingAccountName: ""

  # -- Check certificates validity for evefry UrlReference below.
  clearmlcheckCertificate: true

  # -- Reference to Api server url
  apiServerUrlReference: "https://api-clearml.xxx.com"
  # -- Reference to File server url
  fileServerUrlReference: "https://files-clearml.xxx.com"
  # -- Reference to Web server url
  webServerUrlReference: "https://app-clearml.xxx.com"

  # -- default container image for ClearML Task pod
  defaultContainerImage: clearml/fractional-gpu:u22-cu11.7-12gb
  # -- ClearML queue this agent will consume. Multiple queues can be specified with the following format: queue1,queue2,queue3
  queue: kollama
  # -- if ClearML queue does not exist, it will be create it if the value is set to true
  createQueueIfNotExists: false
  # -- labels setup for Agent pod (example in values.yaml comments)
  labels: {}
    # schedulerName: scheduler
  # -- annotations setup for Agent pod (example in values.yaml comments)
  annotations: {}
    # key1: value1
  # -- Extra Environment variables for Glue Agent
  extraEnvs: []
    # - name: PYTHONPATH
    #   value: "somepath"
  # -- container securityContext setup for Agent pod (example in values.yaml comments)
  podSecurityContext: {}
    #  runAsUser: 1001
    #  fsGroup: 1001
  # -- container securityContext setup for Agent pod (example in values.yaml comments)
  containerSecurityContext: {}
    #  runAsUser: 1001
    #  fsGroup: 1001
  # -- additional existing ClusterRoleBindings
  additionalClusterRoleBindings: []
    # - privileged
  # -- additional existing RoleBindings
  additionalRoleBindings: []
    # - privileged
  # -- nodeSelector setup for Agent pod (example in values.yaml comments)
  nodeSelector: {}
    # fleet: agent-nodes
  # -- tolerations setup for Agent pod (example in values.yaml comments)
  tolerations: []
  # -- affinity setup for Agent pod (example in values.yaml comments)
  affinity: {}
  # -- volumes definition for Glue Agent (example in values.yaml comments)
  volumes: []
    # - name: "yourvolume"
    #   nfs:
    #    server: 192.168.0.1
    #    path: /var/nfs/mount
  # -- volume mounts definition for Glue Agent (example in values.yaml comments)
  volumeMounts: []
    # - name: yourvolume
    #   mountPath: /yourpath
    #   subPath: userfolder

  # -- file definition for Glue Agent (example in values.yaml comments)
  fileMounts: []
    # - name: "integration.py"
    #   folderPath: "/mnt/python"
    #   fileContent: |-
    #     def get_template(*args, **kwargs):
    #       print("args: {}".format(args))
    #       print("kwargs: {}".format(kwargs))
    #       return {
    #           "template": {
    #           }
    #       }

  # -- base template for pods spawned to consume ClearML Task
  basePodTemplate:
    # -- labels setup for pods spawned to consume ClearML Task (example in values.yaml comments)
    labels: {}
      # schedulerName: scheduler
    # -- annotations setup for pods spawned to consume ClearML Task (example in values.yaml comments)
    annotations: {}
      # key1: value1
    # -- initContainers definition for pods spawned to consume ClearML Task (example in values.yaml comments)
    hostPID: true
    # initContainers:
    #   - name: train-container
    #     image: clearml/fractional-gpu:u22-cu11.7-12gb
    #     command: ['python3', '-c', 'print(f"Free GPU Memory: (free, global) {torch.cuda.mem_get_info()}")']
    # initContainers: []
      # - name: volume-dirs-init-cntr
      #   image: busybox:1.35
      #  command:
      #    - /bin/bash
      #    - -c
      #    - >
      #      /bin/echo "this is an init";
    # -- schedulerName setup for pods spawned to consume ClearML Task
    schedulerName: ""
    # -- volumes definition for pods spawned to consume ClearML Task (example in values.yaml comments)
    volumes: []
      # - name: "yourvolume"
      #   nfs:
      #    server: 192.168.0.1
      #    path: /var/nfs/mount
    # -- volume mounts definition for pods spawned to consume ClearML Task (example in values.yaml comments)
    volumeMounts: []
      # - name: yourvolume
      #   mountPath: /yourpath
      #   subPath: userfolder
    # -- file definition for pods spawned to consume ClearML Task (example in values.yaml comments)
    fileMounts: []
      # - name: "mounted-file.txt"
      #   folderPath: "/mnt/"
      #   fileContent: |-
      #     this is a test file
      #     with test content
    # -- environment variables for pods spawned to consume ClearML Task (example in values.yaml comments)
    env: []
      # # to setup access to private repo, setup secret with git credentials:
      # - name: CLEARML_AGENT_GIT_USER
      #   value: mygitusername
      # - name: CLEARML_AGENT_GIT_PASS
      #   valueFrom:
      #     secretKeyRef:
      #       name: git-password
      #       key: git-password
      # - name: CURL_CA_BUNDLE
      #   value: ""
      # - name: PYTHONWARNINGS
      #   value: "ignore:Unverified HTTPS request"
    # -- resources declaration for pods spawned to consume ClearML Task (example in values.yaml comments)
    resources:
      limits:
        nvidia.com/gpu: 1
    # -- priorityClassName setup for pods spawned to consume ClearML Task
    priorityClassName: ""
    # -- nodeSelector setup for pods spawned to consume ClearML Task (example in values.yaml comments)
    nodeSelector:  
      nvidia.com/gpu.product: NVIDIA-L40S-SHARED
    #   fleet: gpu-nodes
    # -- tolerations setup for pods spawned to consume ClearML Task (example in values.yaml comments)
    tolerations:
      - key: "nvidia.com/gpu"
        operator: Exists
        effect: "NoSchedule"
    # -- affinity setup for pods spawned to consume ClearML Task
    affinity: {}
    # -- securityContext setup for pods spawned to consume ClearML Task (example in values.yaml comments)
    podSecurityContext: {}
      #  runAsUser: 1001
      #  fsGroup: 1001
    # -- securityContext setup for containers spawned to consume ClearML Task (example in values.yaml comments)
    containerSecurityContext: {}
      #  runAsUser: 1001
      #  fsGroup: 1001
    # -- hostAliases setup for pods spawned to consume ClearML Task (example in values.yaml comments)
    hostAliases: []
    # - ip: "127.0.0.1"
    #   hostnames:
    #   - "foo.local"
    #   - "bar.local"

# -- Sessions internal service configuration
sessions:
  # -- Enable/Disable sessions portmode WARNING: only one Agent deployment can have this set to true
  portModeEnabled: false
  # -- specific annotations for session services
  svcAnnotations: {}
  # -- service type ("NodePort" or "ClusterIP" or "LoadBalancer")
  svcType: "NodePort"
  # -- External IP sessions clients can connect to
  externalIP: 0.0.0.0
  # -- starting range of exposed NodePorts
  startingPort: 30000
  # -- maximum number of NodePorts exposed
  maxServices: 20

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions