Skip to content

Trainjob file-metricscollector does not exit after collecting metrics, but the same job is normal #2616

@wangyakun

Description

@wangyakun

What happened?

I have created a experiment by python SDK uses trainjob, then the same thing occured like #2614 (pod state is 2/3, NotReady, metrics-collector container is still running, no error.)
However, I duplicate the experiment yaml, only change the "trainjob" to "job" in experimnet yaml, and run it, finally it works well: metrics-collector container exit normally, and experiment state is succeeded.
I think this is maybe a new bug, related to the trainjob, and different from #2614.

promblem katib experiment

python code:

import kubeflow.katib as katib
from kubeflow.katib import KatibClient
from kubeflow.katib.models import (
    V1beta1Experiment,
    V1beta1ExperimentSpec,
    V1beta1AlgorithmSpec,
    V1beta1ObjectiveSpec,
    V1beta1ParameterSpec,
    V1beta1TrialTemplate,
    V1beta1TrialParameterSpec,
    V1ObjectMeta
)

parameters = [
    V1beta1ParameterSpec(
        name="learning_rate",
        parameter_type="double",
        feasible_space={"min": "1e-05", "max": "5e-05"}
    ),
    V1beta1ParameterSpec(
        name="r",
        parameter_type="int",
        feasible_space={"min": "1", "max": "8"}
    )
]

TRAIN_IMAGE = "36.134.128.101.nip.io:31104/aict-gpu/llama-factory-amd64-nvidia:1.0.1-llamafactory-0.9.2-dev"       
EXP_NAME = "katib-llamafactory-qwen-sft3-lyt-old"
NAMESPACE = "aict"

trial_spec={
    "apiVersion": "trainer.kubeflow.org/v1alpha1",
    "kind": "TrainJob",
    "spec": {
        "podTemplateOverrides": [
            {
                "spec": {
                    "containers": [
                        {
                            "name": "node",
                            "volumeMounts": [
                                {
                                    "mountPath": "/datas",
                                    "name": "trainer-datas",
                                },
                            ],
                        },
                    ],
                    "volumes": [
                        {
                            "name": "trainer-datas",
                            "persistentVolumeClaim": {
                                "claimName": "katib-llamafactory-qwen-sft"
                            },
                        },
                    ],
                },
                "targetJobs":[
                    {
                        "name": "node",
                    },
                ],
            },
        ],
        "runtimeRef": {
            "apiGroup": "trainer.kubeflow.org",
            "kind": "ClusterTrainingRuntime",
            "name": "custom-test",
        },
        "trainer": {
            "numNodes": 1,
            "image": TRAIN_IMAGE,
            "command": [
                "sh",
                "-c",
                "set -x;"
                "accelerate launch "
                "  --multi_gpu"
                "  src/train.py "
                "  --model_name_or_path=/datas/models "
                "  --output_dir=/datas/output "
                "  --dataset_dir /datas/datasets "
                "  --do_train "
                "  --report_to=tensorboard "
                "  --finetuning_type=lora "
                "  --flash_attn=auto "
                "  --packing=False "
                "  --plot_loss=True "
                "  --ddp_timeout=180000000 "
                "  --fp16=True "
                "  --cutoff_len=4096 "
                "  --dataset=default "
                "  --gradient_accumulation_steps=8 "
                "  --learning_rate=${trialParameters.learning_rate} "
                "  --logging_steps=5 "
                "  --lr_scheduler_type=cosine "
                "  --max_samples=100000 "
                "  --num_train_epochs=1 "
                "  --optim=adamw_torch "
                "  --per_device_train_batch_size=2 "
                "  --save_steps=256 "
                "  --stage=sft "
                "  --template=qwen "
                "  --lora_alpha=16 "
                "  --lora_dropout=0 "
                "  --lora_rank=${trialParameters.r} "
                "  --loraplus_lr_ratio=0 "
                "  --use_dora=false "
                "  --use_rslora=false "
                "  --overwrite_output_dir;"
                "if [ -f /datas/output/train_results.json ]; then"
                "  echo 'Converting all_results.json to single-line format...' >&2;"
                "  python3 -c \"import json; data=json.load(open('/datas/output/train_results.json')); print(json.dumps({k: str(v) for k, v in data.items()}, separators=(',', ':')))\" > /tmp/all_results_single.json;"
                "  mv /tmp/all_results_single.json /datas/output/train_results.json;"
                "  echo 'JSON conversion complete' >&2;"
                "fi;"
                "cat /datas/output/train_results.json;"
                "sync;"
                "sleep 10;"
                "echo completed > /datas/output/$$$$.pid;"
                "sleep 30;"
                "exit 0"
            ],
            "resourcesPerNode": {
                "limits": {
                    "cpu": "1",
                    "memory": "8Gi",
                    "nvidia.com/gpu": "1",      
                },
                "requests": {
                    "cpu": "1",
                    "memory": "8Gi",
                    "nvidia.com/gpu": "1",
                },
            },
        },
    }
}

trial_template = V1beta1TrialTemplate(
    primary_container_name="node",
    trial_parameters=[
        V1beta1TrialParameterSpec(
            name="learning_rate", 
            description="Learning rate",
            reference="learning_rate"
        ),
        V1beta1TrialParameterSpec(
            name="r", 
            description="LoRA rank",
            reference="r"
        )
    ],
    trial_spec=trial_spec,
    success_condition='status.conditions.#(type=="Complete")#|#(status=="True")#',
    failure_condition='status.conditions.#(type=="Failed")#|#(status=="True")#',
    retain=True,
)

experiment_spec = V1beta1ExperimentSpec(
    algorithm=V1beta1AlgorithmSpec(algorithm_name="random"),
    objective=V1beta1ObjectiveSpec(
        type="minimize",
        goal=2.0,
        objective_metric_name="train_loss",
        metric_strategies=[{"name": "train_loss", "value": "min"}]
    ),
    parameters=parameters,
    trial_template=trial_template,
    max_trial_count=2,
    parallel_trial_count=1,
    metrics_collector_spec={
        "collector": {"kind": "File"},
        "source": {
            "fileSystemPath": {
                "kind": "File",
                "path": "/datas/output/train_results.json",
                "format": "JSON",
            }
        }
    },
    max_failed_trial_count=1
)

experiment = V1beta1Experiment(
    api_version="kubeflow.org/v1beta1",
    kind="Experiment",
    metadata=V1ObjectMeta(name=EXP_NAME, namespace=NAMESPACE),
    spec=experiment_spec
)

cl = KatibClient(namespace=NAMESPACE)
cl.create_experiment(experiment)

cl.wait_for_experiment_condition(name=EXP_NAME)
print(cl.get_optimal_hyperparameters(EXP_NAME))

experiment yaml:

metadata:
  name: katib-llamafactory-qwen-sft3-lyt-old
  namespace: aict
  uid: b41ef14b-7287-42f9-9dbf-db43b0f8f1de
  resourceVersion: '146317884'
  generation: 1
  creationTimestamp: '2026-02-06T07:20:00Z'
  finalizers:
    - update-prometheus-metrics
  managedFields:
    - manager: OpenAPI-Generator
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2026-02-06T07:20:00Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          .: {}
          f:algorithm:
            .: {}
            f:algorithmName: {}
          f:maxFailedTrialCount: {}
          f:maxTrialCount: {}
          f:metricsCollectorSpec:
            .: {}
            f:collector:
              .: {}
              f:kind: {}
            f:source:
              .: {}
              f:fileSystemPath:
                .: {}
                f:format: {}
                f:kind: {}
                f:path: {}
          f:objective:
            .: {}
            f:goal: {}
            f:metricStrategies: {}
            f:objectiveMetricName: {}
            f:type: {}
          f:parallelTrialCount: {}
          f:parameters: {}
          f:trialTemplate:
            .: {}
            f:failureCondition: {}
            f:primaryContainerName: {}
            f:retain: {}
            f:successCondition: {}
            f:trialParameters: {}
            f:trialSpec:
              .: {}
              f:apiVersion: {}
              f:kind: {}
              f:spec:
                .: {}
                f:podTemplateOverrides: {}
                f:runtimeRef:
                  .: {}
                  f:apiGroup: {}
                  f:kind: {}
                  f:name: {}
                f:trainer:
                  .: {}
                  f:command: {}
                  f:image: {}
                  f:numNodes: {}
                  f:resourcesPerNode:
                    .: {}
                    f:limits:
                      .: {}
                      f:cpu: {}
                      f:memory: {}
                      f:nvidia.com/gpu: {}
                    f:requests:
                      .: {}
                      f:cpu: {}
                      f:memory: {}
                      f:nvidia.com/gpu: {}
    - manager: katib-controller
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2026-02-06T07:20:00Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:finalizers:
            .: {}
            v:"update-prometheus-metrics": {}
    - manager: katib-controller
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2026-02-06T07:20:14Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          .: {}
          f:conditions: {}
          f:currentOptimalTrial:
            .: {}
            f:observation: {}
          f:runningTrialList: {}
          f:startTime: {}
          f:trials: {}
          f:trialsRunning: {}
      subresource: status
spec:
  parameters:
    - name: learning_rate
      parameterType: double
      feasibleSpace:
        max: '5e-05'
        min: '1e-05'
        distribution: uniform
    - name: r
      parameterType: int
      feasibleSpace:
        max: '8'
        min: '1'
        distribution: uniform
  objective:
    type: minimize
    goal: 2
    objectiveMetricName: train_loss
    metricStrategies:
      - name: train_loss
        value: min
  algorithm:
    algorithmName: random
  trialTemplate:
    retain: true
    trialSpec:
      apiVersion: trainer.kubeflow.org/v1alpha1
      kind: TrainJob
      spec:
        podTemplateOverrides:
          - spec:
              containers:
                - name: node
                  volumeMounts:
                    - mountPath: /datas
                      name: trainer-datas
              volumes:
                - name: trainer-datas
                  persistentVolumeClaim:
                    claimName: katib-llamafactory-qwen-sft
            targetJobs:
              - name: node
        runtimeRef:
          apiGroup: trainer.kubeflow.org
          kind: ClusterTrainingRuntime
          name: custom-test
        trainer:
          command:
            - sh
            - '-c'
            - >-
              set -x;accelerate launch   --multi_gpu  src/train.py  
              --model_name_or_path=/datas/models   --output_dir=/datas/output  
              --dataset_dir /datas/datasets   --do_train  
              --report_to=tensorboard   --finetuning_type=lora  
              --flash_attn=auto   --packing=False   --plot_loss=True  
              --ddp_timeout=180000000   --fp16=True   --cutoff_len=4096  
              --dataset=default   --gradient_accumulation_steps=8  
              --learning_rate=${trialParameters.learning_rate}  
              --logging_steps=5   --lr_scheduler_type=cosine  
              --max_samples=100000   --num_train_epochs=1  
              --optim=adamw_torch   --per_device_train_batch_size=2  
              --save_steps=256   --stage=sft   --template=qwen  
              --lora_alpha=16   --lora_dropout=0  
              --lora_rank=${trialParameters.r}   --loraplus_lr_ratio=0  
              --use_dora=false   --use_rslora=false   --overwrite_output_dir;if
              [ -f /datas/output/train_results.json ]; then  echo 'Converting
              all_results.json to single-line format...' >&2;  python3 -c
              "import json;
              data=json.load(open('/datas/output/train_results.json'));
              print(json.dumps({k: str(v) for k, v in data.items()},
              separators=(',', ':')))" > /tmp/all_results_single.json;  mv
              /tmp/all_results_single.json /datas/output/train_results.json; 
              echo 'JSON conversion complete' >&2;fi;cat
              /datas/output/train_results.json;sync;sleep 10;echo completed >
              /datas/output/$$$$.pid;sleep 30;exit 0
          image: >-
            36.134.128.101.nip.io:31104/aict-gpu/llama-factory-amd64-nvidia:1.0.1-llamafactory-0.9.2-dev
          numNodes: 1
          resourcesPerNode:
            limits:
              cpu: '1'
              memory: 8Gi
              nvidia.com/gpu: '1'
            requests:
              cpu: '1'
              memory: 8Gi
              nvidia.com/gpu: '1'
    trialParameters:
      - name: learning_rate
        description: Learning rate
        reference: learning_rate
      - name: r
        description: LoRA rank
        reference: r
    primaryPodLabels:
      batch.kubernetes.io/job-completion-index: '0'
      jobset.sigs.k8s.io/replicatedjob-name: node
    primaryContainerName: node
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
  parallelTrialCount: 1
  maxTrialCount: 2
  maxFailedTrialCount: 1
  metricsCollectorSpec:
    source:
      fileSystemPath:
        path: /datas/output/train_results.json
        kind: File
        format: JSON
    collector:
      kind: File
  resumePolicy: Never
status:
  startTime: '2026-02-06T07:20:00Z'
  conditions:
    - type: Created
      status: 'True'
      reason: ExperimentCreated
      message: Experiment is created
      lastUpdateTime: '2026-02-06T07:20:00Z'
      lastTransitionTime: '2026-02-06T07:20:00Z'
    - type: Running
      status: 'True'
      reason: ExperimentRunning
      message: Experiment is running
      lastUpdateTime: '2026-02-06T07:20:14Z'
      lastTransitionTime: '2026-02-06T07:20:14Z'
  currentOptimalTrial:
    observation: {}
  runningTrialList:
    - katib-llamafactory-qwen-sft3-lyt-old-k652h5cp
  trials: 1
  trialsRunning: 1

state of pod:

# kubectl get pod | grep old
katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0-0-q2df4   2/3     NotReady           0               58m
katib-llamafactory-qwen-sft3-lyt-old-random-7bbc5c78d4-nnpx4   1/1     Running            0               58m

log of metrics:

# kubectl logs -f -c metrics-logger-and-collector katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0-0-q2df4
I0206 07:20:19.726497     105 main.go:400] Trial Name: katib-llamafactory-qwen-sft3-lyt-old-k652h5cp
I0206 07:37:20.817003     105 main.go:143] {
I0206 07:37:20.817104     105 main.go:143]     "epoch": 1.0,
I0206 07:37:20.817123     105 main.go:143]     "total_flos": 8476485388075008.0,
I0206 07:37:20.817127     105 main.go:143]     "train_loss": 3.3930898904800415,
I0206 07:37:20.817139     105 main.go:143]     "train_runtime": 986.1327,
I0206 07:37:20.817144     105 main.go:143]     "train_samples_per_second": 5.71,
I0206 07:37:20.817165     105 main.go:143]     "train_steps_per_second": 0.357
I0206 07:37:20.817169     105 main.go:143] }
2026/02/06 07:37:25 Re-opening truncated file /datas/output/train_results.json ...
2026/02/06 07:37:25 Successfully reopened truncated /datas/output/train_results.json
I0206 07:37:25.229170     105 main.go:143] {"epoch":"1.0","total_flos":"8476485388075008.0","train_loss":"3.3930898904800415","train_runtime":"986.1327","train_samples_per_second":"5.71","train_steps_per_second":"0.357"}
2026/02/06 07:37:25 Re-opening moved/deleted file /datas/output/train_results.json ...
2026/02/06 07:37:25 Successfully reopened /datas/output/train_results.json
I0206 07:37:25.229414     105 main.go:143] {"epoch":"1.0","total_flos":"8476485388075008.0","train_loss":"3.3930898904800415","train_runtime":"986.1327","train_samples_per_second":"5.71","train_steps_per_second":"0.357"}

describe of pod:

# kubectl describe pod katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0-0-q2df4
Name:             katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0-0-q2df4
Namespace:        aict
Priority:         0
Service Account:  default
Node:             llm1/192.168.1.4
Start Time:       Fri, 06 Feb 2026 15:20:14 +0800
Labels:           batch.kubernetes.io/controller-uid=e614da32-fa81-4f5a-ac21-8232bc38224b
                  batch.kubernetes.io/job-completion-index=0
                  batch.kubernetes.io/job-name=katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0
                  controller-uid=e614da32-fa81-4f5a-ac21-8232bc38224b
                  job-name=katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0
                  jobset.sigs.k8s.io/global-replicas=1
                  jobset.sigs.k8s.io/group-name=default
                  jobset.sigs.k8s.io/group-replicas=1
                  jobset.sigs.k8s.io/job-global-index=0
                  jobset.sigs.k8s.io/job-group-index=0
                  jobset.sigs.k8s.io/job-index=0
                  jobset.sigs.k8s.io/job-key=77f78d77fd4a9c5f3dda557b211da8643fc5a4eb
                  jobset.sigs.k8s.io/jobset-name=katib-llamafactory-qwen-sft3-lyt-old-k652h5cp
                  jobset.sigs.k8s.io/jobset-uid=9ca13e49-5443-4656-ad74-327147852c5f
                  jobset.sigs.k8s.io/replicatedjob-name=node
                  jobset.sigs.k8s.io/replicatedjob-replicas=1
                  jobset.sigs.k8s.io/restart-attempt=0
                  katib.kubeflow.org/experiment=katib-llamafactory-qwen-sft3-lyt-old
                  katib.kubeflow.org/trial=katib-llamafactory-qwen-sft3-lyt-old-k652h5cp
                  security.istio.io/tlsMode=istio
                  service.istio.io/canonical-name=katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0
                  service.istio.io/canonical-revision=latest
Annotations:      batch.kubernetes.io/job-completion-index: 0
                  cni.projectcalico.org/containerID: 2358563f7dc46f4fc07175fdcd7d351a3901a777995d3f95a0b1117030b6f8af
                  cni.projectcalico.org/podIP: 10.42.0.119/32
                  cni.projectcalico.org/podIPs: 10.42.0.119/32
                  istio.io/rev: default
                  jobset.sigs.k8s.io/global-replicas: 1
                  jobset.sigs.k8s.io/group-name: default
                  jobset.sigs.k8s.io/group-replicas: 1
                  jobset.sigs.k8s.io/job-global-index: 0
                  jobset.sigs.k8s.io/job-group-index: 0
                  jobset.sigs.k8s.io/job-index: 0
                  jobset.sigs.k8s.io/job-key: 77f78d77fd4a9c5f3dda557b211da8643fc5a4eb
                  jobset.sigs.k8s.io/jobset-name: katib-llamafactory-qwen-sft3-lyt-old-k652h5cp
                  jobset.sigs.k8s.io/jobset-uid: 9ca13e49-5443-4656-ad74-327147852c5f
                  jobset.sigs.k8s.io/replicatedjob-name: node
                  jobset.sigs.k8s.io/replicatedjob-replicas: 1
                  jobset.sigs.k8s.io/restart-attempt: 0
                  kubectl.kubernetes.io/default-container: node
                  kubectl.kubernetes.io/default-logs-container: node
                  prometheus.io/path: /stats/prometheus
                  prometheus.io/port: 15020
                  prometheus.io/scrape: true
                  sidecar.istio.io/interceptionMode: REDIRECT
                  sidecar.istio.io/status:
                    {"initContainers":["istio-validation","istio-proxy"],"containers":null,"volumes":["workload-socket","credential-socket","workload-certs","...
                  traffic.sidecar.istio.io/excludeInboundPorts: 15020
                  traffic.sidecar.istio.io/includeInboundPorts: *
                  traffic.sidecar.istio.io/includeOutboundIPRanges: *
Status:           Running
IP:               10.42.0.119
IPs:
  IP:           10.42.0.119
Controlled By:  Job/katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0
Init Containers:
  istio-validation:
    Container ID:  containerd://279a472bd3f21151ca632eb9ab5a65d5b0f659cd122b4ef9cccb0b53db7e10f2
    Image:         36.134.128.101.nip.io:31104/kubeflow/proxyv2:1.26.1
    Image ID:      36.134.128.101.nip.io:31104/kubeflow/proxyv2@sha256:79ae318dc23920468ea6cfaa0743883b3764b472635c3a698166c33dd4edb329
    Port:          <none>
    Host Port:     <none>
    Args:
      istio-iptables
      -p
      15001
      -z
      15006
      -u
      1337
      -m
      REDIRECT
      -i
      *
      -x
      
      -b
      *
      -d
      15090,15021,15020
      --log_output_level=default:info
      --run-validation
      --skip-rule-apply
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 06 Feb 2026 15:20:15 +0800
      Finished:     Fri, 06 Feb 2026 15:20:15 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:        100m
      memory:     128Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-df5ql (ro)
  istio-proxy:
    Container ID:  containerd://e389ec32b15abe0aac51909fbbc366210061e2a940d7f9e7ce9068637eecf0c2
    Image:         36.134.128.101.nip.io:31104/kubeflow/proxyv2:1.26.1
    Image ID:      36.134.128.101.nip.io:31104/kubeflow/proxyv2@sha256:79ae318dc23920468ea6cfaa0743883b3764b472635c3a698166c33dd4edb329
    Port:          15090/TCP
    Host Port:     0/TCP
    Args:
      proxy
      sidecar
      --domain
      $(POD_NAMESPACE).svc.cluster.local
      --proxyLogLevel=warning
      --proxyComponentLogLevel=misc:error
      --log_output_level=default:info
    State:          Running
      Started:      Fri, 06 Feb 2026 15:20:17 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:      100m
      memory:   128Mi
    Readiness:  http-get http://:15021/healthz/ready delay=0s timeout=3s period=15s #success=1 #failure=4
    Startup:    http-get http://:15021/healthz/ready delay=0s timeout=3s period=1s #success=1 #failure=600
    Environment:
      PILOT_CERT_PROVIDER:           istiod
      CA_ADDR:                       istiod.istio-system.svc:15012
      POD_NAME:                      katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0-0-q2df4 (v1:metadata.name)
      POD_NAMESPACE:                 aict (v1:metadata.namespace)
      INSTANCE_IP:                    (v1:status.podIP)
      SERVICE_ACCOUNT:                (v1:spec.serviceAccountName)
      HOST_IP:                        (v1:status.hostIP)
      ISTIO_CPU_LIMIT:               2 (limits.cpu)
      PROXY_CONFIG:                  {"tracing":{}}
                                     
      ISTIO_META_POD_PORTS:          [
                                     ]
      ISTIO_META_APP_CONTAINERS:     node
      GOMEMLIMIT:                    1073741824 (limits.memory)
      GOMAXPROCS:                    2 (limits.cpu)
      ISTIO_META_CLUSTER_ID:         Kubernetes
      ISTIO_META_NODE_NAME:           (v1:spec.nodeName)
      ISTIO_META_INTERCEPTION_MODE:  REDIRECT
      ISTIO_META_WORKLOAD_NAME:      katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0
      ISTIO_META_OWNER:              kubernetes://apis/batch/v1/namespaces/aict/jobs/katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0
      ISTIO_META_MESH_ID:            cluster.local
      TRUST_DOMAIN:                  cluster.local
    Mounts:
      /etc/istio/pod from istio-podinfo (rw)
      /etc/istio/proxy from istio-envoy (rw)
      /var/lib/istio/data from istio-data (rw)
      /var/run/secrets/credential-uds from credential-socket (rw)
      /var/run/secrets/istio from istiod-ca-cert (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-df5ql (ro)
      /var/run/secrets/tokens from istio-token (rw)
      /var/run/secrets/workload-spiffe-credentials from workload-certs (rw)
      /var/run/secrets/workload-spiffe-uds from workload-socket (rw)
Containers:
  node:
    Container ID:  containerd://c0bd832963b1034adde9b8843ca3d58f3eb9e5b898bd532b631d6575976eaf6a
    Image:         36.134.128.101.nip.io:31104/aict-gpu/llama-factory-amd64-nvidia:1.0.1-llamafactory-0.9.2-dev
    Image ID:      36.134.128.101.nip.io:31104/aict-gpu/llama-factory-amd64-nvidia@sha256:2899589618c16624103fb0170b865119fce8af891bb38dbf1be36b8c4f2cdc2f
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      set -x;accelerate launch   --multi_gpu  src/train.py   --model_name_or_path=/datas/models   --output_dir=/datas/output   --dataset_dir /datas/datasets   --do_train   --report_to=tensorboard   --finetuning_type=lora   --flash_attn=auto   --packing=False   --plot_loss=True   --ddp_timeout=180000000   --fp16=True   --cutoff_len=4096   --dataset=default   --gradient_accumulation_steps=8   --learning_rate=2.5746808259899108e-05   --logging_steps=5   --lr_scheduler_type=cosine   --max_samples=100000   --num_train_epochs=1   --optim=adamw_torch   --per_device_train_batch_size=2   --save_steps=256   --stage=sft   --template=qwen   --lora_alpha=16   --lora_dropout=0   --lora_rank=8   --loraplus_lr_ratio=0   --use_dora=false   --use_rslora=false   --overwrite_output_dir;if [ -f /datas/output/train_results.json ]; then  echo 'Converting all_results.json to single-line format...' >&2;  python3 -c "import json; data=json.load(open('/datas/output/train_results.json')); print(json.dumps({k: str(v) for k, v in data.items()}, separators=(',', ':')))" > /tmp/all_results_single.json;  mv /tmp/all_results_single.json /datas/output/train_results.json;  echo 'JSON conversion complete' >&2;fi;cat /datas/output/train_results.json;sync;sleep 10;echo completed > /datas/output/$$$$.pid;sleep 30;exit 0 && echo completed > /datas/output/$$$$.pid
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 06 Feb 2026 15:20:19 +0800
      Finished:     Fri, 06 Feb 2026 15:38:05 +0800
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:             1
      memory:          8Gi
      nvidia.com/gpu:  1
    Requests:
      cpu:             1
      memory:          8Gi
      nvidia.com/gpu:  1
    Environment:
      JOB_COMPLETION_INDEX:   (v1:metadata.labels['batch.kubernetes.io/job-completion-index'])
      KATIB_TRIAL_NAME:       (v1:metadata.labels['katib.kubeflow.org/trial'])
    Mounts:
      /datas from trainer-datas (rw)
      /datas/output from metrics-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-df5ql (ro)
  metrics-logger-and-collector:
    Container ID:  containerd://244906db4f40c2947327ecd487e3f3d356acc192391b79b6ec494bdd9655b1fe
    Image:         ghcr.io/kubeflow/katib/file-metrics-collector:v0.19.0
    Image ID:      ghcr.io/kubeflow/katib/file-metrics-collector@sha256:0616af2111b2c6029105ac4670e1e94a0ceb7ba02ddb06a8cee3a687fde1514c
    Port:          <none>
    Host Port:     <none>
    Args:
      -t
      katib-llamafactory-qwen-sft3-lyt-old-k652h5cp
      -m
      train_loss
      -o-type
      minimize
      -s-db
      katib-db-manager.kubeflow:6789
      -path
      /datas/output/train_results.json
      -format
      JSON
    State:          Running
      Started:      Fri, 06 Feb 2026 15:20:19 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:                500m
      ephemeral-storage:  5Gi
      memory:             100Mi
    Requests:
      cpu:                50m
      ephemeral-storage:  500Mi
      memory:             10Mi
    Environment:          <none>
    Mounts:
      /datas/output from metrics-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-df5ql (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  workload-socket:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  credential-socket:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  workload-certs:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  istio-envoy:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  istio-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  istio-podinfo:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.labels -> labels
      metadata.annotations -> annotations
  istio-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  43200
  istiod-ca-cert:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      istio-ca-root-cert
    Optional:  false
  trainer-datas:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  katib-llamafactory-qwen-sft
    ReadOnly:   false
  kube-api-access-df5ql:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
  metrics-volume:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:      
    SizeLimit:   <unset>
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>

This promblem katib experiment in Kubeflow UI display:
Image

normal katib experiment

experiment yaml, only changed job kind from "trainjob" to "job":

metadata:
  name: katib-llamafactory-qwen-sft3-lyt
  namespace: aict
  uid: 704b3553-3c93-496a-adb2-f69e5b40a09b
  resourceVersion: '144115325'
  generation: 1
  creationTimestamp: '2026-02-06T02:27:46Z'
  finalizers:
    - update-prometheus-metrics
  managedFields:
    - manager: OpenAPI-Generator
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2026-02-06T02:27:46Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          .: {}
          f:algorithm:
            .: {}
            f:algorithmName: {}
          f:maxFailedTrialCount: {}
          f:maxTrialCount: {}
          f:metricsCollectorSpec:
            .: {}
            f:collector:
              .: {}
              f:kind: {}
            f:source:
              .: {}
              f:fileSystemPath:
                .: {}
                f:format: {}
                f:kind: {}
                f:path: {}
          f:objective:
            .: {}
            f:goal: {}
            f:metricStrategies: {}
            f:objectiveMetricName: {}
            f:type: {}
          f:parallelTrialCount: {}
          f:parameters: {}
          f:trialTemplate:
            .: {}
            f:failureCondition: {}
            f:primaryContainerName: {}
            f:retain: {}
            f:successCondition: {}
            f:trialParameters: {}
            f:trialSpec:
              .: {}
              f:apiVersion: {}
              f:kind: {}
              f:spec:
                .: {}
                f:template:
                  .: {}
                  f:metadata:
                    .: {}
                    f:annotations:
                      .: {}
                      f:cni.istio.io/exclude: {}
                      f:istio.io/rev: {}
                      f:sidecar.istio.io/inject: {}
                  f:spec:
                    .: {}
                    f:containers: {}
                    f:restartPolicy: {}
                    f:schedulerName: {}
                    f:volumes: {}
    - manager: katib-controller
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2026-02-06T02:27:46Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:finalizers:
            .: {}
            v:"update-prometheus-metrics": {}
    - manager: katib-controller
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2026-02-06T02:45:39Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          .: {}
          f:completionTime: {}
          f:conditions: {}
          f:currentOptimalTrial:
            .: {}
            f:bestTrialName: {}
            f:observation:
              .: {}
              f:metrics: {}
            f:parameterAssignments: {}
          f:startTime: {}
          f:succeededTrialList: {}
          f:trials: {}
          f:trialsSucceeded: {}
      subresource: status
spec:
  parameters:
    - name: learning_rate
      parameterType: double
      feasibleSpace:
        max: '5e-05'
        min: '1e-05'
        distribution: uniform
    - name: r
      parameterType: int
      feasibleSpace:
        max: '8'
        min: '1'
        distribution: uniform
  objective:
    type: minimize
    goal: 2
    objectiveMetricName: train_loss
    metricStrategies:
      - name: train_loss
        value: min
  algorithm:
    algorithmName: random
  trialTemplate:
    retain: true
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          metadata:
            annotations:
              cni.istio.io/exclude: 'true'
              istio.io/rev: ''
              sidecar.istio.io/inject: 'false'
          spec:
            containers:
              - args:
                  - >-
                    set -x;accelerate launch   --multi_gpu  src/train.py  
                    --model_name_or_path=/datas/models  
                    --output_dir=/datas/output   --dataset_dir /datas/datasets  
                    --do_train   --report_to=tensorboard  
                    --finetuning_type=lora   --flash_attn=auto  
                    --packing=False   --plot_loss=True  
                    --ddp_timeout=180000000   --fp16=True   --cutoff_len=4096  
                    --dataset=default   --gradient_accumulation_steps=8  
                    --learning_rate=${trialParameters.learning_rate}  
                    --logging_steps=5   --lr_scheduler_type=cosine  
                    --max_samples=100000   --num_train_epochs=1  
                    --optim=adamw_torch   --per_device_train_batch_size=2  
                    --save_steps=256   --stage=sft   --template=qwen  
                    --lora_alpha=16   --lora_dropout=0  
                    --lora_rank=${trialParameters.r}   --loraplus_lr_ratio=0  
                    --use_dora=false   --use_rslora=false  
                    --overwrite_output_dir;if [ -f
                    /datas/output/train_results.json ]; then  echo 'Converting
                    all_results.json to single-line format...' >&2;  python3 -c
                    "import json;
                    data=json.load(open('/datas/output/train_results.json'));
                    print(json.dumps({k: str(v) for k, v in data.items()},
                    separators=(',', ':')))" > /tmp/all_results_single.json;  mv
                    /tmp/all_results_single.json
                    /datas/output/train_results.json;  echo 'JSON conversion
                    complete' >&2;fi;cat
                    /datas/output/train_results.json;sync;sleep 10;echo
                    completed > /datas/output/$$$$.pid;sleep 10;exit 0
                command:
                  - sh
                  - '-c'
                image: >-
                  36.134.128.101.nip.io:31104/aict-gpu/llama-factory-amd64-nvidia:1.0.1-llamafactory-0.9.2-dev
                name: node
                resources:
                  limits:
                    cpu: '1'
                    memory: 8Gi
                    nvidia.com/gpu: '1'
                  requests:
                    cpu: '1'
                    memory: 8Gi
                    nvidia.com/gpu: '1'
                volumeMounts:
                  - mountPath: /datas
                    name: trainer-datas
            restartPolicy: Never
            schedulerName: volcano
            volumes:
              - name: trainer-datas
                persistentVolumeClaim:
                  claimName: katib-llamafactory-qwen-sft
    trialParameters:
      - name: learning_rate
        description: Learning rate
        reference: learning_rate
      - name: r
        description: LoRA rank
        reference: r
    primaryContainerName: node
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
  parallelTrialCount: 1
  maxTrialCount: 1
  maxFailedTrialCount: 0
  metricsCollectorSpec:
    source:
      fileSystemPath:
        path: /datas/output/train_results.json
        kind: File
        format: JSON
    collector:
      kind: File
  resumePolicy: Never
status:
  startTime: '2026-02-06T02:27:46Z'
  completionTime: '2026-02-06T02:45:39Z'
  conditions:
    - type: Created
      status: 'True'
      reason: ExperimentCreated
      message: Experiment is created
      lastUpdateTime: '2026-02-06T02:27:46Z'
      lastTransitionTime: '2026-02-06T02:27:46Z'
    - type: Running
      status: 'False'
      reason: ExperimentRunning
      message: Experiment is running
      lastUpdateTime: '2026-02-06T02:45:39Z'
      lastTransitionTime: '2026-02-06T02:45:39Z'
    - type: Succeeded
      status: 'True'
      reason: ExperimentMaxTrialsReached
      message: Experiment has succeeded because max trial count has reached
      lastUpdateTime: '2026-02-06T02:45:39Z'
      lastTransitionTime: '2026-02-06T02:45:39Z'
  currentOptimalTrial:
    bestTrialName: katib-llamafactory-qwen-sft3-lyt-qkxpnrzh
    parameterAssignments:
      - name: learning_rate
        value: '3.5283942788578944e-05'
      - name: r
        value: '2'
    observation:
      metrics:
        - name: train_loss
          min: '3.3930898904800415'
          max: '3.3930898904800415'
          latest: '3.3930898904800415'
  succeededTrialList:
    - katib-llamafactory-qwen-sft3-lyt-qkxpnrzh
  trials: 1
  trialsSucceeded: 1

state of pod:

# kubectl get pod | grep lyt | grep -v old
katib-llamafactory-qwen-sft3-lyt-qkxpnrzh-xcn6x                0/2     Completed          0             5h54m

log of metrics:

# kubectl logs -f -c metrics-logger-and-collector katib-llamafactory-qwen-sft3-lyt-qkxpnrzh-xcn6x
I0206 02:28:24.004010      68 main.go:400] Trial Name: katib-llamafactory-qwen-sft3-lyt-qkxpnrzh
I0206 02:45:13.921332      68 main.go:143] {
I0206 02:45:13.921369      68 main.go:143]     "epoch": 1.0,
I0206 02:45:13.921381      68 main.go:143]     "total_flos": 8399292649701376.0,
I0206 02:45:13.921394      68 main.go:143]     "train_loss": 3.3930898904800415,
I0206 02:45:13.921406      68 main.go:143]     "train_runtime": 969.046,
I0206 02:45:13.921410      68 main.go:143]     "train_samples_per_second": 5.811,
I0206 02:45:13.921423      68 main.go:143]     "train_steps_per_second": 0.363
I0206 02:45:13.921689      68 main.go:143] }
W0206 02:45:36.212115      68 file-metricscollector.go:143] Metrics will not have timestamp since {"epoch":"1.0","total_flos":"8399292649701376.0","train_loss":"3.3930898904800415","train_runtime":"969.046","train_samples_per_second":"5.811","train_steps_per_second":"0.363"} doesn't have the key timestamp
I0206 02:45:36.236663      68 main.go:459] Metrics reported. :
metric_logs:{time_stamp:"0001-01-01T00:00:00Z"  metric:{name:"train_loss"  value:"3.3930898904800415"}}

describe of pod:

# kubectl describe pod katib-llamafactory-qwen-sft3-lyt-qkxpnrzh-xcn6x 
Name:             katib-llamafactory-qwen-sft3-lyt-qkxpnrzh-xcn6x
Namespace:        aict
Priority:         0
Service Account:  default
Node:             llm1/192.168.1.4
Start Time:       Fri, 06 Feb 2026 10:28:22 +0800
Labels:           batch.kubernetes.io/controller-uid=91c17c88-fb99-4d33-8bf3-83b2fde6eb03
                  batch.kubernetes.io/job-name=katib-llamafactory-qwen-sft3-lyt-qkxpnrzh
                  controller-uid=91c17c88-fb99-4d33-8bf3-83b2fde6eb03
                  job-name=katib-llamafactory-qwen-sft3-lyt-qkxpnrzh
                  katib.kubeflow.org/experiment=katib-llamafactory-qwen-sft3-lyt
                  katib.kubeflow.org/trial=katib-llamafactory-qwen-sft3-lyt-qkxpnrzh
Annotations:      cni.istio.io/exclude: true
                  cni.projectcalico.org/containerID: 84d074e5dab08b026f5ac653a2d78a6ff3e9cc4744d8a6b3ceacfe71a7f8df33
                  cni.projectcalico.org/podIP: 
                  cni.projectcalico.org/podIPs: 
                  istio.io/rev: 
                  scheduling.k8s.io/group-name: podgroup-91c17c88-fb99-4d33-8bf3-83b2fde6eb03
                  sidecar.istio.io/inject: false
Status:           Succeeded
IP:               10.42.0.187
IPs:
  IP:           10.42.0.187
Controlled By:  Job/katib-llamafactory-qwen-sft3-lyt-qkxpnrzh
Containers:
  node:
    Container ID:  containerd://34cca64a540c52daede81cc96587da6432fc3a4b1d9cac0368c032c59c96b0da
    Image:         36.134.128.101.nip.io:31104/aict-gpu/llama-factory-amd64-nvidia:1.0.1-llamafactory-0.9.2-dev
    Image ID:      36.134.128.101.nip.io:31104/aict-gpu/llama-factory-amd64-nvidia@sha256:2899589618c16624103fb0170b865119fce8af891bb38dbf1be36b8c4f2cdc2f
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      set -x;accelerate launch   --multi_gpu  src/train.py   --model_name_or_path=/datas/models   --output_dir=/datas/output   --dataset_dir /datas/datasets   --do_train   --report_to=tensorboard   --finetuning_type=lora   --flash_attn=auto   --packing=False   --plot_loss=True   --ddp_timeout=180000000   --fp16=True   --cutoff_len=4096   --dataset=default   --gradient_accumulation_steps=8   --learning_rate=3.5283942788578944e-05   --logging_steps=5   --lr_scheduler_type=cosine   --max_samples=100000   --num_train_epochs=1   --optim=adamw_torch   --per_device_train_batch_size=2   --save_steps=256   --stage=sft   --template=qwen   --lora_alpha=16   --lora_dropout=0   --lora_rank=2   --loraplus_lr_ratio=0   --use_dora=false   --use_rslora=false   --overwrite_output_dir;if [ -f /datas/output/train_results.json ]; then  echo 'Converting all_results.json to single-line format...' >&2;  python3 -c "import json; data=json.load(open('/datas/output/train_results.json')); print(json.dumps({k: str(v) for k, v in data.items()}, separators=(',', ':')))" > /tmp/all_results_single.json;  mv /tmp/all_results_single.json /datas/output/train_results.json;  echo 'JSON conversion complete' >&2;fi;cat /datas/output/train_results.json;sync;sleep 10;echo completed > /datas/output/$$$$.pid;sleep 10;exit 0 && echo completed > /datas/output/$$$$.pid
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 06 Feb 2026 10:28:23 +0800
      Finished:     Fri, 06 Feb 2026 10:45:34 +0800
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:             1
      memory:          8Gi
      nvidia.com/gpu:  1
    Requests:
      cpu:             1
      memory:          8Gi
      nvidia.com/gpu:  1
    Environment:
      KATIB_TRIAL_NAME:   (v1:metadata.labels['katib.kubeflow.org/trial'])
    Mounts:
      /datas from trainer-datas (rw)
      /datas/output from metrics-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-l7s79 (ro)
  metrics-logger-and-collector:
    Container ID:  containerd://e41a258f1595fa7b6df7e59cc6bf1e48eff53f29c840f0187556245660ecb8ce
    Image:         ghcr.io/kubeflow/katib/file-metrics-collector:v0.19.0
    Image ID:      ghcr.io/kubeflow/katib/file-metrics-collector@sha256:0616af2111b2c6029105ac4670e1e94a0ceb7ba02ddb06a8cee3a687fde1514c
    Port:          <none>
    Host Port:     <none>
    Args:
      -t
      katib-llamafactory-qwen-sft3-lyt-qkxpnrzh
      -m
      train_loss
      -o-type
      minimize
      -s-db
      katib-db-manager.kubeflow:6789
      -path
      /datas/output/train_results.json
      -format
      JSON
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 06 Feb 2026 10:28:24 +0800
      Finished:     Fri, 06 Feb 2026 10:45:36 +0800
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:                500m
      ephemeral-storage:  5Gi
      memory:             100Mi
    Requests:
      cpu:                50m
      ephemeral-storage:  500Mi
      memory:             10Mi
    Environment:          <none>
    Mounts:
      /datas/output from metrics-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-l7s79 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   False 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  trainer-datas:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  katib-llamafactory-qwen-sft
    ReadOnly:   false
  kube-api-access-l7s79:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
  metrics-volume:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:      
    SizeLimit:   <unset>
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>

This normal katib experiment in Kubeflow UI display:

Image

What did you expect to happen?

I expect trainjob experiment works normally like job experiment, file-metricscollector end normally, and Status of trial should Succeeded on the Kubeflow UI.

Environment

Kubernetes version:

$ kubectl version
Client Version: v1.32.10+rke2r1
Kustomize Version: v5.5.0
Server Version: v1.32.10+rke2r1

Katib controller version:

$ kubectl get pods -n kubeflow -l katib.kubeflow.org/component=controller -o jsonpath="{.items[*].spec.containers[*].image}"
ghcr.io/kubeflow/katib/katib-controller:v0.19.0

Katib Python SDK version:

$ pip show kubeflow-katib
Name: kubeflow-katib
Version: 0.19.0
Summary: Katib Python SDK for APIVersion v1beta1
Home-page: https://github.com/kubeflow/katib/tree/master/sdk/python/v1beta1
Author: Kubeflow Authors
Author-email: [premnath.vel@gmail.com](mailto:premnath.vel@gmail.com)
License: Apache License Version 2.0
Location: /opt/conda/lib/python3.11/site-packages
Requires: certifi, grpcio, kubeflow-training, kubernetes, protobuf, setuptools, six, urllib3
Required-by:

Impacted by this bug?

katib trainjob experiment works not well

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions