-
Notifications
You must be signed in to change notification settings - Fork 510
Description
What happened?
I have created a experiment by python SDK uses trainjob, then the same thing occured like #2614 (pod state is 2/3, NotReady, metrics-collector container is still running, no error.)
However, I duplicate the experiment yaml, only change the "trainjob" to "job" in experimnet yaml, and run it, finally it works well: metrics-collector container exit normally, and experiment state is succeeded.
I think this is maybe a new bug, related to the trainjob, and different from #2614.
promblem katib experiment
python code:
import kubeflow.katib as katib
from kubeflow.katib import KatibClient
from kubeflow.katib.models import (
V1beta1Experiment,
V1beta1ExperimentSpec,
V1beta1AlgorithmSpec,
V1beta1ObjectiveSpec,
V1beta1ParameterSpec,
V1beta1TrialTemplate,
V1beta1TrialParameterSpec,
V1ObjectMeta
)
parameters = [
V1beta1ParameterSpec(
name="learning_rate",
parameter_type="double",
feasible_space={"min": "1e-05", "max": "5e-05"}
),
V1beta1ParameterSpec(
name="r",
parameter_type="int",
feasible_space={"min": "1", "max": "8"}
)
]
TRAIN_IMAGE = "36.134.128.101.nip.io:31104/aict-gpu/llama-factory-amd64-nvidia:1.0.1-llamafactory-0.9.2-dev"
EXP_NAME = "katib-llamafactory-qwen-sft3-lyt-old"
NAMESPACE = "aict"
trial_spec={
"apiVersion": "trainer.kubeflow.org/v1alpha1",
"kind": "TrainJob",
"spec": {
"podTemplateOverrides": [
{
"spec": {
"containers": [
{
"name": "node",
"volumeMounts": [
{
"mountPath": "/datas",
"name": "trainer-datas",
},
],
},
],
"volumes": [
{
"name": "trainer-datas",
"persistentVolumeClaim": {
"claimName": "katib-llamafactory-qwen-sft"
},
},
],
},
"targetJobs":[
{
"name": "node",
},
],
},
],
"runtimeRef": {
"apiGroup": "trainer.kubeflow.org",
"kind": "ClusterTrainingRuntime",
"name": "custom-test",
},
"trainer": {
"numNodes": 1,
"image": TRAIN_IMAGE,
"command": [
"sh",
"-c",
"set -x;"
"accelerate launch "
" --multi_gpu"
" src/train.py "
" --model_name_or_path=/datas/models "
" --output_dir=/datas/output "
" --dataset_dir /datas/datasets "
" --do_train "
" --report_to=tensorboard "
" --finetuning_type=lora "
" --flash_attn=auto "
" --packing=False "
" --plot_loss=True "
" --ddp_timeout=180000000 "
" --fp16=True "
" --cutoff_len=4096 "
" --dataset=default "
" --gradient_accumulation_steps=8 "
" --learning_rate=${trialParameters.learning_rate} "
" --logging_steps=5 "
" --lr_scheduler_type=cosine "
" --max_samples=100000 "
" --num_train_epochs=1 "
" --optim=adamw_torch "
" --per_device_train_batch_size=2 "
" --save_steps=256 "
" --stage=sft "
" --template=qwen "
" --lora_alpha=16 "
" --lora_dropout=0 "
" --lora_rank=${trialParameters.r} "
" --loraplus_lr_ratio=0 "
" --use_dora=false "
" --use_rslora=false "
" --overwrite_output_dir;"
"if [ -f /datas/output/train_results.json ]; then"
" echo 'Converting all_results.json to single-line format...' >&2;"
" python3 -c \"import json; data=json.load(open('/datas/output/train_results.json')); print(json.dumps({k: str(v) for k, v in data.items()}, separators=(',', ':')))\" > /tmp/all_results_single.json;"
" mv /tmp/all_results_single.json /datas/output/train_results.json;"
" echo 'JSON conversion complete' >&2;"
"fi;"
"cat /datas/output/train_results.json;"
"sync;"
"sleep 10;"
"echo completed > /datas/output/$$$$.pid;"
"sleep 30;"
"exit 0"
],
"resourcesPerNode": {
"limits": {
"cpu": "1",
"memory": "8Gi",
"nvidia.com/gpu": "1",
},
"requests": {
"cpu": "1",
"memory": "8Gi",
"nvidia.com/gpu": "1",
},
},
},
}
}
trial_template = V1beta1TrialTemplate(
primary_container_name="node",
trial_parameters=[
V1beta1TrialParameterSpec(
name="learning_rate",
description="Learning rate",
reference="learning_rate"
),
V1beta1TrialParameterSpec(
name="r",
description="LoRA rank",
reference="r"
)
],
trial_spec=trial_spec,
success_condition='status.conditions.#(type=="Complete")#|#(status=="True")#',
failure_condition='status.conditions.#(type=="Failed")#|#(status=="True")#',
retain=True,
)
experiment_spec = V1beta1ExperimentSpec(
algorithm=V1beta1AlgorithmSpec(algorithm_name="random"),
objective=V1beta1ObjectiveSpec(
type="minimize",
goal=2.0,
objective_metric_name="train_loss",
metric_strategies=[{"name": "train_loss", "value": "min"}]
),
parameters=parameters,
trial_template=trial_template,
max_trial_count=2,
parallel_trial_count=1,
metrics_collector_spec={
"collector": {"kind": "File"},
"source": {
"fileSystemPath": {
"kind": "File",
"path": "/datas/output/train_results.json",
"format": "JSON",
}
}
},
max_failed_trial_count=1
)
experiment = V1beta1Experiment(
api_version="kubeflow.org/v1beta1",
kind="Experiment",
metadata=V1ObjectMeta(name=EXP_NAME, namespace=NAMESPACE),
spec=experiment_spec
)
cl = KatibClient(namespace=NAMESPACE)
cl.create_experiment(experiment)
cl.wait_for_experiment_condition(name=EXP_NAME)
print(cl.get_optimal_hyperparameters(EXP_NAME))
experiment yaml:
metadata:
name: katib-llamafactory-qwen-sft3-lyt-old
namespace: aict
uid: b41ef14b-7287-42f9-9dbf-db43b0f8f1de
resourceVersion: '146317884'
generation: 1
creationTimestamp: '2026-02-06T07:20:00Z'
finalizers:
- update-prometheus-metrics
managedFields:
- manager: OpenAPI-Generator
operation: Update
apiVersion: kubeflow.org/v1beta1
time: '2026-02-06T07:20:00Z'
fieldsType: FieldsV1
fieldsV1:
f:spec:
.: {}
f:algorithm:
.: {}
f:algorithmName: {}
f:maxFailedTrialCount: {}
f:maxTrialCount: {}
f:metricsCollectorSpec:
.: {}
f:collector:
.: {}
f:kind: {}
f:source:
.: {}
f:fileSystemPath:
.: {}
f:format: {}
f:kind: {}
f:path: {}
f:objective:
.: {}
f:goal: {}
f:metricStrategies: {}
f:objectiveMetricName: {}
f:type: {}
f:parallelTrialCount: {}
f:parameters: {}
f:trialTemplate:
.: {}
f:failureCondition: {}
f:primaryContainerName: {}
f:retain: {}
f:successCondition: {}
f:trialParameters: {}
f:trialSpec:
.: {}
f:apiVersion: {}
f:kind: {}
f:spec:
.: {}
f:podTemplateOverrides: {}
f:runtimeRef:
.: {}
f:apiGroup: {}
f:kind: {}
f:name: {}
f:trainer:
.: {}
f:command: {}
f:image: {}
f:numNodes: {}
f:resourcesPerNode:
.: {}
f:limits:
.: {}
f:cpu: {}
f:memory: {}
f:nvidia.com/gpu: {}
f:requests:
.: {}
f:cpu: {}
f:memory: {}
f:nvidia.com/gpu: {}
- manager: katib-controller
operation: Update
apiVersion: kubeflow.org/v1beta1
time: '2026-02-06T07:20:00Z'
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:finalizers:
.: {}
v:"update-prometheus-metrics": {}
- manager: katib-controller
operation: Update
apiVersion: kubeflow.org/v1beta1
time: '2026-02-06T07:20:14Z'
fieldsType: FieldsV1
fieldsV1:
f:status:
.: {}
f:conditions: {}
f:currentOptimalTrial:
.: {}
f:observation: {}
f:runningTrialList: {}
f:startTime: {}
f:trials: {}
f:trialsRunning: {}
subresource: status
spec:
parameters:
- name: learning_rate
parameterType: double
feasibleSpace:
max: '5e-05'
min: '1e-05'
distribution: uniform
- name: r
parameterType: int
feasibleSpace:
max: '8'
min: '1'
distribution: uniform
objective:
type: minimize
goal: 2
objectiveMetricName: train_loss
metricStrategies:
- name: train_loss
value: min
algorithm:
algorithmName: random
trialTemplate:
retain: true
trialSpec:
apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
spec:
podTemplateOverrides:
- spec:
containers:
- name: node
volumeMounts:
- mountPath: /datas
name: trainer-datas
volumes:
- name: trainer-datas
persistentVolumeClaim:
claimName: katib-llamafactory-qwen-sft
targetJobs:
- name: node
runtimeRef:
apiGroup: trainer.kubeflow.org
kind: ClusterTrainingRuntime
name: custom-test
trainer:
command:
- sh
- '-c'
- >-
set -x;accelerate launch --multi_gpu src/train.py
--model_name_or_path=/datas/models --output_dir=/datas/output
--dataset_dir /datas/datasets --do_train
--report_to=tensorboard --finetuning_type=lora
--flash_attn=auto --packing=False --plot_loss=True
--ddp_timeout=180000000 --fp16=True --cutoff_len=4096
--dataset=default --gradient_accumulation_steps=8
--learning_rate=${trialParameters.learning_rate}
--logging_steps=5 --lr_scheduler_type=cosine
--max_samples=100000 --num_train_epochs=1
--optim=adamw_torch --per_device_train_batch_size=2
--save_steps=256 --stage=sft --template=qwen
--lora_alpha=16 --lora_dropout=0
--lora_rank=${trialParameters.r} --loraplus_lr_ratio=0
--use_dora=false --use_rslora=false --overwrite_output_dir;if
[ -f /datas/output/train_results.json ]; then echo 'Converting
all_results.json to single-line format...' >&2; python3 -c
"import json;
data=json.load(open('/datas/output/train_results.json'));
print(json.dumps({k: str(v) for k, v in data.items()},
separators=(',', ':')))" > /tmp/all_results_single.json; mv
/tmp/all_results_single.json /datas/output/train_results.json;
echo 'JSON conversion complete' >&2;fi;cat
/datas/output/train_results.json;sync;sleep 10;echo completed >
/datas/output/$$$$.pid;sleep 30;exit 0
image: >-
36.134.128.101.nip.io:31104/aict-gpu/llama-factory-amd64-nvidia:1.0.1-llamafactory-0.9.2-dev
numNodes: 1
resourcesPerNode:
limits:
cpu: '1'
memory: 8Gi
nvidia.com/gpu: '1'
requests:
cpu: '1'
memory: 8Gi
nvidia.com/gpu: '1'
trialParameters:
- name: learning_rate
description: Learning rate
reference: learning_rate
- name: r
description: LoRA rank
reference: r
primaryPodLabels:
batch.kubernetes.io/job-completion-index: '0'
jobset.sigs.k8s.io/replicatedjob-name: node
primaryContainerName: node
successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
parallelTrialCount: 1
maxTrialCount: 2
maxFailedTrialCount: 1
metricsCollectorSpec:
source:
fileSystemPath:
path: /datas/output/train_results.json
kind: File
format: JSON
collector:
kind: File
resumePolicy: Never
status:
startTime: '2026-02-06T07:20:00Z'
conditions:
- type: Created
status: 'True'
reason: ExperimentCreated
message: Experiment is created
lastUpdateTime: '2026-02-06T07:20:00Z'
lastTransitionTime: '2026-02-06T07:20:00Z'
- type: Running
status: 'True'
reason: ExperimentRunning
message: Experiment is running
lastUpdateTime: '2026-02-06T07:20:14Z'
lastTransitionTime: '2026-02-06T07:20:14Z'
currentOptimalTrial:
observation: {}
runningTrialList:
- katib-llamafactory-qwen-sft3-lyt-old-k652h5cp
trials: 1
trialsRunning: 1
state of pod:
# kubectl get pod | grep old
katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0-0-q2df4 2/3 NotReady 0 58m
katib-llamafactory-qwen-sft3-lyt-old-random-7bbc5c78d4-nnpx4 1/1 Running 0 58m
log of metrics:
# kubectl logs -f -c metrics-logger-and-collector katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0-0-q2df4
I0206 07:20:19.726497 105 main.go:400] Trial Name: katib-llamafactory-qwen-sft3-lyt-old-k652h5cp
I0206 07:37:20.817003 105 main.go:143] {
I0206 07:37:20.817104 105 main.go:143] "epoch": 1.0,
I0206 07:37:20.817123 105 main.go:143] "total_flos": 8476485388075008.0,
I0206 07:37:20.817127 105 main.go:143] "train_loss": 3.3930898904800415,
I0206 07:37:20.817139 105 main.go:143] "train_runtime": 986.1327,
I0206 07:37:20.817144 105 main.go:143] "train_samples_per_second": 5.71,
I0206 07:37:20.817165 105 main.go:143] "train_steps_per_second": 0.357
I0206 07:37:20.817169 105 main.go:143] }
2026/02/06 07:37:25 Re-opening truncated file /datas/output/train_results.json ...
2026/02/06 07:37:25 Successfully reopened truncated /datas/output/train_results.json
I0206 07:37:25.229170 105 main.go:143] {"epoch":"1.0","total_flos":"8476485388075008.0","train_loss":"3.3930898904800415","train_runtime":"986.1327","train_samples_per_second":"5.71","train_steps_per_second":"0.357"}
2026/02/06 07:37:25 Re-opening moved/deleted file /datas/output/train_results.json ...
2026/02/06 07:37:25 Successfully reopened /datas/output/train_results.json
I0206 07:37:25.229414 105 main.go:143] {"epoch":"1.0","total_flos":"8476485388075008.0","train_loss":"3.3930898904800415","train_runtime":"986.1327","train_samples_per_second":"5.71","train_steps_per_second":"0.357"}
describe of pod:
# kubectl describe pod katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0-0-q2df4
Name: katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0-0-q2df4
Namespace: aict
Priority: 0
Service Account: default
Node: llm1/192.168.1.4
Start Time: Fri, 06 Feb 2026 15:20:14 +0800
Labels: batch.kubernetes.io/controller-uid=e614da32-fa81-4f5a-ac21-8232bc38224b
batch.kubernetes.io/job-completion-index=0
batch.kubernetes.io/job-name=katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0
controller-uid=e614da32-fa81-4f5a-ac21-8232bc38224b
job-name=katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0
jobset.sigs.k8s.io/global-replicas=1
jobset.sigs.k8s.io/group-name=default
jobset.sigs.k8s.io/group-replicas=1
jobset.sigs.k8s.io/job-global-index=0
jobset.sigs.k8s.io/job-group-index=0
jobset.sigs.k8s.io/job-index=0
jobset.sigs.k8s.io/job-key=77f78d77fd4a9c5f3dda557b211da8643fc5a4eb
jobset.sigs.k8s.io/jobset-name=katib-llamafactory-qwen-sft3-lyt-old-k652h5cp
jobset.sigs.k8s.io/jobset-uid=9ca13e49-5443-4656-ad74-327147852c5f
jobset.sigs.k8s.io/replicatedjob-name=node
jobset.sigs.k8s.io/replicatedjob-replicas=1
jobset.sigs.k8s.io/restart-attempt=0
katib.kubeflow.org/experiment=katib-llamafactory-qwen-sft3-lyt-old
katib.kubeflow.org/trial=katib-llamafactory-qwen-sft3-lyt-old-k652h5cp
security.istio.io/tlsMode=istio
service.istio.io/canonical-name=katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0
service.istio.io/canonical-revision=latest
Annotations: batch.kubernetes.io/job-completion-index: 0
cni.projectcalico.org/containerID: 2358563f7dc46f4fc07175fdcd7d351a3901a777995d3f95a0b1117030b6f8af
cni.projectcalico.org/podIP: 10.42.0.119/32
cni.projectcalico.org/podIPs: 10.42.0.119/32
istio.io/rev: default
jobset.sigs.k8s.io/global-replicas: 1
jobset.sigs.k8s.io/group-name: default
jobset.sigs.k8s.io/group-replicas: 1
jobset.sigs.k8s.io/job-global-index: 0
jobset.sigs.k8s.io/job-group-index: 0
jobset.sigs.k8s.io/job-index: 0
jobset.sigs.k8s.io/job-key: 77f78d77fd4a9c5f3dda557b211da8643fc5a4eb
jobset.sigs.k8s.io/jobset-name: katib-llamafactory-qwen-sft3-lyt-old-k652h5cp
jobset.sigs.k8s.io/jobset-uid: 9ca13e49-5443-4656-ad74-327147852c5f
jobset.sigs.k8s.io/replicatedjob-name: node
jobset.sigs.k8s.io/replicatedjob-replicas: 1
jobset.sigs.k8s.io/restart-attempt: 0
kubectl.kubernetes.io/default-container: node
kubectl.kubernetes.io/default-logs-container: node
prometheus.io/path: /stats/prometheus
prometheus.io/port: 15020
prometheus.io/scrape: true
sidecar.istio.io/interceptionMode: REDIRECT
sidecar.istio.io/status:
{"initContainers":["istio-validation","istio-proxy"],"containers":null,"volumes":["workload-socket","credential-socket","workload-certs","...
traffic.sidecar.istio.io/excludeInboundPorts: 15020
traffic.sidecar.istio.io/includeInboundPorts: *
traffic.sidecar.istio.io/includeOutboundIPRanges: *
Status: Running
IP: 10.42.0.119
IPs:
IP: 10.42.0.119
Controlled By: Job/katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0
Init Containers:
istio-validation:
Container ID: containerd://279a472bd3f21151ca632eb9ab5a65d5b0f659cd122b4ef9cccb0b53db7e10f2
Image: 36.134.128.101.nip.io:31104/kubeflow/proxyv2:1.26.1
Image ID: 36.134.128.101.nip.io:31104/kubeflow/proxyv2@sha256:79ae318dc23920468ea6cfaa0743883b3764b472635c3a698166c33dd4edb329
Port: <none>
Host Port: <none>
Args:
istio-iptables
-p
15001
-z
15006
-u
1337
-m
REDIRECT
-i
*
-x
-b
*
-d
15090,15021,15020
--log_output_level=default:info
--run-validation
--skip-rule-apply
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 06 Feb 2026 15:20:15 +0800
Finished: Fri, 06 Feb 2026 15:20:15 +0800
Ready: True
Restart Count: 0
Limits:
cpu: 2
memory: 1Gi
Requests:
cpu: 100m
memory: 128Mi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-df5ql (ro)
istio-proxy:
Container ID: containerd://e389ec32b15abe0aac51909fbbc366210061e2a940d7f9e7ce9068637eecf0c2
Image: 36.134.128.101.nip.io:31104/kubeflow/proxyv2:1.26.1
Image ID: 36.134.128.101.nip.io:31104/kubeflow/proxyv2@sha256:79ae318dc23920468ea6cfaa0743883b3764b472635c3a698166c33dd4edb329
Port: 15090/TCP
Host Port: 0/TCP
Args:
proxy
sidecar
--domain
$(POD_NAMESPACE).svc.cluster.local
--proxyLogLevel=warning
--proxyComponentLogLevel=misc:error
--log_output_level=default:info
State: Running
Started: Fri, 06 Feb 2026 15:20:17 +0800
Ready: True
Restart Count: 0
Limits:
cpu: 2
memory: 1Gi
Requests:
cpu: 100m
memory: 128Mi
Readiness: http-get http://:15021/healthz/ready delay=0s timeout=3s period=15s #success=1 #failure=4
Startup: http-get http://:15021/healthz/ready delay=0s timeout=3s period=1s #success=1 #failure=600
Environment:
PILOT_CERT_PROVIDER: istiod
CA_ADDR: istiod.istio-system.svc:15012
POD_NAME: katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0-0-q2df4 (v1:metadata.name)
POD_NAMESPACE: aict (v1:metadata.namespace)
INSTANCE_IP: (v1:status.podIP)
SERVICE_ACCOUNT: (v1:spec.serviceAccountName)
HOST_IP: (v1:status.hostIP)
ISTIO_CPU_LIMIT: 2 (limits.cpu)
PROXY_CONFIG: {"tracing":{}}
ISTIO_META_POD_PORTS: [
]
ISTIO_META_APP_CONTAINERS: node
GOMEMLIMIT: 1073741824 (limits.memory)
GOMAXPROCS: 2 (limits.cpu)
ISTIO_META_CLUSTER_ID: Kubernetes
ISTIO_META_NODE_NAME: (v1:spec.nodeName)
ISTIO_META_INTERCEPTION_MODE: REDIRECT
ISTIO_META_WORKLOAD_NAME: katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0
ISTIO_META_OWNER: kubernetes://apis/batch/v1/namespaces/aict/jobs/katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0
ISTIO_META_MESH_ID: cluster.local
TRUST_DOMAIN: cluster.local
Mounts:
/etc/istio/pod from istio-podinfo (rw)
/etc/istio/proxy from istio-envoy (rw)
/var/lib/istio/data from istio-data (rw)
/var/run/secrets/credential-uds from credential-socket (rw)
/var/run/secrets/istio from istiod-ca-cert (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-df5ql (ro)
/var/run/secrets/tokens from istio-token (rw)
/var/run/secrets/workload-spiffe-credentials from workload-certs (rw)
/var/run/secrets/workload-spiffe-uds from workload-socket (rw)
Containers:
node:
Container ID: containerd://c0bd832963b1034adde9b8843ca3d58f3eb9e5b898bd532b631d6575976eaf6a
Image: 36.134.128.101.nip.io:31104/aict-gpu/llama-factory-amd64-nvidia:1.0.1-llamafactory-0.9.2-dev
Image ID: 36.134.128.101.nip.io:31104/aict-gpu/llama-factory-amd64-nvidia@sha256:2899589618c16624103fb0170b865119fce8af891bb38dbf1be36b8c4f2cdc2f
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
set -x;accelerate launch --multi_gpu src/train.py --model_name_or_path=/datas/models --output_dir=/datas/output --dataset_dir /datas/datasets --do_train --report_to=tensorboard --finetuning_type=lora --flash_attn=auto --packing=False --plot_loss=True --ddp_timeout=180000000 --fp16=True --cutoff_len=4096 --dataset=default --gradient_accumulation_steps=8 --learning_rate=2.5746808259899108e-05 --logging_steps=5 --lr_scheduler_type=cosine --max_samples=100000 --num_train_epochs=1 --optim=adamw_torch --per_device_train_batch_size=2 --save_steps=256 --stage=sft --template=qwen --lora_alpha=16 --lora_dropout=0 --lora_rank=8 --loraplus_lr_ratio=0 --use_dora=false --use_rslora=false --overwrite_output_dir;if [ -f /datas/output/train_results.json ]; then echo 'Converting all_results.json to single-line format...' >&2; python3 -c "import json; data=json.load(open('/datas/output/train_results.json')); print(json.dumps({k: str(v) for k, v in data.items()}, separators=(',', ':')))" > /tmp/all_results_single.json; mv /tmp/all_results_single.json /datas/output/train_results.json; echo 'JSON conversion complete' >&2;fi;cat /datas/output/train_results.json;sync;sleep 10;echo completed > /datas/output/$$$$.pid;sleep 30;exit 0 && echo completed > /datas/output/$$$$.pid
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 06 Feb 2026 15:20:19 +0800
Finished: Fri, 06 Feb 2026 15:38:05 +0800
Ready: False
Restart Count: 0
Limits:
cpu: 1
memory: 8Gi
nvidia.com/gpu: 1
Requests:
cpu: 1
memory: 8Gi
nvidia.com/gpu: 1
Environment:
JOB_COMPLETION_INDEX: (v1:metadata.labels['batch.kubernetes.io/job-completion-index'])
KATIB_TRIAL_NAME: (v1:metadata.labels['katib.kubeflow.org/trial'])
Mounts:
/datas from trainer-datas (rw)
/datas/output from metrics-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-df5ql (ro)
metrics-logger-and-collector:
Container ID: containerd://244906db4f40c2947327ecd487e3f3d356acc192391b79b6ec494bdd9655b1fe
Image: ghcr.io/kubeflow/katib/file-metrics-collector:v0.19.0
Image ID: ghcr.io/kubeflow/katib/file-metrics-collector@sha256:0616af2111b2c6029105ac4670e1e94a0ceb7ba02ddb06a8cee3a687fde1514c
Port: <none>
Host Port: <none>
Args:
-t
katib-llamafactory-qwen-sft3-lyt-old-k652h5cp
-m
train_loss
-o-type
minimize
-s-db
katib-db-manager.kubeflow:6789
-path
/datas/output/train_results.json
-format
JSON
State: Running
Started: Fri, 06 Feb 2026 15:20:19 +0800
Ready: True
Restart Count: 0
Limits:
cpu: 500m
ephemeral-storage: 5Gi
memory: 100Mi
Requests:
cpu: 50m
ephemeral-storage: 500Mi
memory: 10Mi
Environment: <none>
Mounts:
/datas/output from metrics-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-df5ql (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
workload-socket:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
credential-socket:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
workload-certs:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
istio-envoy:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: <unset>
istio-data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
istio-podinfo:
Type: DownwardAPI (a volume populated by information about the pod)
Items:
metadata.labels -> labels
metadata.annotations -> annotations
istio-token:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 43200
istiod-ca-cert:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: istio-ca-root-cert
Optional: false
trainer-datas:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: katib-llamafactory-qwen-sft
ReadOnly: false
kube-api-access-df5ql:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
metrics-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
This promblem katib experiment in Kubeflow UI display:

normal katib experiment
experiment yaml, only changed job kind from "trainjob" to "job":
metadata:
name: katib-llamafactory-qwen-sft3-lyt
namespace: aict
uid: 704b3553-3c93-496a-adb2-f69e5b40a09b
resourceVersion: '144115325'
generation: 1
creationTimestamp: '2026-02-06T02:27:46Z'
finalizers:
- update-prometheus-metrics
managedFields:
- manager: OpenAPI-Generator
operation: Update
apiVersion: kubeflow.org/v1beta1
time: '2026-02-06T02:27:46Z'
fieldsType: FieldsV1
fieldsV1:
f:spec:
.: {}
f:algorithm:
.: {}
f:algorithmName: {}
f:maxFailedTrialCount: {}
f:maxTrialCount: {}
f:metricsCollectorSpec:
.: {}
f:collector:
.: {}
f:kind: {}
f:source:
.: {}
f:fileSystemPath:
.: {}
f:format: {}
f:kind: {}
f:path: {}
f:objective:
.: {}
f:goal: {}
f:metricStrategies: {}
f:objectiveMetricName: {}
f:type: {}
f:parallelTrialCount: {}
f:parameters: {}
f:trialTemplate:
.: {}
f:failureCondition: {}
f:primaryContainerName: {}
f:retain: {}
f:successCondition: {}
f:trialParameters: {}
f:trialSpec:
.: {}
f:apiVersion: {}
f:kind: {}
f:spec:
.: {}
f:template:
.: {}
f:metadata:
.: {}
f:annotations:
.: {}
f:cni.istio.io/exclude: {}
f:istio.io/rev: {}
f:sidecar.istio.io/inject: {}
f:spec:
.: {}
f:containers: {}
f:restartPolicy: {}
f:schedulerName: {}
f:volumes: {}
- manager: katib-controller
operation: Update
apiVersion: kubeflow.org/v1beta1
time: '2026-02-06T02:27:46Z'
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:finalizers:
.: {}
v:"update-prometheus-metrics": {}
- manager: katib-controller
operation: Update
apiVersion: kubeflow.org/v1beta1
time: '2026-02-06T02:45:39Z'
fieldsType: FieldsV1
fieldsV1:
f:status:
.: {}
f:completionTime: {}
f:conditions: {}
f:currentOptimalTrial:
.: {}
f:bestTrialName: {}
f:observation:
.: {}
f:metrics: {}
f:parameterAssignments: {}
f:startTime: {}
f:succeededTrialList: {}
f:trials: {}
f:trialsSucceeded: {}
subresource: status
spec:
parameters:
- name: learning_rate
parameterType: double
feasibleSpace:
max: '5e-05'
min: '1e-05'
distribution: uniform
- name: r
parameterType: int
feasibleSpace:
max: '8'
min: '1'
distribution: uniform
objective:
type: minimize
goal: 2
objectiveMetricName: train_loss
metricStrategies:
- name: train_loss
value: min
algorithm:
algorithmName: random
trialTemplate:
retain: true
trialSpec:
apiVersion: batch/v1
kind: Job
spec:
template:
metadata:
annotations:
cni.istio.io/exclude: 'true'
istio.io/rev: ''
sidecar.istio.io/inject: 'false'
spec:
containers:
- args:
- >-
set -x;accelerate launch --multi_gpu src/train.py
--model_name_or_path=/datas/models
--output_dir=/datas/output --dataset_dir /datas/datasets
--do_train --report_to=tensorboard
--finetuning_type=lora --flash_attn=auto
--packing=False --plot_loss=True
--ddp_timeout=180000000 --fp16=True --cutoff_len=4096
--dataset=default --gradient_accumulation_steps=8
--learning_rate=${trialParameters.learning_rate}
--logging_steps=5 --lr_scheduler_type=cosine
--max_samples=100000 --num_train_epochs=1
--optim=adamw_torch --per_device_train_batch_size=2
--save_steps=256 --stage=sft --template=qwen
--lora_alpha=16 --lora_dropout=0
--lora_rank=${trialParameters.r} --loraplus_lr_ratio=0
--use_dora=false --use_rslora=false
--overwrite_output_dir;if [ -f
/datas/output/train_results.json ]; then echo 'Converting
all_results.json to single-line format...' >&2; python3 -c
"import json;
data=json.load(open('/datas/output/train_results.json'));
print(json.dumps({k: str(v) for k, v in data.items()},
separators=(',', ':')))" > /tmp/all_results_single.json; mv
/tmp/all_results_single.json
/datas/output/train_results.json; echo 'JSON conversion
complete' >&2;fi;cat
/datas/output/train_results.json;sync;sleep 10;echo
completed > /datas/output/$$$$.pid;sleep 10;exit 0
command:
- sh
- '-c'
image: >-
36.134.128.101.nip.io:31104/aict-gpu/llama-factory-amd64-nvidia:1.0.1-llamafactory-0.9.2-dev
name: node
resources:
limits:
cpu: '1'
memory: 8Gi
nvidia.com/gpu: '1'
requests:
cpu: '1'
memory: 8Gi
nvidia.com/gpu: '1'
volumeMounts:
- mountPath: /datas
name: trainer-datas
restartPolicy: Never
schedulerName: volcano
volumes:
- name: trainer-datas
persistentVolumeClaim:
claimName: katib-llamafactory-qwen-sft
trialParameters:
- name: learning_rate
description: Learning rate
reference: learning_rate
- name: r
description: LoRA rank
reference: r
primaryContainerName: node
successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
parallelTrialCount: 1
maxTrialCount: 1
maxFailedTrialCount: 0
metricsCollectorSpec:
source:
fileSystemPath:
path: /datas/output/train_results.json
kind: File
format: JSON
collector:
kind: File
resumePolicy: Never
status:
startTime: '2026-02-06T02:27:46Z'
completionTime: '2026-02-06T02:45:39Z'
conditions:
- type: Created
status: 'True'
reason: ExperimentCreated
message: Experiment is created
lastUpdateTime: '2026-02-06T02:27:46Z'
lastTransitionTime: '2026-02-06T02:27:46Z'
- type: Running
status: 'False'
reason: ExperimentRunning
message: Experiment is running
lastUpdateTime: '2026-02-06T02:45:39Z'
lastTransitionTime: '2026-02-06T02:45:39Z'
- type: Succeeded
status: 'True'
reason: ExperimentMaxTrialsReached
message: Experiment has succeeded because max trial count has reached
lastUpdateTime: '2026-02-06T02:45:39Z'
lastTransitionTime: '2026-02-06T02:45:39Z'
currentOptimalTrial:
bestTrialName: katib-llamafactory-qwen-sft3-lyt-qkxpnrzh
parameterAssignments:
- name: learning_rate
value: '3.5283942788578944e-05'
- name: r
value: '2'
observation:
metrics:
- name: train_loss
min: '3.3930898904800415'
max: '3.3930898904800415'
latest: '3.3930898904800415'
succeededTrialList:
- katib-llamafactory-qwen-sft3-lyt-qkxpnrzh
trials: 1
trialsSucceeded: 1
state of pod:
# kubectl get pod | grep lyt | grep -v old
katib-llamafactory-qwen-sft3-lyt-qkxpnrzh-xcn6x 0/2 Completed 0 5h54m
log of metrics:
# kubectl logs -f -c metrics-logger-and-collector katib-llamafactory-qwen-sft3-lyt-qkxpnrzh-xcn6x
I0206 02:28:24.004010 68 main.go:400] Trial Name: katib-llamafactory-qwen-sft3-lyt-qkxpnrzh
I0206 02:45:13.921332 68 main.go:143] {
I0206 02:45:13.921369 68 main.go:143] "epoch": 1.0,
I0206 02:45:13.921381 68 main.go:143] "total_flos": 8399292649701376.0,
I0206 02:45:13.921394 68 main.go:143] "train_loss": 3.3930898904800415,
I0206 02:45:13.921406 68 main.go:143] "train_runtime": 969.046,
I0206 02:45:13.921410 68 main.go:143] "train_samples_per_second": 5.811,
I0206 02:45:13.921423 68 main.go:143] "train_steps_per_second": 0.363
I0206 02:45:13.921689 68 main.go:143] }
W0206 02:45:36.212115 68 file-metricscollector.go:143] Metrics will not have timestamp since {"epoch":"1.0","total_flos":"8399292649701376.0","train_loss":"3.3930898904800415","train_runtime":"969.046","train_samples_per_second":"5.811","train_steps_per_second":"0.363"} doesn't have the key timestamp
I0206 02:45:36.236663 68 main.go:459] Metrics reported. :
metric_logs:{time_stamp:"0001-01-01T00:00:00Z" metric:{name:"train_loss" value:"3.3930898904800415"}}
describe of pod:
# kubectl describe pod katib-llamafactory-qwen-sft3-lyt-qkxpnrzh-xcn6x
Name: katib-llamafactory-qwen-sft3-lyt-qkxpnrzh-xcn6x
Namespace: aict
Priority: 0
Service Account: default
Node: llm1/192.168.1.4
Start Time: Fri, 06 Feb 2026 10:28:22 +0800
Labels: batch.kubernetes.io/controller-uid=91c17c88-fb99-4d33-8bf3-83b2fde6eb03
batch.kubernetes.io/job-name=katib-llamafactory-qwen-sft3-lyt-qkxpnrzh
controller-uid=91c17c88-fb99-4d33-8bf3-83b2fde6eb03
job-name=katib-llamafactory-qwen-sft3-lyt-qkxpnrzh
katib.kubeflow.org/experiment=katib-llamafactory-qwen-sft3-lyt
katib.kubeflow.org/trial=katib-llamafactory-qwen-sft3-lyt-qkxpnrzh
Annotations: cni.istio.io/exclude: true
cni.projectcalico.org/containerID: 84d074e5dab08b026f5ac653a2d78a6ff3e9cc4744d8a6b3ceacfe71a7f8df33
cni.projectcalico.org/podIP:
cni.projectcalico.org/podIPs:
istio.io/rev:
scheduling.k8s.io/group-name: podgroup-91c17c88-fb99-4d33-8bf3-83b2fde6eb03
sidecar.istio.io/inject: false
Status: Succeeded
IP: 10.42.0.187
IPs:
IP: 10.42.0.187
Controlled By: Job/katib-llamafactory-qwen-sft3-lyt-qkxpnrzh
Containers:
node:
Container ID: containerd://34cca64a540c52daede81cc96587da6432fc3a4b1d9cac0368c032c59c96b0da
Image: 36.134.128.101.nip.io:31104/aict-gpu/llama-factory-amd64-nvidia:1.0.1-llamafactory-0.9.2-dev
Image ID: 36.134.128.101.nip.io:31104/aict-gpu/llama-factory-amd64-nvidia@sha256:2899589618c16624103fb0170b865119fce8af891bb38dbf1be36b8c4f2cdc2f
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
set -x;accelerate launch --multi_gpu src/train.py --model_name_or_path=/datas/models --output_dir=/datas/output --dataset_dir /datas/datasets --do_train --report_to=tensorboard --finetuning_type=lora --flash_attn=auto --packing=False --plot_loss=True --ddp_timeout=180000000 --fp16=True --cutoff_len=4096 --dataset=default --gradient_accumulation_steps=8 --learning_rate=3.5283942788578944e-05 --logging_steps=5 --lr_scheduler_type=cosine --max_samples=100000 --num_train_epochs=1 --optim=adamw_torch --per_device_train_batch_size=2 --save_steps=256 --stage=sft --template=qwen --lora_alpha=16 --lora_dropout=0 --lora_rank=2 --loraplus_lr_ratio=0 --use_dora=false --use_rslora=false --overwrite_output_dir;if [ -f /datas/output/train_results.json ]; then echo 'Converting all_results.json to single-line format...' >&2; python3 -c "import json; data=json.load(open('/datas/output/train_results.json')); print(json.dumps({k: str(v) for k, v in data.items()}, separators=(',', ':')))" > /tmp/all_results_single.json; mv /tmp/all_results_single.json /datas/output/train_results.json; echo 'JSON conversion complete' >&2;fi;cat /datas/output/train_results.json;sync;sleep 10;echo completed > /datas/output/$$$$.pid;sleep 10;exit 0 && echo completed > /datas/output/$$$$.pid
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 06 Feb 2026 10:28:23 +0800
Finished: Fri, 06 Feb 2026 10:45:34 +0800
Ready: False
Restart Count: 0
Limits:
cpu: 1
memory: 8Gi
nvidia.com/gpu: 1
Requests:
cpu: 1
memory: 8Gi
nvidia.com/gpu: 1
Environment:
KATIB_TRIAL_NAME: (v1:metadata.labels['katib.kubeflow.org/trial'])
Mounts:
/datas from trainer-datas (rw)
/datas/output from metrics-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-l7s79 (ro)
metrics-logger-and-collector:
Container ID: containerd://e41a258f1595fa7b6df7e59cc6bf1e48eff53f29c840f0187556245660ecb8ce
Image: ghcr.io/kubeflow/katib/file-metrics-collector:v0.19.0
Image ID: ghcr.io/kubeflow/katib/file-metrics-collector@sha256:0616af2111b2c6029105ac4670e1e94a0ceb7ba02ddb06a8cee3a687fde1514c
Port: <none>
Host Port: <none>
Args:
-t
katib-llamafactory-qwen-sft3-lyt-qkxpnrzh
-m
train_loss
-o-type
minimize
-s-db
katib-db-manager.kubeflow:6789
-path
/datas/output/train_results.json
-format
JSON
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 06 Feb 2026 10:28:24 +0800
Finished: Fri, 06 Feb 2026 10:45:36 +0800
Ready: False
Restart Count: 0
Limits:
cpu: 500m
ephemeral-storage: 5Gi
memory: 100Mi
Requests:
cpu: 50m
ephemeral-storage: 500Mi
memory: 10Mi
Environment: <none>
Mounts:
/datas/output from metrics-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-l7s79 (ro)
Conditions:
Type Status
PodReadyToStartContainers False
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
trainer-datas:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: katib-llamafactory-qwen-sft
ReadOnly: false
kube-api-access-l7s79:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
metrics-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
This normal katib experiment in Kubeflow UI display:
What did you expect to happen?
I expect trainjob experiment works normally like job experiment, file-metricscollector end normally, and Status of trial should Succeeded on the Kubeflow UI.
Environment
Kubernetes version:
$ kubectl version
Client Version: v1.32.10+rke2r1
Kustomize Version: v5.5.0
Server Version: v1.32.10+rke2r1Katib controller version:
$ kubectl get pods -n kubeflow -l katib.kubeflow.org/component=controller -o jsonpath="{.items[*].spec.containers[*].image}"
ghcr.io/kubeflow/katib/katib-controller:v0.19.0Katib Python SDK version:
$ pip show kubeflow-katib
Name: kubeflow-katib
Version: 0.19.0
Summary: Katib Python SDK for APIVersion v1beta1
Home-page: https://github.com/kubeflow/katib/tree/master/sdk/python/v1beta1
Author: Kubeflow Authors
Author-email: [premnath.vel@gmail.com](mailto:premnath.vel@gmail.com)
License: Apache License Version 2.0
Location: /opt/conda/lib/python3.11/site-packages
Requires: certifi, grpcio, kubeflow-training, kubernetes, protobuf, setuptools, six, urllib3
Required-by:Impacted by this bug?
katib trainjob experiment works not well