-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
Reported by Maryam, GKM Agent gets in a state where it ends up in a constant loop of "GKMCacheNode CacheStatus Missing!!!!".
Sample logs
kubectl logs -f -n gkm-system gkm-agent-bvwfk
{"level":"info","ts":"2025-09-11T17:23:22Z","logger":"setup","msg":"Logging","Level":"info"}
{"level":"info","ts":"2025-09-11T17:23:22Z","logger":"setup","msg":"starting gkm-agent"}
{"level":"info","ts":"2025-09-11T17:23:22Z","msg":"starting server","name":"health probe","addr":"[::]:8081"}
{"level":"info","ts":"2025-09-11T17:23:22Z","msg":"Starting EventSource","controller":"gkmcache","controllerGroup":"gkm.io","controllerKind":"GKMCache","source":"kind source: *v1alpha1.GKMCacheNode"}
{"level":"info","ts":"2025-09-11T17:23:22Z","msg":"Starting EventSource","controller":"gkmcache","controllerGroup":"gkm.io","controllerKind":"GKMCache","source":"kind source: *v1alpha1.GKMCache"}
{"level":"info","ts":"2025-09-11T17:23:22Z","msg":"Starting EventSource","controller":"clustergkmcache","controllerGroup":"gkm.io","controllerKind":"ClusterGKMCache","source":"kind source: *v1alpha1.ClusterGKMCacheNode"}
{"level":"info","ts":"2025-09-11T17:23:22Z","msg":"Starting EventSource","controller":"clustergkmcache","controllerGroup":"gkm.io","controllerKind":"ClusterGKMCache","source":"kind source: *v1alpha1.ClusterGKMCache"}
{"level":"info","ts":"2025-09-11T17:23:23Z","msg":"Starting Controller","controller":"clustergkmcache","controllerGroup":"gkm.io","controllerKind":"ClusterGKMCache"}
{"level":"info","ts":"2025-09-11T17:23:23Z","msg":"Starting workers","controller":"clustergkmcache","controllerGroup":"gkm.io","controllerKind":"ClusterGKMCache","worker count":1}
{"level":"info","ts":"2025-09-11T17:23:23Z","msg":"Starting Controller","controller":"gkmcache","controllerGroup":"gkm.io","controllerKind":"GKMCache"}
{"level":"info","ts":"2025-09-11T17:23:23Z","msg":"Starting workers","controller":"gkmcache","controllerGroup":"gkm.io","controllerKind":"GKMCache","worker count":1}
{"level":"info","ts":"2025-09-11T17:24:12Z","logger":"agent-ns","msg":"No GKMCacheNode found"}
{"level":"info","ts":"2025-09-11T17:24:12Z","logger":"agent-ns","msg":"Create GKMCacheNode object","Namespace":"gkm-test-ns-scoped","CacheNodeName":"llama-3-1-8b-instruct-62bb947f"}
time="2025-09-11T17:24:28Z" level=error msg="Failed to init device: failed to get GPU information: could not get system info"
time="2025-09-11T17:24:28Z" level=error msg="Failed to start device of type AMD"
time="2025-09-11T17:24:28Z" level=error msg="unsupported Device"
time="2025-09-11T17:24:28Z" level=error msg="Could not init the gpu device going to try again"
{"level":"info","ts":"2025-09-11T17:24:54Z","logger":"agent-ns","msg":"Detected GPU Devices:","gpus":{"gpus":[{"gpuType":"Aldebaran/MI200 [Instinct MI210]","driverVersion":"Linuxversion6.12.10-100.fc40.x86_64(mockbuild@03d6781744ee4221b1f3020091f6fc28)(gcc(GCC)14.2.120240912(RedHat14.2.1-3),GNUldversion2.41-38.fc40)#1SMPPREEMPT_DYNAMICFriJan1718:03:20UTC2025","ids":[0,1]}]}}
{"level":"info","ts":"2025-09-11T17:24:54Z","logger":"agent-ns","msg":"Calling KubeAPI to Update GKMCacheNode Status","reason":"Update GPU list","Namespace":"gkm-test-ns-scoped","CacheNodeName":"llama-3-1-8b-instruct-62bb947f"}
{"level":"info","ts":"2025-09-11T17:24:54Z","logger":"agent-ns","msg":"GKMCacheNode CacheStatus Missing!!!!","Namespace":"gkm-test-ns-scoped","Name":"llama-3-1-8b-instruct","CacheNodeName":"gkm-test-ns-scoped","Digest":"sha256:07395351038f43228e6f9c5983bd1b7d8b3650c95b540f712ed703514483b9fe"}
{"level":"info","ts":"2025-09-11T17:24:54Z","logger":"agent-ns","msg":"Cache being deleted, removing extracted cache from host","namespace":"gkm-test-ns-scoped","name":"vector-add-cache-rocm","digest":"sha256:91b5421b460d1a5c71bf32cafb6db5419048e90960c2b6ccaf4ce0d5584dc69f"}
{"level":"info","ts":"2025-09-11T17:24:55Z","logger":"agent-ns","msg":"GKMCacheNode CacheStatus Missing!!!!","Namespace":"gkm-test-ns-scoped","Name":"llama-3-1-8b-instruct","CacheNodeName":"gkm-test-ns-scoped","Digest":"sha256:07395351038f43228e6f9c5983bd1b7d8b3650c95b540f712ed703514483b9fe"}
{"level":"info","ts":"2025-09-11T17:25:00Z","logger":"agent-ns","msg":"GKMCacheNode CacheStatus Missing!!!!","Namespace":"gkm-test-ns-scoped","Name":"llama-3-1-8b-instruct","CacheNodeName":"gkm-test-ns-scoped","Digest":"sha256:07395351038f43228e6f9c5983bd1b7d8b3650c95b540f712ed703514483b9fe"}
{"level":"info","ts":"2025-09-11T17:25:05Z","logger":"agent-ns","msg":"GKMCacheNode CacheStatus Missing!!!!","Namespace":"gkm-test-ns-scoped","Name":"llama-3-1-8b-instruct","CacheNodeName":"gkm-test-ns-scoped","Digest":"sha256:07395351038f43228e6f9c5983bd1b7d8b3650c95b540f712ed703514483b9fe"}
{"level":"info","ts":"2025-09-11T17:25:10Z","logger":"agent-ns","msg":"GKMCacheNode CacheStatus Missing!!!!","Namespace":"gkm-test-ns-scoped","Name":"llama-3-1-8b-instruct","CacheNodeName":"gkm-test-ns-scoped","Digest":"sha256:07395351038f43228e6f9c5983bd1b7d8b3650c95b540f712ed703514483b9fe"}
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels