Skip to content

GKM Agent gets in a state where a GKMCacheNode object doesn't have a status associated with it #54

@Billy99

Description

@Billy99

Reported by Maryam, GKM Agent gets in a state where it ends up in a constant loop of "GKMCacheNode CacheStatus Missing!!!!".

Sample logs

kubectl logs -f -n gkm-system gkm-agent-bvwfk 
{"level":"info","ts":"2025-09-11T17:23:22Z","logger":"setup","msg":"Logging","Level":"info"}
{"level":"info","ts":"2025-09-11T17:23:22Z","logger":"setup","msg":"starting gkm-agent"}
{"level":"info","ts":"2025-09-11T17:23:22Z","msg":"starting server","name":"health probe","addr":"[::]:8081"}
{"level":"info","ts":"2025-09-11T17:23:22Z","msg":"Starting EventSource","controller":"gkmcache","controllerGroup":"gkm.io","controllerKind":"GKMCache","source":"kind source: *v1alpha1.GKMCacheNode"}
{"level":"info","ts":"2025-09-11T17:23:22Z","msg":"Starting EventSource","controller":"gkmcache","controllerGroup":"gkm.io","controllerKind":"GKMCache","source":"kind source: *v1alpha1.GKMCache"}
{"level":"info","ts":"2025-09-11T17:23:22Z","msg":"Starting EventSource","controller":"clustergkmcache","controllerGroup":"gkm.io","controllerKind":"ClusterGKMCache","source":"kind source: *v1alpha1.ClusterGKMCacheNode"}
{"level":"info","ts":"2025-09-11T17:23:22Z","msg":"Starting EventSource","controller":"clustergkmcache","controllerGroup":"gkm.io","controllerKind":"ClusterGKMCache","source":"kind source: *v1alpha1.ClusterGKMCache"}
{"level":"info","ts":"2025-09-11T17:23:23Z","msg":"Starting Controller","controller":"clustergkmcache","controllerGroup":"gkm.io","controllerKind":"ClusterGKMCache"}
{"level":"info","ts":"2025-09-11T17:23:23Z","msg":"Starting workers","controller":"clustergkmcache","controllerGroup":"gkm.io","controllerKind":"ClusterGKMCache","worker count":1}
{"level":"info","ts":"2025-09-11T17:23:23Z","msg":"Starting Controller","controller":"gkmcache","controllerGroup":"gkm.io","controllerKind":"GKMCache"}
{"level":"info","ts":"2025-09-11T17:23:23Z","msg":"Starting workers","controller":"gkmcache","controllerGroup":"gkm.io","controllerKind":"GKMCache","worker count":1}
{"level":"info","ts":"2025-09-11T17:24:12Z","logger":"agent-ns","msg":"No GKMCacheNode found"}
{"level":"info","ts":"2025-09-11T17:24:12Z","logger":"agent-ns","msg":"Create GKMCacheNode object","Namespace":"gkm-test-ns-scoped","CacheNodeName":"llama-3-1-8b-instruct-62bb947f"}
time="2025-09-11T17:24:28Z" level=error msg="Failed to init device: failed to get GPU information: could not get system info"
time="2025-09-11T17:24:28Z" level=error msg="Failed to start device of type AMD"
time="2025-09-11T17:24:28Z" level=error msg="unsupported Device"
time="2025-09-11T17:24:28Z" level=error msg="Could not init the gpu device going to try again"
{"level":"info","ts":"2025-09-11T17:24:54Z","logger":"agent-ns","msg":"Detected GPU Devices:","gpus":{"gpus":[{"gpuType":"Aldebaran/MI200 [Instinct MI210]","driverVersion":"Linuxversion6.12.10-100.fc40.x86_64(mockbuild@03d6781744ee4221b1f3020091f6fc28)(gcc(GCC)14.2.120240912(RedHat14.2.1-3),GNUldversion2.41-38.fc40)#1SMPPREEMPT_DYNAMICFriJan1718:03:20UTC2025","ids":[0,1]}]}}
{"level":"info","ts":"2025-09-11T17:24:54Z","logger":"agent-ns","msg":"Calling KubeAPI to Update GKMCacheNode Status","reason":"Update GPU list","Namespace":"gkm-test-ns-scoped","CacheNodeName":"llama-3-1-8b-instruct-62bb947f"}
{"level":"info","ts":"2025-09-11T17:24:54Z","logger":"agent-ns","msg":"GKMCacheNode CacheStatus Missing!!!!","Namespace":"gkm-test-ns-scoped","Name":"llama-3-1-8b-instruct","CacheNodeName":"gkm-test-ns-scoped","Digest":"sha256:07395351038f43228e6f9c5983bd1b7d8b3650c95b540f712ed703514483b9fe"}
{"level":"info","ts":"2025-09-11T17:24:54Z","logger":"agent-ns","msg":"Cache being deleted, removing extracted cache from host","namespace":"gkm-test-ns-scoped","name":"vector-add-cache-rocm","digest":"sha256:91b5421b460d1a5c71bf32cafb6db5419048e90960c2b6ccaf4ce0d5584dc69f"}
{"level":"info","ts":"2025-09-11T17:24:55Z","logger":"agent-ns","msg":"GKMCacheNode CacheStatus Missing!!!!","Namespace":"gkm-test-ns-scoped","Name":"llama-3-1-8b-instruct","CacheNodeName":"gkm-test-ns-scoped","Digest":"sha256:07395351038f43228e6f9c5983bd1b7d8b3650c95b540f712ed703514483b9fe"}
{"level":"info","ts":"2025-09-11T17:25:00Z","logger":"agent-ns","msg":"GKMCacheNode CacheStatus Missing!!!!","Namespace":"gkm-test-ns-scoped","Name":"llama-3-1-8b-instruct","CacheNodeName":"gkm-test-ns-scoped","Digest":"sha256:07395351038f43228e6f9c5983bd1b7d8b3650c95b540f712ed703514483b9fe"}
{"level":"info","ts":"2025-09-11T17:25:05Z","logger":"agent-ns","msg":"GKMCacheNode CacheStatus Missing!!!!","Namespace":"gkm-test-ns-scoped","Name":"llama-3-1-8b-instruct","CacheNodeName":"gkm-test-ns-scoped","Digest":"sha256:07395351038f43228e6f9c5983bd1b7d8b3650c95b540f712ed703514483b9fe"}
{"level":"info","ts":"2025-09-11T17:25:10Z","logger":"agent-ns","msg":"GKMCacheNode CacheStatus Missing!!!!","Namespace":"gkm-test-ns-scoped","Name":"llama-3-1-8b-instruct","CacheNodeName":"gkm-test-ns-scoped","Digest":"sha256:07395351038f43228e6f9c5983bd1b7d8b3650c95b540f712ed703514483b9fe"}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions