-
Notifications
You must be signed in to change notification settings - Fork 167
Description
Hi,
You probably have not control over how the pd-csi
daemonset is deployed on GKE, but I am taking my chance.
After upgrading to GKE 1.28, some gce-pd-driver
containers started to be OOM killed. Before getting killed, the last log line is Checking for issues with fsck on disk: /dev/disk/by-id/google-restore-aus-southeast1-fcb9-pg-data-pg-main-0-7279
. That disk is a multi TB disk attached to a pod. My guess is 50MB (resources.memory.limit set by GKE for that gce-pd-driver
container) is not enough to run fsck
on such a large disk.
Any chance you could reach to someone at GKE to increase that memory limit (although baseline usage, fsck excepted is ~10MB so 50MB seems reasonable) ? Or how could I skip that fsck
check ?
If that helps, this is part of my Go code (running elsewhere) that seems to, down the line, trigger the call to fsck
.
// make the request to the api /metrics endpoint and handle the response
req := clientset.
CoreV1().
RESTClient().
Get().
Resource("nodes").
Name(nodeName).
SubResource("proxy").
Suffix("metrics")
respBody, err := req.DoRaw(ctx)
if err != nil {
return errors.Errorf("failed to get stats from kubelet on node %s: with error %s", nodeName, err)
}
Thanks !
(please, don't recommend I reach to my TAM at GCP, we don't have one haha)