gce-pd-driver container OOM killed after upgrade to GKE 1.28

Hi,
You probably have not control over how the `pd-csi` daemonset is deployed on GKE, but I am taking my chance.

After upgrading to GKE 1.28, some `gce-pd-driver` containers started to be OOM killed. Before getting killed, the last log line is `Checking for issues with fsck on disk: /dev/disk/by-id/google-restore-aus-southeast1-fcb9-pg-data-pg-main-0-7279`. That disk is a multi TB disk attached to a pod. My guess is 50MB (resources.memory.limit set by GKE for that  `gce-pd-driver` container) is not enough to run `fsck` on such a large disk. 

Any chance you could reach to someone at GKE to increase that memory limit (although baseline usage, fsck excepted is ~10MB so 50MB seems reasonable)  ? Or how could I skip that `fsck` check ?

If that helps, this is part of my Go code (running elsewhere) that seems to, down the line, trigger the call to `fsck`.

```go
	// make the request to the api /metrics endpoint and handle the response
	req := clientset.
		CoreV1().
		RESTClient().
		Get().
		Resource("nodes").
		Name(nodeName).
		SubResource("proxy").
		Suffix("metrics")
	respBody, err := req.DoRaw(ctx)
	if err != nil {
		return errors.Errorf("failed to get stats from kubelet on node %s: with error %s", nodeName, err)
	}
```

Thanks !

(please, don't recommend I reach to my TAM at GCP, we don't have one haha)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gce-pd-driver container OOM killed after upgrade to GKE 1.28 #1782

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

gce-pd-driver container OOM killed after upgrade to GKE 1.28 #1782

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions