Skip to content

gce-pd-driver container OOM killed after upgrade to GKE 1.28Β #1782

@JordanP

Description

@JordanP

Hi,
You probably have not control over how the pd-csi daemonset is deployed on GKE, but I am taking my chance.

After upgrading to GKE 1.28, some gce-pd-driver containers started to be OOM killed. Before getting killed, the last log line is Checking for issues with fsck on disk: /dev/disk/by-id/google-restore-aus-southeast1-fcb9-pg-data-pg-main-0-7279. That disk is a multi TB disk attached to a pod. My guess is 50MB (resources.memory.limit set by GKE for that gce-pd-driver container) is not enough to run fsck on such a large disk.

Any chance you could reach to someone at GKE to increase that memory limit (although baseline usage, fsck excepted is ~10MB so 50MB seems reasonable) ? Or how could I skip that fsck check ?

If that helps, this is part of my Go code (running elsewhere) that seems to, down the line, trigger the call to fsck.

	// make the request to the api /metrics endpoint and handle the response
	req := clientset.
		CoreV1().
		RESTClient().
		Get().
		Resource("nodes").
		Name(nodeName).
		SubResource("proxy").
		Suffix("metrics")
	respBody, err := req.DoRaw(ctx)
	if err != nil {
		return errors.Errorf("failed to get stats from kubelet on node %s: with error %s", nodeName, err)
	}

Thanks !

(please, don't recommend I reach to my TAM at GCP, we don't have one haha)

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/frozenIndicates that an issue or PR should not be auto-closed due to staleness.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions