Skip to content

Memory issue with qctool-worker on K8s #37

@ch-koehler-gaf

Description

@ch-koehler-gaf

We are trying to deploy the tool on a new Kubernetes cluster and are facing issues with the worker: it keeps using up all memory until it reaches the defined memory limit (then getting OOM-killed), even with extremely high limits, such as 50GiB. This happens directly after starting up, without any load on the worker.

Jumping into the container before it get's killed, shows that the squid cache seems to be responsible for this behavior:

top - 08:43:45 up 1 day, 18:27,  0 users,  load average: 3.02, 0.89, 0.43
Tasks:  22 total,   2 running,  20 sleeping,   0 stopped,   0 zombie
%Cpu(s): 11.8 us,  3.0 sy,  0.0 ni, 84.8 id,  0.1 wa,  0.2 hi,  0.0 si,  0.0 st
MiB Mem : 128795.3 total,  11098.1 free, 107866.0 used,   9831.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  19747.4 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
     11 root      20   0   10.1g   1.0g  28928 S 363.3   0.8   1:33.64 java
     18 root      20   0  568.1g  31.7g  12544 R  99.3  25.2   0:22.17 squid
      1 root      20   0    2616   1280   1280 S   0.0   0.0   0:00.01 sh
      7 root      20   0   25772  20968   8960 S   0.0   0.0   0:00.19 supervisord
      9 root      20   0    4212   3072   2816 S   0.0   0.0   0:00.00 bash
     10 root      20   0    2616   1536   1536 S   0.0   0.0   0:00.00 apachectl
     12 postgres  20   0  216260  29184  27136 S   0.0   0.0   0:00.04 postgres
     14 root      20   0  100796  20992   8448 S   0.0   0.0   0:00.14 python3
     25 root      20   0   11268   8392   6856 S   0.0   0.0   0:00.03 apache2
     39 www-data  20   0 2002356   4632   2816 S   0.0   0.0   0:00.00 apache2
     40 www-data  20   0 2002356   4632   2816 S   0.0   0.0   0:00.00 apache2
     95 postgres  20   0  216400   5812   3840 S   0.0   0.0   0:00.00 postgres
     98 postgres  20   0  216392   6068   4096 S   0.0   0.0   0:00.00 postgres
    102 postgres  20   0  216392  10164   8192 S   0.0   0.0   0:00.00 postgres
    103 postgres  20   0  217844   9396   7168 S   0.0   0.0   0:00.00 postgres
    104 postgres  20   0  217852   8372   6144 S   0.0   0.0   0:00.00 postgres
    229 root      20   0    2616   1280   1280 S   0.0   0.0   0:00.00 sh
    235 root      20   0    2616    256    256 S   0.0   0.0   0:00.00 sh
    236 root      20   0    2644   1536   1536 S   0.0   0.0   0:00.00 script
    237 root      20   0    2616   1280   1280 S   0.0   0.0   0:00.00 sh
    238 root      20   0    4248   3328   2816 S   0.0   0.0   0:00.00 bash
    241 root      20   0    6112   3072   2560 R   0.0   0.0   0:00.00 top

Tested with versions 2.2.6, 2.2.9 and 2.3.3.

Any idea how to address that? Is there a possible fix or workaround? Maybe updating squid to a newer version would help (the image comes with v4.10 with the latest version being v7.4)?

Note: the worker version 2.2.6 used to be running without this issue on an old cluster. Unfortunately, I can't tell what the exact differences were on the cluster / node level. The old one may not have used cgroups v2, for example, but I can't tell for sure.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions