[🐛 Bug]: Memory leak on hub and nodes when using google kubernetes (v4.26)

### What happened?

**### After executing some tests with paralellization, the resources baseline increases a couple without never reaching the previous baseline.** 

For example purposes, we start the hub and nodes, and the average memory used is like 500MB. After running tests a first time and resources reach their peak (2GB), the new resource baseline is near 700MB. On the next run, the new baseline is 800MB, and so on and so on. Obviously this number is not accurate and is random, but it is visible that there is a baseline memory value increase, influencing a fast trigger of the OOMKilled event on the nodes in the middle of a test execution.

On our automation test Google Cloud infra-structure, we use Google Kubernetes Engine (GKE) to host the Selenium Hub and Selenium Nodes on different pods of the same namespace. There is a pod with one replica of Selenium Hub and one pod with 5 replicas of Selenium Node (chrome), with 8 max-sessions each (total of 40 threads).
![2024-11-26_18h23_27](https://github.com/user-attachments/assets/7bd26bc1-2b42-4b1c-9d72-0385294b1bc1)

We used to run the tests sequentially but don't have an idea if this problem occured with this setup, but we remember sometimes we had some problems and needed to force a restart. Either way, with paralellization enabled this problem is more persistent and led us to increase the resources, in order have a better buffer and minimize this occurrence.

The actual resources configuration of the chrome nodes is the following:
```
resources:
      limits:
        cpu: "10"
        memory: 2560Mi
      requests:
        cpu: "2"
        memory: 2560Mi
```

The actual node configurations (collected from the UI):
```
OS Arch: amd64
OS Name: Linux
OS Version: 6.1.85+
Total slots: 8
Grid version: 4.26.0 (revision 69f9e5e)
```

We have identified that the selenium-servar.jar process is the main cause of the resource consumption as we can verify in the following images.
**Before running tests and after a restart:**
![sg_mem_1](https://github.com/user-attachments/assets/45b16926-0d67-423b-a3bd-cc727e73a815)
**After running tests:**
![sg_mem_2](https://github.com/user-attachments/assets/4bd49049-4959-4068-bfc1-20fd2ec16325)


Questions:
I've read somewhere that in order to minimize this effect we can use the following parameter:
`**--drain-after-session-count** to drain and shutdown the Node after X sessions have been executed. Useful for environments like Kubernetes. A value higher than zero enables this feature. `
- Does it mean that Selenium Grid is not optimized to work with Kubernetes?
- In that case, it is mandatory to use this parameter?
- What are the drawbacks of using it? Will the test executions still be stable?
- Is it possible to define the X number of sessions and the node still be active processing test requests/sessions from the hub in case it should drain in the middle of a test execution?
- Should we use this approach or event restart the nodes at the end of each test execution, or for example, with a scheduled restart in the gitlab pipelines every day at a time where is certain nobody is using the grid?

### Command used to start Selenium Grid with Docker (or Kubernetes)

```shell
global:
  selenium:
    imageTag: 4.0
    nodesImageTag: 4.0

isolateComponents: false

busConfigMap:
  name: selenium-event-bus-config
  annotations: {}

hub:
  enabled: true
  host: ~
  imageName: /selenium/hub
  imageTag: 4.26.0
  imagePullPolicy: IfNotPresent
  annotations: {}
  labels: {}
  publishPort: - # removed for privacy/security purposes
  subscribePort: - # removed for privacy/security purposes
  port: - # removed for privacy/security purposes
  ingress:
    enabled: true
    path: /
    host: "-" # removed for privacy/security purposes
    annotations:
      kubernetes.io/ingress.class: nginx
    tls:
      enabled: false
      secretName: selenium-hub-tls
  livenessProbe:
    enabled: true
    path: /wd/hub/status
    initialDelaySeconds: 10
    failureThreshold: 10
    timeoutSeconds: 10
    periodSeconds: 10
    successThreshold: 1
  readinessProbe:
    enabled: true
    path: /wd/hub/status
    initialDelaySeconds: 12
    failureThreshold: 10
    timeoutSeconds: 10
    periodSeconds: 10
    successThreshold: 1
  extraEnvironmentVariables:
    - name: SE_DISTRIBUTOR_MAX_THREADS
      value: "50"
    - name: SE_ENABLE_TRACING
      value: "false"
  resources: {}
  serviceType: ClusterIP
  serviceAnnotations: {}
  tolerations: []
  nodeSelector: {}

chromeNode:
  enabled: true
  replicas: 5
  autoscale:
    enabled: false
    minReplicas: 2
    maxReplicas: 5
    pollingInterval: 30
    cooldownPeriod: 300
  imageName: selenium/node-chrome
  imageTag: 130.0
  imagePullPolicy: IfNotPresent
  ports:
    - - # removed for privacy/security purposes
    - - # removed for privacy/security purposes
  seleniumPort: - # removed for privacy/security purposes
  seleniumServicePort: - # removed for privacy/security purposes
  annotations: {}
  labels: {}
  resources:
    limits:
      cpu: '10'
      memory: 2560Mi
    requests:
      cpu: '2'
      memory: 2560Mi
  tolerations: []
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: node_pool
                operator: In
                values:
                  - standard
  antiAffinity: "hard"
  podAffinityPreset: ""
  podAntiAffinityPreset: soft
  nodeAffinityPreset:
    values: [ ]
  extraEnvironmentVariables:
    - name: SE_NODE_MAX_SESSIONS
      value: "10"
    - name: SE_NODE_MAX_THREADS
      value: "10"
  service:
    enabled: true
    type: ClusterIP
    annotations: {}
  terminationGracePeriodSeconds: 300
  dshmVolumeSizeLimit: 1Gi

customLabels: {}
```


### Relevant log output

```shell
N/A
```


### Operating System

Kubernetes (GKE)

### Docker Selenium version (image tag)

4.26.0 (revision 69f9e5e)

### Selenium Grid chart version (chart version)

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[🐛 Bug]: Memory leak on hub and nodes when using google kubernetes (v4.26) #2476

What happened?

Command used to start Selenium Grid with Docker (or Kubernetes)

Relevant log output

Operating System

Docker Selenium version (image tag)

Selenium Grid chart version (chart version)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

[🐛 Bug]: Memory leak on hub and nodes when using google kubernetes (v4.26) #2476

Description

What happened?

Command used to start Selenium Grid with Docker (or Kubernetes)

Relevant log output

Operating System

Docker Selenium version (image tag)

Selenium Grid chart version (chart version)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions