-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Description
What happened?
After hours of inactivity, a new session that triggers a node scale-up in a new node pool is ignored until the distributor is restarted.
Command used to start Selenium Grid with Docker (or Kubernetes)
Verified on DigitalOcean, which natively uses Cilium (this could be relevant).
Installation was done via Helm Chart with separated components, specifically a Chrome node assigned to a dedicated node pool selector (with autoscaling enabled).
helm install selenium-grid-release ./selenium-grid --set chromeNode.nodeEnableManagedDownloads=true --set chromeNode.replicas=1 --set isolateComponents=true --set chromeNode.nodeSelector."doks\.digitalocean\.com/node-pool"=pool-green --set firefoxNode.enabled=false --set edgeNode.enabled=false
When scaling down the Chrome deployment, the distributor immediately drains the node.
kubectl scale deployment selenium-grid-release-selenium-node-chrome --replicas=0
After waiting a few hours (e.g., 8 hours), if you delete and recreate the aforementioned node pool, then scale up the Chrome deployment, a triggered node scale-up occurs. The node begins sending the registration request, but the distributor does not recognize it.
At the network level, connections from the pod appear fine when checked via bash. The issue is likely related to ZMQ cache becoming stale. Using the ZMQ heartbeat might help (see zeromq/jeromq#364).
Relevant log output
The distributor log is no longer updated when the new Chrome pod is running. The Chrome pod simply times out after 120 seconds, as described above
Operating System
DigitalOcean Kubernetes
Docker Selenium version (image tag)
latest
Selenium Grid chart version (chart version)
latest