- 
          
 - 
                Notifications
    
You must be signed in to change notification settings  - Fork 156
 
Open
Labels
Description
Describe the issue:
Per https://kubernetes.dask.org/en/latest/operator_resources.html#daskautoscaler, only the default worker pool should autoscale. But I'm observing other worker groups also getting scaled down at the end of a computation.
Minimal Complete Verifiable Example:
from dask_kubernetes.operator import KubeCluster
# Enable HTTP API
env = dict(DASK_DISTRIBUTED__SCHEDULER__HTTP__ROUTES='["distributed.http.scheduler.prometheus", "distributed.http.scheduler.info", "distributed.http.scheduler.json", "distributed.http.health", "distributed.http.proxy", "distributed.http.statics", "distributed.http.scheduler.api"]')
cluster = KubeCluster(name="foo", image="ghcr.io/dask/dask:latest", env=env)
cluster.add_worker_group(name="highmem", n_workers=2, resources={"requests": {"memory": "2Gi"}, "limits": {"memory": "64Gi"}}, env=env)
cluster.adapt(minimum=0, maximum=1000)
import distributed
client = distributed.Client(cluster)
import time
def wait(i):
    time.sleep(1)
    return i
futures = [client.submit(wait, i) for i in range(10000)]Obviously don't see the exact same behavior every time but I consistently see one or two of the highmem workers get retired by the scheduler:
[dask-kubernetes-operator-74cc5c8855-zc7nm] [2024-10-25 19:27:26,748] kopf.objects  [INFO    ] [default/foo-default] Workers to close: [..., 'foo-highmem-worker-ca6c5d33ea', ...]
@jacobtomlinson happy to help debug further but would need some pointers into where the decision of which workers to close is being made
Environment:
- Dask version: 2024.9.1 for the above example but I saw the same on 2024.10.0
 - Python version: 3.10.12
 - Operating System: Linux