Skip to content

Autoscaling removes workers in other worker groups besides default #914

@bnaul

Description

@bnaul

Describe the issue:
Per https://kubernetes.dask.org/en/latest/operator_resources.html#daskautoscaler, only the default worker pool should autoscale. But I'm observing other worker groups also getting scaled down at the end of a computation.

Minimal Complete Verifiable Example:

from dask_kubernetes.operator import KubeCluster

# Enable HTTP API
env = dict(DASK_DISTRIBUTED__SCHEDULER__HTTP__ROUTES='["distributed.http.scheduler.prometheus", "distributed.http.scheduler.info", "distributed.http.scheduler.json", "distributed.http.health", "distributed.http.proxy", "distributed.http.statics", "distributed.http.scheduler.api"]')

cluster = KubeCluster(name="foo", image="ghcr.io/dask/dask:latest", env=env)
cluster.add_worker_group(name="highmem", n_workers=2, resources={"requests": {"memory": "2Gi"}, "limits": {"memory": "64Gi"}}, env=env)
cluster.adapt(minimum=0, maximum=1000)

import distributed
client = distributed.Client(cluster)

import time
def wait(i):
    time.sleep(1)
    return i
futures = [client.submit(wait, i) for i in range(10000)]

Obviously don't see the exact same behavior every time but I consistently see one or two of the highmem workers get retired by the scheduler:

[dask-kubernetes-operator-74cc5c8855-zc7nm] [2024-10-25 19:27:26,748] kopf.objects  [INFO    ] [default/foo-default] Workers to close: [..., 'foo-highmem-worker-ca6c5d33ea', ...]

@jacobtomlinson happy to help debug further but would need some pointers into where the decision of which workers to close is being made
Environment:

  • Dask version: 2024.9.1 for the above example but I saw the same on 2024.10.0
  • Python version: 3.10.12
  • Operating System: Linux

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs infoNeeds further information from the useroperator

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions