[ML] ML nodes autoscaling not down to 0 in stateful and serverless

### Environment

- Stateful cloud (gcp-us-west2)
- Serverless QA

### build
```
  "build": {
    "hash": "8ccfb227c2131c859033f409ee37a87023fada62",
    "date": "2024-10-16T05:50:43.944345200Z"
  }
```

### Step to reproduce
1. Deploy a serverless or stateful cluster, for stateful cluster, make sure ML autoscaling is ON
2.  Create an inference endpoint  with adaptive allocation ON
```
PUT _inference/sparse_embedding/elser-endpoint
{
  "service": "elser", 
  "service_settings": {"num_threads": 4, "adaptive_allocations": {"enabled": true}}
}
```
3. wait a few minutes for scaling up event, ml node available and allocated to that inference endpoint, you can confirm by running: `GET _ml/trained_models/elser-endpoint/_stats`
4. Run inference, you can follow the steps in this tutorial, https://www.elastic.co/guide/en/elasticsearch/reference/current/semantic-search-semantic-text.html
5. after that, wait at least 15 minutes for allocation to scale down to 0
```
{
  "count": 1,
  "trained_model_stats": [
    {
      "model_id": ".elser_model_2_linux-x86_64",
      "model_size_stats": {
        "model_size_bytes": 274756282,
        "required_native_memory_bytes": 2101346304
      },
      "pipeline_count": 1,
      "ingest": {
        "total": {
          "count": 0,
          "time_in_millis": 0,
          "current": 0,
          "failed": 0
        },
        "pipelines": {
          ".kibana-elastic-ai-assistant-ingest-pipeline-knowledge-base": {
            "count": 0,
            "time_in_millis": 0,
            "current": 0,
            "failed": 0,
            "ingested_as_first_pipeline_in_bytes": 0,
            "produced_as_first_pipeline_in_bytes": 0,
            "processors": [
              {
                "inference": {
                  "type": "inference",
                  "stats": {
                    "count": 0,
                    "time_in_millis": 0,
                    "current": 0,
                    "failed": 0
                  }
                }
              }
            ]
          }
        }
      },
      "inference_stats": {
        "failure_count": 0,
        "inference_count": 0,
        "cache_miss_count": 0,
        "missing_all_fields_count": 0,
        "timestamp": 1729097542245
      },
      "deployment_stats": {
        "deployment_id": "elser-endpoint",
        "model_id": ".elser_model_2_linux-x86_64",
        "threads_per_allocation": 4,
        "number_of_allocations": 0,
        "adaptive_allocations": {
          "enabled": true
        },
        "queue_capacity": 1024,
        "state": "started",
        "allocation_status": {
          "allocation_count": 0,
          "target_allocation_count": 0,
          "state": "fully_allocated"
        },
        "cache_size": "262mb",
        "priority": "normal",
        "start_time": 1729044099355,
        "peak_throughput_per_minute": 0,
        "nodes": []
      }
    }
  ]
}
```
7. after allocation scales down to 0, ml nodes autoscaling (down to 0) should happen in ~1 hour
**Observed:**

After hours wait, ml nodes autoscaling (down to 0) didnt happen

- for stateful, `GET /_autoscaling/capacity/` returns:

```
"ml": {
      "required_capacity": {
        "node": {
          "memory": 0,
          "processors": 4
        },
        "total": {
          "memory": 0,
          "processors": 0
        }
      },
      "current_capacity": {
        "node": {
          "storage": 0,
          "memory": 8585740288,
          "processors": 4
        },
        "total": {
          "storage": 0,
          "memory": 17171480576,
          "processors": 8
        }
      },
      "current_nodes": [
        {
          "name": "instance-0000000003"
        },
        {
          "name": "instance-0000000004"
        }
      ],
      "deciders": {
        "ml": {
          "required_capacity": {
            "node": {
              "memory": 0,
              "processors": 4
            },
            "total": {
              "memory": 0,
              "processors": 0
            }
          },
          "reason_summary": "[memory_decider] Requesting scale down as tier and/or node size could be smaller; [processor_decider] requesting scale down as tier and/or node size could be smaller",
          "reason_details": {
            "waiting_analytics_jobs": [],
            "waiting_anomaly_jobs": [],
            "waiting_models": [],
            "configuration": {},
            "perceived_current_capacity": {
              "node": {
                "memory": 8585740288,
                "processors": 4
              },
              "total": {
                "memory": 17171480576,
                "processors": 8
              }
            },
            "reason": "[memory_decider] Requesting scale down as tier and/or node size could be smaller; [processor_decider] requesting scale down as tier and/or node size could be smaller"
          }
        }
      }
    }
```

- for serverless, `GET /_internal/serverless/autoscaling` returns:
```
"ml": {
    "metrics": {
      "nodes": {
        "value": 1,
        "quality": "exact"
      },
      "node_memory_in_bytes": {
        "value": 34359738368,
        "quality": "exact"
      },
      "model_memory_in_bytes": {
        "value": 0,
        "quality": "exact"
      },
      "min_nodes": {
        "value": 0,
        "quality": "exact"
      },
      "extra_single_node_model_memory_in_bytes": {
        "value": 2101346304,
        "quality": "exact"
      },
      "extra_single_node_processors": {
        "value": 0,
        "quality": "exact"
      },
      "extra_model_memory_in_bytes": {
        "value": 2101346304,
        "quality": "exact"
      },
      "extra_processors": {
        "value": 0,
        "quality": "exact"
      },
      "remove_node_memory_in_bytes": {
        "value": 0,
        "quality": "exact"
      },
      "per_node_memory_overhead_in_bytes": {
        "value": 31457280,
        "quality": "exact"
      }
    }
  }
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] ML nodes autoscaling not down to 0 in stateful and serverless #114930

Environment

build

Step to reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ML] ML nodes autoscaling not down to 0 in stateful and serverless #114930

Description

Environment

build

Step to reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions