-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Closed
Closed
Copy link
Labels
Description
Environment
- Stateful cloud (gcp-us-west2)
- Serverless QA
build
"build": {
"hash": "8ccfb227c2131c859033f409ee37a87023fada62",
"date": "2024-10-16T05:50:43.944345200Z"
}
Step to reproduce
- Deploy a serverless or stateful cluster, for stateful cluster, make sure ML autoscaling is ON
- Create an inference endpoint with adaptive allocation ON
PUT _inference/sparse_embedding/elser-endpoint
{
"service": "elser",
"service_settings": {"num_threads": 4, "adaptive_allocations": {"enabled": true}}
}
- wait a few minutes for scaling up event, ml node available and allocated to that inference endpoint, you can confirm by running:
GET _ml/trained_models/elser-endpoint/_stats - Run inference, you can follow the steps in this tutorial, https://www.elastic.co/guide/en/elasticsearch/reference/current/semantic-search-semantic-text.html
- after that, wait at least 15 minutes for allocation to scale down to 0
{
"count": 1,
"trained_model_stats": [
{
"model_id": ".elser_model_2_linux-x86_64",
"model_size_stats": {
"model_size_bytes": 274756282,
"required_native_memory_bytes": 2101346304
},
"pipeline_count": 1,
"ingest": {
"total": {
"count": 0,
"time_in_millis": 0,
"current": 0,
"failed": 0
},
"pipelines": {
".kibana-elastic-ai-assistant-ingest-pipeline-knowledge-base": {
"count": 0,
"time_in_millis": 0,
"current": 0,
"failed": 0,
"ingested_as_first_pipeline_in_bytes": 0,
"produced_as_first_pipeline_in_bytes": 0,
"processors": [
{
"inference": {
"type": "inference",
"stats": {
"count": 0,
"time_in_millis": 0,
"current": 0,
"failed": 0
}
}
}
]
}
}
},
"inference_stats": {
"failure_count": 0,
"inference_count": 0,
"cache_miss_count": 0,
"missing_all_fields_count": 0,
"timestamp": 1729097542245
},
"deployment_stats": {
"deployment_id": "elser-endpoint",
"model_id": ".elser_model_2_linux-x86_64",
"threads_per_allocation": 4,
"number_of_allocations": 0,
"adaptive_allocations": {
"enabled": true
},
"queue_capacity": 1024,
"state": "started",
"allocation_status": {
"allocation_count": 0,
"target_allocation_count": 0,
"state": "fully_allocated"
},
"cache_size": "262mb",
"priority": "normal",
"start_time": 1729044099355,
"peak_throughput_per_minute": 0,
"nodes": []
}
}
]
}
- after allocation scales down to 0, ml nodes autoscaling (down to 0) should happen in ~1 hour
Observed:
After hours wait, ml nodes autoscaling (down to 0) didnt happen
- for stateful,
GET /_autoscaling/capacity/returns:
"ml": {
"required_capacity": {
"node": {
"memory": 0,
"processors": 4
},
"total": {
"memory": 0,
"processors": 0
}
},
"current_capacity": {
"node": {
"storage": 0,
"memory": 8585740288,
"processors": 4
},
"total": {
"storage": 0,
"memory": 17171480576,
"processors": 8
}
},
"current_nodes": [
{
"name": "instance-0000000003"
},
{
"name": "instance-0000000004"
}
],
"deciders": {
"ml": {
"required_capacity": {
"node": {
"memory": 0,
"processors": 4
},
"total": {
"memory": 0,
"processors": 0
}
},
"reason_summary": "[memory_decider] Requesting scale down as tier and/or node size could be smaller; [processor_decider] requesting scale down as tier and/or node size could be smaller",
"reason_details": {
"waiting_analytics_jobs": [],
"waiting_anomaly_jobs": [],
"waiting_models": [],
"configuration": {},
"perceived_current_capacity": {
"node": {
"memory": 8585740288,
"processors": 4
},
"total": {
"memory": 17171480576,
"processors": 8
}
},
"reason": "[memory_decider] Requesting scale down as tier and/or node size could be smaller; [processor_decider] requesting scale down as tier and/or node size could be smaller"
}
}
}
}
- for serverless,
GET /_internal/serverless/autoscalingreturns:
"ml": {
"metrics": {
"nodes": {
"value": 1,
"quality": "exact"
},
"node_memory_in_bytes": {
"value": 34359738368,
"quality": "exact"
},
"model_memory_in_bytes": {
"value": 0,
"quality": "exact"
},
"min_nodes": {
"value": 0,
"quality": "exact"
},
"extra_single_node_model_memory_in_bytes": {
"value": 2101346304,
"quality": "exact"
},
"extra_single_node_processors": {
"value": 0,
"quality": "exact"
},
"extra_model_memory_in_bytes": {
"value": 2101346304,
"quality": "exact"
},
"extra_processors": {
"value": 0,
"quality": "exact"
},
"remove_node_memory_in_bytes": {
"value": 0,
"quality": "exact"
},
"per_node_memory_overhead_in_bytes": {
"value": 31457280,
"quality": "exact"
}
}
}