Skip to content

[ML] ingest node crashes when running semantic_text inference during ml node shutdown #114909

@wwang500

Description

@wwang500

Version:

9.0.0

Build:

  "build": {
    "hash": "8ccfb227c2131c859033f409ee37a87023fada62",
    "date": "2024-10-16T00:43:30.449150814Z"
  },

Error:

Node crashed with the error: java.lang.IllegalStateException: index [0] has already been set

Step to reproduce:

  1. Deploy a multi-nodes env locally. must contain one ml node and one ingest node. My environment (get _cat/node):
192.168.68.51 57 100 21 4.52   it - node-5
192.168.68.51 61 100 24 4.52   l  - node-4
192.168.68.51 59 100 21 4.52   m  - node-1
192.168.68.51 12 100 21 4.52   m  * node-0
192.168.68.51 51 100 21 4.52   d  - node-3
192.168.68.51 33 100 21 4.52   m  - node-2

node-4 is ml node, node-5 is ingest node

  1. Create an inference endpoint:
PUT _inference/sparse_embedding/elser-endpoint
{
  "service": "elser", 
  "service_settings": {"num_threads": 4, "adaptive_allocations": {"enabled": true}}
}
  1. create an index with semantic_text in mapping:
...
    "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "content": { 
          "type": "semantic_text", 
          "inference_id": "elser-endpoint" 
        },
        ...
      }
    }
  1. Start ingest data to the above index, I was using python client: es.index(index=index_name, document=document)
  2. During the indexing, manually stop ml node: node-4

Observed:
After some seconds, it shows this in python client:

2024-10-16 00:02:18,557 DEBUG : Starting new HTTPS connection (2): localhost:9200
2024-10-16 00:02:46,400 DEBUG : https://localhost:9200 "POST /news-rss-feeds-espn-2024-10/_doc HTTP/11" 500 0
2024-10-16 00:02:46,401 ERROR : Exception type: ApiError
2024-10-16 00:02:46,401 ERROR : Error message: ApiError(500, 'node_disconnected_exception', '[node-5][192.168.68.51:9304][indices:data/write/bulk] disconnected')
2024-10-16 00:02:46,405 ERROR : Traceback (most recent call last):

Then index node: node-5 crashed

es log shows:

[2024-10-16T00:00:23,419][ERROR][o.e.a.s.SubscribableListener] [node-5] exception thrown while handling another exception in listener [org.elasticsearch.action.ActionListenerImplementations$MappedActionListener/org.elasticsearch.action.support.ContextPreservingActionListener/org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener/org.elasticsearch.tasks.TaskManager$1{org.elasticsearch.action.support.ContextPreservingActionListener/org.elasticsearch.xpack.ml.action.TransportInternalInferModelAction$1@b59a5ea}{CancellableTask{Task{id=1373, type='transport', action='cluster:monitor/xpack/ml/trained_models/deployment/infer', description='infer_trained_model_deployment[elser-endpoint]', parentTask=ubKEYCjuSpC_vtLNjywQMw:1372, startTime=1729051218557, headers={}, startTimeNanos=804107876203041}, reason='null', isCancelled=false}}/org.elasticsearch.action.support.TransportAction$$Lambda/0x00001e000249ca88@7bf152ac/org.elasticsearch.xpack.security.action.filter.SecurityActionFilter$$Lambda/0x00001e000249a110@3b622e21]
java.lang.IllegalStateException: index [0] has already been set
	at org.elasticsearch.common.util.concurrent.AtomicArray.setOnce(AtomicArray.java:70) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.xpack.inference.chunking.EmbeddingRequestChunker$DebatchingListener.onFailure(EmbeddingRequestChunker.java:327) ~[?:?]
	at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:64) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:75) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]
	at org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:32) ~[elasticsearch-9.0.0-SNAPSHOT.jar:?]

Full es log:
8-nodes-9.0.0-main-884.log

Metadata

Metadata

Assignees

No one assigned

    Labels

    :mlMachine learning>bugFeature:GenAIFeatures around GenAITeam:MLMeta label for the ML team

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions