Skip to content

ModelNotHereException causing 8 retry iterations exhausted for model #523

@GolanLevy

Description

@GolanLevy

Describe the bug

From time to time, our system spins out of control, throwing many ModelNotHereExceptions which eventually leading to "8 retry iterations exhausted for model".

Our registration process is completely automated, and triggered by a registerModel gRPC request (instead of a yaml configuration), followed by ensureLoaded request to validate that the registration has completed successfully.

Models:
The issues is not consistent per model: a failing invocation of a model can be successful on the next try, in case that the request is directed to a not-faulty mm pod (see the next section).

MM pods:
We have a few dozens mm pods , and the issues is very prominent in only some of them (<50%), addressed as "faulty" pods. Faulty pods are still functioning, meaning they are able to serve, run predictions and invoke internal requests, but have very high error rate due to the ModelNotHereExceptions.
It looks like faulty pods are somehow not synced with ETCD and invoke random internal requests.
All the mm pods are not new, and are running for a days/hours before the issue starts.
Note that non-faulty pods are also throwing these errors from time to time.

ETCD:
We do however suspect the ETCD, since its pods were restarted (for reasons unclear to us yet) and the faulty pods are only ones that were created prior to the ETCD restart.

Mitigation:
The issue usually stops when there is a scale in event, so some of the pods are terminated.
Note that a faulty pod might not be terminated, but the errors are stopped due to a termination of a different pod (maybe on that the problematic model was loaded on).

Example:
In the attached log file, you can see that a newly registered model 4774912c is facing this issue, even though it was loaded on modelmesh-serving-triton-2.x-768448c4fb-q9564.
The external requests to the many faulty pods, are directed to 8 pods, which none of them is modelmesh-serving-triton-2.x-768448c4fb-q9564.

report.csv

As you can see, the situation is very peculiar and we are not sure how to investigate further.
We are curious:

  1. Why the ForwardingLB decides to randomly invoke inference requests to other pods, assuming the model is already loaded there?
  2. How to continue this investigation?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions