-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[Inference API] Support service and task type aware rate-limiting #122333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Inference API] Support service and task type aware rate-limiting #122333
Conversation
| // TODO: add service() and taskType() | ||
| String service(); | ||
|
|
||
| TaskType taskType(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered possibly making these static, but then they can't be overridden and enforce that new sub-classes implement them, so I added them to this interface.
However, I was wondering if it makes sense to make these methods abstract methods of the BaseRequestManager instead of adding them to the RequestManager interface. I see in the RequestExecutorService that we expect instances of RequestManager, so I'm wondering in what situations the RequestExecutorService's methods will be passed instances of RequestManager and not BaseRequestManager's, or if we can restrict the type a bit more. Maybe @jonathan-buttner or @timgrein can help clear things up for me? I don't have a strong opinion either way, but more-so wanting to understand the implications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked at all the request manager's recently but I suspect that their base class already has the service name and task type information. It should be stored in the model classes.
So I think what we can do is modify the RequestManager interface like you have to include those new methods. Then modify the BaseRequestManager constructor to take the service name string and the task type and implement the required methods you just added to the interface.
Then in the service abstract class access the generic model to retrieve the information and pass it along to BaseRequestManager.
For example here are the changes I'm thinking:
BaseRequestManager(ThreadPool threadPool, String inferenceEntityId, Object rateLimitGroup, RateLimitSettings rateLimitSettings, String serviceName, TaskType taskType) { ... }
protected OpenAiRequestManager(ThreadPool threadPool, OpenAiModel model, CheckedSupplier<URI, URISyntaxException> uriBuilder) {
super(
threadPool,
model.getInferenceEntityId(),
RateLimitGrouping.of(model, uriBuilder),
model.rateLimitServiceSettings().rateLimitSettings(),
model.getConfigurations().getService(),
model.getConfigurations().getTaskType()
);
}
Unfortunately we have to do that for every request manager 😭 Hopefully soon we can do some refactoring to reduce the footprint of changes like this in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I see! Makes sense. Will refactor!
| @Override | ||
| public TaskType taskType() { | ||
| return COMPLETION; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For classes like ElasticInferenceServiceUnifiedCompletionRequestManager, GoogleAiStudioCompletionRequestManager, and OpenAiCompletionRequestManager (all completions classes in general?) should we be using COMPLETION or CHAT_COMPLETION? The use of ChatCompletionInput in each of these classes point me to COMPLETION, but wanted to make sure this is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sorry our naming is confusing. I think my comment above should address this too though because we shouldn't need to add the method directly to these classes anymore. But in general when you see UnifiedChatInput that means chat_completion and then confusingly when you see ChatCompletionInput it means completion 😭
|
I sourced |
| @Override | ||
| public TaskType taskType() { | ||
| return TaskType.TEXT_EMBEDDING; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For HuggingFaceRequestManager, I used TEXT_EMBEDDING as the TaskType since it is directly mentioned in the requestType within the HuggingFaceActionCreator and the only subclasses of HuggingFaceModel are HuggingFaceEmbeddingsModel and HuggingFaceElserModel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if the model approach I mentioned addresses this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, should be good here. Thanks!
bca4565 to
162763b
Compare
The goal of this PR is to address the rate-limiting follow-up TODOs introduced by this PR in order to support service and task type aware rate-limiting.