Skip to content

Change requests queue implementation from a channel to more complex queue #172

@mayabar

Description

@mayabar

Currently, requests queue is implemented using a go channel. This causes some limitations:

Known limitation 1: max-loras is not taken into consideration when worker pulls request from the queue
The component that receives requests, push a new request to a channel, workers are waiting for a request on this channel and process them.
Always requests are processed by arrival time.
In some scenarios current behavior leads to running requests in parallel with more lora adapters than defined in the max-lora parameter.

Example:
max-loras is deined to 2, number of parallel requests - 3
Queue contains: R1(lora1), R2(lora2), R3(lora3), R4 (lora1)
Current implementation will pull 3 requests for processing r1, r2, and r3.
Required behavior: pull r1, r2, r4 (r3 could not be sent for processing since it will cause loading of more than 2 loras)

Known limitation 2: extra entries in loraInfo metrics are reported
When a single request is received, it is pushed to the queue (the channel) which creates metrics report that the lora of this request is in waiting list. This report does not affects the LoraAwareScorer but should be removed in the future version. Implementing new queue will fix this behavior.

Solution:
Implement a queue class which will expose API similar to channel - workers will wait for a new request to process.
It will skip requests with loras that cannot be loaded right now.
Design - TBD

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions