This is a batch-processing proxy that wraps the API of Text Embeddings Inference. It is implemented in Rust, using Tokio as the async runtime, Actix-web as the HTTP API framework, and reqwest as the underlying HTTP client.
The proxy uses a lightweight actor model to manage shared state.
- Request arrival: When a request arrives, the proxy extracts all common parameters (basically everything except
input) and assigns them to a worker instance. - Worker messaging: The proxy sends the worker a message containing both the client’s reply handle and the main request payload.
- Batching logic: The worker waits for new requests or until the configured waiting timeout (
max_waiting_timeout) expires. If the queued inputs count exceeds the configuredmax_batch_size, the worker flushes the batch immediately. - Request execution: On flushing, the worker combines the batch’s inputs and common API parameters, sends them to the target API, and distributes the resulting responses back to the corresponding clients.
It is assumed that all requests that can be batched can be represented in the form of Vec<TReq>, and all of the responses will be in the form of Vec<TResp>. That means that for the embed endpoint TReq=String and TRes=Vec<f64>.
To define a new endpoint, you will need to implement a type container trait called ApiEndpoint and define required types. In addition, you need to define GroupingParams for the common parameters that the request can be grouped on, and add the endpoint API call definition to the ApiClient.
You can take a look at the /embed endpoint for an example implementation.
The proxy is configured via settings.toml. Local development overrides can be placed in settings.local.toml.
All settings can also be overridden using environment variables, which must:
- Have the
BATCH_PROXYprefix - Use double underscores (
__) as separators between nested fields
Example: To override inference_api.target_url, set:
BATCH_PROXY__INFERENCE_API__TARGET_URL
The proxy can be run with the included docker-compose.yml.
Running docker compose up --profile cpu will start both the proxy and the underlying text inference API, which defaults to the nomic-ai/nomic-embed-text-v1.5 model.
To run only the proxy, disable the API container use docker compose start proxy and configure the it to point to the correct API URL.
In addition, you can use docker compose up --profile gpu to run the text inference api on you GPU.
Benchmarks were performed with the Hey load-testing tool.
Test command:
hey -m POST -d '{"inputs": "hello"}' -H "Content-Type: application/json" -n 5000 ${API}
where API is set to either the proxy endpoint or the text inference API directly.
The GPU version of the text inference API was used, running on the NVIDIA RTX 2060 Super.
Settings: max_batch_size=32, max_waiting_time=8ms, RUST_LOG=error
Summary: Total: 3.8910 secs Slowest: 0.0869 secs Fastest: 0.0257 secs Average: 0.0388 secs Requests/sec: 1285.0072 Total data: 47605000 bytes Size/request: 9521 bytes Response time histogram: 0.026 [1] | 0.032 [60] |■ 0.038 [2819] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.044 [1781] |■■■■■■■■■■■■■■■■■■■■■■■■■ 0.050 [229] |■■■ 0.056 [64] |■ 0.062 [0] | 0.069 [0] | 0.075 [0] | 0.081 [32] | 0.087 [14] | Latency distribution: 10% in 0.0356 secs 25% in 0.0365 secs 50% in 0.0376 secs 75% in 0.0401 secs 90% in 0.0430 secs 95% in 0.0451 secs 99% in 0.0541 secs Details (average, fastest, slowest): DNS+dialup: 0.0000 secs, 0.0257 secs, 0.0869 secs DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0034 secs req write: 0.0000 secs, 0.0000 secs, 0.0020 secs resp wait: 0.0386 secs, 0.0257 secs, 0.0830 secs resp read: 0.0001 secs, 0.0000 secs, 0.0015 secs Status code distribution: [200] 5000 responses
Summary: Total: 6.5275 secs Slowest: 0.4035 secs Fastest: 0.0063 secs Average: 0.0641 secs Requests/sec: 765.9908 Total data: 47605000 bytes Size/request: 9521 bytes Response time histogram: 0.006 [1] | 0.046 [4277] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.086 [40] | 0.125 [1] | 0.165 [91] |■ 0.205 [248] |■■ 0.245 [114] |■ 0.284 [35] | 0.324 [123] |■ 0.364 [20] | 0.403 [50] | Latency distribution: 10% in 0.0358 secs 25% in 0.0368 secs 50% in 0.0379 secs 75% in 0.0405 secs 90% in 0.1837 secs 95% in 0.2354 secs 99% in 0.3920 secs Details (average, fastest, slowest): DNS+dialup: 0.0000 secs, 0.0063 secs, 0.4035 secs DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0050 secs req write: 0.0000 secs, 0.0000 secs, 0.0047 secs resp wait: 0.0639 secs, 0.0062 secs, 0.4033 secs resp read: 0.0001 secs, 0.0000 secs, 0.0030 secs Status code distribution: [200] 5000 responses
As we can see, requests going through the batch proxy are faster. For example, the proxy request throughput is 1285 rps, while raw inference api is 765 rps.
In addition, average response time for the raw inference API calls is 0.641 seconds, while the proxy response time is 0.388 seconds.
In the current implementation the workers stay in memory forever. This oppens possibilities for DoS attacks, which can easily be circumvented by removing workers on periodic basis.
Currently if something goes wrong, user receives a generic error. It would be better to specify what exactly went wrong and depending on the error relay it to the user.