Skip to content

btwotwo/batch_proxy

Repository files navigation

Description

This is a batch-processing proxy that wraps the API of Text Embeddings Inference. It is implemented in Rust, using Tokio as the async runtime, Actix-web as the HTTP API framework, and reqwest as the underlying HTTP client.

Architecture

The proxy uses a lightweight actor model to manage shared state.

  1. Request arrival: When a request arrives, the proxy extracts all common parameters (basically everything except input) and assigns them to a worker instance.
  2. Worker messaging: The proxy sends the worker a message containing both the client’s reply handle and the main request payload.
  3. Batching logic: The worker waits for new requests or until the configured waiting timeout (max_waiting_timeout) expires. If the queued inputs count exceeds the configured max_batch_size, the worker flushes the batch immediately.
  4. Request execution: On flushing, the worker combines the batch’s inputs and common API parameters, sends them to the target API, and distributes the resulting responses back to the corresponding clients.

Abstractions

It is assumed that all requests that can be batched can be represented in the form of Vec<TReq>, and all of the responses will be in the form of Vec<TResp>. That means that for the embed endpoint TReq=String and TRes=Vec<f64>.

To define a new endpoint, you will need to implement a type container trait called ApiEndpoint and define required types. In addition, you need to define GroupingParams for the common parameters that the request can be grouped on, and add the endpoint API call definition to the ApiClient.

You can take a look at the /embed endpoint for an example implementation.

Configuration

The proxy is configured via settings.toml. Local development overrides can be placed in settings.local.toml. All settings can also be overridden using environment variables, which must:

  • Have the BATCH_PROXY prefix
  • Use double underscores (__) as separators between nested fields

Example: To override inference_api.target_url, set: BATCH_PROXY__INFERENCE_API__TARGET_URL

How to run

The proxy can be run with the included docker-compose.yml. Running docker compose up --profile cpu will start both the proxy and the underlying text inference API, which defaults to the nomic-ai/nomic-embed-text-v1.5 model.

To run only the proxy, disable the API container use docker compose start proxy and configure the it to point to the correct API URL.

In addition, you can use docker compose up --profile gpu to run the text inference api on you GPU.

Benchmarks

Benchmarks were performed with the Hey load-testing tool.

Test command: hey -m POST -d '{"inputs": "hello"}' -H "Content-Type: application/json" -n 5000 ${API} where API is set to either the proxy endpoint or the text inference API directly.

The GPU version of the text inference API was used, running on the NVIDIA RTX 2060 Super.

Batch proxy

Settings: max_batch_size=32, max_waiting_time=8ms, RUST_LOG=error

Summary:
  Total:	3.8910 secs
  Slowest:	0.0869 secs
  Fastest:	0.0257 secs
  Average:	0.0388 secs
  Requests/sec:	1285.0072

  Total data:	47605000 bytes
  Size/request:	9521 bytes

Response time histogram:
  0.026 [1]	|
  0.032 [60]	|■
  0.038 [2819]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.044 [1781]	|■■■■■■■■■■■■■■■■■■■■■■■■■
  0.050 [229]	|■■■
  0.056 [64]	|■
  0.062 [0]	|
  0.069 [0]	|
  0.075 [0]	|
  0.081 [32]	|
  0.087 [14]	|


Latency distribution:
  10% in 0.0356 secs
  25% in 0.0365 secs
  50% in 0.0376 secs
  75% in 0.0401 secs
  90% in 0.0430 secs
  95% in 0.0451 secs
  99% in 0.0541 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0000 secs, 0.0257 secs, 0.0869 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0034 secs
  req write:	0.0000 secs, 0.0000 secs, 0.0020 secs
  resp wait:	0.0386 secs, 0.0257 secs, 0.0830 secs
  resp read:	0.0001 secs, 0.0000 secs, 0.0015 secs

Status code distribution:
  [200]	5000 responses

Text inference API

Summary:
  Total:	6.5275 secs
  Slowest:	0.4035 secs
  Fastest:	0.0063 secs
  Average:	0.0641 secs
  Requests/sec:	765.9908

  Total data:	47605000 bytes
  Size/request:	9521 bytes

Response time histogram:
  0.006 [1]	|
  0.046 [4277]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.086 [40]	|
  0.125 [1]	|
  0.165 [91]	|■
  0.205 [248]	|■■
  0.245 [114]	|■
  0.284 [35]	|
  0.324 [123]	|■
  0.364 [20]	|
  0.403 [50]	|


Latency distribution:
  10% in 0.0358 secs
  25% in 0.0368 secs
  50% in 0.0379 secs
  75% in 0.0405 secs
  90% in 0.1837 secs
  95% in 0.2354 secs
  99% in 0.3920 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0000 secs, 0.0063 secs, 0.4035 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0050 secs
  req write:	0.0000 secs, 0.0000 secs, 0.0047 secs
  resp wait:	0.0639 secs, 0.0062 secs, 0.4033 secs
  resp read:	0.0001 secs, 0.0000 secs, 0.0030 secs

Status code distribution:
  [200]	5000 responses

Summary

As we can see, requests going through the batch proxy are faster. For example, the proxy request throughput is 1285 rps, while raw inference api is 765 rps.

In addition, average response time for the raw inference API calls is 0.641 seconds, while the proxy response time is 0.388 seconds.

Improvement points

Workers cleanup

In the current implementation the workers stay in memory forever. This oppens possibilities for DoS attacks, which can easily be circumvented by removing workers on periodic basis.

Better error handling

Currently if something goes wrong, user receives a generic error. It would be better to specify what exactly went wrong and depending on the error relay it to the user.

About

An auto-batching proxy for https://github.com/huggingface/text-embeddings-inference

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published