Skip to content

Commit dae1bdb

Browse files
authored
feat(ai-gateway): Add new load balancing alogrithms (#3659)
1 parent 1201d5c commit dae1bdb

File tree

4 files changed

+189
-2
lines changed

4 files changed

+189
-2
lines changed
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
title: 'Load balancing: Least-connections'
2+
description: 'Configure the plugin to use two OpenAI models and route requests based on in-flight connection counts and spare capacity.'
3+
4+
extended_description: |
5+
{% new_in 3.13 %} Configure the plugin to use two OpenAI models and route requests to the backend with the highest spare capacity based on in-flight connection counts.
6+
7+
In this example, both models have equal weight (2), so requests are distributed based on which backend has fewer active connections. The algorithm automatically routes new requests to backends with more spare capacity, making it particularly effective when backends have varying response times.
8+
9+
weight: 111
10+
11+
requirements:
12+
- An OpenAI account
13+
14+
config:
15+
balancer:
16+
algorithm: least-connections
17+
retries: 3
18+
failover_criteria:
19+
- error
20+
- timeout
21+
- http_429
22+
- non_idempotent
23+
targets:
24+
- model:
25+
name: gpt-4o
26+
provider: openai
27+
options:
28+
max_tokens: 1024
29+
temperature: 1.0
30+
route_type: llm/v1/chat
31+
weight: 2
32+
auth:
33+
header_name: Authorization
34+
header_value: Bearer ${key}
35+
logging:
36+
log_statistics: true
37+
log_payloads: true
38+
- model:
39+
name: gpt-4o-mini
40+
provider: openai
41+
options:
42+
max_tokens: 1024
43+
temperature: 1.0
44+
route_type: llm/v1/chat
45+
weight: 2
46+
auth:
47+
header_name: Authorization
48+
header_value: Bearer ${key}
49+
logging:
50+
log_statistics: true
51+
log_payloads: true
52+
53+
variables:
54+
key:
55+
value: $OPENAI_API_KEY
56+
description: The API key to use to connect to OpenAI.
57+
58+
tools:
59+
- deck
60+
- admin-api
61+
- konnect-api
62+
- kic
63+
- terraform
64+
65+
group: load-balancing
Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
title: 'Load balancing: Semantic with fallback'
2+
description: 'Configure the plugin to route requests based on semantic similarity between prompts and model descriptions, with automatic fallback among models sharing identical descriptions.'
3+
4+
extended_description: |
5+
{% new_in 3.13 %} Configure the plugin to use three OpenAI models and route requests based on semantic similarity between the prompt and model descriptions.
6+
7+
In this example, two targets share the same description ("Specialist in programming problems"). When a prompt matches this description, the plugin will first route to the target with weight 75 (gpt-4o). If that target fails, it falls back to the target with weight 25 (gpt-4o-mini) using round-robin. The third target with a different description ("Specialist in real life topics") handles prompts about non-technical topics.
8+
9+
weight: 111
10+
11+
min_version:
12+
gateway: '3.13'
13+
14+
requirements:
15+
- An OpenAI account
16+
- A Redis instance for vector storage
17+
18+
config:
19+
balancer:
20+
algorithm: semantic
21+
retries: 3
22+
failover_criteria:
23+
- error
24+
- timeout
25+
- http_429
26+
- http_503
27+
- non_idempotent
28+
embeddings:
29+
auth:
30+
header_name: Authorization
31+
header_value: Bearer ${key}
32+
model:
33+
name: text-embedding-3-small
34+
provider: openai
35+
vectordb:
36+
strategy: redis
37+
distance_metric: cosine
38+
threshold: 0.7
39+
dimensions: 1024
40+
redis:
41+
host: localhost
42+
port: 6379
43+
targets:
44+
- model:
45+
name: gpt-4o
46+
provider: openai
47+
options:
48+
max_tokens: 1024
49+
temperature: 1.0
50+
route_type: llm/v1/chat
51+
weight: 2
52+
description: Specialist in real life topics
53+
auth:
54+
header_name: Authorization
55+
header_value: Bearer ${key}
56+
logging:
57+
log_statistics: true
58+
log_payloads: true
59+
- model:
60+
name: gpt-4o
61+
provider: openai
62+
options:
63+
max_tokens: 1024
64+
temperature: 1.0
65+
route_type: llm/v1/chat
66+
weight: 75
67+
description: Specialist in programming problems
68+
auth:
69+
header_name: Authorization
70+
header_value: Bearer ${key}
71+
logging:
72+
log_statistics: true
73+
log_payloads: true
74+
- model:
75+
name: gpt-4o-mini
76+
provider: openai
77+
options:
78+
max_tokens: 1024
79+
temperature: 1.0
80+
route_type: llm/v1/chat
81+
weight: 25
82+
description: Specialist in programming problems
83+
auth:
84+
header_name: Authorization
85+
header_value: Bearer ${key}
86+
logging:
87+
log_statistics: true
88+
log_payloads: true
89+
90+
variables:
91+
key:
92+
value: $OPENAI_API_KEY
93+
description: The API key to use to connect to OpenAI.
94+
95+
tools:
96+
- deck
97+
- admin-api
98+
- konnect-api
99+
- kic
100+
- terraform
101+
102+
group: load-balancing

app/_kong_plugins/ai-proxy-advanced/index.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,9 @@ rows:
171171
- algorithm: "[Consistent-hashing (sticky-session on given header value)](/plugins/ai-proxy-advanced/examples/consistent-hashing/)"
172172
description: |
173173
The consistent-hashing algorithm routes requests based on a specified header value (`X-Hashing-Header`). Requests with the same header are repeatedly routed to the same model, enabling sticky sessions for maintaining context or affinity across user interactions.
174+
- algorithm: "[Least-connections](/plugins/ai-proxy-advanced/examples/least-connections/)"
175+
description: |
176+
{% new_in 3.13 %} The least-connections algorithm tracks the number of in-flight requests for each backend. Weights are used to calculate the connection capacity of a backend. Requests are routed to the backend with the highest spare capacity. This option is more dynamic, automatically routing new requests to other backends when slower backends accumulate more open connections.
174177
- algorithm: "[Lowest-latency](/plugins/ai-proxy-advanced/examples/lowest-latency/)"
175178
description: |
176179
The lowest-latency algorithm is based on the response time for each model. It distributes requests to models with the lowest response time.
@@ -189,10 +192,12 @@ rows:
189192
The priority algorithm routes requests to groups of models based on assigned weights. Higher-weighted groups are preferred, and if all models in a group fail, the plugin falls back to the next group. This allows for reliable failover and cost-aware routing across multiple AI models.
190193
- algorithm: "[Round-robin (weighted)](/plugins/ai-proxy-advanced/examples/round-robin/)"
191194
description: |
192-
The round-robin algorithm distributes requests across models based on their respective weights. For example, if your models `gpt-4`, `gpt-4o-mini`, and `gpt-3` have weights of `70`, `25`, and `5` respectively, theyll receive approximately 70%, 25%, and 5% of the traffic in turn. Requests are distributed proportionally, independent of usage or latency metrics.
195+
The round-robin algorithm distributes requests across models based on their respective weights. For example, if your models `gpt-4`, `gpt-4o-mini`, and `gpt-3` have weights of `70`, `25`, and `5` respectively, they'll receive approximately 70%, 25%, and 5% of the traffic in turn. Requests are distributed proportionally, independent of usage or latency metrics.
193196
- algorithm: "[Semantic](/plugins/ai-proxy-advanced/examples/semantic/)"
194197
description: |
195198
The semantic algorithm distributes requests to different models based on the similarity between the prompt in the request and the description provided in the model configuration. This allows Kong to automatically select the model that is best suited for the given domain or use case.
199+
200+
{% new_in 3.13 %} Multiple targets can be [configured with identical descriptions](/plugins/ai-proxy-advanced/examples/semantic-with-fallback/). When multiple targets share the same description, the AI balancer performs round-robin fallback among these targets if the primary target fails. Weights affect the order in which fallback targets are selected.
196201
{% endtable %}
197202
<!--vale on-->
198203

app/ai-gateway/load-balancing.md

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,12 @@ Kong AI Gateway supports multiple load balancing strategies to optimize traffic
5353

5454
The table below provides a detailed overview of the available algorithms, along with considerations to keep in mind when selecting the best option for your use case.
5555

56+
### Load balancing algorithms
57+
58+
Kong AI Gateway supports multiple load balancing strategies to optimize traffic distribution across AI models. Each algorithm is suited for different performance goals such as balancing load, improving cache-hit ratios, reducing latency, or ensuring [failover reliability](#retry-and-fallback).
59+
60+
The table below provides a detailed overview of the available algorithms, along with considerations to keep in mind when selecting the best option for your use case.
61+
5662
<!--vale off-->
5763
{% table %}
5864
columns:
@@ -77,6 +83,13 @@ rows:
7783
* Especially effective with consistent keys like user IDs.
7884
* Requires diverse hash inputs for balanced distribution.
7985
* Ideal for maintaining session persistence.
86+
- algorithm: "[Least-connections](/plugins/ai-proxy-advanced/examples/least-connections/)"
87+
description: |
88+
{% new_in 3.13 %} Routes requests to backends with the highest spare capacity based on in-flight request counts. In the configuration, the [`weight`](/plugins/ai-proxy-advanced/reference/#schema--config-targets-weight) parameter calculates the connection capacity of each backend.
89+
considerations: |
90+
* Provides good distribution of traffic.
91+
* More dynamic, automatically routing new requests to other backends when slower backends accumulate more open connections.
92+
* Does not improve cache-hit ratios.
8093
- algorithm: "[Lowest-usage](/plugins/ai-proxy-advanced/examples/lowest-usage/)"
8194
description: |
8295
Routes requests to the least-utilized models based on resource usage metrics. In the configuration, the [`tokens_count_strategy`](/plugins/ai-proxy-advanced/reference/#schema--config-balancer-tokens-count-strategy) (for example, `prompt-tokens`) defines how usage is measured, focusing on prompt tokens or other resource indicators.
@@ -88,14 +101,16 @@ rows:
88101
description: |
89102
Routes requests to the models with the lowest observed latency. In the configuration, the [`latency_strategy`](/plugins/ai-proxy-advanced/reference/#schema--config-balancer-latency-strategy) parameter (for example, `latency_strategy: e2e`) defines how latency is measured, typically based on end-to-end response times. By default, the latency is calculated based on the time the model takes to generate each token (`tpot`).
90103

91-
The latency algorithm is based on peak EWMA (Exponentially Weighted Moving Average), which ensures that the balancer selects the backend by the lowest latency. The latency metric used is the full request cycle, from TCP connect to body response time. Since its a moving average, the metrics will decay over time.
104+
The latency algorithm is based on peak EWMA (Exponentially Weighted Moving Average), which ensures that the balancer selects the backend by the lowest latency. The latency metric used is the full request cycle, from TCP connect to body response time. Since it's a moving average, the metrics will decay over time.
92105
considerations: |
93106
* Prioritizes models with the fastest response times.
94107
* Optimizes for real-time performance in time-sensitive applications.
95108
* Less suitable for long-lived or persistent connections (e.g., WebSockets).
96109
- algorithm: "[Semantic](/plugins/ai-proxy-advanced/examples/semantic/)"
97110
description: |
98111
Routes requests based on semantic similarity between the prompt and model descriptions. In the configuration, embeddings are generated using a specified model (e.g., `text-embedding-3-small`), and similarity is calculated using vector search.
112+
113+
{% new_in 3.13 %} Multiple targets can be configured with [identical descriptions](/plugins/ai-proxy-advanced/examples/semantic-with-fallback/). When multiple targets share the same description, the AI balancer performs round-robin fallback among these targets if the primary target fails. Weights affect the order in which fallback targets are selected.
99114
considerations: |
100115
* Uses vector search (for example, Redis) to find the best match based on prompt embeddings.
101116
* `distance_metric` and `threshold` settings fine-tune matching sensitivity.

0 commit comments

Comments
 (0)