You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
description: 'Configure the plugin to use two OpenAI models and route requests based on in-flight connection counts and spare capacity.'
3
+
4
+
extended_description: |
5
+
{% new_in 3.13 %} Configure the plugin to use two OpenAI models and route requests to the backend with the highest spare capacity based on in-flight connection counts.
6
+
7
+
In this example, both models have equal weight (2), so requests are distributed based on which backend has fewer active connections. The algorithm automatically routes new requests to backends with more spare capacity, making it particularly effective when backends have varying response times.
8
+
9
+
weight: 111
10
+
11
+
requirements:
12
+
- An OpenAI account
13
+
14
+
config:
15
+
balancer:
16
+
algorithm: least-connections
17
+
retries: 3
18
+
failover_criteria:
19
+
- error
20
+
- timeout
21
+
- http_429
22
+
- non_idempotent
23
+
targets:
24
+
- model:
25
+
name: gpt-4o
26
+
provider: openai
27
+
options:
28
+
max_tokens: 1024
29
+
temperature: 1.0
30
+
route_type: llm/v1/chat
31
+
weight: 2
32
+
auth:
33
+
header_name: Authorization
34
+
header_value: Bearer ${key}
35
+
logging:
36
+
log_statistics: true
37
+
log_payloads: true
38
+
- model:
39
+
name: gpt-4o-mini
40
+
provider: openai
41
+
options:
42
+
max_tokens: 1024
43
+
temperature: 1.0
44
+
route_type: llm/v1/chat
45
+
weight: 2
46
+
auth:
47
+
header_name: Authorization
48
+
header_value: Bearer ${key}
49
+
logging:
50
+
log_statistics: true
51
+
log_payloads: true
52
+
53
+
variables:
54
+
key:
55
+
value: $OPENAI_API_KEY
56
+
description: The API key to use to connect to OpenAI.
description: 'Configure the plugin to route requests based on semantic similarity between prompts and model descriptions, with automatic fallback among models sharing identical descriptions.'
3
+
4
+
extended_description: |
5
+
{% new_in 3.13 %} Configure the plugin to use three OpenAI models and route requests based on semantic similarity between the prompt and model descriptions.
6
+
7
+
In this example, two targets share the same description ("Specialist in programming problems"). When a prompt matches this description, the plugin will first route to the target with weight 75 (gpt-4o). If that target fails, it falls back to the target with weight 25 (gpt-4o-mini) using round-robin. The third target with a different description ("Specialist in real life topics") handles prompts about non-technical topics.
8
+
9
+
weight: 111
10
+
11
+
min_version:
12
+
gateway: '3.13'
13
+
14
+
requirements:
15
+
- An OpenAI account
16
+
- A Redis instance for vector storage
17
+
18
+
config:
19
+
balancer:
20
+
algorithm: semantic
21
+
retries: 3
22
+
failover_criteria:
23
+
- error
24
+
- timeout
25
+
- http_429
26
+
- http_503
27
+
- non_idempotent
28
+
embeddings:
29
+
auth:
30
+
header_name: Authorization
31
+
header_value: Bearer ${key}
32
+
model:
33
+
name: text-embedding-3-small
34
+
provider: openai
35
+
vectordb:
36
+
strategy: redis
37
+
distance_metric: cosine
38
+
threshold: 0.7
39
+
dimensions: 1024
40
+
redis:
41
+
host: localhost
42
+
port: 6379
43
+
targets:
44
+
- model:
45
+
name: gpt-4o
46
+
provider: openai
47
+
options:
48
+
max_tokens: 1024
49
+
temperature: 1.0
50
+
route_type: llm/v1/chat
51
+
weight: 2
52
+
description: Specialist in real life topics
53
+
auth:
54
+
header_name: Authorization
55
+
header_value: Bearer ${key}
56
+
logging:
57
+
log_statistics: true
58
+
log_payloads: true
59
+
- model:
60
+
name: gpt-4o
61
+
provider: openai
62
+
options:
63
+
max_tokens: 1024
64
+
temperature: 1.0
65
+
route_type: llm/v1/chat
66
+
weight: 75
67
+
description: Specialist in programming problems
68
+
auth:
69
+
header_name: Authorization
70
+
header_value: Bearer ${key}
71
+
logging:
72
+
log_statistics: true
73
+
log_payloads: true
74
+
- model:
75
+
name: gpt-4o-mini
76
+
provider: openai
77
+
options:
78
+
max_tokens: 1024
79
+
temperature: 1.0
80
+
route_type: llm/v1/chat
81
+
weight: 25
82
+
description: Specialist in programming problems
83
+
auth:
84
+
header_name: Authorization
85
+
header_value: Bearer ${key}
86
+
logging:
87
+
log_statistics: true
88
+
log_payloads: true
89
+
90
+
variables:
91
+
key:
92
+
value: $OPENAI_API_KEY
93
+
description: The API key to use to connect to OpenAI.
Copy file name to clipboardExpand all lines: app/_kong_plugins/ai-proxy-advanced/index.md
+6-1Lines changed: 6 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -171,6 +171,9 @@ rows:
171
171
- algorithm: "[Consistent-hashing (sticky-session on given header value)](/plugins/ai-proxy-advanced/examples/consistent-hashing/)"
172
172
description: |
173
173
The consistent-hashing algorithm routes requests based on a specified header value (`X-Hashing-Header`). Requests with the same header are repeatedly routed to the same model, enabling sticky sessions for maintaining context or affinity across user interactions.
{% new_in 3.13 %} The least-connections algorithm tracks the number of in-flight requests for each backend. Weights are used to calculate the connection capacity of a backend. Requests are routed to the backend with the highest spare capacity. This option is more dynamic, automatically routing new requests to other backends when slower backends accumulate more open connections.
The lowest-latency algorithm is based on the response time for each model. It distributes requests to models with the lowest response time.
@@ -189,10 +192,12 @@ rows:
189
192
The priority algorithm routes requests to groups of models based on assigned weights. Higher-weighted groups are preferred, and if all models in a group fail, the plugin falls back to the next group. This allows for reliable failover and cost-aware routing across multiple AI models.
The round-robin algorithm distributes requests across models based on their respective weights. For example, if your models `gpt-4`, `gpt-4o-mini`, and `gpt-3` have weights of `70`, `25`, and `5` respectively, they’ll receive approximately 70%, 25%, and 5% of the traffic in turn. Requests are distributed proportionally, independent of usage or latency metrics.
195
+
The round-robin algorithm distributes requests across models based on their respective weights. For example, if your models `gpt-4`, `gpt-4o-mini`, and `gpt-3` have weights of `70`, `25`, and `5` respectively, they'll receive approximately 70%, 25%, and 5% of the traffic in turn. Requests are distributed proportionally, independent of usage or latency metrics.
The semantic algorithm distributes requests to different models based on the similarity between the prompt in the request and the description provided in the model configuration. This allows Kong to automatically select the model that is best suited for the given domain or use case.
199
+
200
+
{% new_in 3.13 %} Multiple targets can be [configured with identical descriptions](/plugins/ai-proxy-advanced/examples/semantic-with-fallback/). When multiple targets share the same description, the AI balancer performs round-robin fallback among these targets if the primary target fails. Weights affect the order in which fallback targets are selected.
Copy file name to clipboardExpand all lines: app/ai-gateway/load-balancing.md
+16-1Lines changed: 16 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -53,6 +53,12 @@ Kong AI Gateway supports multiple load balancing strategies to optimize traffic
53
53
54
54
The table below provides a detailed overview of the available algorithms, along with considerations to keep in mind when selecting the best option for your use case.
55
55
56
+
### Load balancing algorithms
57
+
58
+
Kong AI Gateway supports multiple load balancing strategies to optimize traffic distribution across AI models. Each algorithm is suited for different performance goals such as balancing load, improving cache-hit ratios, reducing latency, or ensuring [failover reliability](#retry-and-fallback).
59
+
60
+
The table below provides a detailed overview of the available algorithms, along with considerations to keep in mind when selecting the best option for your use case.
61
+
56
62
<!--vale off-->
57
63
{% table %}
58
64
columns:
@@ -77,6 +83,13 @@ rows:
77
83
* Especially effective with consistent keys like user IDs.
78
84
* Requires diverse hash inputs for balanced distribution.
{% new_in 3.13 %} Routes requests to backends with the highest spare capacity based on in-flight request counts. In the configuration, the [`weight`](/plugins/ai-proxy-advanced/reference/#schema--config-targets-weight) parameter calculates the connection capacity of each backend.
89
+
considerations: |
90
+
* Provides good distribution of traffic.
91
+
* More dynamic, automatically routing new requests to other backends when slower backends accumulate more open connections.
Routes requests to the least-utilized models based on resource usage metrics. In the configuration, the [`tokens_count_strategy`](/plugins/ai-proxy-advanced/reference/#schema--config-balancer-tokens-count-strategy) (for example, `prompt-tokens`) defines how usage is measured, focusing on prompt tokens or other resource indicators.
@@ -88,14 +101,16 @@ rows:
88
101
description: |
89
102
Routes requests to the models with the lowest observed latency. In the configuration, the [`latency_strategy`](/plugins/ai-proxy-advanced/reference/#schema--config-balancer-latency-strategy) parameter (for example, `latency_strategy: e2e`) defines how latency is measured, typically based on end-to-end response times. By default, the latency is calculated based on the time the model takes to generate each token (`tpot`).
90
103
91
-
The latency algorithm is based on peak EWMA (Exponentially Weighted Moving Average), which ensures that the balancer selects the backend by the lowest latency. The latency metric used is the full request cycle, from TCP connect to body response time. Since it’s a moving average, the metrics will decay over time.
104
+
The latency algorithm is based on peak EWMA (Exponentially Weighted Moving Average), which ensures that the balancer selects the backend by the lowest latency. The latency metric used is the full request cycle, from TCP connect to body response time. Since it's a moving average, the metrics will decay over time.
92
105
considerations: |
93
106
* Prioritizes models with the fastest response times.
94
107
* Optimizes for real-time performance in time-sensitive applications.
95
108
* Less suitable for long-lived or persistent connections (e.g., WebSockets).
Routes requests based on semantic similarity between the prompt and model descriptions. In the configuration, embeddings are generated using a specified model (e.g., `text-embedding-3-small`), and similarity is calculated using vector search.
112
+
113
+
{% new_in 3.13 %} Multiple targets can be configured with [identical descriptions](/plugins/ai-proxy-advanced/examples/semantic-with-fallback/). When multiple targets share the same description, the AI balancer performs round-robin fallback among these targets if the primary target fails. Weights affect the order in which fallback targets are selected.
99
114
considerations: |
100
115
* Uses vector search (for example, Redis) to find the best match based on prompt embeddings.
101
116
*`distance_metric` and `threshold` settings fine-tune matching sensitivity.
0 commit comments