Skip to content

Commit 7ab2ef3

Browse files
committed
Blog post introducing Queue-Based Scaling
Signed-off-by: Alex Ellis (OpenFaaS Ltd) <[email protected]>
1 parent 3083240 commit 7ab2ef3

File tree

6 files changed

+242
-1
lines changed

6 files changed

+242
-1
lines changed

_posts/2022-10-06-pdf-generation-at-scale-on-kubernetes.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ Learn how to run headless browsers on Kubernetes with massive scale using OpenFa
1818

1919
> Intro from Alex:
2020
>
21-
> We had a call with a team from Deel (an international payroll company) who told us they'd recently migrated their PDF generation functions from AWS Lambda to Kubernetes. They told us that AWS Lambda scaled exactly how they wanted, despite its high cost. Then, after moving to Kubernetes, they started to run into various problems scaling headless Chrome and it is still a pain point for them today.
21+
> We had a call with a team from [Deel](https://deel.com) (an international payroll company) who told us they'd recently migrated their PDF generation functions from AWS Lambda to Kubernetes. They told us that AWS Lambda scaled exactly how they wanted, despite its high cost. Then, after moving to Kubernetes, they started to run into various problems scaling headless Chrome and it is still a pain point for them today.
2222
>
2323
> This article shows how we approached the problem using built-in features of OpenFaaS - connection-based (aka capacity), hard limits (aka max_inflight) and a flexible queuing system that can retry failed invocations.
2424
>
Lines changed: 241 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,241 @@
1+
---
2+
title: "Introducing Queue Based Scaling for Functions"
3+
description: "Queue Based Scaling is a long awaited feature that matches queued requests to the exact amount of replicas almost instantly."
4+
date: 2025-07-29
5+
author_staff_member: alex
6+
categories:
7+
- queue
8+
- async
9+
- autoscaling
10+
- kubernetes
11+
- serverless
12+
dark_background: true
13+
# image: "/images/2025-07-headroom/background.png"
14+
hide_header_image: true
15+
---
16+
17+
Queue-Based Scaling is a long awaited feature for OpenFaaS that matches queued requests to the exact amount of replicas almost instantly.
18+
19+
The initial version of OpenFaaS released in 2016 had effective, but rudimentary autoscaling based upon Requests Per Second (RPS) and was driven through AlertManager, a component of the Prometheus project. In 2019, with growing needs of commercial users with long running jobs, we rewrote the autoscaler to query metrics directly from functions and Kubernetes to fine-tune how functions scaled.
20+
21+
OpenFaaS already has a versatile set of scaling modes that can be fine tuned such as: Requests Per Second (RPS), Capacity (inflight connections/concurrency), CPU, and Custom scaling modes. This new mode is specialised to match the needs of large amounts of background tasks and long running processing tasks.
22+
23+
## What is Queue-Based Scaling?
24+
25+
Queue-Based Scaling is a new autoscaling mode for OpenFaaS functions. It is made possible by supporting changes that emit queue depth metrics for each function that's being invoked asynchronously.
26+
27+
This new scaling mode fits well for functions that are:
28+
29+
* Primarily invoked asynchronously
30+
* May have a large backlog of requests
31+
* Need to scale up to the maximum number of replicas as quickly as possible
32+
* Run in batches, bursts, or spikes for minutes to hours
33+
34+
Typical tasks include: Extract, Transform, Load (ETL) jobs, security/asset auditing and analysis, data processing, image processing, video transcoding, and file scanning, backup/synchronisation, and other background tasks.
35+
36+
All previous scaling modes used *output metrics* from the function to determine the amount of replicas, which can involve some lag as the invocations build up from a few per second, to hundreds or thousands per second.
37+
38+
When using the queue-depth, we have an *input metric* that is available immediately, and can be used to set the exact number of replicas needed to process the backlog of requests.
39+
40+
**A note from a customer**
41+
42+
[Surge](https://www.workwithsurge.com) is a lending platform providing in-depth financial analysis, insights and risk management for their clients. They use dozens of OpenFaaS functions to process data in long-running asynchronous jobs. Part of that involves synchronising data between [Salesforce.com](https://www.salesforce.com) and Snowflake, a data warehousing solution.
43+
44+
Kevin Lindsay, Principal Engineer at Surge rolled out Queue-Based Scaling for their existing functions and said:
45+
46+
> "We just changed the `com.openfaas.scale.type` to `queue` and now async is basically instantly reactive, burning through large queues in minutes"
47+
48+
Kevin explained that Surge makes heavy use of Datadog for logging and insights, which charges based upon various factors, including the number of Pods and Nodes in the cluster. So unnecessary Pods, and extra capacity in the cluster means a larger bill, so having reactive horizontal scaling and scale to zero is a big win for them.
49+
50+
**Load test - Comparing Queue-Based Scaling to Capacity Scaling**
51+
52+
We ran a load test to compare the new Queue-Based Scaling mode to the existing Capacity scaling mode. Capacity mode is also effective for asynchronous invocations, and functions that are invoked in a hybrid manner (i.e. a mixture of both synchronous and asynchronous invocations).
53+
54+
For the test, we used `hey` to generate 1000 invocations of the sleep function from the store. Each invocation had a variable run-time of 10-25s to simulate a long-running job.
55+
56+
You will see a number of retries in the graphs emitted as 429 responses from the function. This is because we set a hard-limit of 5 inflight connections per replica to simulate a limited or expensive resource such as API calls or database connections.
57+
58+
First up - Capacity Scaling:
59+
60+
![Load test with capacity mode](/images/2025-07-queue-based/capacity-scaling.png)
61+
62+
We see that the load starts low, and builds up as the number of inflight connections increases, and the autoscaler responds by adding more replicas.
63+
64+
It is effective, but given that all of the invocations are asynchronous, we already had the data to scale up to the maximum number of replicas immediately.
65+
66+
Next up - Queue-Based Scaling:
67+
68+
![Load test with queue mode](/images/2025-07-queue-based/queue-scaling.png)
69+
70+
The load metric in this screenshot is the equivalent of the pending queue-depth.
71+
72+
We see the maximum number of replicas jump to 10 and remain there until the queue is emptied, which means the load (which is the number of invocations) is also able to start out at the maximum level.
73+
74+
## How does it work?
75+
76+
Just like all the other autoscaling modes, basic ranges are set on the [function's stack.yaml](https://docs.openfaas.com/reference/yaml/) file, or via [REST API call](https://docs.openfaas.com/reference/rest-api/)
77+
78+
**A quick recap on scaling modes**
79+
80+
One size does not fit all, and to give a quick summary:
81+
82+
* RPS - a default, and useful for most functions that execute quickly
83+
* Capacity - also known as "inflight connections" or "concurrency" - best for long running jobs or those which are going to be limited on concurrency
84+
* CPU - a good fit when RPS/Capacity aren't working as expected
85+
* Custom - any metric that you can find in Prometheus, or emit from some component of your stack can be used to drive scaling
86+
87+
**Demo with Queue-Based Scaling**
88+
89+
First, you can set a custom range for the minimum and maximum number of replicas (or use the defaults):
90+
91+
```yaml
92+
functions:
93+
etl:
94+
labels:
95+
com.openfaas.scale.min: "1"
96+
com.openfaas.scale.max: "100"
97+
```
98+
99+
Then, you specify whether it should also scale to zero, with an optional custom idle period:
100+
101+
```yaml
102+
labels:
103+
com.openfaas.scale.zero: "true"
104+
com.openfaas.scale.zero-duration: "5m"
105+
```
106+
107+
Finally, you can set the scaling mode and how many requests per Pod to target:
108+
109+
```yaml
110+
labels:
111+
com.openfaas.scale.mode: "queue"
112+
com.openfaas.scale.target: "10"
113+
com.openfaas.scale.target-proportion: "1"
114+
```
115+
116+
With all of the above, we have a function that:
117+
118+
* Scales from 1 to 10 replicas
119+
* Scales to zero after 5 minutes of inactivity
120+
* For each 10 requests in the queue, we will get 1 Pod
121+
122+
So if you have to scan 1,000,000 CSV files from an AWS S3 Bucket, you could enqueue one request for each file. This would create a queue depth of 1M requests and so the autoscaler would immediately create 100 Pods (the maximum set via the label).
123+
124+
In any of the prior modes, the Queue Worker would have to build up a steady flow of requests, in order for the scaling to take place.
125+
126+
If you wanted to generate load in a rudimentary way, you could use the open source tool `hey`, to submit i.e. 2.5 million requests to the above function.
127+
128+
```bash
129+
hey -d PAYLOAD -m POST -n 2500000 -c 100 http://127.0.0.1:8080/async-function/etl
130+
```
131+
132+
Any function invoked via the queue-worker can also return its result via a webhook, if you pass in a URL via the `X-Callback-Url` header.
133+
134+
## Concurrency limiting and retrying requests
135+
136+
Queued requests can be limited in concurrency, and retried if they fail.
137+
138+
Hard concurrency limiting can be achieved by setting the `max_inflight` environment variable i.e. `10` would mean the 11th request gets a 429 Too Many Requests response.
139+
140+
```yaml
141+
environment:
142+
max_inflight: "10"
143+
```
144+
145+
[Retries](https://docs.openfaas.com/openfaas-pro/retries/) are already configured as a system-wide default from the Helm chart, but they can be overridden on a per function basis, which is important for long running jobs that may take a while to complete.
146+
147+
```yaml
148+
annotations:
149+
com.openfaas.retry.attempts: "100"
150+
com.openfaas.retry.codes: "429"
151+
com.openfaas.retry.min_wait: "5s"
152+
com.openfaas.retry.max_wait: "5m"
153+
```
154+
155+
## Better fairness and efficiency
156+
157+
The previous version of the Queue Worker created a single Consumer for all invocations.
158+
159+
That meant that if you had 10,000 invocations come in from one tenant for their functions, they would likely block any other requests that came in after that.
160+
161+
The new mode creates a Consumer per function, where each Consumer gets scheduled independently into a work queue.
162+
163+
If you do find that certain tenants, or functions are monopolising the queue, you can provision dedicated queues using the [Queue Worker Helm chart](https://github.com/openfaas/faas-netes/tree/master/chart/queue-worker).
164+
165+
Let's picture the difference by observing the Grafana Dashboard for the Queue Worker.
166+
167+
In the first picture, we'll show the default mode "static" where a single Consumer is created for all functions, and asynchronous invocations are processed in a FIFO manner.
168+
169+
The sleep-1 function has all of its invocations processed first, and sleep-2 is unable to make any progress until the first function has been processed.
170+
171+
![Queue metrics dashboard in static mode](/images/2025-07-queue-based/fairness-static.png)
172+
173+
Next, we show two functions that are invoked asynchronously, but this time with the new "function" mode. Each function has its own Consumer, and so they can be processed independently.
174+
175+
![Queue metrics dashboard in function mode](/images/2025-07-queue-based/fairness-function.png)
176+
177+
Here, we see that the sleep-1 function is still being processed first, but the sleep-2 function is also able to make progress at the same time.
178+
179+
## What changes have been made?
180+
181+
A number of changes have been made to support Queue-Based Scaling:
182+
183+
* Queue Worker - the component that performs asynchronous invocations
184+
185+
When set to run in "function" mode, it will now create a Consumer per function with queued requests.
186+
187+
It deletes any Consumers once all available invocations have been processed.* Helm chart - new scaling rule and type "queue"
188+
189+
No changes were needed in the autoscaler, however the Helm chart introduces a new scaling rule named "queue"
190+
191+
* Gateway - publish invocations to an updated subject
192+
193+
Previously all messages were published to a single subject in NATS which meant no metric could be obtained on a per-function basis.
194+
195+
The updated subject format includes the function name, allowing for precise queue depth metrics to be collected.
196+
197+
Note that the 0.5.x gateway will start publishing messages to a new subject format, so if you update the gateway, you must also update the Queue Worker to 0.4.x or later, otherwise the Queue Worker will not be able to consume any messages.
198+
199+
## How do you turn it all on?
200+
201+
Since these features change the way that OpenFaaS works, and we value backwards compatibility, Queue-Based Scaling is an opt-in feature.
202+
203+
First, update to the latest version of the OpenFaaS Helm chart which includes:
204+
205+
* Queue Worker 0.4.x or later
206+
* Gateway 0.5.x or later
207+
208+
Then configure the following in your `values.yaml` file:
209+
210+
```diff
211+
jetstreamQueueWorker:
212+
mode: function
213+
```
214+
215+
The `mode` variable can be set to `static` to use the previous FIFO / single Consumer model, or `function` to use the new Consumer per function model.
216+
217+
At the same time, as introducing this new setting, we have deprecated an older configuration option that is no longer needed: `queueMode`.
218+
219+
So if you have a `queueMode` setting in your `values.yaml`, you can now safely remove it so long as you stay on a newer version of the Helm chart.
220+
221+
## Wrapping up
222+
223+
A quick summary about Queue-Based Scaling:
224+
225+
* The Queue-Worker consumes messages in a fairer way than previously
226+
* It creates Consumers per function but only when they have some work to do
227+
* The new `queue` scaling mode is reactive and precise - setting the exact number of replicas immediately
228+
* Better for multi-tenant deployments, where one tenant cannot monopolise the queue as easily
229+
230+
If you'd like a demo about asynchronous processing or long running jobs, please reach out via the [form on our pricing page](https://openfaas.com/pricing).
231+
232+
Use-cases:
233+
234+
* [Generate PDFs at Scale](/blog/pdf-generation-at-scale-on-kubernetes)
235+
* [Exploring the Fan out and Fan in pattern](/blog/fan-out-and-back-in-using-functions/)
236+
* [On Autoscaling - What Goes Up Must Come Down](/blog/what-goes-up-must-come-down/)
237+
238+
Docs:
239+
240+
* [Docs: OpenFaaS Asynchronous Invocations](https://docs.openfaas.com/async/)
241+
* [Docs: OpenFaaS Queue Worker](https://docs.openfaas.com/pro/jetstream-queue-worker/)
95.1 KB
Loading
119 KB
Loading
103 KB
Loading
95.1 KB
Loading

0 commit comments

Comments
 (0)