feat(core): Expose service instance metrics via API and add new pending worker messages metric. by ammeek · Pull Request #14495 · kestra-io/kestra

ammeek · 2026-02-09T11:24:54Z

✨ Description

What does this PR change?
This PR adds the ability to query service instance metrics via the web server.
This PR also add a new metric for the amount of unconsumed messages on the worker queue.

🔗 Related Issue

Part of #424

🛠️ Backend Checklist

If this PR does not include any backend changes, delete this entire section.

Code compiles successfully and passes all checks
All unit and integration tests pass

…etrics via the web server.

github-actions · 2026-02-09T11:31:35Z

🐋 Docker image

ghcr.io/kestra-io/kestra-pr:14495

docker run --pull=always --rm -it -p 8080:8080 --user=root -v /var/run/docker.sock:/var/run/docker.sock -v /tmp:/tmp ghcr.io/kestra-io/kestra-pr:14495 server local

🧪 Java Unit Tests

	Tests	Passed ✅	Skipped ⚠️	Failed	Time ⏱
Java Tests Report	3902 ran	3887 ✅	15 ⚠️	0 ❌	42m 40s 557ms

github-actions · 2026-02-09T11:46:48Z

Tests report quick summary:

success ✅ > tests: 3902, success: 3887, skipped: 15, failed: 0

unfold for details

Project	Status	Success	Skipped
cli	success ✅	80	0
core	success ✅	1856	1
executor	success ✅	4	0
jdbc	success ✅	12	0
jdbc-h2	success ✅	473	0
jdbc-mysql	success ✅	476	0
jdbc-postgres	success ✅	476	0
processor	success ✅	7	0
runner-memory	success ✅	1	0
scheduler	success ✅	23	0
script	success ✅	11	0
storage-local	success ✅	64	0
webserver	success ✅	414	0
worker	success ✅	4	0

Develocity build scan: https://develocity.kestra.io/s/qyjkzscd7yope

loicmathieu

This is not a full review, @fhussonnois will do the final one.
But I think you mix consumerGroup with queueType, the worker group will be passed using a consumerGroup.

core/src/main/java/io/kestra/core/metrics/MetricRegistry.java

core/src/main/java/io/kestra/core/queues/QueueInterface.java

core/src/main/java/io/kestra/core/queues/WorkerQueueSizePoller.java

jdbc-mysql/src/main/java/io/kestra/runner/mysql/MysqlQueue.java

…trics through the MetricRegistry/micrometer.

…ible metric names.

…instance metrics.

…to-zero' into feat/scale-workers-to-zero

core/src/main/java/io/kestra/core/queues/QueueLagPoller.java

.

…ceTest. .

… feat/scale-workers-to-zero

.

…e flakiness.

…with flaky test.

…le to reduce flaky failures.

ammeek · 2026-02-16T12:13:35Z

@fhussonnois and @loicmathieu this pr is now ready for review.

loicmathieu · 2026-02-16T14:55:28Z

core/src/main/java/io/kestra/core/queues/QueueLagPoller.java

+        QueueInterface<WorkerJob> workerJobQueue = workerJobQueueProvider.get();
+        this.register(
+            getQueueLagForConsumerGroup(WORKERJOB_NAMED, null, Worker.class, workerJobQueue),
+            MetricRegistry.TAG_WORKER_GROUP, "default",


The issue is that if a user creates a worker group with the name default which is totally acceptable, it will clash with it

@loicmathieu This seems like a valid issue. Unfortunately we are currently using the Micronaut management metric endpoint to fetch this value for the scaling work. The metrics endpoint only supports filtering for specific tag value not metrics without values. This means if we remove this value we won't be able to tell the difference between the cumulative metric value for all tags and the metric value for workers that are not inside of a specifically designed worker group. I can see three ways to resolve this issue, I've ordered them in preference from top to bottom:

Add a new endpoint which will allow us to filter metric values with grater granularity. If we where to do this I would suggest that we make it match the k8 metric provider spec as we can then expose the values as external metrics which could be useful in the future.

instead of defaulting to 'default' for as the tag value for workers that are not inside of a specifically designed worker group we could instead use the string value of 'null' this is significantly less likely to picked by a end user.

We could add a restriction to the names allow for worker groups to stop people from picking default. This solution seem the least appealing as it add a restriction to the user and decreases UX.

We use (default) in a bunch of other places, it may be a good idea to use this.
Unfortunately, we didn't add validation to the Worker group key so we accept any character.

My point is default has a high risk of being used, null is a bit akward so something else you be found that use special char so it has less risk to be used by a user.

(default) as when we log a worker group with no key, <default> of __default__ all work for me but not just default.

loicmathieu · 2026-02-16T14:58:08Z

core/src/main/java/io/kestra/core/queues/QueueLagPoller.java

+        .expireAfterWrite(Duration.ofSeconds(30))
+        .build();
+
+    private final Set<String> availableWorkerGroups = new HashSet<>();


This should be a concurrent collection

But anyway, I'm not fan of maintaining the list of worker groups here, we already have that inside the WorkerGroupExecutor

I will remove this set and always fetch worker groups from the MetaStore, they are in cache their anyway

jdbc-mysql/src/main/java/io/kestra/runner/mysql/MysqlQueue.java

loicmathieu · 2026-02-16T15:01:40Z

jdbc/src/main/java/io/kestra/jdbc/runner/JdbcQueue.java


    private final ExecutorService poolExecutor;
    private final ExecutorService asyncPoolExecutor;
+    private final WorkerGroupExecutorInterface workerGroupExecutor;


Seems not used

loicmathieu · 2026-02-16T15:01:58Z

scheduler/src/test/java/io/kestra/scheduler/SchedulerConditionTest.java


            scheduler.run();
-            assertTrue(queueCount.await(15, TimeUnit.SECONDS));
+            assertTrue(queueCount.await(30, TimeUnit.SECONDS));


Not a good idea, it's already too long

loicmathieu · 2026-02-16T15:06:55Z

core/src/main/java/io/kestra/core/queues/QueueLagPoller.java

+    }
+
+
+    @Scheduled(fixedDelay = "30000s", initialDelay = "30s")


You refresh each 8h?

loicmathieu · 2026-02-16T15:09:35Z

webserver/src/main/java/io/kestra/webserver/services/SharedServiceInstanceMetricService.java

+                    if (metric.tags().stream().map(Metric.Tag::key).noneMatch(key -> key.equals("instance_id"))) {
+                        tags.add(MetricRegistry.SERVICE_ID);
+                        tags.add(serviceInstance.uid());
+                        metricTags.add(new Metric.Tag(MetricRegistry.SERVICE_ID, serviceInstance.uid()));


Not a good idea, this would lead to metrics explosion

loicmathieu · 2026-02-16T15:10:27Z

webserver/src/main/java/io/kestra/webserver/services/SharedServiceInstanceMetricService.java

+                        metricKey, k -> new AtomicReference<>()
+                    ).set(metric.value());
+
+                    String filteredName = metric.name().replaceAll("^kestra\\.", "");


Why are you doing that?

@loicmathieu in the service instance table metrics are stored in the following format with the kestra prefix

kestra.metric.name

This means if we push the above value directly into the web servers metric registry then the resulting name will have two kestra prefixes e.g..

kestra.kestra.metric.name

This is configurable, you must get the configured prefix from kestra.metrics.prefix and do a substring

loicmathieu · 2026-02-16T15:11:09Z

webserver/src/main/java/io/kestra/webserver/services/SharedServiceInstanceMetricService.java

+
+        toRemove.forEach(metricKey -> {
+            log.debug("Removing metric {} from shared metrics, as the associated service instance is no longer active", metricKey);
+            metricRegistry.removeMeter(sharedMetricsGauges.remove(metricKey));


This is not a good idea, metrics should never be removed

Thanks @loicmathieu what do you think the idea behaviour here should be when there are no service instance pushing a metric for example when all worker are scaled to zero? Should the metric value reset to zero?

Yes, setting the gauge to 0 will work

loicmathieu · 2026-02-16T15:11:48Z

webserver/src/main/java/io/kestra/webserver/services/SharedServiceInstanceMetricService.java

+    private final Map<MetricKey, AtomicReference<Number>> sharedMetricsValues = new HashMap<>();
+
+    private final Map<MetricKey, io.micrometer.core.instrument.Gauge> sharedMetricsGauges = new HashMap<>();


They are set and used by different threads so you need to use concurrent collections

…rk as flaky.

…ault__'.

… already exists.

…nfig.

…MetricService where maps shared between threads.

…clude correct queue lag metric name.

steven meek added 4 commits January 27, 2026 11:24

feat(webserver): add endpoint to expose metrics for running workers.

d3d64f5

feature: add ability to query queue lag and expose service instance m…

57f1aa0

…etrics via the web server.

Merge branch 'develop' into feat/scale-workers-to-zero

df16c25

chore: revert metric config updated accidentally

d7e2a64

ammeek requested review from a team and fhussonnois February 9, 2026 11:24

github-project-automation bot added this to Pull Requests Feb 9, 2026

github-project-automation bot moved this to To review in Pull Requests Feb 9, 2026

Merge branch 'develop' into feat/scale-workers-to-zero

fe106ea

loicmathieu requested changes Feb 9, 2026

View reviewed changes

steven meek added 6 commits February 12, 2026 16:03

feature(core): refactor SharedServiceInstanceMetricService to push me…

ce1f494

…trics through the MetricRegistry/micrometer.

chore(core): refactor QueueLagPoller to use caffine cache.

8941b91

chore(core): ensure that QueueLagPoller ocacnially refreshes the aval…

01c5e9c

…ible metric names.

chore(core): ensure that a base metric value is added for all shared …

5caa454

…instance metrics.

Merge remote-tracking branch 'refs/remotes/origin/feat/scale-workers-…

3499806

…to-zero' into feat/scale-workers-to-zero

chore(core): remove k8 client.

0707cfd

github-code-quality bot found potential problems Feb 13, 2026

View reviewed changes

core/src/main/java/io/kestra/core/queues/QueueLagPoller.java Fixed Show fixed Hide fixed

steven meek and others added 11 commits February 13, 2026 12:17

chore(core): disable queue poller by default.

0d1e0f7

chore(core): remove redundant condition from query

31a4137

.

Merge branch 'develop' into feat/scale-workers-to-zero

0179d71

chore(core): fix flaky tests in SharedServiceInstanceMetricServiceTest

a8c9ec6

.

chore(core): annotate flaky tests in SharedServiceInstanceMetricServi…

0faf1d5

…ceTest. .

Merge remote-tracking branch 'origin/feat/scale-workers-to-zero' into…

cf45d93

… feat/scale-workers-to-zero

chore(core): fix flaky tests in SharedServiceInstanceMetricServiceTest.

25f8717

.

Merge branch 'develop' into feat/scale-workers-to-zero

5428f61

chore(core): add conditional check for null consumers.

c21712a

chore(core): make queue lag tests always use consumer groups to reduc…

748bc08

…e flakiness.

chore(core): move isBasicAuthInitialized to be a flaky test.

4539b71

ammeek and others added 5 commits February 13, 2026 16:58

Merge branch 'develop' into feat/scale-workers-to-zero

8565994

chore: add cluser service instance endpoint to openapi spec

8de7359

chore: annotate WorkerTest.killed with flaky test

b32e6bb

chore(core): chore: annotate TriggerControllerTest.disableByTriggers …

396bb91

…with flaky test.

chore(core): chore: increase timeout on SchedulerConditionTest.schedu…

0e207a9

…le to reduce flaky failures.

loicmathieu requested changes Feb 16, 2026

View reviewed changes

steven meek added 11 commits February 16, 2026 16:11

chore(core): remove unused property on JdbcQueue.

a18507c

chore(core): reduce timeout of SchedulerConditionTest.schedule and ma…

05c1feb

…rk as flaky.

chore(core): stop tracking service instance id on metrics.

c98b480

chore(core): reduce polling interval for QueueLagPoller.

e5cabe9

chore(core): update default worker group key from 'default' to '__def…

8b21666

…ault__'.

chore(core): remove worker tracking set and replace query if a metric…

39008d8

… already exists.

chore(core): remove shared instance metrics prefix based on metric co…

99e0b9d

…nfig.

chore(core): move to using ConcurrentHashMap in SharedServiceInstance…

21e253f

…MetricService where maps shared between threads.

chore(core): update QueueLagPoller polling interval to five minutes.

3d6b589

chore(core): match sharedMetricConfig with prefix value.

1f755d4

chore(core): update default sharedServiceInstanceMetrics config to in…

a6649a5

…clude correct queue lag metric name.

		private final Map<MetricKey, AtomicReference<Number>> sharedMetricsValues = new HashMap<>();

		private final Map<MetricKey, io.micrometer.core.instrument.Gauge> sharedMetricsGauges = new HashMap<>();

Conversation

ammeek commented Feb 9, 2026

✨ Description

🔗 Related Issue

🛠️ Backend Checklist

Uh oh!

github-actions bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐋 Docker image

🧪 Java Unit Tests

Uh oh!

github-actions bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tests report quick summary:

Uh oh!

loicmathieu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ammeek commented Feb 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

loicmathieu Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Feb 9, 2026 •

edited

Loading

github-actions bot commented Feb 9, 2026 •

edited

Loading

loicmathieu Feb 17, 2026 •

edited

Loading