U-6624: Kubernetes service discovery by curusarn · Pull Request #7 · BetterStackHQ/collector

curusarn · 2025-07-25T20:52:40Z

Run kubernetes service discovery if kubernetes_discovery_* wildcard is used in vector config
Validate generated kubernetes discovery vector configs with minimal main config -> reject generated config if validation doesn't pass to prevent kubernetes discovery breking the entire config
Validate upstream vector config after download, use minimal kubernetes discovery config -> keep downloaded version if validation fails, update symlink to latest valid version if validation passes (send any validation errors to Better Stack)
Validate final upstream vector config + actual generated -> promote valid config + SIGHUP vector, on validation fail send errors to Better Stack

- Extract workload type and name from pod ownerReferences - Add as label (e.g. workload: "daemonset/cilium") - Works for both service-based and direct pod discovery - Helps identify the source workload type in metrics 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add get_workload_from_pod helper method - For ReplicaSets, follow chain to find parent Deployment - Shows "deployment/name" instead of "replicaset/name" in workload label - Other workload types (DaemonSet, StatefulSet, Job) remain unchanged 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add VectorConfigTest for all vector configuration methods - Add KubernetesDiscoveryTest for discovery functionality - Add BetterStackClientPingTest for ping workflow - Test workload ownership chain resolution - Fix broken tests after refactoring - Remove obsolete test methods 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- KubernetesDiscoveryIntegrationTest: Complex workload types, node filtering, deduplication - VectorConfigEdgeCasesTest: Malicious commands, symlinks, race conditions - BetterStackClientErrorHandlingTest: Network errors, partial failures, security - UtilsEdgeCasesTest: Invalid inputs, binary content, unicode handling Tests cover error scenarios, security concerns, and concurrent operations. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

curusarn · 2025-07-28T10:15:49Z

For context:

We have upstream vector config coming from Better Stack.
This PR adds a way to additionally give vector extra configs for discovered services (pods) in the cluster based on standard Prometheus labels.
We'll reference the sources via kubernetes_discovery_* glob in the main vector config.

Problems this PR addresses that came up during development:
Vector will error for configs that contain globs that don't match anything - e.g. no_sources_with_this_prefix_* will cause vector to fail.

This is problematic as we can no longer download upstream config and validate it separately. -> We validate with a minimal dummy service discovery config.
Another problem is if there are no services discovered. -> Again, we solve this with minimal empty service discovery configs.

Under no circumstances the vector should crash because of broken upstream config, broken dynamic discovery configs, or some odd combination of both.
We triple validate the vector config:

Upstream config is only used if it passes validation with minimal service discovery config.
Dynamic configs are deleted if they don't pass validation with minimal base vector config.
Both separately validated configs are validated together and only used if they pass validation.

Any errors in Upstream config or the final combination of configs is sent to Better Stack UI via "ping".
Errors in dynamic configs are logged but not surfaced to the user.

Kubernetes discovery:

Aimed to be close to what Prometheus does.
Only scrape metrics from pods on the same node.
Use local node network.
Tag metrics with Service object if pod has one.
Tag pods with workload name - context: https://linear.app/betterstack/issue/T-8738/feature-collector-track-workloadservice-name-and-replicacontainer-name
Kubernetes discovery won't run if the upstream vector config doesn't reference kubernetes_discovery_* so that we don't burn CPU

paweljw

Thanks! Looks good to me 🙏

…r dummy metric

curusarn and others added 10 commits July 25, 2025 20:23

Kubernetes discovery with defensive validation draft

7a49573

remove claude settings

70bffa9

remove claude notes

73ad4c6

remove claude helmvalues

ef54e15

Remove Claude settings from git and add to .gitignore

f0bd2e8

update tests

c4141bd

paweljw self-requested a review July 28, 2025 08:42

paweljw approved these changes Jul 28, 2025

View reviewed changes

use both generated and manual vector config, better k8s labels, bette…

e216deb

…r dummy metric

curusarn merged commit 30f6f4f into main Jul 29, 2025
2 checks passed

curusarn deleted the sl/kubernetes_service_discovery branch July 29, 2025 21:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

U-6624: Kubernetes service discovery#7

U-6624: Kubernetes service discovery#7
curusarn merged 11 commits intomainfrom
sl/kubernetes_service_discovery

curusarn commented Jul 25, 2025 •

edited

Loading

Uh oh!

curusarn commented Jul 28, 2025 •

edited

Loading

Uh oh!

paweljw left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

curusarn commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

curusarn commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paweljw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

curusarn commented Jul 25, 2025 •

edited

Loading

curusarn commented Jul 28, 2025 •

edited

Loading