Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
661676e
set virtual service to return non retryable error (500)
jeffreylimnardy Mar 11, 2026
6ba6f51
lower fault injection since now retries are not happening
jeffreylimnardy Mar 11, 2026
48af641
Merge branch 'main' into virtual-service-return-500
jeffreylimnardy Mar 12, 2026
a3f3526
increase fault injection again to 85
jeffreylimnardy Mar 12, 2026
3b3bc2c
Merge branch 'main' into virtual-service-return-500
jeffreylimnardy Mar 12, 2026
00d9f54
try debugging with tmate
jeffreylimnardy Mar 12, 2026
e4d871e
detach
jeffreylimnardy Mar 12, 2026
d233392
Merge branch 'main' into virtual-service-return-500
jeffreylimnardy Mar 12, 2026
3d35faa
lower fault injection again to make tests flaky
jeffreylimnardy Mar 12, 2026
95b43b0
Merge branch 'main' into virtual-service-return-500
jeffreylimnardy Mar 12, 2026
5079112
remove tmate
jeffreylimnardy Mar 12, 2026
8093303
debug self monitor on failure
jeffreylimnardy Mar 12, 2026
34ea29e
dont use canceled context
jeffreylimnardy Mar 12, 2026
2c9bc3f
move
jeffreylimnardy Mar 12, 2026
60f3576
Merge branch 'main' into virtual-service-return-500
k15r Mar 18, 2026
b55805a
improve tests for istio
k15r Mar 18, 2026
48174a8
in progress
k15r Mar 19, 2026
9432dd3
simplify the selfmonitor tests
k15r Mar 19, 2026
096f207
test
k15r Mar 19, 2026
8576297
hostpath cleanup
k15r Mar 19, 2026
2653a56
add cleanup
k15r Mar 19, 2026
6a5ebb4
ds cleanup
k15r Mar 19, 2026
f4e7a6b
another try
k15r Mar 19, 2026
b33622e
Merge branch 'main' into virtual-service-return-500
k15r Mar 19, 2026
0090ecc
move fluentbit cleanup to setuptest
k15r Mar 19, 2026
1d59311
create virtual services before generators and backends
k15r Mar 19, 2026
4870ada
increase timeout as 5min rate window can cause more than 10 mins for the
k15r Mar 19, 2026
4cebea2
implement code review comments
k15r Mar 20, 2026
3b8a121
Address review: buffer alert ops note, VS chain, drop unused delay fault
k15r Mar 20, 2026
26e5943
chore: extract fluent-bit buffer alert changes to separate PR
k15r Mar 20, 2026
6bb4138
Merge branch 'main' into virtual-service-return-500
k15r Mar 20, 2026
e7839b6
Address review: rename backendScaledToZero, add default panics, clari…
k15r Mar 20, 2026
3587481
Address review: add default panic to switch, clarify comments, fix cl…
k15r Mar 20, 2026
8b5e77d
feat: replace Istio fault injection with lightweight mock-backend
k15r Mar 20, 2026
31f5090
chore: remove accidentally committed mock-backend binary
k15r Mar 20, 2026
955cf36
use mock-backend
k15r Mar 23, 2026
34329dc
Merge remote-tracking branch 'upstream/main' into virtual-service-ret…
k15r Mar 23, 2026
562a9dc
add support for delays in mock-backend
k15r Mar 23, 2026
e47bad0
Address review: improve comments on fault injection semantics and sta…
k15r Mar 23, 2026
a05dbf3
Merge branch 'main' into virtual-service-return-500
k15r Mar 23, 2026
f557691
use the correct mock-backend
k15r Mar 23, 2026
7834f15
make generate
k15r Mar 23, 2026
dbc239f
Address review: replace busybox:1.36 with mirrored alpine, normalize …
k15r Mar 23, 2026
8825f04
add make targets
k15r Mar 23, 2026
ab4b4b1
Address review: add useIstio path for metric-agent, fix backend.go co…
k15r Mar 24, 2026
589bc81
fix: clarify Istio sourceLabels semantics in VirtualService builder
k15r Mar 24, 2026
1bfe8e1
Merge branch 'main' into virtual-service-return-500
k15r Mar 24, 2026
b94d959
feat(selfmonitor): implement runtime FaultEnabler pattern for self-mo…
k15r Mar 25, 2026
5ac9e96
test(selfmonitor): log when faults are enabled via VirtualService
k15r Mar 25, 2026
307df69
chore: use :dev tag for mock-backend image to prevent always-pull wit…
k15r Mar 25, 2026
64e85da
chore: use :main tag for mock-backend image to match other image conv…
k15r Mar 25, 2026
40a368b
test(selfmonitor): remove redundant healthy check in assertFlowDegrad…
k15r Mar 25, 2026
4419646
test(selfmonitor): drop FlowHealthy(true) from transition sequence — …
k15r Mar 25, 2026
e687a05
test(selfmonitor): rename flowHealthyThenDegraded to degradedReasons
k15r Mar 25, 2026
42b2e5d
Merge branch 'main' into virtual-service-return-500
k15r Mar 25, 2026
a6db751
Merge upstream/main into virtual-service-return-500
k15r Mar 26, 2026
43eb07f
chore: rename mock-backend to fault-backend in testkit and self-monit…
k15r Mar 26, 2026
5b44b44
increase rate
k15r Mar 26, 2026
0b6205c
skip healthy baseline for outage tests where alert fires from boot
k15r Mar 26, 2026
950f92b
make alertConditionDescription component-aware for Fluent Bit
k15r Mar 26, 2026
b9f966b
use gRPC status fault for metric-agent Istio VS to fix missing send_f…
k15r Mar 26, 2026
49ed72e
decrease the rate while querying the Selfmonitor tests
k15r Mar 26, 2026
581f574
Merge branch 'main' into virtual-service-return-500
k15r Mar 26, 2026
72b15c7
wait for FlowHealthy=True before enabling faults in backpressure tests
k15r Mar 26, 2026
bbac34f
add missing SomeTelemetryDataDropped descriptions to alertConditionDe…
k15r Mar 26, 2026
c4b97b4
wait for Prometheus rate metrics to be non-zero before enabling faults
k15r Mar 26, 2026
6039d89
log rate baseline query results in assertSelfMonitorRateNonZero
k15r Mar 26, 2026
f12c5e5
configure selfmonitor tests to use 1 gateway replica
k15r Mar 27, 2026
ff966b3
Merge branch 'main' into virtual-service-return-500
k15r Mar 27, 2026
ab30a79
fix lint in helpers.go
k15r Mar 27, 2026
c7cc184
log selfmonitor scrape targets in eventually debug output
k15r Mar 27, 2026
ec08629
improve rate baseline log messages to distinguish no-data from query …
k15r Mar 27, 2026
cf572f8
log selfmonitor targets in assertSelfMonitorRateNonZero eventually block
k15r Mar 27, 2026
a4cbf07
add separator line before rate baseline check output
k15r Mar 27, 2026
2d52176
increase rate baseline timeout to 5 minutes
k15r Mar 27, 2026
c601246
set kubernetes SD refresh_interval to 1 minute in selfmonitor scrape …
k15r Mar 27, 2026
2d85a50
revert refresh_interval: not a valid field in kubernetes_sd_configs
k15r Mar 27, 2026
b77beda
increase k3d server memory to 4g
k15r Mar 27, 2026
e3ffd8b
revert k3d server memory: no limit is the default (adding one would c…
k15r Mar 27, 2026
6262d46
fix selfmonitor target discovery: use endpointslice role for Promethe…
k15r Mar 27, 2026
d8391f5
log dropped targets in selfmonitor target output
k15r Mar 27, 2026
2409312
revert endpointslice back to endpoints role in selfmonitor scrape config
k15r Mar 27, 2026
336f61d
add workload status logging to selfmonitor debug output
k15r Mar 27, 2026
ecbf6d1
log k8s endpoints in selfmonitor rate baseline loop
k15r Mar 27, 2026
35ce474
bump selfmonitor resource limits
k15r Mar 27, 2026
0de0f57
fix selfmonitor scrape: use endpointslice role with correct service l…
k15r Mar 27, 2026
7a5317b
grant manager endpointslices RBAC to allow self-monitor role creation
k15r Mar 27, 2026
5238480
add endpointslices RBAC to self-monitor role
k15r Mar 27, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -50,4 +50,9 @@ telemetry-manager-experimental.yaml
# The default Telemetry CR which is installed by the lifecycle-manager
telemetry-default-cr.yaml
dependencies/sample-app/vendor
dependencies/mock-backend/mock-backend
.vscode

# PR review loop (Claude skill) — local reviewer/implementer hand-off; never commit
.pr-review-loop/
review-loop.diff
34 changes: 34 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co

Telemetry Manager is a Kubernetes operator (built with Kubebuilder) that manages telemetry pipelines for logs, traces, and metrics in Kyma clusters. It deploys and configures OpenTelemetry Collectors and Fluent Bit agents based on user-defined pipeline CRDs.

## Agent skills (Claude Code)

Project skills live under **`.claude/skills/`** — e.g. **`.claude/skills/e2e/SKILL.md`** (local E2E runs) and **`.claude/skills/pr-review-loop/SKILL.md`** (merge-base/GitHub PR context, **`.pr-review-loop/*.md`** hand-off, implementer commits, loop until approved).

## Common Commands

### Build and Development
Expand Down Expand Up @@ -116,6 +120,36 @@ make update-golden-files # Update golden files for config builder tests

The `.env` file contains default image versions and configuration. Key environment variables for the manager are defined in `main.go` (envConfig struct).

## PR Title Convention

PR titles must follow the [Conventional Commits](https://www.conventionalcommits.org/) format, enforced by the `amannn/action-semantic-pull-request` GitHub Action.

**Format:** `<type>: <subject>` or `<type>(<scope>): <subject>`

**Allowed types:** `deps`, `chore`, `docs`, `feat`, `fix`, `test`

**Rules:**
- The subject (text after the colon) must **not** start with an uppercase character
- Scopes are optional (`requireScope: false`)

**Examples:**
```
feat: add retry logic to log pipeline reconciler
fix: resolve race condition in metric exporter
docs: update architecture diagrams
chore: bump Go version to 1.23
test: add e2e tests for trace pipeline validation
deps: update controller-runtime to v0.19
feat(metrics): support histogram aggregation
```

**Label mapping:** The PR title prefix automatically sets a `kind/` label:
- `fix` → `kind/bug`
- `feat` → `kind/feature`
- `docs` → `kind/docs`
- `chore` → `kind/chore`
- `test` → `kind/test`

## Documentation Guidelines
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we establish documentation guidelines simply as a claude skill?


When adding, updating, or removing any documentation inside the `docs/` folder, you must always follow the guidelines in [docs/claude-docs.md](docs/CLAUDE.md).
1 change: 1 addition & 0 deletions dependencies/populateimages/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ const (
SelfMonitorImage = "{{ .ENV_SELFMONITOR_IMAGE }}"
SelfMonitorFIPSImage = "{{ .ENV_SELFMONITOR_FIPS_IMAGE }}"
FaultBackendImage = "{{ .ENV_FAULT_BACKEND_IMAGE }}"
AlpineImage = "{{ .ENV_ALPINE_IMAGE }}"
)
`,
}
Expand Down
8 changes: 8 additions & 0 deletions helm/templates/manager-rbac.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -284,6 +284,14 @@ rules:
- get
- list
- watch
- apiGroups:
- discovery.k8s.io
resources:
- endpointslices
verbs:
- get
- list
- watch

################################
# Policy rules for trace-gateway
Expand Down
11 changes: 8 additions & 3 deletions internal/resources/selfmonitor/resources.go
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,9 @@ const (

var (
storageVolumeSize = resource.MustParse("1000Mi")
cpuRequest = resource.MustParse("10m")
memoryRequest = resource.MustParse("50Mi")
memoryLimit = resource.MustParse("180Mi")
cpuRequest = resource.MustParse("50m")
memoryRequest = resource.MustParse("100Mi")
memoryLimit = resource.MustParse("300Mi")
)

type ApplierDeleter struct {
Expand Down Expand Up @@ -164,6 +164,11 @@ func (ad *ApplierDeleter) makeRole() *rbacv1.Role {
Resources: []string{"services", "endpoints", "pods"},
Verbs: []string{"get", "list", "watch"},
},
{
APIGroups: []string{"discovery.k8s.io"},
Resources: []string{"endpointslices"},
Verbs: []string{"get", "list", "watch"},
},
},
}

Expand Down
16 changes: 12 additions & 4 deletions internal/resources/selfmonitor/testdata/self-monitor.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ spec:
- --log.format=json
env:
- name: GOMEMLIMIT
value: "150994880"
value: "251658240"
livenessProbe:
failureThreshold: 5
httpGet:
Expand All @@ -113,10 +113,10 @@ spec:
timeoutSeconds: 3
resources:
limits:
memory: 180Mi
memory: 300Mi
requests:
cpu: 10m
memory: 50Mi
cpu: 50m
memory: 100Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
Expand Down Expand Up @@ -262,6 +262,14 @@ rules:
- get
- list
- watch
- apiGroups:
- discovery.k8s.io
resources:
- endpointslices
verbs:
- get
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
Expand Down
3 changes: 2 additions & 1 deletion internal/selfmonitor/config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,8 @@ type KubernetesDiscoveryConfig struct {
type Role string

const (
RoleEndpoints Role = "endpoints"
RoleEndpoints Role = "endpoints"
RoleEndpointSlice Role = "endpointslice"
)

type RelabelConfig struct {
Expand Down
6 changes: 3 additions & 3 deletions internal/selfmonitor/config/config_builder.go
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ func makeScrapeConfig(scrapeNamespace string) []ScrapeConfig {
Regex: "true",
},
{
SourceLabels: []string{"__meta_kubernetes_endpoints_label_telemetry_kyma_project_io_self_monitor"},
SourceLabels: []string{"__meta_kubernetes_service_label_telemetry_kyma_project_io_self_monitor"},
Action: Keep,
Regex: "enabled",
},
Expand Down Expand Up @@ -89,7 +89,7 @@ func makeScrapeConfig(scrapeNamespace string) []ScrapeConfig {
TargetLabel: "service",
},
{
SourceLabels: []string{"__meta_kubernetes_pod_node_name"},
SourceLabels: []string{"__meta_kubernetes_endpointslice_endpoint_node_name"},
Action: Replace,
TargetLabel: "node",
},
Expand Down Expand Up @@ -118,7 +118,7 @@ func makeScrapeConfig(scrapeNamespace string) []ScrapeConfig {
},
},
KubernetesDiscoveryConfigs: []KubernetesDiscoveryConfig{{
Role: RoleEndpoints,
Role: RoleEndpointSlice,
Namespaces: Names{Name: []string{scrapeNamespace}},
}},
},
Expand Down
6 changes: 3 additions & 3 deletions internal/selfmonitor/config/testdata/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ scrape_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
regex: "true"
action: keep
- source_labels: [__meta_kubernetes_endpoints_label_telemetry_kyma_project_io_self_monitor]
- source_labels: [__meta_kubernetes_service_label_telemetry_kyma_project_io_self_monitor]
regex: enabled
action: keep
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
Expand All @@ -38,7 +38,7 @@ scrape_configs:
- source_labels: [__meta_kubernetes_service_name]
target_label: service
action: replace
- source_labels: [__meta_kubernetes_pod_node_name]
- source_labels: [__meta_kubernetes_endpointslice_endpoint_node_name]
target_label: node
action: replace
metric_relabel_configs:
Expand All @@ -54,7 +54,7 @@ scrape_configs:
target_label: pipeline_name
action: replace
kubernetes_sd_configs:
- role: endpoints
- role: endpointslice
namespaces:
names:
- kyma-system
Loading
Loading