Skip to content

test: refactor selfmonitor tests with shared helpers and HTTP 400/429 fault injection#3132

Open
jeffreylimnardy wants to merge 88 commits intokyma-project:mainfrom
jeffreylimnardy:virtual-service-return-500
Open

test: refactor selfmonitor tests with shared helpers and HTTP 400/429 fault injection#3132
jeffreylimnardy wants to merge 88 commits intokyma-project:mainfrom
jeffreylimnardy:virtual-service-return-500

Conversation

@jeffreylimnardy
Copy link
Copy Markdown
Contributor

@jeffreylimnardy jeffreylimnardy commented Mar 12, 2026

Description

Changes proposed in this pull request (what was done and why):

  • Refactor selfmonitor E2E tests to extract shared helper functions (flowHealthyThenDegraded, assertFlowDegraded, buildPipeline, defaultGenerator, assertComponentReady, assertPipelineHealthy) that eliminate repetition across outage, backpressure, and healthy test scenarios
  • Replace Istio VirtualService fault injection with the mock-backend (merged in test: add mock-backend for lightweight fault injection in tests #3283) for most components: the mock-backend supports configurable HTTP status codes, delays, and connection-close behavior — used to simulate NoLogsDelivered (TCP close on port 9880) and BufferFillingUp (HTTP 429 retries + delayed 200 responses) without requiring Istio
  • Metric-agent tests still use Istio VirtualService with source-label targeting (app.kubernetes.io/name: telemetry-metric-agent): the metric-agent and gateway share the same backend, so faulting it at the backend level would affect both legs; a source-label-scoped VirtualService selectively blocks only agent→gateway traffic while leaving the gateway's own exports healthy
  • Replace HTTP 500 (which triggered retries) with HTTP 400 as the universal non-retryable fault code, and HTTP 429 as the universal retryable fault code, making test behavior more deterministic across OTel Collector and Fluent Bit
  • Split fluent-bit backpressure into two test cases: fluent-bit-buffer-filling-up (429 retryable + delayed 200 to hold the queue full) and fluent-bit-data-dropped (400 non-retryable drop)
  • Split fluent-bit outage into fluent-bit-no-logs-delivered (TCP close, simulating ECONNREFUSED-like behavior) and fluent-bit-all-data-dropped (100% HTTP 400)
  • Add WithFluentBitHostPathCleanup to clear /var/telemetry-fluent-bit on nodes after Fluent Bit tests, preventing buffer state bleed between test runs
  • Centralize FlowHealthConditionTransitionTimeout (10 min) in periodic.go to replace duplicated inline 10*time.Minute constants

Changes refer to particular issues, PRs or documents:

Traceability

  • The PR is linked to a GitHub issue.
  • The follow-up issues (if any) are linked in the Related Issues section.
  • If the change is user-facing, the documentation has been adjusted.
  • If a CRD is changed, the corresponding Busola ConfigMap has been adjusted.
  • The feature is unit-tested. — N/A (test-only change)
  • The feature is e2e-tested. — N/A (this PR IS the e2e test change)

@jeffreylimnardy jeffreylimnardy requested a review from a team as a code owner March 12, 2026 08:29
@github-actions github-actions bot added this to the 1.60.0 milestone Mar 12, 2026
@jeffreylimnardy jeffreylimnardy added the area/tests Writing/adding/Refactoring tests or checks label Mar 12, 2026
@jeffreylimnardy jeffreylimnardy changed the title test: virtual service return 500 test: virtual service return non retryable errors Mar 12, 2026
@skhalash skhalash enabled auto-merge (squash) March 12, 2026 08:32
skhalash
skhalash previously approved these changes Mar 12, 2026
@TeodorSAP TeodorSAP modified the milestones: 1.60.0, 1.61.0 Mar 17, 2026
k15r added 30 commits March 26, 2026 19:46
Add WithStartFaulted() to faultbackend so the backend starts rejecting
requests immediately, and skipHealthyBaseline flag to outage tests to
skip the assertPipelineHealthy wait. This avoids the 5-minute rate()
window decay that would otherwise delay alert firing.

fluent-bit-all-data-dropped still requires a healthy baseline since the
dropped_records metric series does not exist until the exporter has been
active.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/tests Writing/adding/Refactoring tests or checks kind/test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants