Skip to content

Conversation

@timebertt
Copy link
Owner

What this PR does / why we need it:

This PR adds another scenario to the experiment tool: chaos.
The scenario runs for 15 minutes, creates about 4.5k Websites, and terminates a random shard every 5 minutes.

Which issue(s) this PR fixes:
Fixes n/a

Special notes for your reviewer:

With #647, I can run load test experiments on my shoot cluster reliably again.
The chaos scenario reveals that it's hard to fulfill the P99 SLOs in a 15-minute time frame with multiple shard terminations. This will be even harder when performing a rolling update.

However, the results also show that the P99 SLI on a 1-minute time frame recovers quickly (~2m), now that the sharder performs drain/move operations concurrently (see #637).

P99 is not met:
Screenshot 2025-08-24 at 18 49 40

P95 is met:
Screenshot 2025-08-24 at 18 50 02

@timebertt timebertt added the enhancement New feature or request label Aug 24, 2025
@timebertt timebertt enabled auto-merge (squash) August 24, 2025 16:57
@timebertt timebertt moved this from Backlog to In progress in kubernetes-controller-sharding Aug 24, 2025
@timebertt timebertt merged commit 7b9b47f into main Aug 24, 2025
4 checks passed
@timebertt timebertt deleted the chaos branch August 24, 2025 17:04
@github-project-automation github-project-automation bot moved this from In progress to Done in kubernetes-controller-sharding Aug 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Development

Successfully merging this pull request may close these issues.

2 participants