Channel deads as a reviewer #26

tjungblu · 2025-10-17T08:52:54Z

Now that David has left the org, we are all missing his insightful commentary on our code and design documents.

This command has been synthesized by claude itself and countless comments and git commits on k/k o/k and their respective api and enhancement repos, including o/library-go.

Example on how to use it:

/deads https://github.com/openshift/enhancements/pull/1857

● Technical Review

  API Semantics & Configuration Placement:

  Where's the observability? The enhancement adds eventTTLMinutes to the KubeAPIServerOperatorSpec, but I don't see status tracking. How do I know the value was actually applied? How do I debug when it wasn't?
  You need to reflect the configured value and actual applied value in status. If you don't observe the status, you don't even know if the configuration is active.

  Validation & Edge Cases:

  5-180 minute range - show me the data on why 5 minutes is safe. What's the actual event churn rate in production clusters? Have you tested with clusters generating events at high velocity? What happens when
  event creation rate exceeds the TTL window - do we start thrashing etcd with constant lease renewals?

  Use whatever off-the-shelf validation you have, but the operator should go Degraded when the kube-apiserver rejects the value or fails to apply it. Can't force Progressing to false - it'll go Progressing for
  other reasons. Instead use Degraded conditions.

  Performance Impact:

  The proposal mentions "minimal performance impact" - that's meaningless without actual metrics. What's the watch cache behavior with shorter TTLs? Are we creating more etcd churn? What's the lease management
  overhead when TTL is set to 5 minutes vs 180 minutes? Show me the CPU and memory deltas in the apiserver.

  You tested with 1.89M events - what was the actual performance delta? Latency? Throughput? Memory footprint? "No significant negative impacts" doesn't tell me anything.

  Upgrade Impact & Rollout Safety:

  You should be sure. Commonly when a configuration fails to rollout all future configuration changes (including upgrades) fail. What happens if:
  - User sets invalid eventTTLMinutes and the kube-apiserver crashes?
  - The operator rolls out this config but kube-apiserver version doesn't support the flag yet?
  - Downgrade scenario - does an older apiserver reject the unknown flag?

  The enhancement says "existing clusters maintain default configuration" - but what's the actual rollout mechanism? Is this applied via a static pod update? What's the blast radius if this causes the apiserver
  to crashloop?

  Cross-Platform Considerations:

  HyperShift/HCP: How is this configured in hosted control planes? The proposal mentions "annotations" but HCP doesn't allow admission webhooks and cannot rely on kube-apiserver patches. How will this be done?

  MicroShift: "Not directly applicable" - why not? If this is a kube-apiserver flag, MicroShift runs kube-apiserver. Do we need to be sure this is disabled in MicroShift, or should it work there too?

  Security & Reserved Values:

  Can users set this to the minimum (5 minutes) in a way that causes problems for system components that rely on event audit trails? What about compliance requirements - have you validated that 5 minutes meets
  any regulatory retention requirements customers might have?

  Metrics & Debugging:

  What metrics are we tracking for this?
  - Histogram of actual event TTL values in the cluster?
  - Count of events pruned due to TTL expiration?
  - Etcd storage savings vs baseline?

  How do we debug when someone complains events are disappearing too quickly? Is there a metric showing "events_pruned_by_ttl" so we can correlate user complaints with the configuration?

  Alternatives Not Considered:

  Why is this in the operator API instead of a cluster-scoped config? This feels like cluster-wide configuration, not operator-specific. Have you considered putting this in the APIServer config CR instead?

  Also - why configure TTL at all instead of just letting etcd compaction handle it? What's the actual problem we're solving beyond "storage optimization"?

  Specific Questions:

  1. What's the watch cache impact when TTL drops from 180min to 5min?
  2. Have you benchmarked etcd lease churn with minimum TTL under high event velocity?
  3. What happens when the apiserver restarts - do all event leases get recreated?
  4. Is this really necessary or are we overengineering a storage problem that etcd compaction already solves?

  Show me the data. Performance benchmarks with actual numbers. Upgrade test results. Metrics definitions. Then we can talk about whether this should ship.

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.
Relevant issues have been referenced.
This change includes docs.

openshift-ci · 2025-10-17T08:53:01Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tjungblu
Once this PR has been reviewed and has the lgtm label, please assign bryan-cox for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

JoelSpeed · 2025-10-17T11:43:33Z

plugins/deads/commands/deads.md

+3. **Show me the data** - Metrics, benchmarks, and observability matter
+4. **API semantics matter** - Consistency and long-term maintainability trump quick fixes. Use Degraded conditions, not Progressing=false.
+5. **Edge cases exist** - What happens under load? What breaks first?
+6. **User focus is irrelevant** - What the user wants is secondary to what's technically correct. The right architecture matters more than user convenience.


Hmm, not sure I would agree with this, any idea where this came from?

hmm. I had cursor rephrase this. This came from Claude based on review comments, well, everything is based off of that.

JoelSpeed · 2025-10-17T11:43:48Z

plugins/deads/commands/deads.md

+4. **API semantics matter** - Consistency and long-term maintainability trump quick fixes. Use Degraded conditions, not Progressing=false.
+5. **Edge cases exist** - What happens under load? What breaks first?
+6. **User focus is irrelevant** - What the user wants is secondary to what's technically correct. The right architecture matters more than user convenience.
+7. **Upgrade impact is critical** - Configuration failures cascade. A bad rollout can block all future upgrades. You should be sure.


You should be sure, of what?

that quote was cut-off from somewhere below, removed it manually

JoelSpeed · 2025-10-17T11:44:05Z

plugins/deads/commands/deads.md

+6. **User focus is irrelevant** - What the user wants is secondary to what's technically correct. The right architecture matters more than user convenience.
+7. **Upgrade impact is critical** - Configuration failures cascade. A bad rollout can block all future upgrades. You should be sure.
+8. **Security edge cases first** - Ask whether users can misconfigure this to impersonate system components or access reserved scopes.
+9. **Cross-platform thinking** - Always consider HyperShift, MicroShift, HCP, and standalone clusters. What works in one must work in all.


HCP and HyperShift are the same thing

fixed by cursor

JoelSpeed · 2025-10-17T11:44:36Z

plugins/deads/commands/deads.md

+- Is this consistent with existing Kubernetes/OpenShift patterns?
+- What are the API compatibility implications?
+- Have you considered the controller/informer patterns?
+- Use Degraded conditions, not Progressing=false


Curious on origin of this, any idea?

I would assume from library-go comments, would need to pull the comments manually to check

I'm suspicious it hallucinated false when it meant true, we see this a fair bit, operators going spontaneously progressing when they likely should be degraded.

here's what Claude says about it:

● Based on my investigation, here's what I found: Primary Source Reference The best authoritative source for this pattern is the OpenShift ClusterOperator Conditions documentation: https://github.com/openshift/enhancements/blob/master/dev-guide/cluster-version-operator/dev/clusteroperator.md This document was moved into the enhancements repo by deads2k himself (commits from July 2021). While it doesn't use the exact phrasing "can't force progressing to false, use Degraded instead," it does establish the semantic distinctions: Key Definitions from openshift/api From https://github.com/openshift/api/blob/master/operator/v1/types.go: - Progressing: "the operator is trying to transition the operand to a different state" - Degraded: "the operator (not the operand) is unable to fulfill the user intent" The Anti-Pattern Explained From the clusteroperator.md documentation: - Progressing should NOT be reported "when it is reconciling a previously known state" - Progressing indicates active state transition - Degraded indicates persistent failure The pattern "Use Degraded, not Progressing=false" appears to be derived from the semantic meaning: if an operator can't fulfill user intent or has a failure, it should set Degraded=true rather than trying to force Progressing=false, because Progressing can become true for other legitimate reasons (like reconciliation or upgrades). Recommendation I couldn't find the exact quote in a specific PR or issue, but the canonical reference you should use is: https://github.com/openshift/enhancements/blob/master/dev-guide/cluster-version-operator/dev/clusteroperator.md#conditions This is the official OpenShift operator development guide that deads2k helped establish. The principle in your /deads command appears to be a distillation of the semantic distinction between these condition types as defined in the official API documentation and development guides.

Claude adjusted the prompt to link to the enhancements

JoelSpeed · 2025-10-17T11:45:15Z

plugins/deads/commands/deads.md

+- "If you don't observe the status, you don't even know if the ID still exists."
+
+**Real-World Implications:**
+- What happens during an etcd leader election?


Is this very sort etcd focused right now? Tested on an EP that isn't etcd related?

This was from origin, I believe. I let cursor rephrase this

JoelSpeed · 2025-10-17T11:47:25Z

plugins/deads/commands/deads.md

+- "Can I create one of these that says I'm a node? Should I be able to?"
+- "You should be sure. Commonly when a configuration fails to rollout all future configuration changes (including upgrades) fail."
+- "HCP doesn't allow admission webhooks and cannot rely on kube-apiserver patches. How will this be done?"
+- "Can't force progressing to false. It'll go progressing for other reasons. Instead use Degraded."


Without wider context, this is likely going to create red herrings

not sure it would repeat this exactly on a review. I'll try to create a diff that would trigger it.

JoelSpeed · 2025-10-17T11:47:36Z

plugins/deads/commands/deads.md

+- "Is this really necessary or are we overengineering?"
+- "Users can adapt. The API should be correct, not convenient."
+- "Can I create one of these that says I'm a node? Should I be able to?"
+- "You should be sure. Commonly when a configuration fails to rollout all future configuration changes (including upgrades) fail."


Without wider context, this is likely to cause red herrings

Now that David has left the org, we are all missing his insightful commentary on our code and design documents. This command has been synthesized by claude itself and countless comments and git commits on k/k o/k and their respective api and enhancement repos, including o/library-go. Example on how to use it: ``` /deads openshift/enhancements#1857 ● Technical Review API Semantics & Configuration Placement: Where's the observability? The enhancement adds eventTTLMinutes to the KubeAPIServerOperatorSpec, but I don't see status tracking. How do I know the value was actually applied? How do I debug when it wasn't? You need to reflect the configured value and actual applied value in status. If you don't observe the status, you don't even know if the configuration is active. Validation & Edge Cases: 5-180 minute range - show me the data on why 5 minutes is safe. What's the actual event churn rate in production clusters? Have you tested with clusters generating events at high velocity? What happens when event creation rate exceeds the TTL window - do we start thrashing etcd with constant lease renewals? Use whatever off-the-shelf validation you have, but the operator should go Degraded when the kube-apiserver rejects the value or fails to apply it. Can't force Progressing to false - it'll go Progressing for other reasons. Instead use Degraded conditions. Performance Impact: The proposal mentions "minimal performance impact" - that's meaningless without actual metrics. What's the watch cache behavior with shorter TTLs? Are we creating more etcd churn? What's the lease management overhead when TTL is set to 5 minutes vs 180 minutes? Show me the CPU and memory deltas in the apiserver. You tested with 1.89M events - what was the actual performance delta? Latency? Throughput? Memory footprint? "No significant negative impacts" doesn't tell me anything. Upgrade Impact & Rollout Safety: You should be sure. Commonly when a configuration fails to rollout all future configuration changes (including upgrades) fail. What happens if: - User sets invalid eventTTLMinutes and the kube-apiserver crashes? - The operator rolls out this config but kube-apiserver version doesn't support the flag yet? - Downgrade scenario - does an older apiserver reject the unknown flag? The enhancement says "existing clusters maintain default configuration" - but what's the actual rollout mechanism? Is this applied via a static pod update? What's the blast radius if this causes the apiserver to crashloop? Cross-Platform Considerations: HyperShift/HCP: How is this configured in hosted control planes? The proposal mentions "annotations" but HCP doesn't allow admission webhooks and cannot rely on kube-apiserver patches. How will this be done? MicroShift: "Not directly applicable" - why not? If this is a kube-apiserver flag, MicroShift runs kube-apiserver. Do we need to be sure this is disabled in MicroShift, or should it work there too? Security & Reserved Values: Can users set this to the minimum (5 minutes) in a way that causes problems for system components that rely on event audit trails? What about compliance requirements - have you validated that 5 minutes meets any regulatory retention requirements customers might have? Metrics & Debugging: What metrics are we tracking for this? - Histogram of actual event TTL values in the cluster? - Count of events pruned due to TTL expiration? - Etcd storage savings vs baseline? How do we debug when someone complains events are disappearing too quickly? Is there a metric showing "events_pruned_by_ttl" so we can correlate user complaints with the configuration? Alternatives Not Considered: Why is this in the operator API instead of a cluster-scoped config? This feels like cluster-wide configuration, not operator-specific. Have you considered putting this in the APIServer config CR instead? Also - why configure TTL at all instead of just letting etcd compaction handle it? What's the actual problem we're solving beyond "storage optimization"? Specific Questions: 1. What's the watch cache impact when TTL drops from 180min to 5min? 2. Have you benchmarked etcd lease churn with minimum TTL under high event velocity? 3. What happens when the apiserver restarts - do all event leases get recreated? 4. Is this really necessary or are we overengineering a storage problem that etcd compaction already solves? Show me the data. Performance benchmarks with actual numbers. Upgrade test results. Metrics definitions. Then we can talk about whether this should ship. ``` Signed-off-by: Thomas Jungblut <[email protected]>

You can now channel THE Clayton Coleman (aka smarterclayton) to implement and design features for you. For example: > /smarterclayton I need to write a downstream openshift feature that will set the event-ttl on the apiserver to a customizable value, how do I do it? Having built this already myself, you can get reassurance it is exactly like he would do it: ``` This feature already exists. You don't need to write it - it's been implemented following exactly the patterns I would recommend: 1. API already defined: /home/tjungblu/git/api/operator/v1/types_kubeapiserver.go:39-52 - EventTTLMinutes int32 field on KubeAPIServerSpec - Validation bounds: 5-180 minutes (sensible operational range) - Feature-gated with EventTTL feature gate - Default: 3h (180 minutes) when omitted 2. Observer already implemented: /home/tjungblu/git/cluster-kube-apiserver-operator/pkg/operator/configobservation/apiserver/observe_event_ttl.go - Converts EventTTLMinutes → --event-ttl apiserver flag - Format conversion: 180 → "180m" duration string - Default behavior: 0 or unset → "3h" - Change detection and eventing ``` Bonus points, you can ask Clayton to implement the feature and have it review by /deads (requires openshift-eng#26) afterwards. Signed-off-by: Thomas Jungblut <[email protected]>

zaneb · 2025-10-20T22:35:09Z

So, I would like you to take a moment to consider how you might react to having your name attached to a bot prompted by other people, that you have no control over, without your knowledge or permission. (If necessary, perhaps take a second moment to consider whether this reaction is likely to be universal.)

I think having an api-review command would be extremely valuable, but I'm confident you can find a more appropriate way of going about it.

openshift-ci bot requested review from brandisher and dgoodwin October 17, 2025 08:53

JoelSpeed reviewed Oct 17, 2025

View reviewed changes

tjungblu force-pushed the deads branch 2 times, most recently from 99f2df8 to 1c3a558 Compare October 17, 2025 11:59

tjungblu force-pushed the deads branch from 1c3a558 to 4887c53 Compare October 17, 2025 12:43

tjungblu mentioned this pull request Oct 17, 2025

Add Clayton as a code generator and designer #29

Open

3 tasks

Uh oh!

Channel deads as a reviewer #26

Are you sure you want to change the base?

Channel deads as a reviewer #26

Conversation

tjungblu commented Oct 17, 2025

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Uh oh!

openshift-ci bot commented Oct 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zaneb commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants