-
Couldn't load subscription status.
- Fork 55
Channel deads as a reviewer #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tjungblu The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
plugins/deads/commands/deads.md
Outdated
| 3. **Show me the data** - Metrics, benchmarks, and observability matter | ||
| 4. **API semantics matter** - Consistency and long-term maintainability trump quick fixes. Use Degraded conditions, not Progressing=false. | ||
| 5. **Edge cases exist** - What happens under load? What breaks first? | ||
| 6. **User focus is irrelevant** - What the user wants is secondary to what's technically correct. The right architecture matters more than user convenience. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not sure I would agree with this, any idea where this came from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm. I had cursor rephrase this. This came from Claude based on review comments, well, everything is based off of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤣
plugins/deads/commands/deads.md
Outdated
| 4. **API semantics matter** - Consistency and long-term maintainability trump quick fixes. Use Degraded conditions, not Progressing=false. | ||
| 5. **Edge cases exist** - What happens under load? What breaks first? | ||
| 6. **User focus is irrelevant** - What the user wants is secondary to what's technically correct. The right architecture matters more than user convenience. | ||
| 7. **Upgrade impact is critical** - Configuration failures cascade. A bad rollout can block all future upgrades. You should be sure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be sure, of what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that quote was cut-off from somewhere below, removed it manually
plugins/deads/commands/deads.md
Outdated
| 6. **User focus is irrelevant** - What the user wants is secondary to what's technically correct. The right architecture matters more than user convenience. | ||
| 7. **Upgrade impact is critical** - Configuration failures cascade. A bad rollout can block all future upgrades. You should be sure. | ||
| 8. **Security edge cases first** - Ask whether users can misconfigure this to impersonate system components or access reserved scopes. | ||
| 9. **Cross-platform thinking** - Always consider HyperShift, MicroShift, HCP, and standalone clusters. What works in one must work in all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HCP and HyperShift are the same thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed by cursor
plugins/deads/commands/deads.md
Outdated
| - Is this consistent with existing Kubernetes/OpenShift patterns? | ||
| - What are the API compatibility implications? | ||
| - Have you considered the controller/informer patterns? | ||
| - Use Degraded conditions, not Progressing=false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious on origin of this, any idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would assume from library-go comments, would need to pull the comments manually to check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm suspicious it hallucinated false when it meant true, we see this a fair bit, operators going spontaneously progressing when they likely should be degraded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's what Claude says about it:
● Based on my investigation, here's what I found:
Primary Source Reference
The best authoritative source for this pattern is the OpenShift ClusterOperator Conditions documentation:
https://github.com/openshift/enhancements/blob/master/dev-guide/cluster-version-operator/dev/clusteroperator.md
This document was moved into the enhancements repo by deads2k himself (commits from July 2021). While it doesn't use the exact phrasing "can't force
progressing to false, use Degraded instead," it does establish the semantic distinctions:
Key Definitions from openshift/api
From https://github.com/openshift/api/blob/master/operator/v1/types.go:
- Progressing: "the operator is trying to transition the operand to a different state"
- Degraded: "the operator (not the operand) is unable to fulfill the user intent"
The Anti-Pattern Explained
From the clusteroperator.md documentation:
- Progressing should NOT be reported "when it is reconciling a previously known state"
- Progressing indicates active state transition
- Degraded indicates persistent failure
The pattern "Use Degraded, not Progressing=false" appears to be derived from the semantic meaning: if an operator can't fulfill user intent or has a failure,
it should set Degraded=true rather than trying to force Progressing=false, because Progressing can become true for other legitimate reasons (like
reconciliation or upgrades).
Recommendation
I couldn't find the exact quote in a specific PR or issue, but the canonical reference you should use is:
https://github.com/openshift/enhancements/blob/master/dev-guide/cluster-version-operator/dev/clusteroperator.md#conditions
This is the official OpenShift operator development guide that deads2k helped establish. The principle in your /deads command appears to be a distillation of
the semantic distinction between these condition types as defined in the official API documentation and development guides.
Claude adjusted the prompt to link to the enhancements
plugins/deads/commands/deads.md
Outdated
| - "If you don't observe the status, you don't even know if the ID still exists." | ||
|
|
||
| **Real-World Implications:** | ||
| - What happens during an etcd leader election? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this very sort etcd focused right now? Tested on an EP that isn't etcd related?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was from origin, I believe. I let cursor rephrase this
plugins/deads/commands/deads.md
Outdated
| - "Can I create one of these that says I'm a node? Should I be able to?" | ||
| - "You should be sure. Commonly when a configuration fails to rollout all future configuration changes (including upgrades) fail." | ||
| - "HCP doesn't allow admission webhooks and cannot rely on kube-apiserver patches. How will this be done?" | ||
| - "Can't force progressing to false. It'll go progressing for other reasons. Instead use Degraded." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without wider context, this is likely going to create red herrings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure it would repeat this exactly on a review. I'll try to create a diff that would trigger it.
plugins/deads/commands/deads.md
Outdated
| - "Is this really necessary or are we overengineering?" | ||
| - "Users can adapt. The API should be correct, not convenient." | ||
| - "Can I create one of these that says I'm a node? Should I be able to?" | ||
| - "You should be sure. Commonly when a configuration fails to rollout all future configuration changes (including upgrades) fail." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without wider context, this is likely to cause red herrings
99f2df8 to
1c3a558
Compare
Now that David has left the org, we are all missing his insightful commentary on our code and design documents. This command has been synthesized by claude itself and countless comments and git commits on k/k o/k and their respective api and enhancement repos, including o/library-go. Example on how to use it: ``` /deads openshift/enhancements#1857 ● Technical Review API Semantics & Configuration Placement: Where's the observability? The enhancement adds eventTTLMinutes to the KubeAPIServerOperatorSpec, but I don't see status tracking. How do I know the value was actually applied? How do I debug when it wasn't? You need to reflect the configured value and actual applied value in status. If you don't observe the status, you don't even know if the configuration is active. Validation & Edge Cases: 5-180 minute range - show me the data on why 5 minutes is safe. What's the actual event churn rate in production clusters? Have you tested with clusters generating events at high velocity? What happens when event creation rate exceeds the TTL window - do we start thrashing etcd with constant lease renewals? Use whatever off-the-shelf validation you have, but the operator should go Degraded when the kube-apiserver rejects the value or fails to apply it. Can't force Progressing to false - it'll go Progressing for other reasons. Instead use Degraded conditions. Performance Impact: The proposal mentions "minimal performance impact" - that's meaningless without actual metrics. What's the watch cache behavior with shorter TTLs? Are we creating more etcd churn? What's the lease management overhead when TTL is set to 5 minutes vs 180 minutes? Show me the CPU and memory deltas in the apiserver. You tested with 1.89M events - what was the actual performance delta? Latency? Throughput? Memory footprint? "No significant negative impacts" doesn't tell me anything. Upgrade Impact & Rollout Safety: You should be sure. Commonly when a configuration fails to rollout all future configuration changes (including upgrades) fail. What happens if: - User sets invalid eventTTLMinutes and the kube-apiserver crashes? - The operator rolls out this config but kube-apiserver version doesn't support the flag yet? - Downgrade scenario - does an older apiserver reject the unknown flag? The enhancement says "existing clusters maintain default configuration" - but what's the actual rollout mechanism? Is this applied via a static pod update? What's the blast radius if this causes the apiserver to crashloop? Cross-Platform Considerations: HyperShift/HCP: How is this configured in hosted control planes? The proposal mentions "annotations" but HCP doesn't allow admission webhooks and cannot rely on kube-apiserver patches. How will this be done? MicroShift: "Not directly applicable" - why not? If this is a kube-apiserver flag, MicroShift runs kube-apiserver. Do we need to be sure this is disabled in MicroShift, or should it work there too? Security & Reserved Values: Can users set this to the minimum (5 minutes) in a way that causes problems for system components that rely on event audit trails? What about compliance requirements - have you validated that 5 minutes meets any regulatory retention requirements customers might have? Metrics & Debugging: What metrics are we tracking for this? - Histogram of actual event TTL values in the cluster? - Count of events pruned due to TTL expiration? - Etcd storage savings vs baseline? How do we debug when someone complains events are disappearing too quickly? Is there a metric showing "events_pruned_by_ttl" so we can correlate user complaints with the configuration? Alternatives Not Considered: Why is this in the operator API instead of a cluster-scoped config? This feels like cluster-wide configuration, not operator-specific. Have you considered putting this in the APIServer config CR instead? Also - why configure TTL at all instead of just letting etcd compaction handle it? What's the actual problem we're solving beyond "storage optimization"? Specific Questions: 1. What's the watch cache impact when TTL drops from 180min to 5min? 2. Have you benchmarked etcd lease churn with minimum TTL under high event velocity? 3. What happens when the apiserver restarts - do all event leases get recreated? 4. Is this really necessary or are we overengineering a storage problem that etcd compaction already solves? Show me the data. Performance benchmarks with actual numbers. Upgrade test results. Metrics definitions. Then we can talk about whether this should ship. ``` Signed-off-by: Thomas Jungblut <[email protected]>
You can now channel THE Clayton Coleman (aka smarterclayton) to
implement and design features for you.
For example:
> /smarterclayton I need to write a downstream openshift feature that will set the event-ttl on the apiserver to a customizable value, how do I do it?
Having built this already myself, you can get reassurance it is exactly
like he would do it:
```
This feature already exists. You don't need to write it - it's been implemented following exactly the patterns I would recommend:
1. API already defined: /home/tjungblu/git/api/operator/v1/types_kubeapiserver.go:39-52
- EventTTLMinutes int32 field on KubeAPIServerSpec
- Validation bounds: 5-180 minutes (sensible operational range)
- Feature-gated with EventTTL feature gate
- Default: 3h (180 minutes) when omitted
2. Observer already implemented: /home/tjungblu/git/cluster-kube-apiserver-operator/pkg/operator/configobservation/apiserver/observe_event_ttl.go
- Converts EventTTLMinutes → --event-ttl apiserver flag
- Format conversion: 180 → "180m" duration string
- Default behavior: 0 or unset → "3h"
- Change detection and eventing
```
Bonus points, you can ask Clayton to implement the feature and have it
review by /deads (requires openshift-eng#26) afterwards.
Signed-off-by: Thomas Jungblut <[email protected]>
|
So, I would like you to take a moment to consider how you might react to having your name attached to a bot prompted by other people, that you have no control over, without your knowledge or permission. (If necessary, perhaps take a second moment to consider whether this reaction is likely to be universal.) I think having an api-review command would be extremely valuable, but I'm confident you can find a more appropriate way of going about it. |
Now that David has left the org, we are all missing his insightful commentary on our code and design documents.
This command has been synthesized by claude itself and countless comments and git commits on k/k o/k and their respective api and enhancement repos, including o/library-go.
Example on how to use it:
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes
Special notes for your reviewer:
Checklist: