-
Notifications
You must be signed in to change notification settings - Fork 518
CNTRLPLANE-1575: Add support for event-ttl in Kube API Server Operator #1857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@tjungblu: This pull request references CNTRLPLANE-1575 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
4fb0d6d
to
6168a91
Compare
@tjungblu: This pull request references CNTRLPLANE-1575 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
6168a91
to
bc89dd8
Compare
bc89dd8
to
f6bf1ff
Compare
@tjungblu: This pull request references CNTRLPLANE-1575 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
I'm also curious about the event-ttl setting in OCP. Please correct me if I'm missing something - why not set the event-ttl in apiserver(config.openshift.io/v1), just like:
Maybe we will set more parameters for all OpenShift-provided apiserver in the future, not only the event-ttl |
Thanks for the review @lance5890 - events are only created through the kube-apiserver, so adding it to the others makes not much sense. As for other configuration values (we have debated the GOAWAY chance parameter recently), this would be more suitable on the general apiserver config CRD. |
|
||
1. Allow customers to configure the event-ttl setting for kube-apiserver through the OpenShift API | ||
2. Provide a reasonable range of values (5 minutes to 3 hours) that covers most customer needs | ||
3. Maintain backward compatibility with the current default of 3 hours (180 minutes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just so that we don't have to configure this value in CI? Given kube's default is 1 hour and we state there is little reason for a custom to need a longer period than 3 hours, could we also argue customers would generally want the default to be 1 hour and it's only 3 because of our internal processes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As usual, nothing was functionally specified on what the intended behavior is supposed to be.
I would also argue we could reduce the default to 1h safely and configure the CI for 3h. That would be a discussion for the apiserver folks to chime in, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have one customer running a batch job to export events every 2 hours, then automatically reducing the TTL to 1 hour breaks them.
Events are meant to be transient, and if you need to retain them, they can be replicated elsewhere. That sort of replication is not currently a part of the platform (maybe there are optional operators on top that can be configured to do this), and I'm afraid to know how many programs are reliant on events for doing "real work".
I would prefer to keep the scope as small as possible while satisfying the request to allow shorter event TTLs. Users with constrained deployments or high event creation rates will appreciate the escape valve. I'm not opposed to migrating the default eventually, but it carries more risk and is not necessary to satisfy the RFE. Besides, the steady-state number of events depends on both the arrival rate and TTL. I suspect APF would be more a effective way to mitigate the worst case number of events by limiting the arrival rate.
|
||
## Proposal | ||
|
||
We propose to add an `eventTTLMinutes` field to the `APIServer` resource in `config.openshift.io/v1` that allows customers to configure the event-ttl setting for kube-apiserver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API PR is currently operator, not config btw, which do you intend to proceed with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good one, no this is going to the KAS Operator CRD, it's an exclusive kube-apiserver setting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't important for other API servers? Looking at the existing operator CR there isn't any config there right now, might be worth checking with the API server folks where they think this should be landing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only kube-apiserver handles the events to attach leases on them, so it would only make sense there.
might be worth checking with the API server folks where they think this should be landing
Yep, just pinged them on slack. Happy to move it wherever it makes most sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth noting that, in Hypershift, the APIServer cluster config is part of the HostedCluster but the KASO operator config is not and we do not allow configuration inside the hosted cluster to modify how components in the hosted control plane (HCP) are configured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tjungblu is right that events are a kube-apiserver resource, not a generic API server thing. The event-ttl option only exists for kube-apiserver. I think we want to avoid expanding the surface of the config/v1 APIServer type with things that are not common across most/all API servers. At least, that's the stated intent:
// APIServer holds configuration (like serving certificates, client CA and CORS domains)
// shared by all API servers in the system, among them especially kube-apiserver
// and openshift-apiserver. The canonical name of an instance is 'cluster'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KubeAPIServerConfig looks like it could work, seems abandoned besides last years addition with MinimumKubeletVersion.
@sjenning I think you're referring to this, right?
https://github.com/openshift/hypershift/blob/main/api/hypershift/v1beta1/hostedcluster_types.go#L1855-L1859
that would be:
https://github.com/openshift/api/blob/master/config/v1/types_apiserver.go#L37
if the others are OK with moving the setting there, we can also lift and shift this into Hypershift that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KubeAPIServerConfig in kubecontrolplane/v1 is the format of the config file, it's not a resource that we serve anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that in general config/v1 APIServer is meant to be used for settings shared among most API servers.
The operator/v1 KubeAPIServer resource seems like a good candidate for a setting that applies only to the kube-apiserver. This resource already contains spec.logLevel
field which is used to set the log level for the operand.
As a cluster administrator in a regulated environment, I want to configure a longer event retention period so that I can meet compliance requirements for audit trails and debugging. | ||
|
||
#### Story 2: Storage Optimization | ||
As a cluster administrator with limited etcd storage, I want to configure a shorter event retention period so that I can reduce etcd storage usage while maintaining sufficient event history for troubleshooting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any metrics in terms of bytes of data that events generate over time in a regular openshift cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check telemeter tomorrow, I think we may have the counts. Bytes are usually difficult to attain because etcd does not expose those per CRD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be useful to understand the impact on the data, so if we can without too much effort have some estimates of the impact of this change, I suspect that will be useful for folks to know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope I got the query right:
topk(10, cluster:usage:resources:sum{resource="events"})
the biggest cluster has about 3-4 million events in it and then it levels off real quick, putting it into quantiles:

the very large majority of clusters probably won't notice when changing the TTL to 5m with this few events in storage.
The biggest cluster with 3-4 million also has not much else beyond events in it, so taking it as a sample against its five gigabytes of etcd used storage is 1.5kb per event, on average.
Looking into the arrival rate here, this is constant at about 0.2/s with a one to two hour window at 0.8/s. Not super larger either.
|
||
#### Hypershift / Hosted Control Planes | ||
|
||
This enhancement does not apply to Hypershift. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know if HyperShift also uses 3h by default? Do they have similar requests to be able to change the config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
github is down more than it is up right now, but they also use 3h:
https://github.com/openshift/hypershift/blob/69c68a4003cbeb0048c9c31d8b68bed6fc287970/control-plane-operator/controllers/hostedcontrolplane/v2/kas/config.go#L214
As for the other question, I'll ping Seth tomorrow on reviewing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hypershift follows KASO for default config generally and does use 3h
for event-ttl
same as KASO
While we don't have an explicit request for this in HCP, we should look to allow this for all topologies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll note this down, we can do the necessary hypershift changes as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sjenning I was just looking into how GOAWAY chance was implemented and it seems to be based on an annotation:
https://github.com/openshift/hypershift/blob/main/control-plane-operator/controllers/hostedcontrolplane/v2/kas/params.go#L121-L123
do you think this could be a viable implementation here, too? Given that, in the other thread, we wanted to keep it in the operator CRD for OCP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general we shouldn't be adding new annotations, is there any reason this couldn't be a proper structured API on whichever is the appropriate HyperShift object?
kind: APIServer | ||
apiVersion: config.openshift.io/v1 | ||
spec: | ||
eventTTLMinutes: 60 # Integer value in minutes, e.g., 60, 180 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any configuration subsections that it might make sense to group this into? How does this look in the upstream API server configuration file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just a plain cmdline argument with a duration in kube-apiserver
https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/
--event-ttl duration Default: 1h0m0s
Amount of time to retain events.
downstream, not sure, also something for the apiserver folks to chime in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no resource/file for configuring kas upstream, settings are passed as cli flags.
We could group this under ApiServerArguments
, but I’m not sure if that would provide a good UX for administrators to set/use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect then we just leave it where it is for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, for the downstream fork we have KubeAPIServerConfig
, which is filled by the operator and then consumed by the forked kas.
I think the only alternative would be to put this into config/v1 APIServer and mark that fields as kas specific.
(unless we also consider creating a CM that would be created by admins but that would provide terrible UX/validation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, then let's leave it where it is. We need to find a way to add this into Hypershift then.
1. **etcd Compaction Bandwidth**: With faster expiring events, etcd will need to perform compaction more frequently to remove expired events. This increases the bandwidth usage for etcd compaction operations. | ||
|
||
2. **etcd CPU Usage**: More frequent compaction operations will increase CPU usage on etcd nodes, as the compaction process requires CPU cycles to identify and remove expired events. | ||
|
||
3. **Event Availability**: Events will be deleted more quickly, potentially reducing the time window available for debugging and troubleshooting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These impacts might be worth touching on briefly in the godoc for the field, in terms of you are trading lower storage for increased CPU usage. At the moment the godoc only has a positive and doesn't explain the trade off
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to quantify the order of magnitude here. I think this is too negligable to even notice, but I wasn't able to test this in-depth yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having said that, the compactions do not become more frequent, they just become more expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a practical risk of decreasing event-ttl by 90%+ for a cluster with many events (i.e. how expensive can the first expiry cycle be)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good scenario, let me also measure this. The compaction is always O(n) where n is the number of revisions to be compacted, either 10k or whatever we write in the 5 minute interval.
When we reduce the TTL to 1h, we would delete the prior events for 3h and after 1h the additional incoming events since changing the TTL. For the remaining two hours, there would be an elevated rate of those deletes until the 3h backlog is fully expired. Then it should settle on some "steady-state" for the 1h.
What could be also worrisome is any watchers that run on the events resource. I'm not sure whether a lease-based delete would trigger a delete notification. This seems quite wasteful and I think the keys are just internally tombstoned on the index until compacted.
f6bf1ff
to
3fb6e12
Compare
3fb6e12
to
3a99db9
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
2. **Cluster Administrator** edits the operator configuration resource | ||
3. **Cluster Administrator** sets the `eventTTLMinutes` field to the desired value in minutes (e.g., 60, 180) | ||
4. **kube-apiserver-operator** detects the configuration change | ||
5. **kube-apiserver-operator** updates the kube-apiserver deployment with the new configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I think in practice we will create an observer that sets apiServerArguments/event-ttl
, which will eventually be propagated to the operator’s observedConfig, and later propagated as a config to kas
#### Impact of removing 3 gigabytes of events | ||
|
||
To represent the worst case of removing 3 gigabyte of events, we have filled a 4.21 nightly cluster with 3 million events and the default TTL. | ||
Then configured a 5 minute TTL and watch the resource usage over the coming three hours... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be great if you could share some data, especially regarding cpu/memory usage of etcd/kas.
in general kas shouldn’t be affected, since the watch cache should be/is (will try to verify this) disabled for events.
3a99db9
to
f7cb19b
Compare
API PR in openshift/api#2520 Feature Gate PR in openshift/api#2525 Signed-off-by: Thomas Jungblut <[email protected]>
f7cb19b
to
85de283
Compare
@tjungblu: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
API PR in openshift/api#2520
Feature Gate PR in openshift/api#2525