-
Notifications
You must be signed in to change notification settings - Fork 14
ADD: Resilient Deployment #119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,339 @@ | ||||||
| # RFC Template | ||||||
|
|
||||||
| - Feature Name: Resilient Deployment | ||||||
| - Start Date: 2025-04-07 | ||||||
| - RFC PR: [Kuadrant/architecture#119](https://github.com/Kuadrant/architecture/pull/119) | ||||||
| - Issue tracking: [Kuadrant/architecture#117](https://github.com/Kuadrant/architecture/issues/117) | ||||||
|
|
||||||
| # Summary | ||||||
| [summary]: #summary | ||||||
|
|
||||||
| <!-- TODO: One paragraph explanation of the feature. --> | ||||||
|
|
||||||
| # Motivation | ||||||
| [motivation]: #motivation | ||||||
|
|
||||||
| <!-- Why are we doing this? What use cases does it support? What is the expected outcome? --> | ||||||
| As our user move to deploying kuadrant into production environments there is a want to create more resilient deployments. | ||||||
| This includes deploying multiply replicas of the data plane components, Authorino & Limitador. | ||||||
| In the case of Limitador, this also includes being able to persist counters. | ||||||
|
|
||||||
| As Kuadrant we want to provide a user experience that makes deploying a more resilient product possible. | ||||||
|
|
||||||
|
|
||||||
| # Guide-level explanation | ||||||
| [guide-level-explanation]: #guide-level-explanation | ||||||
|
|
||||||
| <!-- Explain the proposal as if it was implemented and you were teaching it to Kuadrant user. That generally means: --> | ||||||
| <!----> | ||||||
| <!-- - Introducing new named concepts. --> | ||||||
| <!-- - Explaining the feature largely in terms of examples. --> | ||||||
| <!-- - Explaining how a user should *think* about the feature, and how it would impact the way they already use Kuadrant. It should explain the impact as concretely as possible. --> | ||||||
| <!-- - If applicable, provide sample error messages, deprecation warnings, or migration guidance. --> | ||||||
| <!-- - If applicable, describe the differences between teaching this to existing and new Kuadrant users. --> | ||||||
|
|
||||||
| At the core, there are two different areas of focus required here. | ||||||
| There is the resilience of the deployments, Authorino & Limitador, and there is the resilience of the counters used within Limitador. | ||||||
| These two areas address the resiliency as whole. | ||||||
|
|
||||||
| For the deployments there are a number of configurations required. | ||||||
| - TopologyConstraints | ||||||
| - PodDisruptionBudget | ||||||
| - Resource Limits | ||||||
maleck13 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
| - Replicas | ||||||
| With Authoring and Limitador there are different ways of configuring some of the features. | ||||||
|
||||||
| With Authoring and Limitador there are different ways of configuring some of the features. | |
| With Authorino and Limitador there are different ways of configuring some of the features. |
maleck13 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given We have observability and mTLS. Should we add an option here for
resilience:
enable: true
I see it is discussed here: Kuadrant/kuadrant-operator#1329 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to responded sooner about this. For the enable field I agree to having it there, but what would be the behavior in this case. With in the resilience section there is 3 other fields. Two with are boolean values, and the third being a config object. One of the boolean fields requires the config object. So the question is, if the user only sets enable: true what is expected to happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the storage config is needed. I see this as effectively a short cut to say the same thing as
authorization: True
rateLimiting: True
counterStorage: {"config"...}
So when they set the top level enable value to true, we should treat it as though they have set authorization and rateLimiting to true and so expect counterStorage to be explicitly set also: Valid would be:
resilience:
enable: true
counterStorage: {"config"....}
One other note here. Authorino can do more than just authorization so I am not sure we should name this authorization and perhaps the short hand auth is a better option
maleck13 marked this conversation as resolved.
Show resolved
Hide resolved
maleck13 marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might not be blank for a new installation right? If I install the operator and then create the kuadrant CR for first time with the resilience block filled in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this is a simple configuration as we are going from nothing to something well defined. Is this perhaps what you were trying to convey. No kuadrant CR means no existing configuration ?
maleck13 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
maleck13 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
maleck13 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
maleck13 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
maleck13 marked this conversation as resolved.
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| From time to time SRE teams would need to change configurations for other to address issues. | |
| From time to time SRE teams may need to change some configuration to address specific issues. |
maleck13 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The user should be allowed to that, but the Kuadrant CR should state the configuration is below expected spec. | |
| The user should be allowed to do that, but the Kuadrant CR should state the configuration is below expected spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this is needed. I think it would be reasonable to expect alerts to be set up based on the expected number of replicas and dashboards rather than needing this to expressed as a warning in the status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In one sense, I agree, and in the other this is more for use. There has being a number time were I have seen an issue arise from testing (nightly, or RC), and the one of the first thing asked for is copy of the configure in question. I have never seen the alerts being asked for. I think both alerts, and good status are required. An alert will give information faster, but a status can give more detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is true to a point. If they delete the PDB, I would expect we put it back for example. If we have an explicit value in the kuadrant CRD that should also be reconciled back IMO. but for this feature in general I agree. But I wanted to call out the deletion as I don't think we want a situation where a PDB is deleted by accident and in order to get it back I first have to remove the config from the Kuadrant CR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is good to call out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder can we just generalise this in some way. I think given the user is making a change we don't have to say "its below minimum" I think we just want to reflect "custom configuration applied to limitador" for example ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it was generalized, does the status lose meaning? I think the detail adds value here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah so I think perhaps (see above) we just aim to have a consistent status for when the configuration has been changed (whether below or above expectations) and a status for when it is as expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above. don't think we need a different status for each just a "config changed status"
maleck13 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| For `spec.resilience.rateLimiting` we can make use of most features within the limitador CR, but will be requiring modifying the limitador deployment for the `topologySpreadConstraints`. | |
| For `spec.resilience.rateLimiting` we can make use of most features within the limitador CR, but it will be required to modify the limitador deployment for the `topologySpreadConstraints`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would we call it an opinionated data plane configuration? To a certain extent we are setting it up based on our own understanding of these components and I would say common topology setups (AZ for example).