Skip to content

ADD: Resilient Deployment#119

Open
Boomatang wants to merge 3 commits intomainfrom
RFC_Resilient_Deployment
Open

ADD: Resilient Deployment#119
Boomatang wants to merge 3 commits intomainfrom
RFC_Resilient_Deployment

Conversation

@Boomatang
Copy link
Member

closes: #117

Start the RFC for Resilient Deployment

Signed-off-by: Jim Fitzpatrick <jfitzpat@redhat.com>
@Boomatang Boomatang marked this pull request as draft April 8, 2025 14:28
Signed-off-by: Jim Fitzpatrick <jfitzpat@redhat.com>
@Boomatang Boomatang marked this pull request as ready for review April 11, 2025 10:04
@Boomatang Boomatang self-assigned this Apr 11, 2025
This includes deploying multiply replicas of the data plane components, Authorino & Limitador.
In the case of Limitador, this also includes being able to persist counters.

As Kuadrant we want to provide a user experience that makes deploying a more resilient product possible.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would we call it an opinionated data plane configuration? To a certain extent we are setting it up based on our own understanding of these components and I would say common topology setups (AZ for example).

- PodDisruptionBudget
- Resource Limits
- Replicas
With Authoring and Limitador there are different ways of configuring some of the features.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
With Authoring and Limitador there are different ways of configuring some of the features.
With Authorino and Limitador there are different ways of configuring some of the features.


### New installations
This is by far the simplest configuration.
By default, the all resilience configuration will be blank.
Copy link
Collaborator

@maleck13 maleck13 Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might not be blank for a new installation right? If I install the operator and then create the kuadrant CR for first time with the resilience block filled in?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this is a simple configuration as we are going from nothing to something well defined. Is this perhaps what you were trying to convey. No kuadrant CR means no existing configuration ?

Why we do this is to simplify the reconciliation of configuration.

### SRE reaction events
From time to time SRE teams would need to change configurations for other to address issues.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
From time to time SRE teams would need to change configurations for other to address issues.
From time to time SRE teams may need to change some configuration to address specific issues.

From time to time SRE teams would need to change configurations for other to address issues.
We should allow the user configure the deployment to a state that is we regard as not resilient.
The one example that comes to mind is scaling replicas to zero for address some issue.
The user should be allowed to that, but the Kuadrant CR should state the configuration is below expected spec.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The user should be allowed to that, but the Kuadrant CR should state the configuration is below expected spec.
The user should be allowed to do that, but the Kuadrant CR should state the configuration is below expected spec.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this is needed. I think it would be reasonable to expect alerts to be set up based on the expected number of replicas and dashboards rather than needing this to expressed as a warning in the status

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In one sense, I agree, and in the other this is more for use. There has being a number time were I have seen an issue arise from testing (nightly, or RC), and the one of the first thing asked for is copy of the configure in question. I have never seen the alerts being asked for. I think both alerts, and good status are required. An alert will give information faster, but a status can give more detail.

### The user wants more
The user knows more about their deployment than we ever can.
We need to allow the user to modify the configuration to suit their needs.
When a user modifies the configuration, the kuadrant-operator should not reconcile back the changes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true to a point. If they delete the PDB, I would expect we put it back for example. If we have an explicit value in the kuadrant CRD that should also be reconciled back IMO. but for this feature in general I agree. But I wanted to call out the deletion as I don't think we want a situation where a PDB is deleted by accident and in order to get it back I first have to remove the config from the Kuadrant CR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is good to call out.

When a user modifies the configuration, the kuadrant-operator should not reconcile back the changes.

If the user changes are below the minimum spec that we state, this should be reflected in the status.
If the user changes the configuration at all it should be reflected in the status.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder can we just generalise this in some way. I think given the user is making a change we don't have to say "its below minimum" I think we just want to reflect "custom configuration applied to limitador" for example ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it was generalized, does the status lose meaning? I think the detail adds value here.

If the user changes are below the minimum spec that we state, this should be reflected in the status.
If the user changes the configuration at all it should be reflected in the status.
However, when the spec is more than what we regard as minimum the status should be a highlight not warning or error.
How this is reflected in the status of the Kuadrant CR is unknown currently.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so I think perhaps (see above) we just aim to have a consistent status for when the configuration has been changed (whether below or above expectations) and a status for when it is as expected

The authorino CR allows the setting of replicas `spec.replicas`.
This should be set to 2.
The user should have the power to modify this number to what they want.
If the number is less than the minimum we recommend (2), the kuadrant CR should report an error in the status.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above. don't think we need a different status for each just a "config changed status"



## spec.resilience.rateLimiting
For `spec.resilience.rateLimiting` we can make use of most features within the limitador CR, but will be requiring modifying the limitador deployment for the `topologySpreadConstraints`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For `spec.resilience.rateLimiting` we can make use of most features within the limitador CR, but will be requiring modifying the limitador deployment for the `topologySpreadConstraints`.
For `spec.resilience.rateLimiting` we can make use of most features within the limitador CR, but it will be required to modify the limitador deployment for the `topologySpreadConstraints`.

Address a number of issues raised during the review process

Signed-off-by: Jim Fitzpatrick <jfitzpat@redhat.com>
@Boomatang Boomatang moved this to Ready For Review in Kuadrant Apr 25, 2025
apiVersion: kuadrant.io/v1beta1
kind: Kuadrant
spec:
resilience:
Copy link
Collaborator

@maleck13 maleck13 May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given We have observability and mTLS. Should we add an option here for

resilience:
      enable: true

I see it is discussed here: Kuadrant/kuadrant-operator#1329 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to responded sooner about this. For the enable field I agree to having it there, but what would be the behavior in this case. With in the resilience section there is 3 other fields. Two with are boolean values, and the third being a config object. One of the boolean fields requires the config object. So the question is, if the user only sets enable: true what is expected to happen?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the storage config is needed. I see this as effectively a short cut to say the same thing as

    authorization: True
    rateLimiting: True
    counterStorage: {"config"...} 

So when they set the top level enable value to true, we should treat it as though they have set authorization and rateLimiting to true and so expect counterStorage to be explicitly set also: Valid would be:

resilience:
      enable: true
      counterStorage: {"config"....}

One other note here. Authorino can do more than just authorization so I am not sure we should name this authorization and perhaps the short hand auth is a better option

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Ready For Review

Development

Successfully merging this pull request may close these issues.

RFC Resilient Deployment

2 participants