Skip to content

gep: standardizing behavior for invalid BackendTLSPolicy #3909

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Aug 13, 2025

Conversation

snorwin
Copy link
Member

@snorwin snorwin commented Jul 11, 2025

What type of PR is this?

/kind gep

What this PR does / why we need it:
This PR updates the GEP (#1897) to clarify the behavior of BackendTLSPolicy in cases where certificate ref resolving fails or when system-trusted certificates are not defined.

/cc @candita

Which issue(s) this PR fixes:

Fixes #3516

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot requested a review from candita July 11, 2025 08:40
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/gep PRs related to Gateway Enhancement Proposal(GEP) cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jul 11, 2025

For an invalid BackendTLSPolicy, implementations MUST NOT fall back to unencrypted (plaintext) connections.
Instead, the corresponding TLS connection MUST fail, and the client MUST receive an HTTP error response.
Additionally, the `Accepted` status condition of the BackendTLSPolicy MUST be set to `False` with the reason `Invalid`.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at it again, it might make sense to introduce the ResolvedRefs condition for policies as well. However, I’m not sure whether it fits within the current schedule for graduating the BackendTLSPolicy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be difficult to implement due to the distinct processes of policy application and certificate validation against the backend. It's a valid scenario that the BTP was accepted, and configuration properly propagated by the controller, but connectivity is broken due to certificate misconfiguration. All information needed to debug this issue should be passed by inspecting the BTP spec against the Service to which the Policy was applied.

Copy link
Member

@robscott robscott Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @kl52752 that ResolvedRefs would be difficult to set for anything that was tied to dataplane/connectivity. On the other hand, ResolvedRefs seems like a useful concept on BackendTLSPolicy for the case that a CACertRef is invalid/points to something that doesn't exist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there can be a maximum of 8 CACertificateRefs on the BTP validation type, so we would also want to specify here whether a minimum of one wrong CACertificateRef causes Accepted to become false or ResolvedRefs to become false, or for either to reflect some degree of problems (1 of 4 CACertificateRefsInvalid?). Isn't this something we should rather leave as an implementation detail?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarize: I think that ResolvedRefs should be included and used for certificates that do not exist or are not valid (where "not valid" means "does not contain certain keys", not "unwrap the certificate and check certificate properties").

Decoding the certificate should never be required for an implementation to be conformant. Implementations MAY unwrap if they wish and do additional error handling, but those errors should be in addition to the included error handling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposing Nick's thoughts on here, I guess it would be good to have the definition of what are the conditions that cause an invalid certificate explicit somewhere (eg.: ref does not exist, ref exist but does not contain the right keys, ref is not of secret type tls, etc). This would probably be good for conformance as well.

I still think that for some cases it would be good to say that controllers CAN validate the certificate content, as I can think on numerous cases where a secret with a tls.crt key that contains multiple certificates has some sort of bad formation (invalid PEM, etc) that can cause the whole gateway to go down if it tries to blindly use the certificate content.

Again, not a MUST but probably a recommendation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that adding some explicit examples of what makes a reference invalid is fine - but the second item - "ref exist but does not contain the right keys" requires decoding the Secret, so it can only be a MAY.

trusted certificates, then the associated TLS connection must fail.
trusted certificates, the BackendTLSPolicy is considered invalid.

For an invalid BackendTLSPolicy, implementations MUST NOT fall back to unencrypted (plaintext) connections.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that "Invalid" isn't the right term here because BTP spec is validated by CEL, making an invalid BTP impossible.

If this refers to a hostname or certificate mismatch with backend configuration, please state that explicitly.
Besides, this case is already covered with implementation specific way.
.

On the question of how to signal that there was a failure in the certificate validation, this is left up to the implementation to return a response error that is appropriate, such as one of the HTTP error codes: 400 (Bad Request), 401 (Unauthorized), 403 (Forbidden), or other signal that makes the failure sufficiently clear to the requester without revealing too much about the transaction, based on established security requirements.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed off-channel, I had two cases in mind where I would consider a BackendTLSPolicy to be invalid, apart from the standard CEL validation or runtime errors:

If the CertificateRef cannot be resolved or does not include a certificate (tls.crt), the BackendTLSPolicy is considered invalid.

If WellKnownCACertificates is set to "System" and there are no system trusted certificates or the implementation doesn't define system
trusted certificates, the BackendTLSPolicy is considered invalid.

My main concern for raising this is to understand, how would the BackendTLSPolicy signal that the referenced resource in CertificateRef does not exist or cannot be resolved?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can created new conditions similar to "ResolvedRefs" condition which is dedicated for listeners?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's exactly what I've suggested in #3909 (comment)

An example structure could look like this:

status:
  ancestors:
  - ancestorRef:
      group: gateway.networking.k8s.io
      kind: Gateway
      name: gw
    conditions:
    - type: Accepted
      reason: Accepted
      status: "True"
      message: BackendTLSPolicy is accepted
    - type: ResolvedRefs
      reason: InvalidCertificateRef | UnsuppportedWellKnownCACertificates
      status: "False"
      message: (implementation specific error)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 that anything that references other objects should include a ResolvedRefs condition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kl52752 has a point about the pre-existing text regarding failures to connect. I prefer this over an HTTP 5XX error response. What do you think @robscott @youngnick ?

On the question of how to signal that there was a failure in the certificate validation, this is left up to the implementation to return a response error that is appropriate, such as one of the HTTP error codes: 400 (Bad Request), 401 (Unauthorized), 403 (Forbidden), or other signal that makes the failure sufficiently clear to the requester without revealing too much about the transaction, based on established security requirements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To go back to the original point, in this document we're defining an "invalid" BackendTrafficPolicy to be one that either uses an unsupported feature (WellKnownCACertificates) or has zero valid CertificateRefs, and that's a bit different to "has no syntactic errors", which CEL prevents.


For an invalid BackendTLSPolicy, implementations MUST NOT fall back to unencrypted (plaintext) connections.
Instead, the corresponding TLS connection MUST fail, and the client MUST receive an HTTP error response.
Additionally, the `Accepted` status condition of the BackendTLSPolicy MUST be set to `False` with the reason `Invalid`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be difficult to implement due to the distinct processes of policy application and certificate validation against the backend. It's a valid scenario that the BTP was accepted, and configuration properly propagated by the controller, but connectivity is broken due to certificate misconfiguration. All information needed to debug this issue should be passed by inspecting the BTP spec against the Service to which the Policy was applied.

@snorwin
Copy link
Member Author

snorwin commented Jul 22, 2025

To evaluate the current behavior of various gateway implementations, I performed a test using envoy-gateway, kgateway, nginx-gateway-fabric and the airlock-microgateway. The goal was to observe how each reacts when a BackendTLSPolicy is configured with a CertificateRefs that points to a resource which does not exist, e.g.:

    caCertificateRefs:
      - kind: ConfigMap
        group: ""
        name: does-not-exist

Result:

    - ancestorRef:
        group: gateway.networking.k8s.io
        kind: Gateway
        name: envoy-gateway
        namespace: envoy
      conditions:
      - lastTransitionTime: "2025-07-22T06:07:05Z"
        message: No ca found in referred configmaps.
        observedGeneration: 1
        reason: Invalid
        status: "False"
        type: Accepted
      controllerName: gateway.envoyproxy.io/gatewayclass-controller
    - ancestorRef:
        group: ""
        kind: Service
        name: backend
      conditions:
      - lastTransitionTime: "2025-07-22T06:09:33Z"
        message: 'Policy error: "configmap default/does-not-exist not found"'
        reason: Invalid
        status: "False"
        type: Accepted
      controllerName: kgateway.dev/kgateway
    - ancestorRef:
        group: gateway.networking.k8s.io
        kind: Gateway
        name: airlock-microgateway
        namespace: airlock
      conditions:
      - lastTransitionTime: "2025-07-22T06:10:31Z"
        message: |-
          Resolving BackendTLSPolicy failed
          Missing referenced ConfigMap 'does-not-exist'
        observedGeneration: 1
        reason: Invalid
        status: "False"
        type: Accepted
      controllerName: microgateway.airlock.com/gatewayclass-controller

Observations:

  • kgateway sets ancestorRef to the target Service, others set it to the Gateway.
  • All implementations (except nginx-gateway-fabric) set Accepted condition to False with reason Invalid for missing resources.
  • envoy-gateway accepts the policy if at least one CertificateRefs is valid.
  • envoy-gateway and kgateway overwrite each other’s status updates.
  • nginx-gateway-fabric sets an implementation-specific reason on the ResolvedRefs condition on the HTTPRoute's status:
          - lastTransitionTime: "2025-07-22T06:07:05Z"
            message: 'Failed to process route rule 0 backendRef 0: no ca found in referred
              configmaps.'
            observedGeneration: 1
            reason: InvalidBackendTLS
            status: "False"
            type: ResolvedRefs
          controllerName: gateway.envoyproxy.io/gatewayclass-controller
    
  • In cases where the BackendTLSPolicy is invalid envoy-gateway, nginx-gateway-fabric and airlock-microgateway respond to incoming traffic with an HTTP 500 error. In contrast, kgateway proceeds to connect to the backend over plaintext, despite the invalid policy.

Conclusion:
IMO the behavior of BackendTLSPolicy in cases where CertificateRefs point to non-existent resources should not be left to implementation-specific handling. Allowing each implementation to behave differently leads to confusion for the user. To ensure consistency, it is important that we define an opinionated design for this scenario, providing a predictable and well-defined API behavior.

Signed-off-by: Norwin Schnyder <[email protected]>
@snorwin snorwin force-pushed the gep-backendtlspolicy branch from 2a97e7b to 49457d6 Compare July 23, 2025 05:53
@robscott
Copy link
Member

Thanks @snorwin, this is a great reference point!

  • kgateway sets ancestorRef to the target Service, others set it to the Gateway.

It seems like we should settle on Gateway as the ancestor as that's the thing that's truly unique. With that said, if all Gateways for a given controllerName are guaranteed to have the same behavior, I'd recommend

  • All implementations (except nginx-gateway-fabric) set Accepted condition to False with reason Invalid for missing resources.

To clarify, is this only true when all refs are invalid, or is it also true when there's a mix of valid and invalid refs?

  • envoy-gateway accepts the policy if at least one CertificateRefs is valid.

This seems correct to me and lines up with my related comment.

  • envoy-gateway and kgateway overwrite each other’s status updates.

This feels like it could benefit from some conformance test coverage as I doubt they're the only ones doing this.

  • nginx-gateway-fabric sets an implementation-specific reason on the ResolvedRefs condition on the HTTPRoute's status

Interesting, maybe our docs aren't sufficiently clear on recommended status?

  • In cases where the BackendTLSPolicy is invalid envoy-gateway, nginx-gateway-fabric and airlock-microgateway respond to incoming traffic with an HTTP 500 error. In contrast, kgateway proceeds to connect to the backend over plaintext, despite the invalid policy.

I believe spec requires HTTP 500, definitely worth some conformance test coverage to ensure consistency here.

/cc @arkodg @danehans

@k8s-ci-robot k8s-ci-robot requested review from arkodg and danehans July 25, 2025 02:56
@snorwin
Copy link
Member Author

snorwin commented Jul 25, 2025

@robscott I created a follow-up issue (#3953) to cover PolicyAncestorStatus handling in the conformance tests.

@shaneutt shaneutt moved this to Review in Release v1.4.0 Jul 29, 2025
@shaneutt shaneutt added this to the v1.4.0 milestone Jul 29, 2025
@shaneutt shaneutt self-assigned this Jul 29, 2025
@youngnick
Copy link
Contributor

It seems like we should settle on Gateway as the ancestor as that's the thing that's truly unique. With that said, if all Gateways for a given controllerName are guaranteed to have the same behavior, I'd recommend

I think you missed the rest of this sentence @robscott...

@youngnick
Copy link
Contributor

I agree with @snorwin and @robscott about a few things:

  • If you support BackendTLSPolicy, then there absolutely should be no fallback for BackendTLSPolicy-covered Services.
  • ResolvedRefs Condition should be included and will be set to false if at least one CACertificateRef contains an error.
  • The entire Route should only not be Accepted if no CA certificates are readable (all CACertificateRefs are invalid)
  • No status should require decoding the certificates, whether they are in CACertificateRefs or the System ones.
  • Implementations MUST NOT change any other status than their own. (We need conformance tests about this for many other things as well)
  • We must specify what happens when a BackendTLSPolicy is configured and references a Service, but there is no valid CA config referenced; This should be either the exact status code 500 or at least a 500-class response. I prefer the former.

@shaneutt shaneutt requested a review from candita August 5, 2025 17:15
@candita
Copy link
Contributor

candita commented Aug 5, 2025

@youngnick, regarding

  • The entire Route should only not be Accepted if no CA certificates are readable (all CACertificateRefs are invalid)

You do mean, the BackendTLSPolicy, not the Route, right?

@youngnick
Copy link
Contributor

Yes, sorry. I was also thinking about partial Route invalidity.

@snorwin snorwin requested a review from candita August 8, 2025 07:40
@candita
Copy link
Contributor

candita commented Aug 8, 2025

As much as I'd like to see this get into v1.4.0, at this stage in the release cycle, I'd like to take the pulse of the community. For any implementations that pass current conformance tests, this change would require new conformance tests that they may not pass for v1.4.0.

/hold

@youngnick
Copy link
Contributor

Okay, I guess I wasn't clear before, so I'll try to be more clear now.

To me, adding the ResolvedRefs Condition is a requirement for Standard because:

  • it makes troubleshooting for end users much easier
  • it forces us to clarify what happens in various error cases (for example, one invalid CertificateRef vs all invalid)
  • Adding this after this transition point is actually worse for implementations, because they may end up accepting things that turn out to be invalid later

We only get one shot at this transition, it's better to do the work here to make it happen, and implementations can catch up after the fact. It's better to front-load big changes all into one change than to do them piecemeal, in this case.

For the details of the ResolvedRefs Condition, for some of the feedback:

  • I don't agree that 400-class errors are appropriate here - if a certificate is misconfigured, then that is exactly what a 500-class error is intended to describe. In the case that the Gateway implementation attempts to connect to a backend and fails because it does not have the correct TLS details (which is what will happen when a CertificateReference is invalid), then a 500-class error is appropriate. (tbh in that case, a 503 feels right to me, since this is an internal server error from the client's POV).
  • Once we add the ResolvedRefs condition, we are locking in partial validity - which I think is good. But we should be clear about the rules: a data path MUST NOT use a certificate that is invalid at all, and MUST reject any connection that uses it.
  • Implementations MAY decode certificates to further check validity, and if they do, they MUST mark certificates that fail their own validity checks by setting ResolvedRefs to false, and use a domain-prefixed implementation specific reason (like implementation.io/InvalidKeys, for example) in addition to specifying the invalid references in the message field. This functionality will be ImplementationSpecific support and will not be conformance tested.
  • For message, the message is specifically not intended to be machine-parseable, so we can really only say that the message SHOULD contain the details of the invalid references, not MUST.
  • The way to distinguish between any or all references being invalid is then: if any CertificateRef is not valid, but at least one CertificateRef is valid, then ResolvedRefs is present and the BackendTrafficPolicy is Accepted. If zero CertificateRefs are valid, then ResolvedRefs is present and the BackendTrafficPolicy is NOT Accepted. (This may actually be different behavior to what the conformance tests currently reflect for HTTPRoute partial validity, but I think it makes more sense.

I've added some specific comments to the file with some wording here.

I think this is very close, and then folks can get on to writing more conformance tests to check these behaviors.

@snorwin
Copy link
Member Author

snorwin commented Aug 11, 2025

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Aug 11, 2025
Signed-off-by: Norwin Schnyder <[email protected]>
@shaneutt
Copy link
Member

shaneutt commented Aug 12, 2025

We talked about this on the community call today:

There was general agreement on the call that in terms of WHAT is implemented, there's a preference for the mixed-mode where some misconfigured certificates are tolerated, and statuses to reflect the misconfiguration are propagated.

Depending on whether we can get this done before code freeze for v1.4.0 (August 26th), we may soon need to make a decision whether we consider this critical to release BackendTLSPolicy to Standard, or whether it can be added in a later release.

Copy link
Member

@shaneutt shaneutt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following today's community call and my re-review of this PR I am convinced that defining this behavior is important. I support the "mixed-mode" approach, which allows some invalid references without causing the entire BackendTLSPolicy to fail.

My reasoning stems from various scenarios, particularly the one I mentioned during the call: when the individual creating the references is different from the one managing the certificates. It would be counterintuitive and potentially damaging for a certificate manager to remove a certificate (to revoke it) without awareness of its reference, leading to a complete disruption of traffic instead of just a portion.

So generally I approve:

/approve

Since we're a couple weeks out from v1.4.0 code freeze, we also have to decide on whether or not we would consider this and the spec and test updates required a blocker for GA, as mentioned here. I'm interested in hearing other people's thoughts especially those who have already implemented BackendTLSPolicy 🤔

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 12, 2025
@shaneutt shaneutt requested review from youngnick and kl52752 August 12, 2025 19:16
Signed-off-by: Norwin Schnyder <[email protected]>
@candita
Copy link
Contributor

candita commented Aug 13, 2025

@snorwin would you consider requiring a 502 rather than 500 error response?

@snorwin
Copy link
Member Author

snorwin commented Aug 13, 2025

@candita, from a Gateway API perspective, I would argue that the semantics of an invalid backend reference should be similar to those of a backend reference with an invalid BackendTLSPolicy. In that cases, an HTTP 500 error would be appropriate, which is also the behavior adopted by most current implementations.

From an Envoy perspective, however, an HTTP 503 might be more suitable, as it is equal to the case where TLS validation for an upstream connection fails.

In general, achieving a specific HTTP status code can be cumbersome, depending on the technology and implementation. Therefore, I would leave it to implementers to decide whether they prefer returning 500, 503, or even 502.

@shaneutt if we can get this PR merged by the end of the week, I’ll be able to prepare the spec changes and propose the necessary conformance tests in time (before the code freeze). IMO we should aim to include this in v1.4.0.

Copy link
Contributor

@youngnick youngnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small comment about domain-prefixing of Reason, but it's non blocking, so this LGTM.

To be specific as well: I do think that this, along with the API changes involved, is required to move BackendTLSPolicy to Standard. If we don't specify this stuff, it will be hard to get good conformance.

Copy link
Contributor

@kl52752 kl52752 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Looks good to me :)

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kl52752, shaneutt, snorwin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@shaneutt shaneutt removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 13, 2025
@robscott
Copy link
Member

Thanks @snorwin!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 13, 2025
@k8s-ci-robot k8s-ci-robot merged commit bb58b0d into kubernetes-sigs:main Aug 13, 2025
19 checks passed
@github-project-automation github-project-automation bot moved this from Review to Done in Release v1.4.0 Aug 13, 2025
@shaneutt
Copy link
Member

@shaneutt if we can get this PR merged by the end of the week, I’ll be able to prepare the spec changes and propose the necessary conformance tests in time (before the code freeze). IMO we should aim to include this in v1.4.0.

Sounds good, thank you very much. If you wouldn't mind creating an issue for this last part of the effort so I can sub-task it under #1897 for tracking, I would appreciate it. 🖖

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/gep PRs related to Gateway Enhancement Proposal(GEP) lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Standardizing Behavior for Invalid BackendTLSPolicy
8 participants