Skip to content

Active health checks fail when using upstream TLS with client certificate #7384

@tsaarni

Description

@tsaarni

What steps did you take and what happened:

Prerequisites:

  • Contour uses a client certificate for Envoy upstream TLS.
  • HTTPProxy uses upstream TLS
  • HTTPProxy uses Envoy's active health checks.

All of these must be true for the bug to happen. If removing any of them, the problem goes away.

The Problem:

The upstream service is unavailable for certain period of time:

  • Scenario 1: The client gets a 503 Service Unavailable "no healthy upstream" error for 4 minutes after Envoy restarts
    • This is counted from the moment Envoy is back up and ready with all configuration from Contour. It doesn't include the time Envoy takes to be ready serve requests.
    • The downtime lasts for healthyThresholdCount * no_traffic_interval (for example: 4 * 60s = 4min).
    • The no_traffic_interval defaults to 60 seconds in Envoy and is not configurable in Contour.
  • Scenario 2: The client gets a 503 Service Unavailable "no healthy upstream" error for several seconds after rotating the Envoy client certificate.
    • The downtime lasts for healthyThresholdCount * intervalSeconds seconds (for example: 4 * 5s = 20s).

What did you expect to happen:

There should be no service interruption.

Anything else you would like to add:

I've added the steps to reproduce this in the comments below.

Environment:

  • Contour version:
  • Kubernetes version: (use kubectl version):
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions