Skip to content

Conversation

@openshift-service-mesh-bot
Copy link

@openshift-service-mesh-bot openshift-service-mesh-bot commented Feb 9, 2026

Generated by Automator - 2026-02-11T05:04:32+00:00

krinkinmu and others added 4 commits February 6, 2026 07:19
* Reduce flakiness of the TestServiceRestart test

In my local environment the test flakes 2-3 times per 1000 runs. Looking
at the causes of the flakes I found two:

1. The echo process gets terminated prematurely - before the pod is
   removed from the list of service endpoints
2. DNS resolution fails with timeout - I've only seen this once, so it's
   quite rare and it's not an Istio issue, so I'm going to ignore this
   cause for now (we can potentially harden echo implementation by
   allowing retries of DNS failures, though)

The way a restart works in k8s deployment is roughly happens in the
following order:

- deployment brings up a new pod
- deployment waits for the new pod to become healthy
- deployment terminates the old pod

In parallel with termination of the new pod, if deployment pods are
behind a service, the terminating pod gets removed from the list of
the service endpoints.

The important thing here is that terminating pod and removing it from
the service endpoints list happens concurrently, so you could end up in
a situation, when the processes in the pod already terminated, but the
pod is still listed as a service endpoint and can get requests routed to
it. Obviously, such requests will fail.

In echo app implementation used in our tests we protect against that by
delaying echo process termination. The way it works is roughly as
follows:

1. Echo receives a SIGTERM from Kubelet - this is Kubelet's way to ask
   the pod politely to shut down.
2. Echo starts a 1 second timer:
   - if during this 1 second interval echo receives a request it rests
     the timer and start waiting for another 1 second
3. If Echo does not receive a request before timer elapses, Echo shuts
   down.

And basically the intention here is to wait for as long as we get
requests routed to the pod. Once requests stop, we assume that the pod
has been deleted from service endpoints list and it's safe to shut down.

The `TestServiceRestart` test currently sends a request every 100ms, so
during the 1 second the Echo waits for requests to come we will have ~10
attempts - that seems like enough, but let's count.

In the test setup normally we have 2 deployments (v1 and v2) each with 1
pod, so 2 pods in total. However, during the restart in one of the
deployments we first bring up a new pod and only after that start
terminating the old pod, so for a period of time we will have 3 pods to
choose from.

That means that each request, assuming uniformity, has about 2/3 chance
of *not* getting routed to the terminating pod. And assuming
independence all ~10 attempts can miss the terminating pod with about
1-2% chance, so it's small, but not insignificant.

If all 10 attempts miss the terminating pod, it will stop listening on
its ports and shut down. If by the time it happens the pod is not yet
removed from the service endpoint list it can still get new requests and
those requests, if routed to the pod, will fail, failing the test along
the way.

This PR changes the call interval from 100ms to 40ms. Due to an
exponential nature of the probaility calculation function, this change
results in a significantly lower chance of hitting the flake (e.g. more
than 100 times lower chance, though the calculations I did are rather
rough).

Related to #58226

Signed-off-by: Mikhail Krinkin <[email protected]>

* Fix spelling and formatting

Signed-off-by: Mikhail Krinkin <[email protected]>

---------

Signed-off-by: Mikhail Krinkin <[email protected]>
@MaxBab
Copy link
Contributor

MaxBab commented Feb 9, 2026

/retest

@MaxBab
Copy link
Contributor

MaxBab commented Feb 9, 2026

/test istio-integration-telemetry

@MaxBab
Copy link
Contributor

MaxBab commented Feb 9, 2026

/test unit

@MaxBab
Copy link
Contributor

MaxBab commented Feb 9, 2026

/test istio-integration-sail-pilot

…echo client/server (#58918)

* test: add support for customizing min TLS version and ECDH curves in echo client/server

Signed-off-by: Jacek Ewertowski <[email protected]>

* Break line to fix linter error

Signed-off-by: Jacek Ewertowski <[email protected]>

* Set TLS min version and curve preferences in the deployment template

Signed-off-by: Jacek Ewertowski <[email protected]>

* Deduplicate implementation of parse TLS settings

Signed-off-by: Jacek Ewertowski <[email protected]>

* Support only standard curve names

Signed-off-by: Jacek Ewertowski <[email protected]>

* Update curve names in flag descriptions

Signed-off-by: Jacek Ewertowski <[email protected]>

* Return error on invalid TLS version or curve preference

Signed-off-by: Jacek Ewertowski <[email protected]>

* Fix ineffassign issue

Signed-off-by: Jacek Ewertowski <[email protected]>

* Wrap parse errors

Signed-off-by: Jacek Ewertowski <[email protected]>

---------

Signed-off-by: Jacek Ewertowski <[email protected]>
@MaxBab
Copy link
Contributor

MaxBab commented Feb 9, 2026

/retest

@openshift-service-mesh-bot openshift-service-mesh-bot force-pushed the none-master-merge_upstream_istio_master-6253864e branch from a1b74c8 to c21412e Compare February 10, 2026 05:15
@openshift-ci openshift-ci bot added size/L and removed size/M labels Feb 10, 2026
@MaxBab
Copy link
Contributor

MaxBab commented Feb 10, 2026

/retest

1 similar comment
@MaxBab
Copy link
Contributor

MaxBab commented Feb 10, 2026

/retest

ilrudie and others added 5 commits February 10, 2026 02:48
* wip: add tls-inspector when using tls dynmaic dns so we have sni information for the dfp

Signed-off-by: Ian Rudie <[email protected]>

* release note

Signed-off-by: Ian Rudie <[email protected]>

* simplify, test

Signed-off-by: Ian Rudie <[email protected]>

---------

Signed-off-by: Ian Rudie <[email protected]>
* upstream/master:
  gatewayapi: make the cors origin stricter with wildcards (#59026)
  Automator: update istio/client-go@master dependency in istio/istio@master (#59036)
  Automator: update istio/client-go@master dependency in istio/istio@master (#59034)
  add tls-inspector for listener with only TLS ports (#59028)
  test: add support for customizing min TLS version and ECDH curves in echo client/server (#58918)
  Automator: update go-control-plane in istio/istio@master (#59025)
  improve bugreport output (#58977)
  update EOL information (#58991)
  Reduce flakiness of the TestServiceRestart test (#59022)
@openshift-service-mesh-bot openshift-service-mesh-bot force-pushed the none-master-merge_upstream_istio_master-6253864e branch from c21412e to d1b8fcd Compare February 11, 2026 05:04
@MaxBab
Copy link
Contributor

MaxBab commented Feb 11, 2026

/retest

@openshift-merge-bot openshift-merge-bot bot merged commit e1ade98 into openshift-service-mesh:master Feb 11, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants