Skip to content

Conversation

@stankevich
Copy link
Contributor

@stankevich stankevich commented May 22, 2025

The end-to-end tests have been pretty flaky lately, which makes it harder to trust them and slows down pull requests. Let's try to get this fixed! Here, I focused on the initial Kind cluster setup, port forwarding, teardown, and some of the most temperamental parts of the test suite. Quick summary of the changes:

  • Added a cache mount for Go module and build caches in the Docker image. On my machine, this frees up disk space and cuts image build time from 40+ seconds to 1–5 seconds, depending on the code changes.
  • Moved the Kind cluster setup steps into a function, DRYing up the code across all tests.
  • Did the same for the teardown logic.
  • Replaced hostPath volumes with persistentVolumeClaim volumes. That simplified the code, cluster setup, file permissions, and removed occasional chmod-ing of backup directories.
  • Made port forwarding more robust by waiting for expected ports to become ready to serve requests. That helped remove a few sleeps from the tests.
  • Reduced VTGate's tablet_refresh_interval from 60 to 10 seconds (same as in vttestserver), which speeds up the initial Vitess cluster setup and upgrades.
  • Added a post-upgrade cluster spec verification step. Combined with the previous point, that stabilized the most frequently failing upgrade test.
  • Moved commerce keyspace verification commands into a function, DRYing up a decent chunk of the code.
  • Removed the local binary build as it isn't used in the tests.
  • Decreased the timeouts of Buildkite jobs and added retries on unexpected Buildkite Agent failures.
  • Added support for running the tests in environments where multiple Buildkite Agents share a single Docker service.
  • Reduced the output when installing or downloading the tools. Fixed the output when building the Docker image. Reduced MySQL client warnings.
  • Other minor tweaks that help with test stability and general cleanup.

I ran some "before" and "after" comparisons. On my machine, the average time for 7 test runs on the main branch was 36 minutes 1 second, with a standard deviation of 1 minute 4 seconds. After these changes, the average dropped to 27 minutes 57 seconds, with a standard deviation of 1 minute 1 second. More importantly though, I haven't seen any of the flakiness that was happening before. Will see what CI says and make further improvements if needed.

The diff of this pull request is pretty big, and it's much easier to review commit-by-commit. I decided not to split it into multiple pull requests because I was worried that flaky failures in intermediate ones would stall all of them. 😅

Related to #582.

stankevich added 19 commits May 21, 2025 14:00
Without a cache mount, any change to the repository results in a new
image build, even if there are no changes to the operator's codebase.
Not only each build takes over 40 seconds (on my machine), it also
creates about 4 GB of Docker layers, every time. This adds up quickly
and can fill up the Docker disk allocation after only a handful of runs
on a Mac.

This change drops the image build time from 40+ to 2 seconds. The layers
that store the Go module and build caches are reused across builds.
As a result, the disk growth between builds is not nearly as noticable.

It should be a completely transparent change locally and in CI.

Signed-off-by: Sergey Stankevich <[email protected]>
To be used at the start of the test run. Will DRY up the setup phase across all
tests.

Signed-off-by: Sergey Stankevich <[email protected]>
Makes sense to keep these cluster setup routines together. Also,
making the function Shellcheck-clean.

Signed-off-by: Sergey Stankevich <[email protected]>
Wraps `killall kubectl` and `./pf.sh` calls into a reusable function that also
has checks for the open ports to be ready to serve requests.

Signed-off-by: Sergey Stankevich <[email protected]>
Once the port forwarding is set up and verified to be working, there's a
waitForKeyspaceToBeServing() to make sure the keyspace is alive.

Signed-off-by: Sergey Stankevich <[email protected]>
This results in a significantly quicker Vitess cluster startup, which
reduces the probability of race conditions caused by a delayed start of
Vitess components.

Signed-off-by: Sergey Stankevich <[email protected]>
Adds a checkPodSpecBySelectorWithTimeout() function that waits for
post-upgrade spec changes to take effect. The new flow verifies that all
components have been recreated with the latest changes before continuing
with other verifications.

Signed-off-by: Sergey Stankevich <[email protected]>
Simplifies the function by removing one kubectl call and makes it
Shellcheck-clean

Signed-off-by: Sergey Stankevich <[email protected]>
DRYing up the code a little, and removing unnecessary `sleep`s.

Signed-off-by: Sergey Stankevich <[email protected]>
Signed-off-by: Sergey Stankevich <[email protected]>
Signed-off-by: Sergey Stankevich <[email protected]>
Signed-off-by: Sergey Stankevich <[email protected]>
Adds a check that waits for the VTAdmin port to be ready. Removes
unnecessary sleeps.

Signed-off-by: Sergey Stankevich <[email protected]>
A tiny improvement, but every bit helps. Everything depends on etcd, so it
should be checked first.

Signed-off-by: Sergey Stankevich <[email protected]>
@stankevich
Copy link
Contributor Author

Of course all of them failed! 😁 Is there any way for me to see the reason? Pasting the relevant errors here would be helpful. Thanks!

@GuptaManan100
Copy link
Contributor

This is the error message I see in the buildkite output -

[2025-05-22T09:36:40Z]  => => naming to docker.io/library/vitess-operator-pr:latest                                                                                               0.0s
[2025-05-22T09:36:40Z] Setting up the Kind config
[2025-05-22T09:36:40Z] ./test/endtoend/utils.sh: line 512: vtdataroot/config.yaml: No such file or directory
[2025-05-22T09:36:40Z] Creating Kind cluster
[2025-05-22T09:36:40Z] ERROR: failed to create cluster: error reading file: open ./vtdataroot/config.yaml: no such file or directory
[2025-05-22T09:36:40Z] Loading docker image into Kind cluster
[2025-05-22T09:36:40Z] ERROR: no nodes found for cluster "kind-0196f755-c056-4ed2-a1fc-d65fa5775d3d"
[2025-05-22T09:36:40Z] Error response from daemon: network kind not found
[2025-05-22T09:36:40Z] ./test/endtoend/utils.sh: line 449: /root/.kube/config: No such file or directory
[2025-05-22T09:36:40Z] Creating the example namespace
[2025-05-22T09:36:40Z] The connection to the server localhost:8080 was refused - did you specify the right host or port?
[2025-05-22T09:36:40Z] Apply latest operator-latest.yaml
[2025-05-22T09:36:40Z] error: error validating "operator-latest.yaml": error validating data: failed to download openapi: Get "http://localhost:8080/openapi/v2?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused; if you choose to ignore these errors, turn validation off with --validate=false
[2025-05-22T09:36:40Z] E0522 09:36:40.566215   14030 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp 127.0.0.1:8080: connect: connection refused"
[2025-05-22T09:36:40Z] E0522 09:36:40.567677   14030 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp 127.0.0.1:8080: connect: connection refused"
[2025-05-22T09:36:40Z] E0522 09:36:40.569088   14030 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp 127.0.0.1:8080: connect: connection refused"
[2025-05-22T09:36:40Z] E0522 09:36:40.570509   14030 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp 127.0.0.1:8080: connect: connection refused"
[2025-05-22T09:36:40Z] E0522 09:36:40.571900   14030 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp 127.0.0.1:8080: connect: connection refused"
[2025-05-22T09:36:40Z] The connection to the server localhost:8080 was refused - did you specify the right host or port?

Create the backup directory in /workdir, not in the Buildkite Agent path

Signed-off-by: Sergey Stankevich <[email protected]>
It isn't used in the tests. The Docker image build would fail just as early if
there are issues with building the binary.

Signed-off-by: Sergey Stankevich <[email protected]>
Signed-off-by: Sergey Stankevich <[email protected]>
@stankevich stankevich force-pushed the refactor-e2e-tests branch from 3e95812 to 88f2f46 Compare May 26, 2025 08:47
* Chromium isn't used in the unmanaged tablet test
* MySQL Server isn't used anywhere

Signed-off-by: Sergey Stankevich <[email protected]>
@stankevich stankevich force-pushed the refactor-e2e-tests branch from 88f2f46 to b033d0a Compare May 26, 2025 09:36
Using `hostPath` as a volume in Kind adds a bit of complexity. A configuration file must be generated, a directory must be pre-created, permissions need to be set, a cleanup step is required when the cluster is destroyed, etc. This isn't as big of a deal as correcting permissions on backup subdirectories midway through test runs only to allow the backup subcontroller "see" the backups and surface them in the `kubectl get vitessbackups` output. This is pretty fragile and is prone to edge cases.

A Kubernetes-native way to deal with this issue is setting `securityContext.fsGroup` on all workloads involved in taking and reading backups. Unfortunately, `fsGroup` doesn't work with `hostPath` in Kubernetes or Kind (rancher/local-path-provisioner#41 (comment), kubernetes-sigs/kind#830). Additionally, using `hostPath` volumes is generally discouraged (https://kubernetes.io/docs/concepts/storage/volumes/#hostpath).

Hence this commit. It replaces `hostPath` volumes with `persistentVolumeClaim` volumes. This makes `fsGroup` work and eliminates all `chmod`-ing. It also removes Kind config generation and host-container directory sharing. The teardown process also becomes simpler, as backup files are automatically removed during Kind cluster deletion. Lastly, this setup is slightly more representative of real-world usage.

To make it work, `securityContext.fsGroup` of the operator Deployment must be set to the `fsGroup` value of Vitess workloads, which is 999 by default. The backup subcontroller then inherits the `fsGroup` of the operator. Otherwise, it runs as UID 1001, which doesn't have permissions to read Vitess backup files from the mounted persistent volume.

Signed-off-by: Sergey Stankevich <[email protected]>
The tests do not reliably work in environments where multiple Buildkite Agents share a single Docker service. The issues arise because `BUILDKITE_BUILD_ID` is identical for different tests triggered by the same commit. As the Kind cluster name uses `BUILDKITE_BUILD_ID`, this can lead to multiple tests attempting to use the same Kind cluster. Additionally, in these shared environments, it isn't uncommon for `docker container ls` to return multiple results, which breaks the `setupKubectlAccessForCI()` function.

This change switches from using `BUILDKITE_BUILD_ID` to `BUILDKITE_JOB_ID`, which is unique for each test case. It also moves away from `docker container ls` to `hostname -s` to determine the container name. This works well because the hostname remains constant throughout the container's lifetime.

Signed-off-by: Sergey Stankevich <[email protected]>
Signed-off-by: Sergey Stankevich <[email protected]>
Normally, all tests except for the upgrade test should complete in under
10 minutes. The upgrade test should be done in under 20 minutes.
Anything longer than that almost guarantees that the test will fail, so
let's make it fail faster to free up resources and get feedback sooner.

Signed-off-by: Sergey Stankevich <[email protected]>
Buildkite Agents can be marked as lost if they can't talk to the Buildkite
API. They can also get shut down at any time because the instance they
are running on receives a termination signal (e.g., spot instances).
This change handles the most common signals and retries the affected job if
the agent running it has been shut down unexpectedly.

Signed-off-by: Sergey Stankevich <[email protected]>
@stankevich stankevich force-pushed the refactor-e2e-tests branch from ac33fe2 to e6e4d70 Compare May 29, 2025 10:25
WARNING: option --ssl-verify-server-cert is disabled, because of an insecure passwordless login.

Signed-off-by: Sergey Stankevich <[email protected]>
@stankevich stankevich force-pushed the refactor-e2e-tests branch from ea7b762 to 4d35938 Compare May 29, 2025 15:11
@stankevich
Copy link
Contributor Author

I made a few more tweaks to stabilize the tests, handle the failures better, and clean up the output. The original pull request description has been updated. Two notable changes since opening this pull request:

  • The hostPath volumes (backup storage) have been replaced with persistentVolumeClaim volumes. This is a more Kubernetes-native approach and is closer to real-world usage outside of using object storage. It significantly simplifies the configuration, cluster setup, permissions, and cleanup.
  • Decreased the timeouts of Buildkite jobs and added retries for unexpected Buildkite Agent failures. This catches sporadic failures caused by external factors, such as spot instance interruptions. It also handles occasional test timeouts, although they are now quite rare. Still, it would be nice to eliminate them completely and not rely on retries.

At this point, the tests are passing fairly reliably. The only one that occasionally gets stuck is the Backup Schedule Test (example). I haven't been able to get to the bottom of it yet. It times out because one out of three tablets fails to become ready after the initial deployment. This only seems to happen in this specific test and only on the public Buildkite Elastic CI Stack in this repository. I haven't been able to reproduce it once in over 30 local test runs or over 50 runs in our custom Buildkite Agent setup, which runs in Kubernetes and uses the Buildkite pipeline from this repository. In any case, I'll keep an eye on the tests and address any further failures that come up.

This is ready to review. I'm happy to split it into multiple pull requests if that would help with the review process. Thanks!

@deepthi deepthi requested review from GuptaManan100 and frouioui June 2, 2025 20:04
Copy link
Contributor

@GuptaManan100 GuptaManan100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Copy link
Member

@frouioui frouioui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really good. Thank you for doing this

WORKDIR /go/src/planetscale.dev/vitess-operator
COPY . /go/src/planetscale.dev/vitess-operator
RUN CGO_ENABLED=0 go install /go/src/planetscale.dev/vitess-operator/cmd/manager
RUN --mount=type=cache,target=/go/pkg/mod --mount=type=cache,target=/root/.cache/go-build CGO_ENABLED=0 go install /go/src/planetscale.dev/vitess-operator/cmd/manager
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this change necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some background to the commit message. Quick summary: before this change, every image build, even after a small and unrelated change (like a README update), would redownload all dependencies and recompile the binary from scratch. This was caused by the COPY . step invalidating the cached Docker layer and running a fresh go install.

This change enables two cache mounts: one for Go modules to avoid redownloading dependencies, and one for build caches to reuse previously built artifacts. That allows go install to reuse outputs from previous builds and cuts down the overall image build time, even if COPY . invalidates the cached layer.

Locally, the speedup is significant, going from 40+ seconds to only a few seconds. In CI, it depends on the setup, but it's never slower than before and can be much faster if the image build runs on the same Buildkite Agent (backed by a Docker service with the cache mounts).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds great to me!

@frouioui frouioui merged commit db1dc3b into planetscale:main Jun 13, 2025
12 checks passed
@stankevich stankevich deleted the refactor-e2e-tests branch June 16, 2025 06:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants