Refactor end-to-end tests by stankevich · Pull Request #700 · planetscale/vitess-operator

stankevich · 2025-05-22T09:31:52Z

The end-to-end tests have been pretty flaky lately, which makes it harder to trust them and slows down pull requests. Let's try to get this fixed! Here, I focused on the initial Kind cluster setup, port forwarding, teardown, and some of the most temperamental parts of the test suite. Quick summary of the changes:

Added a cache mount for Go module and build caches in the Docker image. On my machine, this frees up disk space and cuts image build time from 40+ seconds to 1–5 seconds, depending on the code changes.
Moved the Kind cluster setup steps into a function, DRYing up the code across all tests.
Did the same for the teardown logic.
Replaced hostPath volumes with persistentVolumeClaim volumes. That simplified the code, cluster setup, file permissions, and removed occasional chmod-ing of backup directories.
Made port forwarding more robust by waiting for expected ports to become ready to serve requests. That helped remove a few sleeps from the tests.
Reduced VTGate's tablet_refresh_interval from 60 to 10 seconds (same as in vttestserver), which speeds up the initial Vitess cluster setup and upgrades.
Added a post-upgrade cluster spec verification step. Combined with the previous point, that stabilized the most frequently failing upgrade test.
Moved commerce keyspace verification commands into a function, DRYing up a decent chunk of the code.
Removed the local binary build as it isn't used in the tests.
Decreased the timeouts of Buildkite jobs and added retries on unexpected Buildkite Agent failures.
Added support for running the tests in environments where multiple Buildkite Agents share a single Docker service.
Reduced the output when installing or downloading the tools. Fixed the output when building the Docker image. Reduced MySQL client warnings.
Other minor tweaks that help with test stability and general cleanup.

I ran some "before" and "after" comparisons. On my machine, the average time for 7 test runs on the main branch was 36 minutes 1 second, with a standard deviation of 1 minute 4 seconds. After these changes, the average dropped to 27 minutes 57 seconds, with a standard deviation of 1 minute 1 second. More importantly though, I haven't seen any of the flakiness that was happening before. Will see what CI says and make further improvements if needed.

The diff of this pull request is pretty big, and it's much easier to review commit-by-commit. I decided not to split it into multiple pull requests because I was worried that flaky failures in intermediate ones would stall all of them. 😅

Related to #582.

Without a cache mount, any change to the repository results in a new image build, even if there are no changes to the operator's codebase. Not only each build takes over 40 seconds (on my machine), it also creates about 4 GB of Docker layers, every time. This adds up quickly and can fill up the Docker disk allocation after only a handful of runs on a Mac. This change drops the image build time from 40+ to 2 seconds. The layers that store the Go module and build caches are reused across builds. As a result, the disk growth between builds is not nearly as noticable. It should be a completely transparent change locally and in CI. Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

To be used at the start of the test run. Will DRY up the setup phase across all tests. Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Makes sense to keep these cluster setup routines together. Also, making the function Shellcheck-clean. Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Wraps `killall kubectl` and `./pf.sh` calls into a reusable function that also has checks for the open ports to be ready to serve requests. Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Once the port forwarding is set up and verified to be working, there's a waitForKeyspaceToBeServing() to make sure the keyspace is alive. Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

This results in a significantly quicker Vitess cluster startup, which reduces the probability of race conditions caused by a delayed start of Vitess components. Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Adds a checkPodSpecBySelectorWithTimeout() function that waits for post-upgrade spec changes to take effect. The new flow verifies that all components have been recreated with the latest changes before continuing with other verifications. Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Simplifies the function by removing one kubectl call and makes it Shellcheck-clean Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

DRYing up the code a little, and removing unnecessary `sleep`s. Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Adds a check that waits for the VTAdmin port to be ready. Removes unnecessary sleeps. Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

A tiny improvement, but every bit helps. Everything depends on etcd, so it should be checked first. Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

stankevich · 2025-05-22T09:59:47Z

Of course all of them failed! 😁 Is there any way for me to see the reason? Pasting the relevant errors here would be helpful. Thanks!

GuptaManan100 · 2025-05-22T12:36:42Z

This is the error message I see in the buildkite output -

[2025-05-22T09:36:40Z]  => => naming to docker.io/library/vitess-operator-pr:latest                                                                                               0.0s
[2025-05-22T09:36:40Z] Setting up the Kind config
[2025-05-22T09:36:40Z] ./test/endtoend/utils.sh: line 512: vtdataroot/config.yaml: No such file or directory
[2025-05-22T09:36:40Z] Creating Kind cluster
[2025-05-22T09:36:40Z] ERROR: failed to create cluster: error reading file: open ./vtdataroot/config.yaml: no such file or directory
[2025-05-22T09:36:40Z] Loading docker image into Kind cluster
[2025-05-22T09:36:40Z] ERROR: no nodes found for cluster "kind-0196f755-c056-4ed2-a1fc-d65fa5775d3d"
[2025-05-22T09:36:40Z] Error response from daemon: network kind not found
[2025-05-22T09:36:40Z] ./test/endtoend/utils.sh: line 449: /root/.kube/config: No such file or directory
[2025-05-22T09:36:40Z] Creating the example namespace
[2025-05-22T09:36:40Z] The connection to the server localhost:8080 was refused - did you specify the right host or port?
[2025-05-22T09:36:40Z] Apply latest operator-latest.yaml
[2025-05-22T09:36:40Z] error: error validating "operator-latest.yaml": error validating data: failed to download openapi: Get "http://localhost:8080/openapi/v2?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused; if you choose to ignore these errors, turn validation off with --validate=false
[2025-05-22T09:36:40Z] E0522 09:36:40.566215   14030 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp 127.0.0.1:8080: connect: connection refused"
[2025-05-22T09:36:40Z] E0522 09:36:40.567677   14030 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp 127.0.0.1:8080: connect: connection refused"
[2025-05-22T09:36:40Z] E0522 09:36:40.569088   14030 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp 127.0.0.1:8080: connect: connection refused"
[2025-05-22T09:36:40Z] E0522 09:36:40.570509   14030 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp 127.0.0.1:8080: connect: connection refused"
[2025-05-22T09:36:40Z] E0522 09:36:40.571900   14030 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp 127.0.0.1:8080: connect: connection refused"
[2025-05-22T09:36:40Z] The connection to the server localhost:8080 was refused - did you specify the right host or port?

Create the backup directory in /workdir, not in the Buildkite Agent path Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

It isn't used in the tests. The Docker image build would fail just as early if there are issues with building the binary. Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

* Chromium isn't used in the unmanaged tablet test * MySQL Server isn't used anywhere Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Using `hostPath` as a volume in Kind adds a bit of complexity. A configuration file must be generated, a directory must be pre-created, permissions need to be set, a cleanup step is required when the cluster is destroyed, etc. This isn't as big of a deal as correcting permissions on backup subdirectories midway through test runs only to allow the backup subcontroller "see" the backups and surface them in the `kubectl get vitessbackups` output. This is pretty fragile and is prone to edge cases. A Kubernetes-native way to deal with this issue is setting `securityContext.fsGroup` on all workloads involved in taking and reading backups. Unfortunately, `fsGroup` doesn't work with `hostPath` in Kubernetes or Kind (rancher/local-path-provisioner#41 (comment), kubernetes-sigs/kind#830). Additionally, using `hostPath` volumes is generally discouraged (https://kubernetes.io/docs/concepts/storage/volumes/#hostpath). Hence this commit. It replaces `hostPath` volumes with `persistentVolumeClaim` volumes. This makes `fsGroup` work and eliminates all `chmod`-ing. It also removes Kind config generation and host-container directory sharing. The teardown process also becomes simpler, as backup files are automatically removed during Kind cluster deletion. Lastly, this setup is slightly more representative of real-world usage. To make it work, `securityContext.fsGroup` of the operator Deployment must be set to the `fsGroup` value of Vitess workloads, which is 999 by default. The backup subcontroller then inherits the `fsGroup` of the operator. Otherwise, it runs as UID 1001, which doesn't have permissions to read Vitess backup files from the mounted persistent volume. Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

The tests do not reliably work in environments where multiple Buildkite Agents share a single Docker service. The issues arise because `BUILDKITE_BUILD_ID` is identical for different tests triggered by the same commit. As the Kind cluster name uses `BUILDKITE_BUILD_ID`, this can lead to multiple tests attempting to use the same Kind cluster. Additionally, in these shared environments, it isn't uncommon for `docker container ls` to return multiple results, which breaks the `setupKubectlAccessForCI()` function. This change switches from using `BUILDKITE_BUILD_ID` to `BUILDKITE_JOB_ID`, which is unique for each test case. It also moves away from `docker container ls` to `hostname -s` to determine the container name. This works well because the hostname remains constant throughout the container's lifetime. Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Normally, all tests except for the upgrade test should complete in under 10 minutes. The upgrade test should be done in under 20 minutes. Anything longer than that almost guarantees that the test will fail, so let's make it fail faster to free up resources and get feedback sooner. Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Buildkite Agents can be marked as lost if they can't talk to the Buildkite API. They can also get shut down at any time because the instance they are running on receives a termination signal (e.g., spot instances). This change handles the most common signals and retries the affected job if the agent running it has been shut down unexpectedly. Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

WARNING: option --ssl-verify-server-cert is disabled, because of an insecure passwordless login. Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

stankevich · 2025-05-29T16:32:37Z

I made a few more tweaks to stabilize the tests, handle the failures better, and clean up the output. The original pull request description has been updated. Two notable changes since opening this pull request:

The hostPath volumes (backup storage) have been replaced with persistentVolumeClaim volumes. This is a more Kubernetes-native approach and is closer to real-world usage outside of using object storage. It significantly simplifies the configuration, cluster setup, permissions, and cleanup.
Decreased the timeouts of Buildkite jobs and added retries for unexpected Buildkite Agent failures. This catches sporadic failures caused by external factors, such as spot instance interruptions. It also handles occasional test timeouts, although they are now quite rare. Still, it would be nice to eliminate them completely and not rely on retries.

At this point, the tests are passing fairly reliably. The only one that occasionally gets stuck is the Backup Schedule Test (example). I haven't been able to get to the bottom of it yet. It times out because one out of three tablets fails to become ready after the initial deployment. This only seems to happen in this specific test and only on the public Buildkite Elastic CI Stack in this repository. I haven't been able to reproduce it once in over 30 local test runs or over 50 runs in our custom Buildkite Agent setup, which runs in Kubernetes and uses the Buildkite pipeline from this repository. In any case, I'll keep an eye on the tests and address any further failures that come up.

This is ready to review. I'm happy to split it into multiple pull requests if that would help with the review process. Thanks!

GuptaManan100

Looks good to me!

frouioui

This is really good. Thank you for doing this

frouioui · 2025-06-10T14:51:38Z

build/Dockerfile.release

 WORKDIR /go/src/planetscale.dev/vitess-operator
 COPY . /go/src/planetscale.dev/vitess-operator
-RUN CGO_ENABLED=0 go install /go/src/planetscale.dev/vitess-operator/cmd/manager
+RUN --mount=type=cache,target=/go/pkg/mod --mount=type=cache,target=/root/.cache/go-build CGO_ENABLED=0 go install /go/src/planetscale.dev/vitess-operator/cmd/manager


Why was this change necessary?

I've added some background to the commit message. Quick summary: before this change, every image build, even after a small and unrelated change (like a README update), would redownload all dependencies and recompile the binary from scratch. This was caused by the COPY . step invalidating the cached Docker layer and running a fresh go install.

This change enables two cache mounts: one for Go modules to avoid redownloading dependencies, and one for build caches to reuse previously built artifacts. That allows go install to reuse outputs from previous builds and cuts down the overall image build time, even if COPY . invalidates the cached layer.

Locally, the speedup is significant, going from 40+ seconds to only a few seconds. In CI, it depends on the setup, but it's never slower than before and can be much faster if the image build runs on the same Buildkite Agent (backed by a Docker service with the cache mounts).

That sounds great to me!

stankevich added 19 commits May 21, 2025 14:00

Add reusable function to build the container image

8429484

Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Add reusable function to set up the Kind cluster

26a1fec

To be used at the start of the test run. Will DRY up the setup phase across all tests. Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Move backup dir creation to setupKindConfig()

380965b

Makes sense to keep these cluster setup routines together. Also, making the function Shellcheck-clean. Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Replace test setup commands with setupKindCluster()

c48bcae

Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Add reusable function to clean up and tear down the Kind cluster

bfaaf7f

Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Replace cleanup and teardown commands with teardownKindCluster()

12aaabf

Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Add reusable function to set up port forwarding

9b45581

Wraps `killall kubectl` and `./pf.sh` calls into a reusable function that also has checks for the open ports to be ready to serve requests. Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Replace port forwarding commands with setupPortForwarding()

1bb0585

Once the port forwarding is set up and verified to be working, there's a waitForKeyspaceToBeServing() to make sure the keyspace is alive. Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Reduce VTGate's tablet_refresh_interval to 10 seconds

0c32eab

This results in a significantly quicker Vitess cluster startup, which reduces the probability of race conditions caused by a delayed start of Vitess components. Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Add retries to verifyVtGateVersion()

90fc0b0

Simplifies the function by removing one kubectl call and makes it Shellcheck-clean Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Move 'commerce' data verification to verifyDataCommerce()

cc61186

DRYing up the code a little, and removing unnecessary `sleep`s. Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Remove unused function

5b42886

Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Wait for unmanaged MySQL pod to be running

29a5eeb

Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Remove unnecessary sleeps

4e00a2f

Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Minor code cleanup

26c5309

Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Add VTAdmin port forwarding to setupPortForwarding()

9d3040b

Adds a check that waits for the VTAdmin port to be ready. Removes unnecessary sleeps. Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Reorder pod status checks

d794811

A tiny improvement, but every bit helps. Everything depends on etcd, so it should be checked first. Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

stankevich added 3 commits May 26, 2025 10:47

Fix backup dir creation

e5093cf

Create the backup directory in /workdir, not in the Buildkite Agent path Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Don't build the binary when running end-to-end tests

57cd276

It isn't used in the tests. The Docker image build would fail just as early if there are issues with building the binary. Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Remove unused code

9d87d9b

Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

stankevich force-pushed the refactor-e2e-tests branch from 3e95812 to 88f2f46 Compare May 26, 2025 08:47

Remove unused packages from Buildkite pipeline

b033d0a

* Chromium isn't used in the unmanaged tablet test * MySQL Server isn't used anywhere Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

stankevich force-pushed the refactor-e2e-tests branch from 88f2f46 to b033d0a Compare May 26, 2025 09:36

stankevich added 3 commits May 27, 2025 15:50

Remove unused function

341af5d

Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

stankevich added 2 commits May 28, 2025 15:43

stankevich force-pushed the refactor-e2e-tests branch from ac33fe2 to e6e4d70 Compare May 29, 2025 10:25

stankevich added 4 commits May 29, 2025 17:11

Adding verbosity to port forwarding for easier debugging

a1dc01a

Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Reduce output when installing/downloading tools

5046f91

Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Clean up output when building Docker image

53618c3

Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

Get rid of MySQL SSL warning

4d35938

WARNING: option --ssl-verify-server-cert is disabled, because of an insecure passwordless login. Signed-off-by: Sergey Stankevich <stankevich@users.noreply.github.com>

stankevich force-pushed the refactor-e2e-tests branch from ea7b762 to 4d35938 Compare May 29, 2025 15:11

deepthi requested review from GuptaManan100 and frouioui June 2, 2025 20:04

GuptaManan100 approved these changes Jun 10, 2025

View reviewed changes

frouioui approved these changes Jun 10, 2025

View reviewed changes

frouioui merged commit db1dc3b into planetscale:main Jun 13, 2025
12 checks passed

stankevich deleted the refactor-e2e-tests branch June 16, 2025 06:04

stankevich mentioned this pull request Jun 16, 2025

Fix mysql alias in end-to-end utils #703

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor end-to-end tests#700

Refactor end-to-end tests#700
frouioui merged 32 commits intoplanetscale:mainfrom
stankevich:refactor-e2e-tests

stankevich commented May 22, 2025 •

edited

Loading

Uh oh!

stankevich commented May 22, 2025

Uh oh!

GuptaManan100 commented May 22, 2025

Uh oh!

stankevich commented May 29, 2025

Uh oh!

GuptaManan100 left a comment

Uh oh!

frouioui left a comment

Uh oh!

frouioui Jun 10, 2025

Uh oh!

stankevich Jun 10, 2025

Uh oh!

frouioui Jun 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

stankevich commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stankevich commented May 22, 2025

Uh oh!

GuptaManan100 commented May 22, 2025

Uh oh!

stankevich commented May 29, 2025

Uh oh!

GuptaManan100 left a comment

Choose a reason for hiding this comment

Uh oh!

frouioui left a comment

Choose a reason for hiding this comment

Uh oh!

frouioui Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

stankevich Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

frouioui Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stankevich commented May 22, 2025 •

edited

Loading