-
Notifications
You must be signed in to change notification settings - Fork 88
Refactor end-to-end tests #700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Without a cache mount, any change to the repository results in a new image build, even if there are no changes to the operator's codebase. Not only each build takes over 40 seconds (on my machine), it also creates about 4 GB of Docker layers, every time. This adds up quickly and can fill up the Docker disk allocation after only a handful of runs on a Mac. This change drops the image build time from 40+ to 2 seconds. The layers that store the Go module and build caches are reused across builds. As a result, the disk growth between builds is not nearly as noticable. It should be a completely transparent change locally and in CI. Signed-off-by: Sergey Stankevich <[email protected]>
Signed-off-by: Sergey Stankevich <[email protected]>
To be used at the start of the test run. Will DRY up the setup phase across all tests. Signed-off-by: Sergey Stankevich <[email protected]>
Makes sense to keep these cluster setup routines together. Also, making the function Shellcheck-clean. Signed-off-by: Sergey Stankevich <[email protected]>
Signed-off-by: Sergey Stankevich <[email protected]>
Signed-off-by: Sergey Stankevich <[email protected]>
Signed-off-by: Sergey Stankevich <[email protected]>
Wraps `killall kubectl` and `./pf.sh` calls into a reusable function that also has checks for the open ports to be ready to serve requests. Signed-off-by: Sergey Stankevich <[email protected]>
Once the port forwarding is set up and verified to be working, there's a waitForKeyspaceToBeServing() to make sure the keyspace is alive. Signed-off-by: Sergey Stankevich <[email protected]>
This results in a significantly quicker Vitess cluster startup, which reduces the probability of race conditions caused by a delayed start of Vitess components. Signed-off-by: Sergey Stankevich <[email protected]>
Adds a checkPodSpecBySelectorWithTimeout() function that waits for post-upgrade spec changes to take effect. The new flow verifies that all components have been recreated with the latest changes before continuing with other verifications. Signed-off-by: Sergey Stankevich <[email protected]>
Simplifies the function by removing one kubectl call and makes it Shellcheck-clean Signed-off-by: Sergey Stankevich <[email protected]>
DRYing up the code a little, and removing unnecessary `sleep`s. Signed-off-by: Sergey Stankevich <[email protected]>
Signed-off-by: Sergey Stankevich <[email protected]>
Signed-off-by: Sergey Stankevich <[email protected]>
Signed-off-by: Sergey Stankevich <[email protected]>
Signed-off-by: Sergey Stankevich <[email protected]>
Adds a check that waits for the VTAdmin port to be ready. Removes unnecessary sleeps. Signed-off-by: Sergey Stankevich <[email protected]>
A tiny improvement, but every bit helps. Everything depends on etcd, so it should be checked first. Signed-off-by: Sergey Stankevich <[email protected]>
|
Of course all of them failed! 😁 Is there any way for me to see the reason? Pasting the relevant errors here would be helpful. Thanks! |
|
This is the error message I see in the buildkite output - |
Create the backup directory in /workdir, not in the Buildkite Agent path Signed-off-by: Sergey Stankevich <[email protected]>
It isn't used in the tests. The Docker image build would fail just as early if there are issues with building the binary. Signed-off-by: Sergey Stankevich <[email protected]>
Signed-off-by: Sergey Stankevich <[email protected]>
3e95812 to
88f2f46
Compare
* Chromium isn't used in the unmanaged tablet test * MySQL Server isn't used anywhere Signed-off-by: Sergey Stankevich <[email protected]>
88f2f46 to
b033d0a
Compare
Using `hostPath` as a volume in Kind adds a bit of complexity. A configuration file must be generated, a directory must be pre-created, permissions need to be set, a cleanup step is required when the cluster is destroyed, etc. This isn't as big of a deal as correcting permissions on backup subdirectories midway through test runs only to allow the backup subcontroller "see" the backups and surface them in the `kubectl get vitessbackups` output. This is pretty fragile and is prone to edge cases. A Kubernetes-native way to deal with this issue is setting `securityContext.fsGroup` on all workloads involved in taking and reading backups. Unfortunately, `fsGroup` doesn't work with `hostPath` in Kubernetes or Kind (rancher/local-path-provisioner#41 (comment), kubernetes-sigs/kind#830). Additionally, using `hostPath` volumes is generally discouraged (https://kubernetes.io/docs/concepts/storage/volumes/#hostpath). Hence this commit. It replaces `hostPath` volumes with `persistentVolumeClaim` volumes. This makes `fsGroup` work and eliminates all `chmod`-ing. It also removes Kind config generation and host-container directory sharing. The teardown process also becomes simpler, as backup files are automatically removed during Kind cluster deletion. Lastly, this setup is slightly more representative of real-world usage. To make it work, `securityContext.fsGroup` of the operator Deployment must be set to the `fsGroup` value of Vitess workloads, which is 999 by default. The backup subcontroller then inherits the `fsGroup` of the operator. Otherwise, it runs as UID 1001, which doesn't have permissions to read Vitess backup files from the mounted persistent volume. Signed-off-by: Sergey Stankevich <[email protected]>
The tests do not reliably work in environments where multiple Buildkite Agents share a single Docker service. The issues arise because `BUILDKITE_BUILD_ID` is identical for different tests triggered by the same commit. As the Kind cluster name uses `BUILDKITE_BUILD_ID`, this can lead to multiple tests attempting to use the same Kind cluster. Additionally, in these shared environments, it isn't uncommon for `docker container ls` to return multiple results, which breaks the `setupKubectlAccessForCI()` function. This change switches from using `BUILDKITE_BUILD_ID` to `BUILDKITE_JOB_ID`, which is unique for each test case. It also moves away from `docker container ls` to `hostname -s` to determine the container name. This works well because the hostname remains constant throughout the container's lifetime. Signed-off-by: Sergey Stankevich <[email protected]>
Signed-off-by: Sergey Stankevich <[email protected]>
Normally, all tests except for the upgrade test should complete in under 10 minutes. The upgrade test should be done in under 20 minutes. Anything longer than that almost guarantees that the test will fail, so let's make it fail faster to free up resources and get feedback sooner. Signed-off-by: Sergey Stankevich <[email protected]>
Buildkite Agents can be marked as lost if they can't talk to the Buildkite API. They can also get shut down at any time because the instance they are running on receives a termination signal (e.g., spot instances). This change handles the most common signals and retries the affected job if the agent running it has been shut down unexpectedly. Signed-off-by: Sergey Stankevich <[email protected]>
ac33fe2 to
e6e4d70
Compare
Signed-off-by: Sergey Stankevich <[email protected]>
Signed-off-by: Sergey Stankevich <[email protected]>
Signed-off-by: Sergey Stankevich <[email protected]>
WARNING: option --ssl-verify-server-cert is disabled, because of an insecure passwordless login. Signed-off-by: Sergey Stankevich <[email protected]>
ea7b762 to
4d35938
Compare
|
I made a few more tweaks to stabilize the tests, handle the failures better, and clean up the output. The original pull request description has been updated. Two notable changes since opening this pull request:
At this point, the tests are passing fairly reliably. The only one that occasionally gets stuck is the Backup Schedule Test (example). I haven't been able to get to the bottom of it yet. It times out because one out of three tablets fails to become ready after the initial deployment. This only seems to happen in this specific test and only on the public Buildkite Elastic CI Stack in this repository. I haven't been able to reproduce it once in over 30 local test runs or over 50 runs in our custom Buildkite Agent setup, which runs in Kubernetes and uses the Buildkite pipeline from this repository. In any case, I'll keep an eye on the tests and address any further failures that come up. This is ready to review. I'm happy to split it into multiple pull requests if that would help with the review process. Thanks! |
GuptaManan100
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
frouioui
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really good. Thank you for doing this
| WORKDIR /go/src/planetscale.dev/vitess-operator | ||
| COPY . /go/src/planetscale.dev/vitess-operator | ||
| RUN CGO_ENABLED=0 go install /go/src/planetscale.dev/vitess-operator/cmd/manager | ||
| RUN --mount=type=cache,target=/go/pkg/mod --mount=type=cache,target=/root/.cache/go-build CGO_ENABLED=0 go install /go/src/planetscale.dev/vitess-operator/cmd/manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this change necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some background to the commit message. Quick summary: before this change, every image build, even after a small and unrelated change (like a README update), would redownload all dependencies and recompile the binary from scratch. This was caused by the COPY . step invalidating the cached Docker layer and running a fresh go install.
This change enables two cache mounts: one for Go modules to avoid redownloading dependencies, and one for build caches to reuse previously built artifacts. That allows go install to reuse outputs from previous builds and cuts down the overall image build time, even if COPY . invalidates the cached layer.
Locally, the speedup is significant, going from 40+ seconds to only a few seconds. In CI, it depends on the setup, but it's never slower than before and can be much faster if the image build runs on the same Buildkite Agent (backed by a Docker service with the cache mounts).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds great to me!
The end-to-end tests have been pretty flaky lately, which makes it harder to trust them and slows down pull requests. Let's try to get this fixed! Here, I focused on the initial Kind cluster setup, port forwarding, teardown, and some of the most temperamental parts of the test suite. Quick summary of the changes:
hostPathvolumes withpersistentVolumeClaimvolumes. That simplified the code, cluster setup, file permissions, and removed occasionalchmod-ing of backup directories.sleeps from the tests.tablet_refresh_intervalfrom 60 to 10 seconds (same as in vttestserver), which speeds up the initial Vitess cluster setup and upgrades.commercekeyspace verification commands into a function, DRYing up a decent chunk of the code.I ran some "before" and "after" comparisons. On my machine, the average time for 7 test runs on the
mainbranch was 36 minutes 1 second, with a standard deviation of 1 minute 4 seconds. After these changes, the average dropped to 27 minutes 57 seconds, with a standard deviation of 1 minute 1 second. More importantly though, I haven't seen any of the flakiness that was happening before. Will see what CI says and make further improvements if needed.The diff of this pull request is pretty big, and it's much easier to review commit-by-commit. I decided not to split it into multiple pull requests because I was worried that flaky failures in intermediate ones would stall all of them. 😅
Related to #582.