-
Notifications
You must be signed in to change notification settings - Fork 138
Fix some of the NFR tests to reduce error logs and pipeline failures #3827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
39ba9e7
e30ed0b
d5a1a1b
277f0db
35e5317
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
apiVersion: gateway.networking.k8s.io/v1 | ||
kind: HTTPRoute | ||
metadata: | ||
name: route | ||
spec: | ||
parentRefs: | ||
- name: gateway | ||
sectionName: listener | ||
hostnames: | ||
- "*.example.com" | ||
rules: | ||
- matches: | ||
- path: | ||
type: PathPrefix | ||
value: / | ||
backendRefs: | ||
- name: backend | ||
port: 80 |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,6 +3,7 @@ package main | |
import ( | ||
"bytes" | ||
"context" | ||
"errors" | ||
"fmt" | ||
"io" | ||
"os" | ||
|
@@ -40,6 +41,10 @@ var _ = Describe("Scale test", Ordered, Label("nfr", "scale"), func() { | |
"scale/upstreams.yaml", | ||
} | ||
|
||
httpRouteManifests = []string{ | ||
"scale/httproute.yaml", | ||
} | ||
|
||
namespace = "scale" | ||
|
||
scrapeInterval = 15 * time.Second | ||
|
@@ -468,6 +473,10 @@ The logs are attached only if there are errors. | |
Expect(resourceManager.ApplyFromFiles(upstreamsManifests, namespace)).To(Succeed()) | ||
Expect(resourceManager.WaitForAppsToBeReady(namespace)).To(Succeed()) | ||
|
||
// apply HTTPRoute after upstreams are ready | ||
Expect(resourceManager.ApplyFromFiles(httpRouteManifests, namespace)).To(Succeed()) | ||
Expect(resourceManager.WaitForAppsToBeReady(namespace)).To(Succeed()) | ||
|
||
var nginxPodNames []string | ||
var err error | ||
Eventually( | ||
|
@@ -835,7 +844,16 @@ var _ = Describe("Zero downtime scale test", Ordered, Label("nfr", "zero-downtim | |
AfterAll(func() { | ||
_, err := fmt.Fprint(outFile) | ||
Expect(err).ToNot(HaveOccurred()) | ||
Expect(outFile.Close()).To(Succeed()) | ||
|
||
if outFile != nil { | ||
if err := outFile.Close(); err != nil { | ||
if errors.Is(err, os.ErrClosed) || strings.Contains(err.Error(), "file already closed") { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are these two checks redundant or is they different? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. only added this for sanity because ginkgo appends errors in case of multiple errors. And I wasn't able to reproduce so wanted to be sure |
||
GinkgoWriter.Printf("Warning: attempted to close already closed file: %v\n", err) | ||
} else { | ||
Expect(err).ToNot(HaveOccurred()) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You've already checked that err != nil, so this will never pass There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. aah good catch |
||
} | ||
} | ||
} | ||
|
||
// restoring NGF shared among tests in the suite | ||
cfg := getDefaultSetupCfg() | ||
|
@@ -1012,13 +1030,13 @@ var _ = Describe("Zero downtime scale test", Ordered, Label("nfr", "zero-downtim | |
checkGatewayListeners := func(num int) { | ||
Eventually( | ||
func() error { | ||
ctx, cancel := context.WithTimeout(context.Background(), timeoutConfig.GetTimeout) | ||
ctx, cancel := context.WithTimeout(context.Background(), timeoutConfig.GetTimeout*2) | ||
defer cancel() | ||
|
||
var gw v1.Gateway | ||
key := types.NamespacedName{Namespace: ns.Name, Name: "gateway"} | ||
if err := resourceManager.K8sClient.Get(ctx, key, &gw); err != nil { | ||
return err | ||
return fmt.Errorf("failed to get gateway: %w", err) | ||
} | ||
|
||
if len(gw.Status.Listeners) != num { | ||
|
@@ -1028,8 +1046,8 @@ var _ = Describe("Zero downtime scale test", Ordered, Label("nfr", "zero-downtim | |
return nil | ||
}, | ||
). | ||
WithTimeout(5 * time.Second). | ||
WithPolling(100 * time.Millisecond). | ||
WithTimeout(timeoutConfig.GatewayListenerUpdateTimeout). | ||
WithPolling(1 * time.Second). | ||
Should(Succeed()) | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering how this could already be closed in any situation.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. I saw it as a possible issue in the pipeline runs. I was trying to understand the same by asking github copilot but all the answers did not make sense to me. But I wanted to ensure we fix it for the pipeline so added these checks.
From what I understand -- I don't think these tests run in parallel looking at the workflow. But is it worth risking the NFR biweekly pipeline run to fail for these reasons and not get to the PR step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally I'd like to dig deeper into why this specific test could have this issue, rather than essentially just ignoring it. Since all the tests should open and close their respective fields as part of the setup and teardown. This one shouldn't be any different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I didn’t communicate it well earlier — I actually tried to reproduce this bug with multiple runs, but I couldn’t get the exact error to show up. I also dug into whether the file might be getting closed in any of the helper functions or if the sequence was off, but I didn’t find anything unusual there either.
I added this check mainly as a safeguard so the pipeline wouldn’t fail on what seems like a simple case, but I agree it’s worth digging deeper. I’ll keep investigating to see if I can pinpoint the real cause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an update -- on my 5th run of this locally but haven't had a file close error. But one time I couldn't get LB IP and it timed out. Don't think it is related happens when GKE is hasn't assigned the external IP
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I saw it here: https://github.com/nginx/nginx-gateway-fabric/actions/runs/17382468629/job/49342855140
Attempt 1 failed so it retried.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked logs for

Retrying
and saw one entry of the test that failed. confused :?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, since we have
flake-atttempts=2
, when there's a failure, it'll retry the test again, which is what is happening there.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no I meant like I ave never seen a flakey attempt pass for NFR, so to agree to your original point -- why keep them if they will lead us to different file issues
Also from one of the runs, I saw this error for scale tests
failed on waiting for
createNGFInstallConfig
similar to something Tina shared with us when images were not built and pushed up in her cluster, concerning. I'll recheck and see the workflow but this is my first time seeing this errorUh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
had the first successful run for NFR pipelines. I ran it against my branch but tests results don't look good for some cases compared against the March results
I can remove the flakey attempts if we think it doesn't help.