-
Notifications
You must be signed in to change notification settings - Fork 137
Fix some of the NFR tests to reduce error logs and pipeline failures #3827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ead05a1
to
f53133f
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3827 +/- ##
==========================================
+ Coverage 86.96% 87.00% +0.04%
==========================================
Files 128 128
Lines 16434 16434
Branches 62 62
==========================================
+ Hits 14292 14299 +7
+ Misses 1964 1959 -5
+ Partials 178 176 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
f53133f
to
755a95d
Compare
if strings.Contains(err.Error(), "file already closed") { | ||
GinkgoWriter.Printf("Warning: attempted to close already closed file: %v\n", err) | ||
} else { | ||
Expect(err).ToNot(HaveOccurred()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've already checked that err != nil, so this will never pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aah good catch
Just to clarify, the err != nil
check is there so I can branch on the type of error. The intention is to only warn (and not fail the test) if the error is specifically "file already closed", while treating all other errors as failures.
@@ -835,7 +843,19 @@ var _ = Describe("Zero downtime scale test", Ordered, Label("nfr", "zero-downtim | |||
AfterAll(func() { | |||
_, err := fmt.Fprint(outFile) | |||
Expect(err).ToNot(HaveOccurred()) | |||
Expect(outFile.Close()).To(Succeed()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering how this could already be closed in any situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. I saw it as a possible issue in the pipeline runs. I was trying to understand the same by asking github copilot but all the answers did not make sense to me. But I wanted to ensure we fix it for the pipeline so added these checks.
From what I understand -- I don't think these tests run in parallel looking at the workflow. But is it worth risking the NFR biweekly pipeline run to fail for these reasons and not get to the PR step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally I'd like to dig deeper into why this specific test could have this issue, rather than essentially just ignoring it. Since all the tests should open and close their respective fields as part of the setup and teardown. This one shouldn't be any different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I didn’t communicate it well earlier — I actually tried to reproduce this bug with multiple runs, but I couldn’t get the exact error to show up. I also dug into whether the file might be getting closed in any of the helper functions or if the sequence was off, but I didn’t find anything unusual there either.
I added this check mainly as a safeguard so the pipeline wouldn’t fail on what seems like a simple case, but I agree it’s worth digging deeper. I’ll keep investigating to see if I can pinpoint the real cause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an update -- on my 5th run of this locally but haven't had a file close error. But one time I couldn't get LB IP and it timed out. Don't think it is related happens when GKE is hasn't assigned the external IP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I saw it here: https://github.com/nginx/nginx-gateway-fabric/actions/runs/17382468629/job/49342855140
Attempt 1 failed so it retried.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, since we have flake-atttempts=2
, when there's a failure, it'll retry the test again, which is what is happening there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no I meant like I ave never seen a flakey attempt pass for NFR, so to agree to your original point -- why keep them if they will lead us to different file issues
- https://github.com/nginx/nginx-gateway-fabric/actions/runs/17382468629/job/49342855140
- https://github.com/nginx/nginx-gateway-fabric/actions/runs/16993918579/job/48179870697
- https://github.com/nginx/nginx-gateway-fabric/actions/runs/16993918579/job/48179870629
Also from one of the runs, I saw this error for scale tests
FAILED] in [AfterAll] - /home/username/nginx-gateway-fabric/tests/suite/system_suite_test.go:264 @ 08/15/25 18:37:27.954
• [FAILED] [1090.394 seconds]
Scale test [BeforeEach] scales HTTP routes to 1000 [nfr, scale]
[BeforeEach] /home/username/nginx-gateway-fabric/tests/suite/scale_test.go:99
[It] /home/username/nginx-gateway-fabric/tests/suite/scale_test.go:570
[FAILED] install.go:224: 2025-08-15 18:25:42.463026353 +0000 UTC m=+0.038108761 [debug] Original chart version: ""
install.go:241: 2025-08-15 18:25:42.463081634 +0000 UTC m=+0.038164040 [debug] CHART PATH: /home/username/nginx-gateway-fabric/charts/nginx-gateway-fabric
walk.go:75: found symbolic link in path: /home/username/nginx-gateway-fabric/charts/nginx-gateway-fabric/crds resolves to /home/username/nginx-gateway-fabric/config/crd/bases. Contents of linked file included and used
client.go:142: 2025-08-15 18:25:42.5001725 +0000 UTC m=+0.075254914 [debug] creating 1 resource(s)
client.go:142: 2025-08-15 18:25:42.536370754 +0000 UTC m=+0.111453183 [debug] creating 1 resource(s)
client.go:142: 2025-08-15 18:25:42.596179176 +0000 UTC m=+0.171261591 [debug] creating 1 resource(s)
client.go:142: 2025-08-15 18:25:43.007022036 +0000 UTC m=+0.582104435 [debug] creating 1 resource(s)
client.go:142: 2025-08-15 18:25:43.085068053 +0000 UTC m=+0.660150467 [debug] creating 1 resource(s)
client.go:142: 2025-08-15 18:25:43.14430961 +0000 UTC m=+0.719392046 [debug] creating 1 resource(s)
wait.go:50: 2025-08-15 18:25:43.237721342 +0000 UTC m=+0.812803751 [debug] beginning wait for 6 resources with timeout of 1m0s
install.go:212: 2025-08-15 18:25:45.666880466 +0000 UTC m=+3.241962882 [debug] Clearing REST mapper cache
client.go:142: 2025-08-15 18:25:46.294039937 +0000 UTC m=+3.869122350 [debug] creating 1 resource(s)
client.go:486: 2025-08-15 18:25:46.460743731 +0000 UTC m=+4.0[358](https://github.com/nginx/nginx-gateway-fabric/actions/runs/16993918579/job/48179870697#step:11:359)26141 [debug] Starting delete for "ngf-test-nginx-gateway-fabric-cert-generator" ServiceAccount
client.go:490: 2025-08-15 18:25:46.501964695 +0000 UTC m=+4.077047109 [debug] Ignoring delete failure for "ngf-test-nginx-gateway-fabric-cert-generator" /v1, Kind=ServiceAccount: serviceaccounts "ngf-test-nginx-gateway-fabric-cert-generator" not found
wait.go:104: 2025-08-15 18:25:46.502099876 +0000 UTC m=+4.077182288 [debug] beginning wait for 1 resources to be deleted with timeout of 5m0s
client.go:142: 2025-08-15 18:25:46.758832678 +0000 UTC m=+4.333915093 [debug] creating 1 resource(s)
client.go:486: 2025-08-15 18:25:46.817617819 +0000 UTC m=+4.392700224 [debug] Starting delete for "ngf-test-nginx-gateway-fabric-cert-generator" Role
client.go:490: 2025-08-15 18:25:46.851225828 +0000 UTC m=+4.426308242 [debug] Ignoring delete failure for "ngf-test-nginx-gateway-fabric-cert-generator" rbac.authorization.k8s.io/v1, Kind=Role: roles.rbac.authorization.k8s.io "ngf-test-nginx-gateway-fabric-cert-generator" not found
wait.go:104: 2025-08-15 18:25:46.851272367 +0000 UTC m=+4.426354760 [debug] beginning wait for 1 resources to be deleted with timeout of 5m0s
client.go:142: 2025-08-15 18:25:47.057280013 +0000 UTC m=+4.632362426 [debug] creating 1 resource(s)
client.go:486: 2025-08-15 18:25:47.141462002 +0000 UTC m=+4.716544407 [debug] Starting delete for "ngf-test-nginx-gateway-fabric-cert-generator" RoleBinding
client.go:490: 2025-08-15 18:25:47.174935325 +0000 UTC m=+4.750017744 [debug] Ignoring delete failure for "ngf-test-nginx-gateway-fabric-cert-generator" rbac.authorization.k8s.io/v1, Kind=RoleBinding: rolebindings.rbac.authorization.k8s.io "ngf-test-nginx-gateway-fabric-cert-generator" not found
wait.go:104: 2025-08-15 18:25:47.175001271 +0000 UTC m=+4.750083677 [debug] beginning wait for 1 resources to be deleted with timeout of 5m0s
client.go:142: 2025-08-15 18:25:47.379579502 +0000 UTC m=+4.954661916 [debug] creating 1 resource(s)
client.go:486: 2025-08-15 18:25:47.454800946 +0000 UTC m=+5.029883351 [debug] Starting delete for "ngf-test-nginx-gateway-fabric-cert-generator" Job
client.go:490: 2025-08-15 18:25:47.491223753 +0000 UTC m=+5.066306166 [debug] Ignoring delete failure for "ngf-test-nginx-gateway-fabric-cert-generator" batch/v1, Kind=Job: jobs.batch "ngf-test-nginx-gateway-fabric-cert-generator" not found
wait.go:104: 2025-08-15 18:25:47.491272508 +0000 UTC m=+5.066354919 [debug] beginning wait for 1 resources to be deleted with timeout of 5m0s
client.go:142: 2025-08-15 18:25:47.748791178 +0000 UTC m=+5.32387[359](https://github.com/nginx/nginx-gateway-fabric/actions/runs/16993918579/job/48179870697#step:11:360)1 [debug] creating 1 resource(s)
client.go:712: 2025-08-15 18:25:47.805465089 +0000 UTC m=+5.380547487 [debug] Watching for changes to Job ngf-test-nginx-gateway-fabric-cert-generator with timeout of 5m0s
client.go:740: 2025-08-15 18:25:47.834789407 +0000 UTC m=+5.409871818 [debug] Add/Modify event for ngf-test-nginx-gateway-fabric-cert-generator: ADDED
client.go:779: 2025-08-15 18:25:47.834843163 +0000 UTC m=+5.409925574 [debug] ngf-test-nginx-gateway-fabric-cert-generator: Jobs active: 0, jobs failed: 0, jobs succeeded: 0
client.go:740: 2025-08-15 18:25:47.940051746 +0000 UTC m=+5.515134161 [debug] Add/Modify event for ngf-test-nginx-gateway-fabric-cert-generator: MODIFIED
client.go:779: 2025-08-15 18:25:47.940094635 +0000 UTC m=+5.515177045 [debug] ngf-test-nginx-gateway-fabric-cert-generator: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
Error: INSTALLATION FAILED: failed pre-install: 1 error occurred:
* timed out waiting for the condition
helm.go:86: 2025-08-15 18:30:47.996973087 +0000 UTC m=+305.572055494 [debug] failed pre-install: 1 error occurred:
* timed out waiting for the condition
INSTALLATION FAILED
main.newInstallCmd.func2
helm.sh/helm/v3/cmd/helm/install.go:158
github.com/spf13/cobra.(*Command).execute
github.com/spf13/[email protected]/command.go:985
github.com/spf13/cobra.(*Command).ExecuteC
github.com/spf13/[email protected]/command.go:1117
github.com/spf13/cobra.(*Command).Execute
github.com/spf13/[email protected]/command.go:1041
main.main
helm.sh/helm/v3/cmd/helm/helm.go:85
runtime.main
runtime/proc.go:272
runtime.goexit
runtime/asm_amd64.s:1700
Unexpected error:
<*exec.ExitError | 0xc005170000>:
exit status 1
{
ProcessState: {
pid: 38573,
status: 256,
rusage: {
Utime: {Sec: 1, Usec: 412618},
Stime: {Sec: 0, Usec: 159787},
Maxrss: 3481956,
Ixrss: 0,
Idrss: 0,
Isrss: 0,
Minflt: 19401,
Majflt: 0,
Nswap: 0,
Inblock: 0,
Oublock: 792,
Msgsnd: 0,
Msgrcv: 0,
Nsignals: 0,
Nvcsw: 4897,
Nivcsw: 806,
},
},
Stderr: nil,
}
occurred
Error: BeforeEach 08/15/25 18:30:48.001
Full Stack Trace
github.com/nginx/nginx-gateway-fabric/v2/tests/suite.createNGFInstallConfig({{0x31caa9a, 0x8}, {0xc0005dd640, 0x3f}, {0x7ffc112c[363](https://github.com/nginx/nginx-gateway-fabric/actions/runs/16993918579/job/48179870697#step:11:364)8, 0x5}, 0x1, 0x1, 0x1, 0x0}, ...)
/home/username/nginx-gateway-fabric/tests/suite/system_suite_test.go:264 +0xbec
github.com/nginx/nginx-gateway-fabric/v2/tests/suite.setup({{0x31caa9a, 0x8}, {0xc0005dd640, 0x3f}, {0x7ffc112c3638, 0x5}, 0x1, 0x1, 0x1, 0x0}, ...)
/home/username/nginx-gateway-fabric/tests/suite/system_suite_test.go:180 +0x12e8
github.com/nginx/nginx-gateway-fabric/v2/tests/suite.init.func10.2()
/home/username/nginx-gateway-fabric/tests/suite/scale_test.go:104 +0x10a
There were additional failures detected. To view them in detail run ginkgo -vv
------------------------------
failed on waiting for createNGFInstallConfig
similar to something Tina shared with us when images were not built and pushed up in her cluster, concerning. I'll recheck and see the workflow but this is my first time seeing this error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
had the first successful run for NFR pipelines. I ran it against my branch but tests results don't look good for some cases compared against the March results
I can remove the flakey attempts if we think it doesn't help.
cd4a559
to
35e5317
Compare
|
||
if outFile != nil { | ||
if err := outFile.Close(); err != nil { | ||
if errors.Is(err, os.ErrClosed) || strings.Contains(err.Error(), "file already closed") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these two checks redundant or is they different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only added this for sanity because ginkgo appends errors in case of multiple errors. And I wasn't able to reproduce so wanted to be sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @salonichf5 🎉
Proposed changes
Write a clear and concise description that helps reviewers understand the purpose and impact of your changes. Use the
following format:
Problem: NFR tests had errors and had pipeline failures.
Solution: I have tried to add fixes for some of the errors and fix the pipeline failures
This PR also adds
--server-side
flag to gateway API crd installationTesting: Re ran tests locally to ensure they pass
What's excluded - Not investigating longevity tests because the NGINX agent bug should solve issues around it that's being monitored and investigated of by another teammate.
Please focus on (optional): If you any specific areas where you would like reviewers to focus their attention or provide
specific feedback, add them here.
Closes #3728
Closes #3796
Checklist
Before creating a PR, run through this checklist and mark each as complete.
Release notes
If this PR introduces a change that affects users and needs to be mentioned in the release notes,
please add a brief note that summarizes the change.