-
Notifications
You must be signed in to change notification settings - Fork 100
Description
Summary
TestTranslSubscribe fails approximately 22% of the time in CI (Azure Pipelines), causing nearly all PRs to fail regardless of code correctness.
Symptom
The test panics or errors immediately with:
bind: address already in use
This occurs before any subtest executes, inside createServer(t, 8081) at the parent TestTranslSubscribe level.
Root Cause
The root cause is a hardcoded port 8081 passed to createServer(t, 8081) at line 41 of gnmi_server/transl_sub_test.go. When a prior test run leaves the port in TIME_WAIT or the socket has not yet been released by the OS, the new server fails to bind.
Contributing factors:
-
Long-running SAMPLE subtests — several subtests use
SAMPLEmode with 25-second intervals, meaning the gRPC server lingers for tens of seconds after the subtest ends. Thedefer s.s.Stop()at the parent level only fires when all subtests complete, so the port stays in use until the parent exits. -
Incomplete Redis flush —
prepareDbTranslib(t)flushes only one Redis DB. If a prior test leaves state in a different DB number that translib touches during ACL writes, subsequent test runs may behave unexpectedly and increase the chance of server teardown racing. -
All 15 active subtests share a single
createServercall — a single port conflict fails the entire test function immediately. There is no per-subtest server creation, so there is no partial fallback.
Affected Tests
All subtests under TestTranslSubscribe:
origin=openconfig,origin=sonic-db, and ~13 more active subtests covering ONCE, POLL, STREAM, ON_CHANGE, SAMPLE, and TARGET_DEFINED subscription modes.
Proposed Fix
Temporary (this PR): Skip the test with a t.Skip() linking back to this issue, allowing PRs to pass CI while the root cause is tracked.
Permanent: Allocate a dynamic port using net.Listen("tcp", ":0") in createServer instead of hardcoding 8081, eliminating the port-reuse race entirely. This requires updating all callers.