-
Notifications
You must be signed in to change notification settings - Fork 61
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
The following is the current behaviour:
- Agent establishes initial connection to principal
- (... additional time passes ...)
- ( agent loses connection to principal )
maintainConnection(agent/connection.go) creates a new connection to principal (once connection to principal becomes possible)handleStreamEvents(agent/connection.go) creates aEventWriterto fully replace the oldEventWriter, and that new EventWriter begins receiving events
However, 3) is incorrect, as the old EventWriter object still exists from 1), and still has events left to send in unsentEvents and sentEvents (though sending events will fail because that old EventWriter is still trying to use the previous dead connection)
Principal code does not have this problem, because it calls EventWriter.UpdateTarget when a new connection is established (thus preserving the existing events).
Found using chaos engineering via toxiproxy, which I will contribute separately.
Reproducible on main branch as commit 31282236c1d088ac579544f71dab8720a90ea8c7 (Jan 14).
To reproduce
This will simulate persistent failures from managed agent <-> principal. (Autonomous is not modified here)
# Pull branch for bug reproduction
git clone https://github.com/jgwest/argocd-agent
git checkout delme-repro-argocd-agent-715-jan-2026
# setup vcluster and enabled argocd-agent config
make setup-e2e
# start toxiproxy server (or docker)
podman run --rm --net=host -it ghcr.io/shopify/toxiproxy
# chaos tester telsl toxiproxy server to block all connections between principal<->agent every 10-15 seconds, for 5-10 seconds.
go run hack/chaos-tester/main.go
# Start local argocd-agent (agent uses toxiproxy as intermediary to principal)
ARGOCD_AGENT_REMOTE_PORT=8475 make start-e2e
# Run Test_SyncManaged over and over until it fails
# For me, it takes 20-25 minutes to fail.
until-fail.sh go test -count=1 -v -v -run TestSyncTestSuite/Test_SyncManaged ./test/e2e/
# until-fail.sh is simple shell script:
# https://gist.github.com/jgwest/7048a765d398519837f990120cf3fdd0
Example failure logs when running repro steps
# Round N-1 passes
=== RUN TestSyncTestSuite/Test_SyncManaged
fixture.go:103: Test begun at: 2026-01-29 12:50:55.562902512 -0500 EST m=+0.108581668
sync_test.go:104: jgw: waiting for outofsync guestbook agent-managed 2026-01-29 12:51:02.58232843 -0500 EST m=+7.128007588
sync_test.go:113: jgw: starting syncapplication of guestbook agent-managed 2026-01-29 12:51:03.587190644 -0500 EST m=+8.132869811
sync_test.go:118: jgw: completing syncapplication 2026-01-29 12:51:03.601917405 -0500 EST m=+8.147596602
sync_test.go:127: jgw: completing eventually 2026-01-29 12:51:04.607752216 -0500 EST m=+9.153431369
fixture.go:107: Test ended at: 2026-01-29 12:51:06.64945961 -0500 EST m=+11.195138776
--- PASS: TestSyncTestSuite (28.34s)
--- PASS: TestSyncTestSuite/Test_SyncManaged (28.30s)
PASS
ok github.com/argoproj-labs/argocd-agent/test/e2e 28.373s
=== RUN TestSyncTestSuite
# Round N fails
=== RUN TestSyncTestSuite/Test_SyncManaged
fixture.go:103: Test begun at: 2026-01-29 12:51:26.843821979 -0500 EST m=+0.110578000
sync_test.go:104: jgw: waiting for outofsync guestbook agent-managed 2026-01-29 12:51:27.862541937 -0500 EST m=+1.129297953
sync_test.go:107:
Error Trace: /home/jgw/workspace/argo-cd/argocd-agent/test/e2e/sync_test.go:107
/usr/lib/golang/src/runtime/asm_amd64.s:1693
Error: Condition never satisfied
Test: TestSyncTestSuite/Test_SyncManaged
fixture.go:107: Test ended at: 2026-01-29 12:53:27.862777208 -0500 EST m=+121.129533233
--- FAIL: TestSyncTestSuite (139.28s)
--- FAIL: TestSyncTestSuite/Test_SyncManaged (139.24s)
FAIL
FAIL github.com/argoproj-labs/argocd-agent/test/e2e 139.314s
FAIL
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working