Skip to content

Managed/Autonomous agents lose unsent/retriable events when principal connection is lost #715

@jgwest

Description

@jgwest

Describe the bug

The following is the current behaviour:

  1. Agent establishes initial connection to principal
  • (... additional time passes ...)
  • ( agent loses connection to principal )
  1. maintainConnection (agent/connection.go) creates a new connection to principal (once connection to principal becomes possible)
  2. handleStreamEvents (agent/connection.go) creates a EventWriter to fully replace the old EventWriter, and that new EventWriter begins receiving events

However, 3) is incorrect, as the old EventWriter object still exists from 1), and still has events left to send in unsentEvents and sentEvents (though sending events will fail because that old EventWriter is still trying to use the previous dead connection)

Principal code does not have this problem, because it calls EventWriter.UpdateTarget when a new connection is established (thus preserving the existing events).

Found using chaos engineering via toxiproxy, which I will contribute separately.

Reproducible on main branch as commit 31282236c1d088ac579544f71dab8720a90ea8c7 (Jan 14).

To reproduce

This will simulate persistent failures from managed agent <-> principal. (Autonomous is not modified here)

# Pull branch for bug reproduction
git clone https://github.com/jgwest/argocd-agent
git checkout delme-repro-argocd-agent-715-jan-2026

# setup vcluster and enabled argocd-agent config 
make setup-e2e

# start toxiproxy server (or docker)
podman run --rm --net=host -it ghcr.io/shopify/toxiproxy

# chaos tester telsl toxiproxy server to block all connections between principal<->agent every 10-15 seconds, for 5-10 seconds. 
go run hack/chaos-tester/main.go

# Start local argocd-agent (agent uses toxiproxy as intermediary to principal)
ARGOCD_AGENT_REMOTE_PORT=8475 make start-e2e

# Run Test_SyncManaged over and over until it fails
# For me, it takes 20-25 minutes to fail.
until-fail.sh   go test -count=1 -v -v -run TestSyncTestSuite/Test_SyncManaged ./test/e2e/

# until-fail.sh is simple shell script:
# https://gist.github.com/jgwest/7048a765d398519837f990120cf3fdd0

Example failure logs when running repro steps


# Round N-1 passes
=== RUN   TestSyncTestSuite/Test_SyncManaged
    fixture.go:103: Test begun at: 2026-01-29 12:50:55.562902512 -0500 EST m=+0.108581668
    sync_test.go:104: jgw: waiting for outofsync guestbook agent-managed 2026-01-29 12:51:02.58232843 -0500 EST m=+7.128007588
    sync_test.go:113: jgw: starting syncapplication of guestbook agent-managed 2026-01-29 12:51:03.587190644 -0500 EST m=+8.132869811
    sync_test.go:118: jgw: completing syncapplication 2026-01-29 12:51:03.601917405 -0500 EST m=+8.147596602
    sync_test.go:127: jgw: completing eventually 2026-01-29 12:51:04.607752216 -0500 EST m=+9.153431369
    fixture.go:107: Test ended at: 2026-01-29 12:51:06.64945961 -0500 EST m=+11.195138776
--- PASS: TestSyncTestSuite (28.34s)
    --- PASS: TestSyncTestSuite/Test_SyncManaged (28.30s)
PASS
ok  	github.com/argoproj-labs/argocd-agent/test/e2e	28.373s
=== RUN   TestSyncTestSuite

# Round N fails
=== RUN   TestSyncTestSuite/Test_SyncManaged
    fixture.go:103: Test begun at: 2026-01-29 12:51:26.843821979 -0500 EST m=+0.110578000
    sync_test.go:104: jgw: waiting for outofsync guestbook agent-managed 2026-01-29 12:51:27.862541937 -0500 EST m=+1.129297953
    sync_test.go:107: 
        	Error Trace:	/home/jgw/workspace/argo-cd/argocd-agent/test/e2e/sync_test.go:107
        	            				/usr/lib/golang/src/runtime/asm_amd64.s:1693
        	Error:      	Condition never satisfied
        	Test:       	TestSyncTestSuite/Test_SyncManaged
    fixture.go:107: Test ended at: 2026-01-29 12:53:27.862777208 -0500 EST m=+121.129533233
--- FAIL: TestSyncTestSuite (139.28s)
    --- FAIL: TestSyncTestSuite/Test_SyncManaged (139.24s)
FAIL
FAIL	github.com/argoproj-labs/argocd-agent/test/e2e	139.314s
FAIL

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions