Skip to content

Deadlock in sotw v3 server #503

@lobkovilya

Description

@lobkovilya

The problem occurs in our testing after the update to the latest go-control-plane version, but I believe it might happen in real-world use cases.

Stack trace for goroutines looks like this:

1 @ 0x103ca25 0x100759a 0x10072f5 0x1bb58af 0x1bb4507 0x1bbcdb9 0x1ea2695 0x1bb3ec6 0x1074121
#	0x1bb58ae	github.com/kumahq/kuma/pkg/util/xds/v3.(*snapshotCache).respond+0x12e		/Users/lobkovilya/go/src/github.com/Kong/kuma/pkg/util/xds/v3/cache.go:311
#	0x1bb4506	github.com/kumahq/kuma/pkg/util/xds/v3.(*snapshotCache).SetSnapshot+0x5a6	/Users/lobkovilya/go/src/github.com/Kong/kuma/pkg/util/xds/v3/cache.go:168
#	0x1bbcdb8	github.com/kumahq/kuma/pkg/kds/reconcile.(*reconciler).Reconcile+0x1f8		/Users/lobkovilya/go/src/github.com/Kong/kuma/pkg/kds/reconcile/reconciler.go:46
#	0x1ea2694	github.com/kumahq/kuma/pkg/kds/server.newSyncTracker.func1.2+0x194		/Users/lobkovilya/go/src/github.com/Kong/kuma/pkg/kds/server/components.go:93
#	0x1bb3ec5	github.com/kumahq/kuma/pkg/util/watchdog.(*SimpleWatchdog).Start+0xe5		/Users/lobkovilya/go/src/github.com/Kong/kuma/pkg/util/watchdog/watchdog.go:25

...

1 @ 0x103ca25 0x104e6c5 0x104e6ae 0x106fd67 0x107f225 0x1080990 0x1080922 0x1bb8a06 0x1bab955 0x1bad89a 0x1e9fecb 0x284e9b3 0x1074121
#	0x106fd66	sync.runtime_SemacquireMutex+0x46							/usr/local/Cellar/go/1.16.5/libexec/src/runtime/sema.go:71
#	0x107f224	sync.(*Mutex).lockSlow+0x104								/usr/local/Cellar/go/1.16.5/libexec/src/sync/mutex.go:138
#	0x108098f	sync.(*Mutex).Lock+0x8f									/usr/local/Cellar/go/1.16.5/libexec/src/sync/mutex.go:81
#	0x1080921	sync.(*RWMutex).Lock+0x21								/usr/local/Cellar/go/1.16.5/libexec/src/sync/rwmutex.go:111
#	0x1bb8a05	github.com/kumahq/kuma/pkg/util/xds/v3.(*snapshotCache).cancelWatch.func1+0x65		/Users/lobkovilya/go/src/github.com/Kong/kuma/pkg/util/xds/v3/cache.go:283
#	0x1bab954	github.com/envoyproxy/go-control-plane/pkg/server/sotw/v3.(*server).process+0x7b4	/Users/lobkovilya/go/src/github.com/envoyproxy/go-control-plane/pkg/server/sotw/v3/server.go:418
#	0x1bad899	github.com/envoyproxy/go-control-plane/pkg/server/sotw/v3.(*server).StreamHandler+0xb9	/Users/lobkovilya/go/src/github.com/envoyproxy/go-control-plane/pkg/server/sotw/v3/server.go:449
#	0x1e9feca	github.com/kumahq/kuma/pkg/kds/server.(*server).StreamKumaResources+0x8a		/Users/lobkovilya/go/src/github.com/Kong/kuma/pkg/kds/server/kds.go:30
#	0x284e9b2	github.com/kumahq/kuma/pkg/test/kds/setup.StartServer.func1+0x72			/Users/lobkovilya/go/src/github.com/Kong/kuma/pkg/test/kds/setup/server.go:60

While the first goroutine tries to call SetSnapshot and update all watchers link, the server goroutine receives DiscoveryRequest and tries to call cancel. Both SetSnapshot and cancel call cache.mu.Lock().

The first goroutine in SetSnapshot can't update watchers, because the values.responses channel is full it has capacity 5 so it blocks while the server goroutine won't read something from this channel. But the server goroutine can't read something from values.responses because it's in cancel and waits while cache.mu lock will be unlocked.

cc: @jpeach

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions