Skip to content

Controller panics due to concurrent map writes when using semaphore #15218

@qti-haeyoon

Description

@qti-haeyoon

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

Note: Not deterministically reproducible as this is concurrency issue.

Controller sometimes panic due to concurrent map writes when using semaphore. This is suspected to be regression caused by #14321. In this PR, we switched to using read-write mutex and use RLock instead of Lock, but it turns out when we release the lock in release(), we are deleting the key from the map, which is a write operation. This is causing controller to panic and suggestion is to switch back to exclusive lock.

Version(s)

v3.7.3

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: synchronization-tmpl-level-
  labels:
    workflows.argoproj.io/no-test: "environment"
spec:
  entrypoint: synchronization-tmpl-level-example
  templates:
  - name: synchronization-tmpl-level-example
    steps:
    - - name: synchronization-acquire-lock
        template: acquire-lock
        arguments:
          parameters:
          - name: seconds
            value: "{{item}}"
        withParam: '["1","2","3","4","5"]'

  - name: acquire-lock
    synchronization:
      semaphores: # v3.6 and after
        - configMapKeyRef:
            name: my-config
            key: template
    container:
      image: alpine:3.23
      command: [sh, -c]
      args: ["sleep 10; echo acquired lock"]

Logs from the workflow controller

goroutine 447 [running]:
internal/runtime/maps.fatal({0x2fdb7b7?, 0xc006f84a00?})
 /usr/local/go/src/runtime/panic.go:1058 +0x18
github.com/argoproj/argo-workflows/v3/workflow/sync.(*prioritySemaphore).release(0xc00044d680, {0xc0380a9180, 0x9b})
 /go/src/github.com/argoproj/argo-workflows/workflow/sync/semaphore.go:101 +0x5c
github.com/argoproj/argo-workflows/v3/workflow/sync.(*Manager).Release(0xc0004ff180, {0x3440998, 0xc02aeda810}, 0xc004951688, {0xc03b5056d0?, 0xc005419500?}, 0xc03a7f6000)
 /go/src/github.com/argoproj/argo-workflows/workflow/sync/sync_manager.go:478 +0x250
github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).executeTemplate(0xc007a5da40, {0x3440998, 0xc02aeda810}, {0xc01249c460, 0x4d}, {0x3447f60, 0xc009aa8900}, 0xc01704da40, {{0x0, 0x0, ...}, ...}, ...)
 /go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:2047 +0x4bc8
github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).executeStepGroup(0xc007a5da40, {0x3440998, 0xc02aeda810}, {0xc009aa83c0, 0x1, 0x1}, {0xc03b505f90, 0x42}, 0xc02d843f80)
 /go/src/github.com/argoproj/argo-workflows/workflow/controller/steps.go:285 +0x567
github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).executeSteps(0xc007a5da40, {0x3440998, 0xc02aeda810}, {0xc00c2fa000, 0x3f}, 0xc01704da40, {0xc03b505d60, 0x45}, 0xc00541d8c8, {0x3447f60, ...}, ...)
 /go/src/github.com/argoproj/argo-workflows/workflow/controller/steps.go:110 +0x685
github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).executeTemplate(0xc007a5da40, {0x3440998, 0xc02aeda810}, {0xc00c2fa000, 0x3f}, {0x3447f60, 0xc007a5de00}, 0xc01704da00, {{0xc035aaaa80, 0x1, ...}, ...}, ...)
 /go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:2294 +0x3428
github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).operate(0xc007a5da40, {0x3440998, 0xc02aeda810})
 /go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:368 +0x1d6b
github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).processNextItem(0xc0006eb708, {0x34409d0, 0xc00044cd70})
 /go/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:766 +0x728
github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).runWorker(0xc0006eb708, {0x34409d0, 0xc00044cd70})
 /go/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:677 +0x9e
k8s.io/apimachinery/pkg/util/wait.BackoffUntilWithContext.func1({0x34409d0?, 0xc00044cd70?}, 0xc001d0c000?)
 /go/pkg/mod/k8s.io/apimachinery@v0.33.1/pkg/util/wait/backoff.go:255 +0x51
k8s.io/apimachinery/pkg/util/wait.BackoffUntilWithContext({0x34409d0, 0xc00044cd70}, 0xc002ae4a20, {0x3404760, 0xc001d0c000}, 0x1)
 /go/pkg/mod/k8s.io/apimachinery@v0.33.1/pkg/util/wait/backoff.go:256 +0xe5
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext({0x34409d0, 0xc00044cd70}, 0xc002ae4a20, 0x3b9aca00, 0x0, 0x1)
 /go/pkg/mod/k8s.io/apimachinery@v0.33.1/pkg/util/wait/backoff.go:223 +0x8f
k8s.io/apimachinery/pkg/util/wait.UntilWithContext(...)
 /go/pkg/mod/k8s.io/apimachinery@v0.33.1/pkg/util/wait/backoff.go:172
created by github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run in goroutine 202
 /go/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:382 +0x19bc

Logs from in your workflow's wait container

N/A

Metadata

Metadata

Assignees

Labels

type/regressionRegression from previous behavior (a specific type of bug)

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions