Skip to content

EMRContainers JobRun accumulation #2594

@max-wing

Description

@max-wing

Describe the bug
Seems related to #2592

JobRun objects are kept on the cluster indefinitely, although there's a ttl set to delete them 24h after completion. Removing the finalizer finalizers.emrcontainers.services.k8s.aws/JobRun by patching the object seems to resolve it.

Steps to reproduce
I have an Argo CronWorkflow that runs every 20 minutes. It creates 6 emr job runs sequentially. They are not deleted automatically and accumulate over time. After reaching a few thousand of those objects, the controller tries to cancel them (even those that are in a non-cancellable/completed state) and gets stuck in a reconciliation error loop.

Example for an error:

{"level":"error","ts":"2025-07-16T08:05:05.079Z","msg":"Reconciler error","controller":"jobrun","controllerGroup":"emrcontainers.services.k8s.aws","controllerKind":"JobRun","JobRun":{"name":"redacted","namespace":"[[redacted]]"},"namespace":"redacted","name":"redacted","reconcileID":"e1e8eef7-e5b2-4d6b-941f-b6fe13d2de47","error":"operation error EMR containers: CancelJobRun, failed to get rate limit token, retry quota exceeded, 1 available, 5 requested","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:347\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:294\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:255"}

Example of a completed job run:

Name:          [[redacted]]
Namespace:     [[redacted]]
Labels:       <none>
Annotations:  <none>
API Version:  emrcontainers.services.k8s.aws/v1alpha1
Kind:         JobRun
Metadata:
  Creation Timestamp:             2025-08-04T15:41:22Z
  Deletion Grace Period Seconds:  0
  Deletion Timestamp:             2025-08-04T16:47:36Z
  Finalizers:
    finalizers.emrcontainers.services.k8s.aws/JobRun
  Generation:  2
  Owner References:
    API Version:           argoproj.io/v1alpha1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Workflow
    Name:                   [[redacted]]
    UID:                   0ab79df5-f92d-4c7c-89b0-646369c0237b
  Resource Version:        95869895
  UID:                     3b2fd48a-48e7-4e63-b703-4ef75459bf99
Spec:
  Configuration Overrides:  ApplicationConfiguration:
  - classification: spark-defaults
    properties:
      spark.kubernetes.container.image:  [[redacted]]
      spark.kubernetes.driver.podTemplateFile:  [[redacted]]
      spark.kubernetes.executor.podTemplateFile:  [[redacted]]
      spark.kubernetes.executor.volumes.emptyDir.spark-local-dir-sparkspill.mount.path: /var/spark/spill
      spark.kubernetes.executor.volumes.emptyDir.spark-local-dir-sparkspill.mount.readOnly: "false"
      spark.kubernetes.executor.volumes.emptyDir.spark-local-dir-sparkspill.mount.sizeLimit: 80Gi
      spark.kubernetes.node.selector.topology.kubernetes.io/zone: us-east-1b
      spark.local.dir: /var/spark/spill
  - classification: emr-containers-defaults
    properties:
      logging.image:  [[redacted]]

  Execution Role ARN:   [[redacted]]
  Job Driver:
    Spark Submit Job Driver:
      Entry Point:  redacted
      Entry Point Arguments:
        --step
        app-sync
        --run-id
        20250804152002
      Spark Submit Parameters:  --conf spark.executor.instances=20 --conf spark.executor.memory=21G --conf spark.driver.memory=20G --conf spark.executor.cores=4 --conf spark.driver.cores=6
  Name:                         redacted
  Release Label:                emr-7.7.0-latest
  Virtual Cluster Ref:
    From:
      Name:  redacted
Status:
  Ack Resource Metadata:
    Arn:               redacted
    Owner Account ID:  redacted
    Region:            us-east-1
  Conditions:
    Last Transition Time:  2025-08-04T16:47:25Z
    Status:                True
    Type:                  ACK.ReferencesResolved
    Status:                True
    Type:                  ACK.ResourceSynced
    Message:               ValidationException: Job run 0000000363lnqe40g2e is not in a cancellable state
    Status:                True
    Type:                  ACK.Terminal
  Id:                      0000000363lnqe40g2e
  State:                   COMPLETED
Events:                    <none>

My workflow manifest:

apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
  name:  [[redacted]]
  namespace:  [[redacted]]
spec:
  ttlStrategy:
    secondsAfterCompletion: 300
  schedules:
      - '*/20 * * * *'
  workflowSpec:
    serviceAccountName: emr-application-creator
    entrypoint: main
    metrics:
      prometheus:
      - name:  [[redacted]]
        help: "Duration gauge"
        gauge:
          value: "{{workflow.duration}}"
    synchronization:
      mutexes:
        - name:  [[redacted]]
    templates:
      - name: main
        dag:
          tasks:
            [[redacted]]

      - name: data-pipeline-job
        inputs:
          parameters:
            - name: step
            - name: run-id
        ttlStrategy:
          secondsAfterCompletion: 300
        resource:
          action: create
          setOwnerReference: true
          successCondition: status.state == COMPLETED
          failureCondition: status.state == FAILED
          manifest: |
            apiVersion: emrcontainers.services.k8s.aws/v1alpha1
            kind: JobRun
            metadata:
              name: {{workflow.name}}-{{inputs.parameters.step}}
              namespace:  [[redacted]]
            spec:
              name: {{workflow.name}}-{{inputs.parameters.step}}
              virtualClusterRef:
                from:
                  name:  [[redacted]]
              executionRoleARN:  [[redacted]]
              releaseLabel: emr-7.7.0-latest
              jobDriver:
                sparkSubmitJobDriver:
                  entryPoint:  [[redacted]]
                  entryPointArguments:
                  - --step
                  - '{{inputs.parameters.step}}'
                  - --run-id
                  - '{{inputs.parameters.run-id}}'
                  sparkSubmitParameters: "--conf spark.executor.instances=20 \
                    --conf spark.executor.memory=21G \
                    --conf spark.driver.memory=20G \
                    --conf spark.executor.cores=4 \
                    --conf spark.driver.cores=6"
              configurationOverrides: |
                ApplicationConfiguration:
                  - classification: spark-defaults
                    properties:
                      spark.kubernetes.container.image:  [[redacted]]
                      spark.kubernetes.driver.podTemplateFile: [[redacted]]
                      spark.kubernetes.executor.podTemplateFile:  [[redacted]]
                      spark.kubernetes.executor.volumes.emptyDir.spark-local-dir-sparkspill.mount.path: /var/spark/spill
                      spark.kubernetes.executor.volumes.emptyDir.spark-local-dir-sparkspill.mount.readOnly: "false"
                      spark.kubernetes.executor.volumes.emptyDir.spark-local-dir-sparkspill.mount.sizeLimit: 80Gi
                      spark.kubernetes.node.selector.topology.kubernetes.io/zone: us-east-1b
                      spark.local.dir: /var/spark/spill
                  - classification: emr-containers-defaults
                    properties:
                      logging.image:  [[redacted]]

Expected outcome
I expect JobRuns older than the provided TTL to be deleted from the cluster.

Environment

  • Kubernetes version: 1.32
  • Using EKS (yes/no), if so version? eks.13 / v1.32.5-eks-5d4a308
  • AWS service targeted (S3, RDS, etc.) - EMR
  • EMR containers version - 1.0.26

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions