-
Notifications
You must be signed in to change notification settings - Fork 271
Description
Describe the bug
Seems related to #2592
JobRun objects are kept on the cluster indefinitely, although there's a ttl set to delete them 24h after completion. Removing the finalizer finalizers.emrcontainers.services.k8s.aws/JobRun by patching the object seems to resolve it.
Steps to reproduce
I have an Argo CronWorkflow that runs every 20 minutes. It creates 6 emr job runs sequentially. They are not deleted automatically and accumulate over time. After reaching a few thousand of those objects, the controller tries to cancel them (even those that are in a non-cancellable/completed state) and gets stuck in a reconciliation error loop.
Example for an error:
{"level":"error","ts":"2025-07-16T08:05:05.079Z","msg":"Reconciler error","controller":"jobrun","controllerGroup":"emrcontainers.services.k8s.aws","controllerKind":"JobRun","JobRun":{"name":"redacted","namespace":"[[redacted]]"},"namespace":"redacted","name":"redacted","reconcileID":"e1e8eef7-e5b2-4d6b-941f-b6fe13d2de47","error":"operation error EMR containers: CancelJobRun, failed to get rate limit token, retry quota exceeded, 1 available, 5 requested","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:347\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:294\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:255"}
Example of a completed job run:
Name: [[redacted]]
Namespace: [[redacted]]
Labels: <none>
Annotations: <none>
API Version: emrcontainers.services.k8s.aws/v1alpha1
Kind: JobRun
Metadata:
Creation Timestamp: 2025-08-04T15:41:22Z
Deletion Grace Period Seconds: 0
Deletion Timestamp: 2025-08-04T16:47:36Z
Finalizers:
finalizers.emrcontainers.services.k8s.aws/JobRun
Generation: 2
Owner References:
API Version: argoproj.io/v1alpha1
Block Owner Deletion: true
Controller: true
Kind: Workflow
Name: [[redacted]]
UID: 0ab79df5-f92d-4c7c-89b0-646369c0237b
Resource Version: 95869895
UID: 3b2fd48a-48e7-4e63-b703-4ef75459bf99
Spec:
Configuration Overrides: ApplicationConfiguration:
- classification: spark-defaults
properties:
spark.kubernetes.container.image: [[redacted]]
spark.kubernetes.driver.podTemplateFile: [[redacted]]
spark.kubernetes.executor.podTemplateFile: [[redacted]]
spark.kubernetes.executor.volumes.emptyDir.spark-local-dir-sparkspill.mount.path: /var/spark/spill
spark.kubernetes.executor.volumes.emptyDir.spark-local-dir-sparkspill.mount.readOnly: "false"
spark.kubernetes.executor.volumes.emptyDir.spark-local-dir-sparkspill.mount.sizeLimit: 80Gi
spark.kubernetes.node.selector.topology.kubernetes.io/zone: us-east-1b
spark.local.dir: /var/spark/spill
- classification: emr-containers-defaults
properties:
logging.image: [[redacted]]
Execution Role ARN: [[redacted]]
Job Driver:
Spark Submit Job Driver:
Entry Point: redacted
Entry Point Arguments:
--step
app-sync
--run-id
20250804152002
Spark Submit Parameters: --conf spark.executor.instances=20 --conf spark.executor.memory=21G --conf spark.driver.memory=20G --conf spark.executor.cores=4 --conf spark.driver.cores=6
Name: redacted
Release Label: emr-7.7.0-latest
Virtual Cluster Ref:
From:
Name: redacted
Status:
Ack Resource Metadata:
Arn: redacted
Owner Account ID: redacted
Region: us-east-1
Conditions:
Last Transition Time: 2025-08-04T16:47:25Z
Status: True
Type: ACK.ReferencesResolved
Status: True
Type: ACK.ResourceSynced
Message: ValidationException: Job run 0000000363lnqe40g2e is not in a cancellable state
Status: True
Type: ACK.Terminal
Id: 0000000363lnqe40g2e
State: COMPLETED
Events: <none>
My workflow manifest:
apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
name: [[redacted]]
namespace: [[redacted]]
spec:
ttlStrategy:
secondsAfterCompletion: 300
schedules:
- '*/20 * * * *'
workflowSpec:
serviceAccountName: emr-application-creator
entrypoint: main
metrics:
prometheus:
- name: [[redacted]]
help: "Duration gauge"
gauge:
value: "{{workflow.duration}}"
synchronization:
mutexes:
- name: [[redacted]]
templates:
- name: main
dag:
tasks:
[[redacted]]
- name: data-pipeline-job
inputs:
parameters:
- name: step
- name: run-id
ttlStrategy:
secondsAfterCompletion: 300
resource:
action: create
setOwnerReference: true
successCondition: status.state == COMPLETED
failureCondition: status.state == FAILED
manifest: |
apiVersion: emrcontainers.services.k8s.aws/v1alpha1
kind: JobRun
metadata:
name: {{workflow.name}}-{{inputs.parameters.step}}
namespace: [[redacted]]
spec:
name: {{workflow.name}}-{{inputs.parameters.step}}
virtualClusterRef:
from:
name: [[redacted]]
executionRoleARN: [[redacted]]
releaseLabel: emr-7.7.0-latest
jobDriver:
sparkSubmitJobDriver:
entryPoint: [[redacted]]
entryPointArguments:
- --step
- '{{inputs.parameters.step}}'
- --run-id
- '{{inputs.parameters.run-id}}'
sparkSubmitParameters: "--conf spark.executor.instances=20 \
--conf spark.executor.memory=21G \
--conf spark.driver.memory=20G \
--conf spark.executor.cores=4 \
--conf spark.driver.cores=6"
configurationOverrides: |
ApplicationConfiguration:
- classification: spark-defaults
properties:
spark.kubernetes.container.image: [[redacted]]
spark.kubernetes.driver.podTemplateFile: [[redacted]]
spark.kubernetes.executor.podTemplateFile: [[redacted]]
spark.kubernetes.executor.volumes.emptyDir.spark-local-dir-sparkspill.mount.path: /var/spark/spill
spark.kubernetes.executor.volumes.emptyDir.spark-local-dir-sparkspill.mount.readOnly: "false"
spark.kubernetes.executor.volumes.emptyDir.spark-local-dir-sparkspill.mount.sizeLimit: 80Gi
spark.kubernetes.node.selector.topology.kubernetes.io/zone: us-east-1b
spark.local.dir: /var/spark/spill
- classification: emr-containers-defaults
properties:
logging.image: [[redacted]]
Expected outcome
I expect JobRuns older than the provided TTL to be deleted from the cluster.
Environment
- Kubernetes version: 1.32
- Using EKS (yes/no), if so version? eks.13 / v1.32.5-eks-5d4a308
- AWS service targeted (S3, RDS, etc.) - EMR
- EMR containers version - 1.0.26