Skip to content

[Bug] Volcano PodGroup Stuck in Inqueue State After RayJob Completes #4473

@fangyinc

Description

@fangyinc

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

What happened

After a RayJob completes successfully (status SUCCEEDED / Complete), the associated Volcano PodGroup remains stuck in Inqueue state indefinitely, instead of transitioning to Completed or being deleted.

Production environment example:

kubectl get rayjob
# NAME                     STATUS      JOBDEPLOYMENTSTATUS   ...
# ray-xxxxx-ray-job-xxxxx   SUCCEEDED   Complete

kubectl get podgroup
# NAME                              PHASE     MINMEMBER   AGE
# ray-xxxxx-ray-job-xxxxx-pg        Inqueue   31          2d1h
# ray-yyyyy-ray-job-yyyyy-pg        Inqueue   21          8d

kubectl describe podgroup ray-xxxxx-ray-job-xxxxx-pg
# Status:
#   Phase:      Inqueue
#   Succeeded:  1
# Conditions:
#   Type:          Unschedulable
#   Message:       30/1 tasks in gang unschedulable: pod group is not ready, 1 Succeeded, 31 minAvailable
# Events:
#   Warning  Unschedulable  volcano  0/1 tasks in gang unschedulable...

The RayJob is SUCCEEDED, but the PodGroup stays in Inqueue with Volcano continuously trying to schedule pods that no longer exist.

What you expected to happen

When a RayJob completes successfully, the PodGroup should either:

  1. Transition to Completed status, OR
  2. Be deleted automatically

The PodGroup should not remain in Inqueue state after the job has finished.

Root Cause Analysis

From code analysis, the issue stems from:

  1. PodGroup OwnerReference (volcano_scheduler.go:198-211):

    • For RayJobs, PodGroup's OwnerReference points to the RayJob, not the RayCluster
    • Kubernetes garbage collection only deletes PodGroup when RayJob is deleted
  2. No cleanup logic in RayJob completion handler (rayjob_controller.go:410-431):

    • When RayJob reaches terminal state (Complete/Failed), default behavior is to do nothing
    • There is NO logic to update or delete the associated PodGroup
  3. shutdownAfterJobFinishes only deletes RayCluster (rayjob_controller.go:1331-1372):

    • Only deletes RayCluster, not the PodGroup
    • Comment: "We don't need to delete the submitter Kubernetes Job so that users can still access the driver logs"
    • PodGroup is completely overlooked
  4. This issue has existed since initial Volcano integration (PR [Feature] Support Volcano for batch scheduling #755, Dec 2022):

    • PodGroup creation logic never included cleanup on RayJob completion
    • BatchScheduler interface only has DoBatchSchedulingOnSubmission, no cleanup method

Impact

  • PodGroups accumulate indefinitely (observed stuck for 8d, 19d in production)
  • Volcano scheduler wastes resources trying to schedule non-existent pods
  • User confusion: completed jobs appear to still be waiting in queue
  • Difficult to distinguish between actually queued jobs vs finished jobs

Reproduction script

# Step 1: Create a queue with limited resources
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: ray-queue
spec:
  weight: 1
  reclaimable: false
  capability:
    cpu: 4
    memory: 8Gi
---
# Step 2: Create RayJob
apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: test-ray-job-reproduce
  labels:
    ray.io/scheduler-name: volcano
    volcano.sh/queue-name: ray-queue
spec:
  shutdownAfterJobFinishes: true
  ttlSecondsAfterFinished: 0

  rayClusterSpec:
    rayVersion: "2.53.0"
    headGroupSpec:
      rayStartParams:
        dashboard-host: "0.0.0.0"
      template:
        spec:
          containers:
          - name: ray-head
            image: rayproject/ray:2.53.0
            resources:
              limits:
                cpu: "1"
                memory: "2Gi"
              requests:
                cpu: "1"
                memory: "2Gi"

    workerGroupSpecs:
    - replicas: 2
      minReplicas: 2
      maxReplicas: 2
      groupName: worker-group
      rayStartParams: {}
      template:
        spec:
          containers:
          - name: ray-worker
            image: rayproject/ray:2.53.0
            resources:
              limits:
                cpu: "1"
                memory: "2Gi"
              requests:
                cpu: "1"
                memory: "2Gi"

  submissionMode: K8sJobMode
  entrypoint: python -c "import ray; ray.init(); print('Job started'); import time; time.sleep(30); print('Job completed successfully')"
  activeDeadlineSeconds: 600
  backoffLimit: 0

Steps to reproduce

  1. Create the queue with limited resources (4 CPU, 8Gi):

    kubectl apply -f queue.yaml
  2. Create the first RayJob (requests 3 CPU, 6Gi):

    kubectl apply -f rayjob-1.yaml
  3. Wait for the first RayJob to complete:

    kubectl get rayjob test-ray-job-1
    # STATUS: SUCCEEDED, JOBDEPLOYMENTSTATUS: Complete
  4. Check the first PodGroup status - BUG: stuck in Inqueue instead of Completed:

    kubectl get podgroup ray-test-ray-job-1-pg
    # PHASE: Inqueue (should be Completed or deleted)
  5. Create a second RayJob with the same resource requirements:

    kubectl apply -f rayjob-2.yaml
  6. Observe the second PodGroup - BUG: stuck in Pending indefinitely:

    kubectl get podgroup ray-test-ray-job-2-pg
    # PHASE: Pending (should be able to run since the first job is done)
    kubectl describe podgroup ray-test-ray-job-2-pg
    # Events: queue resource quota insufficient

    The podgroup ray-test-ray-job-2-pg's events like:

       Type     Reason         Age                 From     Message
       ----     ------         ----                ----     -------
       Normal   Unschedulable  20s (x25 over 44s)  volcano  queue resource quota insufficient: insufficient cpu, insufficient memory
       Warning  Unschedulable  20s (x25 over 44s)  volcano  3/3 tasks in gang unschedulable: pod group is not ready, 3 Pending, 3 minAvailable; Pending: 3 Unschedulable
    

    The second RayJob cannot run because the first PodGroup still holds the queue resources, even though the first RayJob has already completed.

Anything else

Environment

  • Kubernetes version: v1.29 and v1.34 all reproduce the same issue
  • KubeRay version: v1.5.1
  • Volcano version: v1.14.0

Possible solutions

  1. Option 1 (Recommended): Update PodGroup to Completed when RayJob reaches terminal state

    • Add cleanup method to BatchScheduler interface
    • Call when RayJob transitions to Complete/Failed
  2. Option 2: Delete PodGroup when RayJob completes

    • Simpler but loses scheduling history

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions