[Bug] Volcano PodGroup Stuck in Inqueue State After RayJob Completes

### Search before asking

- [x] I searched the [issues](https://github.com/ray-project/kuberay/issues) and found no similar issues.


### KubeRay Component

ray-operator

### What happened + What you expected to happen

### What happened
After a RayJob completes successfully (status `SUCCEEDED` / `Complete`), the associated Volcano PodGroup remains stuck in `Inqueue` state indefinitely, instead of transitioning to `Completed` or being deleted.

**Production environment example:**
```bash
kubectl get rayjob
# NAME                     STATUS      JOBDEPLOYMENTSTATUS   ...
# ray-xxxxx-ray-job-xxxxx   SUCCEEDED   Complete

kubectl get podgroup
# NAME                              PHASE     MINMEMBER   AGE
# ray-xxxxx-ray-job-xxxxx-pg        Inqueue   31          2d1h
# ray-yyyyy-ray-job-yyyyy-pg        Inqueue   21          8d

kubectl describe podgroup ray-xxxxx-ray-job-xxxxx-pg
# Status:
#   Phase:      Inqueue
#   Succeeded:  1
# Conditions:
#   Type:          Unschedulable
#   Message:       30/1 tasks in gang unschedulable: pod group is not ready, 1 Succeeded, 31 minAvailable
# Events:
#   Warning  Unschedulable  volcano  0/1 tasks in gang unschedulable...
```

The RayJob is `SUCCEEDED`, but the PodGroup stays in `Inqueue` with Volcano continuously trying to schedule pods that no longer exist.

### What you expected to happen
When a RayJob completes successfully, the PodGroup should either:
1. Transition to `Completed` status, OR
2. Be deleted automatically

The PodGroup should not remain in `Inqueue` state after the job has finished.

### Root Cause Analysis
From code analysis, the issue stems from:

1. **PodGroup OwnerReference** (`volcano_scheduler.go:198-211`):
   - For RayJobs, PodGroup's `OwnerReference` points to the **RayJob**, not the RayCluster
   - Kubernetes garbage collection only deletes PodGroup when RayJob is deleted

2. **No cleanup logic in RayJob completion handler** (`rayjob_controller.go:410-431`):
   - When RayJob reaches terminal state (Complete/Failed), default behavior is to do nothing
   - **There is NO logic to update or delete the associated PodGroup**

3. **shutdownAfterJobFinishes only deletes RayCluster** (`rayjob_controller.go:1331-1372`):
   - Only deletes RayCluster, not the PodGroup
   - Comment: "We don't need to delete the submitter Kubernetes Job so that users can still access the driver logs"
   - **PodGroup is completely overlooked**

4. **This issue has existed since initial Volcano integration** (PR #755, Dec 2022):
   - PodGroup creation logic never included cleanup on RayJob completion
   - `BatchScheduler` interface only has `DoBatchSchedulingOnSubmission`, no cleanup method

### Impact
- PodGroups accumulate indefinitely (observed stuck for 8d, 19d in production)
- Volcano scheduler wastes resources trying to schedule non-existent pods
- User confusion: completed jobs appear to still be waiting in queue
- Difficult to distinguish between actually queued jobs vs finished jobs


### Reproduction script


```yaml
# Step 1: Create a queue with limited resources
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: ray-queue
spec:
  weight: 1
  reclaimable: false
  capability:
    cpu: 4
    memory: 8Gi
---
# Step 2: Create RayJob
apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: test-ray-job-reproduce
  labels:
    ray.io/scheduler-name: volcano
    volcano.sh/queue-name: ray-queue
spec:
  shutdownAfterJobFinishes: true
  ttlSecondsAfterFinished: 0

  rayClusterSpec:
    rayVersion: "2.53.0"
    headGroupSpec:
      rayStartParams:
        dashboard-host: "0.0.0.0"
      template:
        spec:
          containers:
          - name: ray-head
            image: rayproject/ray:2.53.0
            resources:
              limits:
                cpu: "1"
                memory: "2Gi"
              requests:
                cpu: "1"
                memory: "2Gi"

    workerGroupSpecs:
    - replicas: 2
      minReplicas: 2
      maxReplicas: 2
      groupName: worker-group
      rayStartParams: {}
      template:
        spec:
          containers:
          - name: ray-worker
            image: rayproject/ray:2.53.0
            resources:
              limits:
                cpu: "1"
                memory: "2Gi"
              requests:
                cpu: "1"
                memory: "2Gi"

  submissionMode: K8sJobMode
  entrypoint: python -c "import ray; ray.init(); print('Job started'); import time; time.sleep(30); print('Job completed successfully')"
  activeDeadlineSeconds: 600
  backoffLimit: 0
```

### Steps to reproduce
1. Create the queue with limited resources (4 CPU, 8Gi):
   ```bash
   kubectl apply -f queue.yaml
   ```

2. Create the first RayJob (requests 3 CPU, 6Gi):
   ```bash
   kubectl apply -f rayjob-1.yaml
   ```

3. Wait for the first RayJob to complete:
   ```bash
   kubectl get rayjob test-ray-job-1
   # STATUS: SUCCEEDED, JOBDEPLOYMENTSTATUS: Complete
   ```

4. Check the first PodGroup status - **BUG: stuck in `Inqueue` instead of `Completed`**:
   ```bash
   kubectl get podgroup ray-test-ray-job-1-pg
   # PHASE: Inqueue (should be Completed or deleted)
   ```

5. Create a second RayJob with the same resource requirements:
   ```bash
   kubectl apply -f rayjob-2.yaml
   ```

6. Observe the second PodGroup - **BUG: stuck in `Pending` indefinitely**:
   ```bash
   kubectl get podgroup ray-test-ray-job-2-pg
   # PHASE: Pending (should be able to run since the first job is done)
   kubectl describe podgroup ray-test-ray-job-2-pg
   # Events: queue resource quota insufficient
   ```
   The podgroup ray-test-ray-job-2-pg's events like:
   ```
      Type     Reason         Age                 From     Message
      ----     ------         ----                ----     -------
      Normal   Unschedulable  20s (x25 over 44s)  volcano  queue resource quota insufficient: insufficient cpu, insufficient memory
      Warning  Unschedulable  20s (x25 over 44s)  volcano  3/3 tasks in gang unschedulable: pod group is not ready, 3 Pending, 3 minAvailable; Pending: 3 Unschedulable
   ```

   The second RayJob cannot run because the first PodGroup still holds the queue resources, even though the first RayJob has already completed.


### Anything else

### Environment
- Kubernetes version: v1.29 and v1.34 all reproduce the same issue
- KubeRay version: v1.5.1
- Volcano version: v1.14.0

### Possible solutions
1. **Option 1** (Recommended): Update PodGroup to `Completed` when RayJob reaches terminal state
   - Add cleanup method to `BatchScheduler` interface
   - Call when RayJob transitions to Complete/Failed

2. **Option 2**: Delete PodGroup when RayJob completes
   - Simpler but loses scheduling history


### Are you willing to submit a PR?

- [x] Yes I am willing to submit a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Volcano PodGroup Stuck in Inqueue State After RayJob Completes #4473

Search before asking

KubeRay Component

What happened + What you expected to happen

What happened

What you expected to happen

Root Cause Analysis

Impact

Reproduction script

Steps to reproduce

Anything else

Environment

Possible solutions

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Volcano PodGroup Stuck in Inqueue State After RayJob Completes #4473

Description

Search before asking

KubeRay Component

What happened + What you expected to happen

What happened

What you expected to happen

Root Cause Analysis

Impact

Reproduction script

Steps to reproduce

Anything else

Environment

Possible solutions

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions