Skip to content

jac-scale: code-sync pod fails with Multi-Attach error when code-server holds PVC on a different node #5239

@udithishanka

Description

@udithishanka

Summary

When jac start --scale is run to update a deployed app, jac-scale creates a code-sync pod to push new code to the PVC. If the existing code-server pod is scheduled on a different node, the EBS volume (ReadWriteOnce) cannot be mounted by both pods simultaneously, causing a Multi-Attach error. The code-sync pod stays stuck in ContainerCreating indefinitely, so the app code is never updated even though CI reports a successful deployment.

Observed Behaviour

Warning  FailedAttachVolume  kubelet  Multi-Attach error for volume "pvc-...":
Volume is already used by pod(s) jac-builder-dev-code-server-<hash>
  • code-sync pod stays in ContainerCreating indefinitely
  • jac start --scale times out or exits 124 (expected timeout)
  • kubectl rollout status passes (the Deployment rolls out, just with old code)
  • App appears healthy but is running stale code from before the CI run

Root Cause

jac-scale's code distribution architecture:

  1. code-server — a permanent Deployment (busybox httpd) that mounts the code PVC and serves it as a tar.gz over HTTP
  2. code-sync — a standalone Pod created at deploy time to push updated code onto the same PVC
  3. Main app init container — downloads the tar.gz from code-server on pod startup

The problem: EBS volumes are ReadWriteOnce — they can only be mounted by pods on the same node. jac-scale creates code-sync without ensuring it lands on the same node as code-server, and without scaling code-server down first to release the volume. When K8s schedules them on different nodes (which is common in multi-node clusters), the Multi-Attach error occurs.

Steps to Reproduce

  1. Deploy an app with jac start --scale on a multi-node EKS cluster
  2. Push a code change and trigger jac start --scale again
  3. Observe code-sync pod stuck in ContainerCreating with Multi-Attach error
  4. The app deployment rolls out (readiness probe passes on old code) — CI shows green
  5. Exec into the new pod: the code is unchanged from the first deployment

Impact

  • Code updates silently fail — CI is green but the app runs stale code
  • Developers have no indication the code update didn't apply
  • Only workaround is manual kubectl cp directly into the code-sync pod after engineering a brief window when code-server releases the volume

Proposed Fix

Option 1 (recommended): Scale down code-server before code-sync runs

In jac-scale's deploy/update logic, before creating the code-sync pod:

# Scale code-server to 0 to release the PVC
kubectl scale deployment {app}-code-server -n {namespace} --replicas=0
kubectl wait --for=delete pod -l app={app}-code-server -n {namespace} --timeout=60s

# Now code-sync can mount the PVC exclusively
# ... push code to PVC via code-sync ...

# Scale code-server back up (it repacks the tar.gz)
kubectl scale deployment {app}-code-server -n {namespace} --replicas=1

Option 2: Node affinity

Add a podAffinity rule to the code-sync pod spec so it is always scheduled on the same node as the code-server pod. EBS allows multiple mounts from the same node.

Option 3: Switch PVC to ReadWriteMany (EFS)

Replace the EBS-backed PVC with an EFS-backed one (storageClassName: efs-sc, accessModes: [ReadWriteMany]). This allows simultaneous mounts from any number of nodes and eliminates the problem entirely, but requires EFS to be provisioned in the cluster.

Environment

  • EKS (us-east-2), multi-node cluster
  • EBS gp2/gp3 PVC (ReadWriteOnce)
  • jac-scale with --scale --experimental flags
  • Confirmed on jaseci-cluster, namespace jac-builder-dev

Workaround (for deploy workflows)

Add a scale-down/scale-up step around the jac-scale deploy in CI:

- name: Release PVC before deploy
  run: |
    kubectl scale deployment {app}-code-server -n {namespace} --replicas=0
    kubectl wait --for=delete pod -l app={app}-code-server -n {namespace} --timeout=60s || true

- name: Deploy with jac-scale
  run: timeout 600 jac start main.jac --scale --experimental || ...

- name: Restore code-server
  run: kubectl scale deployment {app}-code-server -n {namespace} --replicas=1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working as expected.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions