Skip to content

[BUG] cosmic-comp does not handle PrepareForSleep D-Bus signal — causes NVIDIA suspend failures and 60s freeze timeout #2078

@Lcstyle

Description

@Lcstyle

[BUG] cosmic-comp does not handle PrepareForSleep D-Bus signal — causes NVIDIA suspend failures and 60s freeze timeout

Summary

cosmic-comp does not listen for the org.freedesktop.login1.Manager.PrepareForSleep D-Bus signal. During system suspend, this causes a race condition where the NVIDIA driver revokes DRM access while cosmic-comp is still actively rendering. The compositor enters a tight error-retry loop of failing page flips, which causes a 60-second user.slice freeze timeout that delays or prevents proper S3 entry.

Other major Wayland compositors (Mutter, KWin) subscribe to PrepareForSleep and stop rendering before the NVIDIA driver takes control.

Environment

  • OS: Fedora 43, COSMIC Desktop (alpha 6)
  • Kernel: 6.18.7 / 6.18.8
  • GPU: NVIDIA GeForce RTX 3090 Ti
  • Driver: 580.119.02 (Open kernel module)
  • Session: Wayland (cosmic-comp)
  • Sleep mode: S3 deep

NVIDIA Module Configuration

options nvidia_drm modeset=1 fbdev=1
options nvidia NVreg_PreserveVideoMemoryAllocations=1 NVreg_TemporaryFilePath=/tmp

What Happens

  1. User triggers systemctl suspend
  2. nvidia-suspend.service runs (ordered Before=systemd-suspend.service)
  3. nvidia-sleep.sh executes chvt 63 then echo "suspend" > /proc/driver/nvidia/suspend
  4. NVIDIA revokes DRM access from userspace
  5. cosmic-comp is still rendering — gets "Permission denied" errors
  6. Failed page flips trigger immediate retry via queue_redraw(true) with no backoff
  7. systemd-suspend.service starts — tries to freeze user.slice
  8. Freeze times out after 60 seconds because cosmic-comp is in a bad state
  9. System either enters a degraded suspend or fails to suspend entirely

Journal Evidence

Feb  9 22:42:36 ICESWORD systemd-logind[6634]: The system will suspend now!
Feb  9 22:42:36 ICESWORD systemd[1]: Reached target sleep.target - Sleep.
Feb  9 22:42:36 ICESWORD systemd[1]: Starting nvidia-suspend.service - NVIDIA system suspend actions...

Feb  9 22:42:36 ICESWORD cosmic-comp[7334]: Failed to submit rendering: Failed to submit result for display
    Caused by:
        0: The underlying drm surface encountered an error: DRM access error: Page flip commit failed on device `Some("/dev/dri/card1")` (Permission denied (os error 13))
        1: DRM access error: Page flip commit failed on device `Some("/dev/dri/card1")` (Permission denied (os error 13))
        2: Permission denied (os error 13)

Feb  9 22:42:36 ICESWORD cosmic-comp[7334]: Failed to submit rendering: [repeated 3x]

Feb  9 22:42:41 ICESWORD systemd[1]: Finished nvidia-suspend.service - NVIDIA system suspend actions.
Feb  9 22:42:41 ICESWORD systemd[1]: Starting systemd-suspend.service - System Suspend...
Feb  9 22:43:41 ICESWORD systemd-sleep[65872]: Failed to freeze unit 'user.slice': Connection timed out
Feb  9 22:43:41 ICESWORD systemd-sleep[65872]: Performing sleep operation 'suspend'...

Note: No kernel PM: suspend entry (deep) message appears — the system may never have fully entered S3.

Root Cause Analysis

After investigating the cosmic-comp source, I identified three contributing issues:

1. No PrepareForSleep D-Bus signal handling (primary cause)

cosmic-comp does not subscribe to org.freedesktop.login1.Manager.PrepareForSleep. The only logind integration is for lid switch inhibition (src/dbus/logind.rs), which calls ManagerProxy::inhibit(HandleLidSwitch, ...) — nothing related to system suspend.

cosmic-comp does have a session pause/resume mechanism via libseat (src/backend/kms/mod.rs:101-119), which correctly handles SessionEvent::PauseSession by suspending input, DRM devices, and render surfaces (src/backend/kms/mod.rs:419-431). However, this only fires on VT switching, not on system suspend.

How other compositors handle this:

Compositor PrepareForSleep? Source
Mutter Yes src/backends/meta-backend.cprepare_for_sleep_cb(), registered via g_dbus_connection_signal_subscribe()
KWin Yes src/logind.cpp — logind class handler
Sway Delegated Handled by swayidle daemon rather than compositor core
cosmic-comp No Only lid switch inhibition in src/dbus/logind.rs

2. Race between chvt 63 and NVIDIA GPU takeover

The NVIDIA suspend mechanism is handled by /usr/bin/nvidia-sleep.sh, shipped by the xorg-x11-drv-nvidia-power RPM package (Fedora's packaging of NVIDIA's official power management scripts). This is not a local workaround — it is NVIDIA's documented suspend mechanism. The relevant excerpt:

# From /usr/bin/nvidia-sleep.sh (nvidia-sleep.sh, xorg-x11-drv-nvidia-power-580.119.02)
suspend|hibernate)
    mkdir -p "${RUN_DIR}"
    fgconsole > "${XORG_VT_FILE}"
    chvt 63
    if [[ $? -ne 0 ]]; then
        exit $?
    fi
    echo "$1" > /proc/driver/nvidia/suspend
    ;;

What chvt 63 does: Linux supports 63 virtual terminals (VT1–VT63). The Wayland session typically runs on a low VT (VT1 or VT2). chvt 63 switches to an unused/empty terminal, which forces the current session to lose DRM master status — this is how libseat knows to emit SessionEvent::PauseSession. After the VT switch, NVIDIA writes "suspend" to /proc/driver/nvidia/suspend to save VRAM and power down the GPU. On resume, RestoreVT() switches back to the original VT, restoring DRM master.

The race: chvt 63 and the /proc write happen back-to-back in the same script with no wait for the session to actually process the VT switch. The chvt 63 triggers an async D-Bus notification chain (kernel → logind → libseat → cosmic-comp's calloop), but NVIDIA takes exclusive GPU control synchronously on the very next line. cosmic-comp never gets a chance to process the pause event.

The chvt 63 should eventually trigger SessionEvent::PauseSession through the libseat chain, which would call pause_session() and properly suspend all render surfaces (src/backend/kms/surface/mod.rs:666-686). However, NVIDIA writes to /proc/driver/nvidia/suspend immediately after chvt, taking exclusive GPU control before the async session pause event propagates through the event loop. By the time cosmic-comp could process the pause event, DRM access has already been revoked.

3. Render loop retries failed DRM commits with no backoff

When redraw() fails at src/backend/kms/surface/mod.rs:953-956:

if let Err(err) = state.redraw(estimated_presentation) {
    let name = state.output.name();
    warn!(?name, "Failed to submit rendering: {:?}", err);
    state.queue_redraw(true);  // Immediate retry, no backoff
}

Failed page flips immediately schedule another render attempt via queue_redraw(true). There is no check for EACCES/Permission denied, no exponential backoff, and no circuit breaker. The redraw() function also does not check self.active before attempting DRM operations — it only checks self.compositor.as_mut() (src/backend/kms/surface/mod.rs:992). This creates a tight loop of failing DRM commits during the window between NVIDIA taking control and the system entering suspend.

Suggested Fixes

Fix 1: Subscribe to PrepareForSleep (recommended)

Add a D-Bus signal subscription to org.freedesktop.login1.Manager.PrepareForSleep in the KMS backend initialization. When PrepareForSleep(true) is received (suspend imminent), call the existing pause_session(). When PrepareForSleep(false) is received (resume), call resume_session().

The infrastructure already exists — logind-zbus is already a dependency (optional, feature-gated as systemd), and the pause_session()/resume_session() handlers properly suspend and resume all render surfaces, input devices, and DRM devices. This is straightforward to wire up.

Fix 2: Add backoff/circuit breaker for DRM errors

When redraw() returns an error, especially EACCES, the retry should use exponential backoff or stop entirely rather than immediately rescheduling. This would prevent the tight error loop that contributes to the freeze timeout.

Fix 3: Check session active state in render path

The redraw() function could check self.active.load(Ordering::SeqCst) before attempting DRM operations. Currently active is set to false during suspend() (src/backend/kms/surface/mod.rs:667), but redraw() does not check this flag. Similarly, queue_redraw() could skip scheduling if the surface is inactive.

Current End-User Workaround

To work around this issue, I have created systemd drop-in overrides that send SIGSTOP to cosmic-comp before NVIDIA takes GPU control, and SIGCONT after resume:

/etc/systemd/system/nvidia-suspend.service.d/10-freeze-cosmic-first.conf:

# Workaround for cosmic-comp DRM permission errors during suspend
#
# Problem: nvidia-suspend.service runs BEFORE systemd-suspend freezes user
# processes. When NVIDIA takes exclusive GPU control, cosmic-comp is still
# trying to render, causing "DRM access error: Permission denied" errors
# and a 60-second freeze timeout that prevents proper S3 entry.
#
# Solution: Send SIGSTOP to cosmic-comp before NVIDIA takes control.

[Service]
ExecStartPre=/bin/sh -c 'pkill -STOP -x cosmic-comp 2>/dev/null || true'
ExecStartPre=/bin/sleep 0.3

/etc/systemd/system/nvidia-resume.service.d/10-resume-cosmic.conf:

[Service]
ExecStartPost=/bin/sh -c 'pkill -CONT -x cosmic-comp 2>/dev/null || true'

This workaround is effective but fragile — it relies on process naming, timing heuristics (sleep 0.3), and external service orchestration to work around what should be compositor-internal behavior. It must be re-applied after every NVIDIA driver update that regenerates the service files, and it can silently break if the cosmic-comp binary is renamed or wrapped.

The fact that end users must create custom systemd drop-in overrides to SIGSTOP the compositor process — essentially force-freezing it from the outside because it has no internal suspend awareness — speaks to the severity of this gap. This is a fundamental lifecycle event that the compositor should handle natively.

Related Issues

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions