Skip to content

Fix black screen when login(1) calls vhangup()#1529

Open
sdirkwinkel wants to merge 1 commit intotsl0922:mainfrom
sdirkwinkel:steffend/vhangup-fix
Open

Fix black screen when login(1) calls vhangup()#1529
sdirkwinkel wants to merge 1 commit intotsl0922:mainfrom
sdirkwinkel:steffend/vhangup-fix

Conversation

@sdirkwinkel
Copy link
Copy Markdown

Related to #768

This was mostly investigated and written by claude, I only tested it and did minor cleanups.
Feel free to close / do this differently, I'm not an expert in ttyd or libuv

The C-State behaviour was on my machines with a rt kernel, It seems from the issue this is also more common without
limiting C-States.

Symptom

When ttyd's command is login (or any program that revokes its controlling terminal during startup), the browser-side terminal intermittently comes up as a black screen with the keyboard appearing dead. The bug is timing-dependent: it reproduces reliably on hosts where deeper CPU C-states are disabled, and disappears (or only happens sporadically) when the kernel is allowed to enter C3/C6.

Root cause

login(1) calls vhangup(2) to revoke the controlling terminal before re-attaching to it via open()/TIOCSCTTY. vhangup() sends SIGHUP to the foreground process group (login ignores it) and forces subsequent I/O on the existing slave fd to fail. On the master side, this surfaces as a transient read() == -1, errno == EIO for as long as no slave fd is open. Once login reopens the slave by device path, the master recovers and reads succeed again.

The window between vhangup() and the slave being reopened is short (microseconds when the CPU is hot, much longer when waking from a deep C-state). ttyd's libuv read loop only loses the race when it manages to dispatch a read inside that window:

  1. uv__read() in libuv's src/unix/stream.c calls read(), which returns -1 / EIO.
  2. libuv invokes our read_cb with nread == UV_EIO AND, critically, clears UV_HANDLE_READABLE | UV_HANDLE_WRITABLE on the stream and stops its io_watcher (stream.c, the !EAGAIN branch around the stream->flags &= ~(UV_HANDLE_READABLE | UV_HANDLE_WRITABLE) line).
  3. Any subsequent uv_read_start() on the same uv_pipe_t is a no-op
    • the flags are gone and the io_watcher will not be re-armed. The handle is permanently poisoned even though the underlying kernel master fd recovers a millisecond later.

Result: process->out is dead, no further output reaches the websocket, the browser shows a black screen. process->in is a separate uv_pipe_t over its own dup() of the master fd, so writes (keystrokes) typically still work because users only type after login has finished and the slave is back - the input side is rarely exercised inside the EIO window. This is why the symptom looks like "no output, blind input" rather than a fully dead session.

Three sub-bugs in the existing code conspired here:

  1. process_read_cb (protocol.c) stored NULL in pss->pty_buf on a transient EIO and asked LWS for a writable callback that produced nothing, so even a working pipe would have been left paused.
  2. pty_pause() / pty_resume() and read_cb() never updated process->paused, so the if (!paused) return guards in those functions made flow control depend on stale state. After a uv_read_stop in read_cb, paused was still false, so subsequent pty_resume() short-circuited.
  3. Even with (1) and (2) fixed, the poisoned uv_pipe_t cannot be re-armed - libuv has cleared the flags and a fresh handle is required.

Fix

src/pty.c read_cb: when read fails with UV_EIO and the child is still running, treat it as transient and schedule a reconnect of the read side instead of propagating EOF upstream. Other read errors fall through to the existing EOF path so we don't paper over real failures.

src/pty.c pty_reconnect_out: close the poisoned uv_pipe_t, dup() the still-valid master fd into a fresh handle, and re-arm reading. Retry every 20ms up to 20 times (400ms total), which is more than enough for login to finish reopening the slave. On the last attempt the upper layer is notified with EOF so the websocket closes cleanly.

src/pty.c pty_pause / pty_resume / read_cb: keep process->paused in sync with the actual io_watcher state so pty_resume() will re-arm reading as designed.

src/pty.h: add reconnect_timer + reconnect_attempts to pty_process. The timer is allocated in pty_spawn and freed in process_free after the wait thread is joined, so there is no UAF window. The reconnect helper is file-static; there is no public API change.

Notes / known limitations

  • Only process->out is reconnected. process->in is a separate handle and was not observed to break in practice (writes happen after the user starts typing, by which time login is done). If a future workload triggers EIO on the write side, the symmetric fix would be to attempt reconnect from pty_write() on EPIPE.
  • process_running() uses uv_kill(pid, 0), which returns success for zombies. In the narrow window between child exit and wait_cb reaping it, we may schedule one or two pointless reconnects before the next dup()+read() returns EOF and the session closes normally.

Fixes: #768

Code was written and tested by claude opus 4.6

Standalone reproducer (compile with -luv, run as root):

/*

  • test_ttyd_vhangup_fix.c - integration test for vhangup recovery.
  • Forkpty's a child that writes output, calls vhangup(), reopens the
  • slave PTY by device path, then writes more output. Parent uses the
  • same one-shot read + reconnect pattern as the fix. Passes when
  • AFTER output is received through the new pipe. */ #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <stdbool.h> #include <string.h> #include <unistd.h> #include <fcntl.h> #include <pty.h> #include <sys/wait.h> #include <sys/ioctl.h> #include <uv.h> #include <signal.h> #include <time.h> #include <errno.h>

static uv_loop_t *loop;
static uv_pipe_t *pty_out;
static uv_pipe_t pty_in;
static uv_timer_t timeout_timer;
static uv_timer_t reconnect_timer;
static uv_idle_t writable_idle;
static int master_fd = -1;
static int child_pid = 0;
static bool paused = true;
static int reconnect_attempts = 0;

#define MAX_RECONNECT_ATTEMPTS 10
#define RECONNECT_BASE_DELAY_MS 100

static char *pending_buf = NULL;
static size_t pending_len = 0;
static bool pending_close = false;
static int total_reads = 0;
static int total_eios = 0;
static bool saw_before = false;
static bool saw_after = false;
static struct timespec start_time;

static double elapsed(void) {
struct timespec now;
clock_gettime(CLOCK_MONOTONIC, &now);
return (now.tv_sec - start_time.tv_sec) +
(now.tv_nsec - start_time.tv_nsec) / 1e9;
}
static bool process_running(void) {
return child_pid > 0 && kill(child_pid, 0) == 0;
}
static void alloc_cb(uv_handle_t *h, size_t s, uv_buf_t *b) {
(void)h; b->base = malloc(s); b->len = s;
}
static void close_cb(uv_handle_t *h) { free(h); }
static bool fd_duplicate(int fd, uv_pipe_t *pipe) {
int d = dup(fd);
if (d < 0) return false;
fcntl(d, F_SETFD, FD_CLOEXEC);
if (uv_pipe_open(pipe, d)) { close(d); return false; }
return true;
}

static void read_cb(uv_stream_t *stream, ssize_t n, const uv_buf_t *buf);
static void writable_cb(uv_idle_t *idle);

static void pty_resume(void) {
if (!paused || !pty_out) return;
paused = false;
pty_out->data = NULL;
if (uv_read_start((uv_stream_t *)pty_out, alloc_cb, read_cb))
paused = true;
}
static void reconnect_timer_cb(uv_timer_t *t) {
(void)t;
if (pty_out) {
uv_read_stop((uv_stream_t *)pty_out);
uv_close((uv_handle_t *)pty_out, close_cb);
pty_out = NULL;
}
if (!process_running()) return;
pty_out = malloc(sizeof(uv_pipe_t));
uv_pipe_init(loop, pty_out, 0);
if (!fd_duplicate(master_fd, pty_out)) {
uv_close((uv_handle_t *)pty_out, close_cb);
pty_out = NULL;
return;
}
paused = true;
pty_resume();
}
static void pty_reconnect_out(void) {
reconnect_attempts++;
if (reconnect_attempts > MAX_RECONNECT_ATTEMPTS) {
pending_close = true;
uv_idle_start(&writable_idle, writable_cb);
return;
}
uint64_t delay = RECONNECT_BASE_DELAY_MS *
(1u << (reconnect_attempts - 1));
uv_timer_start(&reconnect_timer, reconnect_timer_cb, delay, 0);
}
static void read_cb(uv_stream_t *stream, ssize_t n,
const uv_buf_t *buf) {
uv_read_stop(stream);
paused = true;
if (n <= 0) {
if (n == UV_ENOBUFS || n == 0) { free(buf->base); return; }
total_eios++;
free(buf->base);
if (process_running()) pty_reconnect_out();
else {
pending_close = true;
uv_idle_start(&writable_idle, writable_cb);
}
return;
}
total_reads++;
reconnect_attempts = 0;
pending_buf = malloc(n);
memcpy(pending_buf, buf->base, n);
pending_len = n;
free(buf->base);
uv_idle_start(&writable_idle, writable_cb);
}
static void writable_cb(uv_idle_t idle) {
uv_idle_stop(idle);
if (pending_close) return;
if (pending_buf) {
if (strstr(pending_buf, "BEFORE")) saw_before = true;
if (strstr(pending_buf, "AFTER")) saw_after = true;
free(pending_buf);
pending_buf = NULL;
pending_len = 0;
pty_resume();
}
}
static void on_close_noop(uv_handle_t h) { (void)h; }
static void timeout_cb(uv_timer_t t) {
(void)t;
printf("saw_before=%d saw_after=%d reads=%d eios=%d\n",
saw_before, saw_after, total_reads, total_eios);
if (saw_before && saw_after)
printf("
FIX VERIFIED \n");
else
printf("
FIX FAILED ***\n");
if (child_pid > 0) kill(child_pid, SIGTERM);
if (pty_out) {
uv_read_stop((uv_stream_t *)pty_out);
uv_close((uv_handle_t *)pty_out, close_cb);
pty_out = NULL;
}
uv_close((uv_handle_t *)&pty_in, on_close_noop);
uv_close((uv_handle_t *)&timeout_timer, on_close_noop);
uv_close((uv_handle_t *)&reconnect_timer, on_close_noop);
uv_close((uv_handle_t *)&writable_idle, on_close_noop);
}
int main(void) {
if (geteuid() != 0) {
fprintf(stderr, "Must run as root\n");
return 1;
}
clock_gettime(CLOCK_MONOTONIC, &start_time);
struct winsize ws = { .ws_row = 24, .ws_col = 80 };
char slave_name[256];
child_pid = forkpty(&master_fd, slave_name, NULL, &ws);
if (child_pid < 0) { perror("forkpty"); return 1; }
if (child_pid == 0) {
signal(SIGHUP, SIG_IGN);
printf("BEFORE\n"); fflush(stdout); usleep(200000);
close(0); close(1); close(2);
vhangup();
signal(SIGHUP, SIG_DFL);
int fd = open(slave_name, O_RDWR);
if (fd < 0) _exit(99);
dup2(fd, 0); dup2(fd, 1); dup2(fd, 2);
if (fd > 2) close(fd);
ioctl(0, TIOCSCTTY, 1);
usleep(500000);
printf("AFTER\n"); fflush(stdout);
sleep(3);
_exit(0);
}
fcntl(master_fd, F_SETFL,
fcntl(master_fd, F_GETFL) | O_NONBLOCK);
fcntl(master_fd, F_SETFD, FD_CLOEXEC);
loop = uv_default_loop();
uv_pipe_init(loop, &pty_in, 0);
pty_out = malloc(sizeof(uv_pipe_t));
uv_pipe_init(loop, pty_out, 0);
int fd_in = dup(master_fd);
fcntl(fd_in, F_SETFD, FD_CLOEXEC);
uv_pipe_open(&pty_in, fd_in);
if (!fd_duplicate(master_fd, pty_out)) return 1;
uv_idle_init(loop, &writable_idle);
uv_timer_init(loop, &reconnect_timer);
pty_resume();
uv_timer_init(loop, &timeout_timer);
uv_timer_start(&timeout_timer, timeout_cb, 8000, 0);
uv_run(loop, UV_RUN_DEFAULT);
int status;
waitpid(child_pid, &status, WNOHANG);
uv_loop_close(loop);
return (saw_before && saw_after) ? 0 : 1;
}

Symptom
-------
When ttyd's command is `login` (or any program that revokes its
controlling terminal during startup), the browser-side terminal
intermittently comes up as a black screen with the keyboard appearing
dead.  The bug is timing-dependent: it reproduces reliably on hosts
where deeper CPU C-states are disabled, and disappears (or only happens
sporadically) when the kernel is allowed to enter C3/C6.

Root cause
----------
login(1) calls vhangup(2) to revoke the controlling terminal before
re-attaching to it via open()/TIOCSCTTY.  vhangup() sends SIGHUP to the
foreground process group (login ignores it) and forces subsequent I/O
on the existing slave fd to fail.  On the master side, this surfaces
as a transient `read() == -1, errno == EIO` for as long as no slave
fd is open.  Once login reopens the slave by device path, the master
recovers and reads succeed again.

The window between vhangup() and the slave being reopened is short
(microseconds when the CPU is hot, much longer when waking from a deep
C-state).  ttyd's libuv read loop only loses the race when it manages
to dispatch a read inside that window:

  1. uv__read() in libuv's src/unix/stream.c calls read(), which
     returns -1 / EIO.
  2. libuv invokes our read_cb with `nread == UV_EIO` AND, critically,
     clears `UV_HANDLE_READABLE | UV_HANDLE_WRITABLE` on the stream and
     stops its io_watcher (stream.c, the !EAGAIN branch around the
     `stream->flags &= ~(UV_HANDLE_READABLE | UV_HANDLE_WRITABLE)`
     line).
  3. Any subsequent uv_read_start() on the same uv_pipe_t is a no-op
     - the flags are gone and the io_watcher will not be re-armed.
     The handle is permanently poisoned even though the underlying
     kernel master fd recovers a millisecond later.

Result: process->out is dead, no further output reaches the websocket,
the browser shows a black screen.  process->in is a *separate*
uv_pipe_t over its own dup() of the master fd, so writes (keystrokes)
typically still work because users only type after login has finished
and the slave is back - the input side is rarely exercised inside the
EIO window.  This is why the symptom looks like "no output, blind
input" rather than a fully dead session.

Three sub-bugs in the existing code conspired here:

  1. process_read_cb (protocol.c) stored NULL in pss->pty_buf on a
     transient EIO and asked LWS for a writable callback that produced
     nothing, so even a working pipe would have been left paused.
  2. pty_pause() / pty_resume() and read_cb() never updated
     process->paused, so the `if (!paused) return` guards in those
     functions made flow control depend on stale state.  After a
     uv_read_stop in read_cb, paused was still false, so subsequent
     pty_resume() short-circuited.
  3. Even with (1) and (2) fixed, the poisoned uv_pipe_t cannot be
     re-armed - libuv has cleared the flags and a fresh handle is
     required.

Fix
---
src/pty.c read_cb: when read fails with UV_EIO and the child is still
running, treat it as transient and schedule a reconnect of the read
side instead of propagating EOF upstream.  Other read errors fall
through to the existing EOF path so we don't paper over real failures.

src/pty.c pty_reconnect_out: close the poisoned uv_pipe_t, dup() the
still-valid master fd into a fresh handle, and re-arm reading.  Retry
every 20ms up to 20 times (400ms total), which is more than enough
for login to finish reopening the slave.  On the last attempt the
upper layer is notified with EOF so the websocket closes cleanly.

src/pty.c pty_pause / pty_resume / read_cb: keep process->paused in
sync with the actual io_watcher state so pty_resume() will re-arm
reading as designed.

src/pty.h: add reconnect_timer + reconnect_attempts to pty_process.
The timer is allocated in pty_spawn and freed in process_free after
the wait thread is joined, so there is no UAF window.  The reconnect
helper is file-static; there is no public API change.

Notes / known limitations
-------------------------
- Only process->out is reconnected.  process->in is a separate
  handle and was not observed to break in practice (writes happen
  after the user starts typing, by which time login is done).  If a
  future workload triggers EIO on the write side, the symmetric fix
  would be to attempt reconnect from pty_write() on EPIPE.
- process_running() uses uv_kill(pid, 0), which returns success for
  zombies.  In the narrow window between child exit and wait_cb
  reaping it, we may schedule one or two pointless reconnects before
  the next dup()+read() returns EOF and the session closes normally.

Fixes: tsl0922#768

Code was written and tested by claude opus 4.6

Standalone reproducer (compile with -luv, run as root):

  /*
   * test_ttyd_vhangup_fix.c - integration test for vhangup recovery.
   * Forkpty's a child that writes output, calls vhangup(), reopens the
   * slave PTY by device path, then writes more output.  Parent uses the
   * same one-shot read + reconnect pattern as the fix.  Passes when
   * AFTER output is received through the new pipe.
   */
  #define _GNU_SOURCE
  #include <stdio.h>
  #include <stdlib.h>
  #include <stdbool.h>
  #include <string.h>
  #include <unistd.h>
  #include <fcntl.h>
  #include <pty.h>
  #include <sys/wait.h>
  #include <sys/ioctl.h>
  #include <uv.h>
  #include <signal.h>
  #include <time.h>
  #include <errno.h>

  static uv_loop_t *loop;
  static uv_pipe_t *pty_out;
  static uv_pipe_t pty_in;
  static uv_timer_t timeout_timer;
  static uv_timer_t reconnect_timer;
  static uv_idle_t writable_idle;
  static int master_fd = -1;
  static int child_pid = 0;
  static bool paused = true;
  static int reconnect_attempts = 0;

  #define MAX_RECONNECT_ATTEMPTS 10
  #define RECONNECT_BASE_DELAY_MS 100

  static char *pending_buf = NULL;
  static size_t pending_len = 0;
  static bool pending_close = false;
  static int total_reads = 0;
  static int total_eios = 0;
  static bool saw_before = false;
  static bool saw_after = false;
  static struct timespec start_time;

  static double elapsed(void) {
      struct timespec now;
      clock_gettime(CLOCK_MONOTONIC, &now);
      return (now.tv_sec - start_time.tv_sec) +
             (now.tv_nsec - start_time.tv_nsec) / 1e9;
  }
  static bool process_running(void) {
      return child_pid > 0 && kill(child_pid, 0) == 0;
  }
  static void alloc_cb(uv_handle_t *h, size_t s, uv_buf_t *b) {
      (void)h; b->base = malloc(s); b->len = s;
  }
  static void close_cb(uv_handle_t *h) { free(h); }
  static bool fd_duplicate(int fd, uv_pipe_t *pipe) {
      int d = dup(fd);
      if (d < 0) return false;
      fcntl(d, F_SETFD, FD_CLOEXEC);
      if (uv_pipe_open(pipe, d)) { close(d); return false; }
      return true;
  }

  static void read_cb(uv_stream_t *stream, ssize_t n, const uv_buf_t *buf);
  static void writable_cb(uv_idle_t *idle);

  static void pty_resume(void) {
      if (!paused || !pty_out) return;
      paused = false;
      pty_out->data = NULL;
      if (uv_read_start((uv_stream_t *)pty_out, alloc_cb, read_cb))
          paused = true;
  }
  static void reconnect_timer_cb(uv_timer_t *t) {
      (void)t;
      if (pty_out) {
          uv_read_stop((uv_stream_t *)pty_out);
          uv_close((uv_handle_t *)pty_out, close_cb);
          pty_out = NULL;
      }
      if (!process_running()) return;
      pty_out = malloc(sizeof(uv_pipe_t));
      uv_pipe_init(loop, pty_out, 0);
      if (!fd_duplicate(master_fd, pty_out)) {
          uv_close((uv_handle_t *)pty_out, close_cb);
          pty_out = NULL;
          return;
      }
      paused = true;
      pty_resume();
  }
  static void pty_reconnect_out(void) {
      reconnect_attempts++;
      if (reconnect_attempts > MAX_RECONNECT_ATTEMPTS) {
          pending_close = true;
          uv_idle_start(&writable_idle, writable_cb);
          return;
      }
      uint64_t delay = RECONNECT_BASE_DELAY_MS *
                        (1u << (reconnect_attempts - 1));
      uv_timer_start(&reconnect_timer, reconnect_timer_cb, delay, 0);
  }
  static void read_cb(uv_stream_t *stream, ssize_t n,
                       const uv_buf_t *buf) {
      uv_read_stop(stream);
      paused = true;
      if (n <= 0) {
          if (n == UV_ENOBUFS || n == 0) { free(buf->base); return; }
          total_eios++;
          free(buf->base);
          if (process_running()) pty_reconnect_out();
          else {
              pending_close = true;
              uv_idle_start(&writable_idle, writable_cb);
          }
          return;
      }
      total_reads++;
      reconnect_attempts = 0;
      pending_buf = malloc(n);
      memcpy(pending_buf, buf->base, n);
      pending_len = n;
      free(buf->base);
      uv_idle_start(&writable_idle, writable_cb);
  }
  static void writable_cb(uv_idle_t *idle) {
      uv_idle_stop(idle);
      if (pending_close) return;
      if (pending_buf) {
          if (strstr(pending_buf, "BEFORE")) saw_before = true;
          if (strstr(pending_buf, "AFTER"))  saw_after = true;
          free(pending_buf);
          pending_buf = NULL;
          pending_len = 0;
          pty_resume();
      }
  }
  static void on_close_noop(uv_handle_t *h) { (void)h; }
  static void timeout_cb(uv_timer_t *t) {
      (void)t;
      printf("saw_before=%d saw_after=%d reads=%d eios=%d\n",
             saw_before, saw_after, total_reads, total_eios);
      if (saw_before && saw_after)
          printf("*** FIX VERIFIED ***\n");
      else
          printf("*** FIX FAILED ***\n");
      if (child_pid > 0) kill(child_pid, SIGTERM);
      if (pty_out) {
          uv_read_stop((uv_stream_t *)pty_out);
          uv_close((uv_handle_t *)pty_out, close_cb);
          pty_out = NULL;
      }
      uv_close((uv_handle_t *)&pty_in, on_close_noop);
      uv_close((uv_handle_t *)&timeout_timer, on_close_noop);
      uv_close((uv_handle_t *)&reconnect_timer, on_close_noop);
      uv_close((uv_handle_t *)&writable_idle, on_close_noop);
  }
  int main(void) {
      if (geteuid() != 0) {
          fprintf(stderr, "Must run as root\n");
          return 1;
      }
      clock_gettime(CLOCK_MONOTONIC, &start_time);
      struct winsize ws = { .ws_row = 24, .ws_col = 80 };
      char slave_name[256];
      child_pid = forkpty(&master_fd, slave_name, NULL, &ws);
      if (child_pid < 0) { perror("forkpty"); return 1; }
      if (child_pid == 0) {
          signal(SIGHUP, SIG_IGN);
          printf("BEFORE\n"); fflush(stdout); usleep(200000);
          close(0); close(1); close(2);
          vhangup();
          signal(SIGHUP, SIG_DFL);
          int fd = open(slave_name, O_RDWR);
          if (fd < 0) _exit(99);
          dup2(fd, 0); dup2(fd, 1); dup2(fd, 2);
          if (fd > 2) close(fd);
          ioctl(0, TIOCSCTTY, 1);
          usleep(500000);
          printf("AFTER\n"); fflush(stdout);
          sleep(3);
          _exit(0);
      }
      fcntl(master_fd, F_SETFL,
            fcntl(master_fd, F_GETFL) | O_NONBLOCK);
      fcntl(master_fd, F_SETFD, FD_CLOEXEC);
      loop = uv_default_loop();
      uv_pipe_init(loop, &pty_in, 0);
      pty_out = malloc(sizeof(uv_pipe_t));
      uv_pipe_init(loop, pty_out, 0);
      int fd_in = dup(master_fd);
      fcntl(fd_in, F_SETFD, FD_CLOEXEC);
      uv_pipe_open(&pty_in, fd_in);
      if (!fd_duplicate(master_fd, pty_out)) return 1;
      uv_idle_init(loop, &writable_idle);
      uv_timer_init(loop, &reconnect_timer);
      pty_resume();
      uv_timer_init(loop, &timeout_timer);
      uv_timer_start(&timeout_timer, timeout_cb, 8000, 0);
      uv_run(loop, UV_RUN_DEFAULT);
      int status;
      waitpid(child_pid, &status, WNOHANG);
      uv_loop_close(loop);
      return (saw_before && saw_after) ? 0 : 1;
  }
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

disconnect with util-linux /bin/login command

1 participant