Skip to content

fix: aptos thread with a ghost file#35

Merged
fals merged 9 commits intolive-4.2from
filipe/LIVE-600-aptos
Feb 26, 2026
Merged

fix: aptos thread with a ghost file#35
fals merged 9 commits intolive-4.2from
filipe/LIVE-600-aptos

Conversation

@fals
Copy link

@fals fals commented Feb 24, 2026

  • added test to reproduce the issue
  • fixed the issue
  • added way to run criu tests on container

fals added 3 commits February 24, 2026 11:37
- added test to reproduce the issue
- fixed the issue
- added way to run criu tests on container
On restore, /proc/<pid>/task/<tid>/... paths where tid != pid fail
because non-leader threads don't exist yet at prepare_fds() time --
they are created later by the restorer blob via clone(). Previously
only dead threads (tid not in pstree) were handled.

Now fixup_thread_proc_path() handles both cases:
- Dead threads: create TASK_HELPER with vPID=tid (existing behavior)
- Live threads: rewrite path to use leader tid (proc/<pid>/task/<pid>)

Verified by live-migrating an Aptos node on GKE which previously
always failed with 'Can't open file proc/1/task/<tid>/stat'.
criu/files-reg.c Outdated
/*
* Live thread: the thread exists in the pstree but
* won't be created until the restorer blob runs
* clone(), which is after prepare_fds(). Rewrite the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as I proposed with modified test, this will cause that the file will be rewriten to the different thread proc fs. This will cause that suddently any open FD will point to the wrong thread. As for the dead thread I guess this is for now fine, for the running thread this is very incorrect behaviour.

For the solution, TBH i'm not sure, we would need to reserach how criu is handling similar cases for example with thie unix sockets, when they explicitly mark that some sockets will be available when all of the processes are in the restored state.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seem to work as well, migration happened tried multiple times but had to change in multiple places to cache the fds that need to be deferred to restore, we create a placeholder of the fd when we restore them and keep real reference but when we restore the task we close the fake one and restore the real one.

fals added 4 commits February 25, 2026 12:14
Instead of rewriting /proc/<pid>/task/<tid>/... paths to the leader
thread (which returns wrong data), defer the actual open until after
threads exist in the restorer blob.

Phase 1 (collect): Mark live-thread proc paths with deferred_thread_fd
instead of rewriting them. Dead-thread TASK_HELPER case is unchanged.

Phase 2 (placeholder): Open /dev/null to reserve the fd number, then
collect deferred fd metadata into rst_mem for the restorer.

Phase 3 (restorer): After clone() creates all threads, loop through
deferred fds and reopen via sys_openat(proc_fd, path) + sys_dup2() +
sys_lseek(). At this point /proc/<pid>/task/<tid>/ exists.

The test is expanded to verify both dead-thread (fd validity check) and
live-thread (reads tid from /proc/self/task/<tid>/stat to prove it
points to the correct thread, not the leader) cases.
The restorer blob was reopening live-thread /proc/<pid>/task/<tid>/...
fds via sys_openat(args->proc_fd, ...) where args->proc_fd pointed to
CRIU's own /proc mount (PROC_FD_OFF). This gave the resulting fds a
mount ID from CRIU's mount namespace rather than the container's.

On the next dump, lookup_mnt_id() would fail because that mount ID
doesn't exist in the container's mountinfo, producing:
  Error: Can't lookup mount=<N> for fd=<M> path=/1/task/7/stat

Fix by opening /proc fresh inside the restorer. By this point the
process has already chroot'd into the container's rootfs (done in
restore_fs() before jumping to the restorer), so sys_open("/proc")
resolves through the container's own proc mount and the fds get the
correct mount ID.
/* Live thread state */
static int live_thread_fd = -1;
static pid_t live_thread_tid;
static volatile int live_thread_ready;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This program is racy volatile is not meant to be used for inter-thread communications.
https://en.cppreference.com/w/c/language/volatile.html

Note that volatile variables are not suitable for communication between threads; they do not offer atomicity, synchronization, or memory ordering. A read from a volatile variable that is modified by another thread without synchronization or concurrent modification from two unsynchronized threads is undefined behavior due to a data race.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

test/Dockerfile Outdated

COPY . /criu
WORKDIR /criu
RUN make mrproper && make -j $(nproc) && make -C test/zdtm -j $(nproc)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make test/zdtm causes that test runs twice inside docker and later on in the makefile?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

- Replace volatile with pthread_mutex/cond for inter-thread
  synchronization in proc_task_comm.c test (volatile does not
  provide atomicity or memory ordering guarantees)
- Remove redundant 'make -C test/zdtm' from test/Dockerfile since
  zdtm.py compiles tests on the fly
- Split castai-test Makefile target into castai-test-build (image)
  and castai-test (run), volume-mounting test/ so test code changes
  take effect without rebuilding the Docker image
* Hence align to 16 bytes for all
*/
#define RESTORE_ALIGN_STACK(start, size) (ALIGN((start) + (size)-16, 16))
#define RESTORE_ALIGN_STACK(start, size) (ALIGN((start) + (size) - 16, 16))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't reformat this, I guess we don't need any irreelvant changes that would make updates troublesome.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted this

criu/files-reg.c Outdated
pid, tid);
}

new_path = xmalloc(5 + 20 + 6 + 20 + strlen(tid_end) + 1);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get rid of this magic numbers?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes dump LLM using the constant instead

- Revert unrelated formatting change in RESTORE_ALIGN_STACK macro
- Revert unrelated (void*) formatting change in get_build_id()
- Replace magic number malloc with PATH_MAX + snprintf in fixup_thread_proc_path()
@fals fals merged commit a27acde into live-4.2 Feb 26, 2026
4 checks passed
@fals fals deleted the filipe/LIVE-600-aptos branch February 26, 2026 09:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants