Conversation
fals
commented
Feb 24, 2026
- added test to reproduce the issue
- fixed the issue
- added way to run criu tests on container
- added test to reproduce the issue - fixed the issue - added way to run criu tests on container
On restore, /proc/<pid>/task/<tid>/... paths where tid != pid fail because non-leader threads don't exist yet at prepare_fds() time -- they are created later by the restorer blob via clone(). Previously only dead threads (tid not in pstree) were handled. Now fixup_thread_proc_path() handles both cases: - Dead threads: create TASK_HELPER with vPID=tid (existing behavior) - Live threads: rewrite path to use leader tid (proc/<pid>/task/<pid>) Verified by live-migrating an Aptos node on GKE which previously always failed with 'Can't open file proc/1/task/<tid>/stat'.
criu/files-reg.c
Outdated
| /* | ||
| * Live thread: the thread exists in the pstree but | ||
| * won't be created until the restorer blob runs | ||
| * clone(), which is after prepare_fds(). Rewrite the |
There was a problem hiding this comment.
So as I proposed with modified test, this will cause that the file will be rewriten to the different thread proc fs. This will cause that suddently any open FD will point to the wrong thread. As for the dead thread I guess this is for now fine, for the running thread this is very incorrect behaviour.
For the solution, TBH i'm not sure, we would need to reserach how criu is handling similar cases for example with thie unix sockets, when they explicitly mark that some sockets will be available when all of the processes are in the restored state.
There was a problem hiding this comment.
this seem to work as well, migration happened tried multiple times but had to change in multiple places to cache the fds that need to be deferred to restore, we create a placeholder of the fd when we restore them and keep real reference but when we restore the task we close the fake one and restore the real one.
Instead of rewriting /proc/<pid>/task/<tid>/... paths to the leader thread (which returns wrong data), defer the actual open until after threads exist in the restorer blob. Phase 1 (collect): Mark live-thread proc paths with deferred_thread_fd instead of rewriting them. Dead-thread TASK_HELPER case is unchanged. Phase 2 (placeholder): Open /dev/null to reserve the fd number, then collect deferred fd metadata into rst_mem for the restorer. Phase 3 (restorer): After clone() creates all threads, loop through deferred fds and reopen via sys_openat(proc_fd, path) + sys_dup2() + sys_lseek(). At this point /proc/<pid>/task/<tid>/ exists. The test is expanded to verify both dead-thread (fd validity check) and live-thread (reads tid from /proc/self/task/<tid>/stat to prove it points to the correct thread, not the leader) cases.
The restorer blob was reopening live-thread /proc/<pid>/task/<tid>/...
fds via sys_openat(args->proc_fd, ...) where args->proc_fd pointed to
CRIU's own /proc mount (PROC_FD_OFF). This gave the resulting fds a
mount ID from CRIU's mount namespace rather than the container's.
On the next dump, lookup_mnt_id() would fail because that mount ID
doesn't exist in the container's mountinfo, producing:
Error: Can't lookup mount=<N> for fd=<M> path=/1/task/7/stat
Fix by opening /proc fresh inside the restorer. By this point the
process has already chroot'd into the container's rootfs (done in
restore_fs() before jumping to the restorer), so sys_open("/proc")
resolves through the container's own proc mount and the fds get the
correct mount ID.
test/zdtm/static/proc_task_comm.c
Outdated
| /* Live thread state */ | ||
| static int live_thread_fd = -1; | ||
| static pid_t live_thread_tid; | ||
| static volatile int live_thread_ready; |
There was a problem hiding this comment.
This program is racy volatile is not meant to be used for inter-thread communications.
https://en.cppreference.com/w/c/language/volatile.html
Note that volatile variables are not suitable for communication between threads; they do not offer atomicity, synchronization, or memory ordering. A read from a volatile variable that is modified by another thread without synchronization or concurrent modification from two unsynchronized threads is undefined behavior due to a data race.
test/Dockerfile
Outdated
|
|
||
| COPY . /criu | ||
| WORKDIR /criu | ||
| RUN make mrproper && make -j $(nproc) && make -C test/zdtm -j $(nproc) |
There was a problem hiding this comment.
make test/zdtm causes that test runs twice inside docker and later on in the makefile?
- Replace volatile with pthread_mutex/cond for inter-thread synchronization in proc_task_comm.c test (volatile does not provide atomicity or memory ordering guarantees) - Remove redundant 'make -C test/zdtm' from test/Dockerfile since zdtm.py compiles tests on the fly - Split castai-test Makefile target into castai-test-build (image) and castai-test (run), volume-mounting test/ so test code changes take effect without rebuilding the Docker image
criu/include/restorer.h
Outdated
| * Hence align to 16 bytes for all | ||
| */ | ||
| #define RESTORE_ALIGN_STACK(start, size) (ALIGN((start) + (size)-16, 16)) | ||
| #define RESTORE_ALIGN_STACK(start, size) (ALIGN((start) + (size) - 16, 16)) |
There was a problem hiding this comment.
Don't reformat this, I guess we don't need any irreelvant changes that would make updates troublesome.
criu/files-reg.c
Outdated
| pid, tid); | ||
| } | ||
|
|
||
| new_path = xmalloc(5 + 20 + 6 + 20 + strlen(tid_end) + 1); |
There was a problem hiding this comment.
Can we get rid of this magic numbers?
There was a problem hiding this comment.
yes dump LLM using the constant instead
- Revert unrelated formatting change in RESTORE_ALIGN_STACK macro - Revert unrelated (void*) formatting change in get_build_id() - Replace magic number malloc with PATH_MAX + snprintf in fixup_thread_proc_path()