Add Docker backend, refactor plugin architecture, and fix process lifecycle bugs#264
Merged
chipsenkbeil merged 41 commits intomasterfrom Mar 5, 2026
Merged
Add Docker backend, refactor plugin architecture, and fix process lifecycle bugs#264chipsenkbeil merged 41 commits intomasterfrom
chipsenkbeil merged 41 commits intomasterfrom
Conversation
79235a9 to
eaded68
Compare
a542602 to
6004111
Compare
f8be0ec to
93877d5
Compare
Accumulated fixes from CI testing against Windows nanoserver containers: - exists: use tar-based check as primary on Windows (exec `if exist` cannot see tar-created paths due to ContainerUser permissions) - remove: split into separate rmdir/del exec calls (compound `&` operator has unreliable errorlevel propagation on nanoserver) - search: add tar-based fallback with condition_matches() for Windows where findstr/dir cannot access tar-created paths - search: fix has_any() to include has_grep in the check - search: fix findstr probe to check stdout instead of exit code (findstr /? exits 1 on nanoserver but still prints help) - process: retrieve real exit codes via inspect_exec API instead of hardcoding success=true/code=None - process: move PTY resize_exec after start_exec (Docker requires exec to be running before resize is valid) - process: fix proc_kill to not remove process map entry (let reader task cleanup handle removal to avoid race conditions)
93877d5 to
9dd79a3
Compare
Revert api.rs ContainerAdministrator changes that didn't fix the issue and replace CI with a focused diagnostic workflow that tests 8 hypotheses about nanoserver tar/exec interaction using raw docker CLI commands on windows-2022 (nanoserver:ltsc2022 + servercore:ltsc2022 comparison).
The test harness created C:\temp via the Docker tar API, which produces SYSTEM-owned directories where ContainerUser only inherits BUILTIN\Users:(RX) — insufficient for file deletion. Switch to exec-based mkdir so ContainerUser gets FILE_DELETE_CHILD permission, fixing remove and rename test failures on nanoserver. Also suppress dead_code warning on condition_matches (reserved for future tar-based search fallback) and restore the original CI workflow that was replaced by the diagnostic workflow.
b183dfa to
662699e
Compare
Install Docker on macOS via docker/setup-docker-action@v4 (Lima). Replace per-platform pre-pull steps with a single step conditioned on runner.os != 'Windows', since Windows runners cannot run Linux containers. Docker tests on Windows are skipped by skip_if_no_docker!.
Drop DockerFamily enum, nanoserver dispatch, findstr/cmd.exe paths, and all Windows-specific workarounds. All containers are now assumed to be Unix (the Docker host can still be any platform). This removes ~750 lines of platform-dispatch code, simplifies shell commands to use sh -c directly, and updates AGENTS.md technical debt entries to reflect the Unix-only container scope.
…chemes Change Plugin::connect/launch to accept &str instead of &Destination, deferring destination parsing to each plugin. This enables the Docker plugin to handle URIs like docker://ubuntu:22.04 where :22.04 is an image tag, not a port. The manager extracts the scheme for plugin routing and passes the raw string through. ManagerRequest, ConnectionInfo, and ConnectionList now carry String destinations instead of parsed Destination types.
Upgrade pull progress logging from trace! to info! and include current/total byte counts from progress_detail when available, giving users visibility into download progress.
Resolve relative paths to absolute using the container's working directory before running find. This ensures strip_prefix works correctly and entries are not silently skipped when the path is '.' or other relative paths.
Add auto_remove field to Docker struct, default LaunchOpts::auto_remove to true, and spawn cleanup tasks that stop and remove the container when the server shuts down. Add subscribe_shutdown() to ServerRef for external shutdown notification. Remove resolved Technical Debt #7 from CLAUDE.md.
Simplify read_file_text, write_file_text, and append_file_text to delegate to their binary counterparts with bytes/string conversion at the boundary, removing ~40 lines of duplicated logic.
DockerApi now stores DockerClient instead of raw bollard::Docker. All callsites in api.rs, test harness, and integration tests use DockerClient's public methods (ping, has_image, pull_image, create_and_start_container) or .inner() for crate-internal access, keeping bollard out of the public API surface.
Encapsulate all SFTP↔native path conversion logic into a proper SftpPathBuf type in distant-ssh, replacing 5 free functions and 2 methods scattered across api.rs. The new type makes correct separator handling the default: all path manipulation happens in SFTP space (using /), and to_remote_path() converts to native separators at the boundary. Key changes: - Add SftpPathBuf with from_remote, from_sftp constructors and as_str, to_remote_path, join, file_name, strip_prefix operations - Use Utf8WindowsPath explicitly (not derive()) so \ is recognized as a separator on all host platforms - Update all api.rs call sites including read_dir recursive traversal - Fix relative DirEntry paths on Windows (sub1\file2 not sub1/file2) by cfg-gating test assertions
- Change CLI destination args from Box<Destination> to String so Docker URIs like docker://ubuntu:22.04 pass through to plugins unparsed - Add extract_scheme() and ensure_scheme() helpers for raw destination strings; guard post-launch host replacement to skip Docker schemes - Add shlex-based shell_quote() and convert 11 api.rs commands from run_shell_cmd to direct argv via run_cmd, eliminating shell injection - Quote paths and patterns in search commands; add SearchCommand struct with exit code handling (grep/rg exit 1 = no matches, >=2 = error) - Add tar_copy_path() for file and directory copies via tar API, replacing the file-only tar_read_file fallback in copy() - Make tar_read_file reject directory paths instead of silently returning the first file's contents - Remove hardcoded default image from LaunchOpts; validate non-empty image in Docker::launch() - Add distant_docker to CLI logging modules - Update CLAUDE.md technical debt (remove 4 resolved items, add 1 new)
Introduce two agent prompt files in .claude/agents/ that enforce test quality standards derived from the project's original human-written tests. The implementor writes tests across unit, integration, and CLI tiers with strict naming, assertion, and cleanup rules. The validator reviews test code against 10 checks and blocks on quality issues.
Replace weak stderr assertions (is_empty().not()) with specific error message checks, fix smoke tests that only validated existence/success without checking actual values, add missing error-case tests for Docker backend operations, and broaden Docker error matching to handle platform-specific message formats.
Embeds git commit hash, dirty-tree indicator, and commit date into the CLI --version output via a build.rs script, making nightly and dev builds fully traceable without changing the semver version.
Move DistantPlugin from the binary crate (handlers.rs) to distant-host as HostPlugin for consistency with SshPlugin and DockerPlugin. All three backend crates are now optional Cargo features (default-enabled), allowing minimal builds. The version string now lists enabled backends (e.g. `[docker, host, ssh]`).
Add global --quiet/-q flag that suppresses all informational UI output (spinners, connection status, success/error messages) on stderr, making CLI output clean for scripting and pipelines. Fix missing trailing newlines in system-info, metadata (unix/windows), and directory listing output that caused zsh to display a '%' marker.
…adiness Replace the fixed 50ms sleep in ManagerCtx::start() and ManagerOnlyCtx::start() with a poll loop that waits for the Unix socket file to exist (up to 2s). The old sleep was too short — the manager sometimes hadn't bound its socket yet, causing connect_to_manager() to fail and attempt a conflicting auto-start. Also improve the launch test assertion to include the full JSON response on failure.
…tection Move SSH and Docker test-group overrides from the CI profile to the default profile so throttling applies to local runs too. Add leak detection (100ms timeout, fail on leak) and set final-status-level to leak. The CI profile now only adds retries and slow-timeout on top.
Nextest detected leaked handles (LKFAIL) in launch and SSH tests because Drop impls only killed the direct child process, leaving grandchildren (servers spawned via distant launch, sshd session handlers) alive with inherited file descriptors. Add a cross-platform kill_process_tree helper: on Unix, spawns processes in their own process group (process_group(0)) and kills the entire group via kill(-pgid, SIGKILL); on Windows, uses taskkill /T /F to kill the process tree. Apply to ManagerCtx, ManagerOnlyCtx, and Sshd Drop impls.
When sshd allocates a PTY, the sshd-session child calls setsid() to become a session leader, escaping the process group. The previous kill(-pgid, SIGKILL) did not reach it, leaking file descriptors. Add collect_descendants() which recursively walks parent-child links via pgrep -P (preserved even after setsid) and kills deepest-first, then falls back to the process group kill for any remaining members.
Windows OpenSSH sometimes never sends Eof after ExitStatus, causing execute_output() and process reader tasks to hang until the 30s timeout. Break immediately on ExitStatus since SSH protocol guarantees all data has been flushed by that point. Increase Windows CI test timeout from 30s to 90s for additional headroom.
Replace the single PowerShell-based `is_windows()` with a three-layer
strategy that avoids unnecessary shell execution:
1. SSH identification string — zero cost, captured during key exchange
2. SFTP canonicalize(".") — detects Windows drive prefix without exec
3. Parallel echo %OS% / PowerShell fallback as a safety net
This reduces SSH channel usage and speeds up family detection on both
Unix (no exec needed) and Windows (fast SSH ID match) servers.
Remove leak-timeout from nextest config to isolate whether leak detection causes Windows CI SSH test failures (19/47 tests failing since 694dfc4). Add channel-level debug logging in SSH exec and process spawn paths to trace the exact point where sshd goes silent after CHANNEL_SUCCESS. Capture sshd logs on drop for all tests (not just connection failures) and add a CI diagnostic step to record Windows sshd version, OS version, and system resources.
The Windows CI SSH failures were caused by the system sshd service, not SYSTEM file ownership or leak-timeout. Remove misleading diagnostics added during debugging (ver command, SYSTEM ownership claims, noisy pre-spawn logging) while keeping useful checks (sc query sshd, OpenSSH version, icacls on failure). Restore leak-timeout now that the real root cause is resolved.
Send SIGKILL via SSH signal request and close the channel on kill, instead of only sending EOF (which sleep ignores). Handle ExitSignal in both readers so the loop breaks immediately on signal-killed processes. Add 500ms startup delay and 5s timeout guards in PTY kill tests to avoid race conditions and catch regressions early.
wait_task now awaits stdout/stderr JoinHandles (with a 5s timeout) after child.wait() returns, ensuring all output is forwarded before ProcDone is sent. This fixes the shell test flake where `echo hello` would exit before stdout_task could forward the PTY data.
Abort stdout/stderr read tasks in SimpleProcess::wait() instead of awaiting them, matching the PtyProcess fix from 5a84ded. On Windows, cmd /C creates process trees where TerminateProcess only kills cmd.exe, leaving child processes holding pipes open indefinitely. Also add a timeout guard to the kill test to catch this regression on CI.
…ocesses Add kill_on_drop(true) to SimpleProcess spawn so Windows Job Object kills the entire process tree on drop, and spawn ping directly in the test instead of via cmd /C to avoid orphan process issues.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
distant-dockercrate — New workspace crate implementing the distant API against Docker containers via Bollard. File I/O uses the tar archive API; search detectsrg/grepat connect time; process management uses container exec with per-process streaming. Unix containers only (Docker host can be any platform).RemotePathnewtype — ReplacePathBufwithRemotePath(String)across the protocol and API layers, eliminating incorrect use ofstd::path::PathBuffor remote paths that may have different separator conventions.Plugin architecture refactoring — Move
DockerPluginintodistant-docker,SshPluginintodistant-ssh, andDistantPlugin(renamedHostPlugin) intodistant-host. ChangePlugin::connect/launchto accept raw&strdestinations instead of parsedDestination, enabling custom URI schemes likedocker://ubuntu:22.04. CLI destination args changed fromBox<Destination>toString.Optional backend features — All three backends (
host,ssh,docker) are now optional Cargo features (default-enabled).--versionoutput lists enabled backends.distant servercompiles out whenhostis disabled;distant sshcompiles out whensshis disabled.SftpPathBufindistant-ssh— Encapsulate SFTP/native path conversion into a proper type, replacing scattered free functions.Build version metadata —
build.rsembeds git hash, dirty flag, and commit date into--versionoutput (e.g.0.20.0 (abc1234+ 2026-03-03) [docker, host, ssh]).Shell injection hardening — Convert 11 Docker API commands from
run_shell_cmdto direct argv viarun_cmd; addshell_quote()for the remaining shell-dependent commands.Fix Windows SSH timeouts — Break on
ExitStatusunconditionally inexecute_output()and process reader tasks, since Windows OpenSSH may never sendEofafterExitStatus. Add layered Windows detection (SSH ID, SFTP, then exec fallback). Increase Windows CI test timeout from 30s to 90s.Fix process kill blocking on Windows — Three issues fixed in
distant-hostprocess lifecycle:wait_tasknow owns the stdout/stderr forwarding tasks and drains them (with a 5-secondOUTPUT_DRAIN_TIMEOUT) before sendingProcDone, ensuring clients receive all output before the done signal.SimpleProcess::wait()now aborts read tasks after receiving exit status instead of awaiting them, since on Windows killed processes may leave child processes alive (e.g.cmd /C→ping.exe) that hold pipes open indefinitely.kill_on_drop(true)toSimpleProcess::spawn(), which on Windows places the child in a Job Object that kills the entire process tree when theChildis dropped.PtyProcess::wait().Test infrastructure —
DockerFixtureandDockerManagerCtxin test harness; E2E CLI tests for Docker file/dir/process ops;test-implementorandtest-validatoragent prompts; strengthened assertions across all test suites. Cross-platformkill_process_treehelper usingpgrep-based descendant collection (Unix) andtaskkill /T(Windows) to prevent leaked handles in tests. Fix intermittent launch test failures by polling for manager socket readiness.Test plan
cargo test --all-features --workspacepasses (all platforms)cargo test --no-default-features -p distant --binspasses (feature-gated code compiles out cleanly)cargo test --all-features -p distant-dockerpasses on Linux with Dockercargo clippy --all-features --workspace --all-targetscleancargo fmt --all -- --checkpasseskill_should_succeed_for_running_processcompletes in <15s on Windows CI (was ~100s)proc_done_waits_for_delayed_stdout_to_drainregression test passes