Skip to content

Add Docker backend, refactor plugin architecture, and fix process lifecycle bugs#264

Merged
chipsenkbeil merged 41 commits intomasterfrom
feature/docker-backend
Mar 5, 2026
Merged

Add Docker backend, refactor plugin architecture, and fix process lifecycle bugs#264
chipsenkbeil merged 41 commits intomasterfrom
feature/docker-backend

Conversation

@chipsenkbeil
Copy link
Owner

@chipsenkbeil chipsenkbeil commented Mar 1, 2026

Summary

  • distant-docker crate — New workspace crate implementing the distant API against Docker containers via Bollard. File I/O uses the tar archive API; search detects rg/grep at connect time; process management uses container exec with per-process streaming. Unix containers only (Docker host can be any platform).

  • RemotePath newtype — Replace PathBuf with RemotePath(String) across the protocol and API layers, eliminating incorrect use of std::path::PathBuf for remote paths that may have different separator conventions.

  • Plugin architecture refactoring — Move DockerPlugin into distant-docker, SshPlugin into distant-ssh, and DistantPlugin (renamed HostPlugin) into distant-host. Change Plugin::connect/launch to accept raw &str destinations instead of parsed Destination, enabling custom URI schemes like docker://ubuntu:22.04. CLI destination args changed from Box<Destination> to String.

  • Optional backend features — All three backends (host, ssh, docker) are now optional Cargo features (default-enabled). --version output lists enabled backends. distant server compiles out when host is disabled; distant ssh compiles out when ssh is disabled.

  • SftpPathBuf in distant-ssh — Encapsulate SFTP/native path conversion into a proper type, replacing scattered free functions.

  • Build version metadatabuild.rs embeds git hash, dirty flag, and commit date into --version output (e.g. 0.20.0 (abc1234+ 2026-03-03) [docker, host, ssh]).

  • Shell injection hardening — Convert 11 Docker API commands from run_shell_cmd to direct argv via run_cmd; add shell_quote() for the remaining shell-dependent commands.

  • Fix Windows SSH timeouts — Break on ExitStatus unconditionally in execute_output() and process reader tasks, since Windows OpenSSH may never send Eof after ExitStatus. Add layered Windows detection (SSH ID, SFTP, then exec fallback). Increase Windows CI test timeout from 30s to 90s.

  • Fix process kill blocking on Windows — Three issues fixed in distant-host process lifecycle:

    1. ProcDone racing ProcStdout: wait_task now owns the stdout/stderr forwarding tasks and drains them (with a 5-second OUTPUT_DRAIN_TIMEOUT) before sending ProcDone, ensuring clients receive all output before the done signal.
    2. SimpleProcess kill blocking on orphaned pipes: SimpleProcess::wait() now aborts read tasks after receiving exit status instead of awaiting them, since on Windows killed processes may leave child processes alive (e.g. cmd /Cping.exe) that hold pipes open indefinitely.
    3. Orphaned child processes on Windows: Added kill_on_drop(true) to SimpleProcess::spawn(), which on Windows places the child in a Job Object that kills the entire process tree when the Child is dropped.
    4. PTY process kill: Same abort fix applied to PtyProcess::wait().
  • Test infrastructureDockerFixture and DockerManagerCtx in test harness; E2E CLI tests for Docker file/dir/process ops; test-implementor and test-validator agent prompts; strengthened assertions across all test suites. Cross-platform kill_process_tree helper using pgrep-based descendant collection (Unix) and taskkill /T (Windows) to prevent leaked handles in tests. Fix intermittent launch test failures by polling for manager socket readiness.

Test plan

  • cargo test --all-features --workspace passes (all platforms)
  • cargo test --no-default-features -p distant --bins passes (feature-gated code compiles out cleanly)
  • cargo test --all-features -p distant-docker passes on Linux with Docker
  • cargo clippy --all-features --workspace --all-targets clean
  • cargo fmt --all -- --check passes
  • Windows CI SSH tests no longer time out (38 tests pass)
  • kill_should_succeed_for_running_process completes in <15s on Windows CI (was ~100s)
  • proc_done_waits_for_delayed_stdout_to_drain regression test passes

@chipsenkbeil chipsenkbeil force-pushed the feature/docker-backend branch 5 times, most recently from 79235a9 to eaded68 Compare March 1, 2026 17:17
@chipsenkbeil chipsenkbeil changed the title Add distant-docker backend plugin for Docker container interaction Add distant-docker backend plugin with cross-platform container support Mar 1, 2026
@chipsenkbeil chipsenkbeil force-pushed the feature/docker-backend branch 2 times, most recently from a542602 to 6004111 Compare March 1, 2026 18:20
@chipsenkbeil chipsenkbeil force-pushed the feature/docker-backend branch 8 times, most recently from f8be0ec to 93877d5 Compare March 1, 2026 22:10
Accumulated fixes from CI testing against Windows nanoserver containers:

- exists: use tar-based check as primary on Windows (exec `if exist`
  cannot see tar-created paths due to ContainerUser permissions)
- remove: split into separate rmdir/del exec calls (compound `&`
  operator has unreliable errorlevel propagation on nanoserver)
- search: add tar-based fallback with condition_matches() for Windows
  where findstr/dir cannot access tar-created paths
- search: fix has_any() to include has_grep in the check
- search: fix findstr probe to check stdout instead of exit code
  (findstr /? exits 1 on nanoserver but still prints help)
- process: retrieve real exit codes via inspect_exec API instead of
  hardcoding success=true/code=None
- process: move PTY resize_exec after start_exec (Docker requires
  exec to be running before resize is valid)
- process: fix proc_kill to not remove process map entry (let reader
  task cleanup handle removal to avoid race conditions)
@chipsenkbeil chipsenkbeil force-pushed the feature/docker-backend branch from 93877d5 to 9dd79a3 Compare March 1, 2026 23:15
Revert api.rs ContainerAdministrator changes that didn't fix the issue
and replace CI with a focused diagnostic workflow that tests 8 hypotheses
about nanoserver tar/exec interaction using raw docker CLI commands on
windows-2022 (nanoserver:ltsc2022 + servercore:ltsc2022 comparison).
The test harness created C:\temp via the Docker tar API, which produces
SYSTEM-owned directories where ContainerUser only inherits
BUILTIN\Users:(RX) — insufficient for file deletion. Switch to
exec-based mkdir so ContainerUser gets FILE_DELETE_CHILD permission,
fixing remove and rename test failures on nanoserver.

Also suppress dead_code warning on condition_matches (reserved for
future tar-based search fallback) and restore the original CI workflow
that was replaced by the diagnostic workflow.
@chipsenkbeil chipsenkbeil force-pushed the feature/docker-backend branch from b183dfa to 662699e Compare March 2, 2026 00:40
Install Docker on macOS via docker/setup-docker-action@v4 (Lima).
Replace per-platform pre-pull steps with a single step conditioned
on runner.os != 'Windows', since Windows runners cannot run Linux
containers. Docker tests on Windows are skipped by skip_if_no_docker!.
Drop DockerFamily enum, nanoserver dispatch, findstr/cmd.exe paths,
and all Windows-specific workarounds. All containers are now assumed
to be Unix (the Docker host can still be any platform). This removes
~750 lines of platform-dispatch code, simplifies shell commands to
use sh -c directly, and updates AGENTS.md technical debt entries to
reflect the Unix-only container scope.
…chemes

Change Plugin::connect/launch to accept &str instead of &Destination,
deferring destination parsing to each plugin. This enables the Docker
plugin to handle URIs like docker://ubuntu:22.04 where :22.04 is an
image tag, not a port. The manager extracts the scheme for plugin
routing and passes the raw string through. ManagerRequest, ConnectionInfo,
and ConnectionList now carry String destinations instead of parsed
Destination types.
Upgrade pull progress logging from trace! to info! and include
current/total byte counts from progress_detail when available,
giving users visibility into download progress.
Resolve relative paths to absolute using the container's working
directory before running find. This ensures strip_prefix works
correctly and entries are not silently skipped when the path is '.'
or other relative paths.
Add auto_remove field to Docker struct, default LaunchOpts::auto_remove
to true, and spawn cleanup tasks that stop and remove the container
when the server shuts down. Add subscribe_shutdown() to ServerRef for
external shutdown notification. Remove resolved Technical Debt #7 from
CLAUDE.md.
Simplify read_file_text, write_file_text, and append_file_text to
delegate to their binary counterparts with bytes/string conversion
at the boundary, removing ~40 lines of duplicated logic.
DockerApi now stores DockerClient instead of raw bollard::Docker.
All callsites in api.rs, test harness, and integration tests use
DockerClient's public methods (ping, has_image, pull_image,
create_and_start_container) or .inner() for crate-internal access,
keeping bollard out of the public API surface.
Encapsulate all SFTP↔native path conversion logic into a proper
SftpPathBuf type in distant-ssh, replacing 5 free functions and
2 methods scattered across api.rs. The new type makes correct
separator handling the default: all path manipulation happens in
SFTP space (using /), and to_remote_path() converts to native
separators at the boundary.

Key changes:
- Add SftpPathBuf with from_remote, from_sftp constructors and
  as_str, to_remote_path, join, file_name, strip_prefix operations
- Use Utf8WindowsPath explicitly (not derive()) so \ is recognized
  as a separator on all host platforms
- Update all api.rs call sites including read_dir recursive traversal
- Fix relative DirEntry paths on Windows (sub1\file2 not sub1/file2)
  by cfg-gating test assertions
- Change CLI destination args from Box<Destination> to String so Docker
  URIs like docker://ubuntu:22.04 pass through to plugins unparsed
- Add extract_scheme() and ensure_scheme() helpers for raw destination
  strings; guard post-launch host replacement to skip Docker schemes
- Add shlex-based shell_quote() and convert 11 api.rs commands from
  run_shell_cmd to direct argv via run_cmd, eliminating shell injection
- Quote paths and patterns in search commands; add SearchCommand struct
  with exit code handling (grep/rg exit 1 = no matches, >=2 = error)
- Add tar_copy_path() for file and directory copies via tar API,
  replacing the file-only tar_read_file fallback in copy()
- Make tar_read_file reject directory paths instead of silently
  returning the first file's contents
- Remove hardcoded default image from LaunchOpts; validate non-empty
  image in Docker::launch()
- Add distant_docker to CLI logging modules
- Update CLAUDE.md technical debt (remove 4 resolved items, add 1 new)
Introduce two agent prompt files in .claude/agents/ that enforce test
quality standards derived from the project's original human-written
tests. The implementor writes tests across unit, integration, and CLI
tiers with strict naming, assertion, and cleanup rules. The validator
reviews test code against 10 checks and blocks on quality issues.
Replace weak stderr assertions (is_empty().not()) with specific error
message checks, fix smoke tests that only validated existence/success
without checking actual values, add missing error-case tests for Docker
backend operations, and broaden Docker error matching to handle
platform-specific message formats.
Embeds git commit hash, dirty-tree indicator, and commit date into
the CLI --version output via a build.rs script, making nightly and
dev builds fully traceable without changing the semver version.
Move DistantPlugin from the binary crate (handlers.rs) to distant-host
as HostPlugin for consistency with SshPlugin and DockerPlugin. All three
backend crates are now optional Cargo features (default-enabled), allowing
minimal builds. The version string now lists enabled backends (e.g.
`[docker, host, ssh]`).
@chipsenkbeil chipsenkbeil changed the title Add distant-docker backend plugin with cross-platform container support Add Docker backend, refactor plugin architecture, and make backends optional features Mar 4, 2026
Add global --quiet/-q flag that suppresses all informational UI output
(spinners, connection status, success/error messages) on stderr, making
CLI output clean for scripting and pipelines.

Fix missing trailing newlines in system-info, metadata (unix/windows),
and directory listing output that caused zsh to display a '%' marker.
…adiness

Replace the fixed 50ms sleep in ManagerCtx::start() and
ManagerOnlyCtx::start() with a poll loop that waits for the Unix socket
file to exist (up to 2s). The old sleep was too short — the manager
sometimes hadn't bound its socket yet, causing connect_to_manager() to
fail and attempt a conflicting auto-start. Also improve the launch test
assertion to include the full JSON response on failure.
…tection

Move SSH and Docker test-group overrides from the CI profile to the
default profile so throttling applies to local runs too. Add leak
detection (100ms timeout, fail on leak) and set final-status-level to
leak. The CI profile now only adds retries and slow-timeout on top.
Nextest detected leaked handles (LKFAIL) in launch and SSH tests because
Drop impls only killed the direct child process, leaving grandchildren
(servers spawned via distant launch, sshd session handlers) alive with
inherited file descriptors.

Add a cross-platform kill_process_tree helper: on Unix, spawns processes
in their own process group (process_group(0)) and kills the entire group
via kill(-pgid, SIGKILL); on Windows, uses taskkill /T /F to kill the
process tree. Apply to ManagerCtx, ManagerOnlyCtx, and Sshd Drop impls.
When sshd allocates a PTY, the sshd-session child calls setsid() to
become a session leader, escaping the process group. The previous
kill(-pgid, SIGKILL) did not reach it, leaking file descriptors.

Add collect_descendants() which recursively walks parent-child links
via pgrep -P (preserved even after setsid) and kills deepest-first,
then falls back to the process group kill for any remaining members.
Windows OpenSSH sometimes never sends Eof after ExitStatus, causing
execute_output() and process reader tasks to hang until the 30s timeout.
Break immediately on ExitStatus since SSH protocol guarantees all data
has been flushed by that point. Increase Windows CI test timeout from
30s to 90s for additional headroom.
@chipsenkbeil chipsenkbeil changed the title Add Docker backend, refactor plugin architecture, and make backends optional features Add Docker backend, refactor plugin architecture, and fix Windows SSH timeouts Mar 4, 2026
Replace the single PowerShell-based `is_windows()` with a three-layer
strategy that avoids unnecessary shell execution:

1. SSH identification string — zero cost, captured during key exchange
2. SFTP canonicalize(".") — detects Windows drive prefix without exec
3. Parallel echo %OS% / PowerShell fallback as a safety net

This reduces SSH channel usage and speeds up family detection on both
Unix (no exec needed) and Windows (fast SSH ID match) servers.
Remove leak-timeout from nextest config to isolate whether leak
detection causes Windows CI SSH test failures (19/47 tests failing
since 694dfc4). Add channel-level debug logging in SSH exec and
process spawn paths to trace the exact point where sshd goes silent
after CHANNEL_SUCCESS. Capture sshd logs on drop for all tests (not
just connection failures) and add a CI diagnostic step to record
Windows sshd version, OS version, and system resources.
The Windows CI SSH failures were caused by the system sshd service, not
SYSTEM file ownership or leak-timeout. Remove misleading diagnostics
added during debugging (ver command, SYSTEM ownership claims, noisy
pre-spawn logging) while keeping useful checks (sc query sshd, OpenSSH
version, icacls on failure). Restore leak-timeout now that the real root
cause is resolved.
Send SIGKILL via SSH signal request and close the channel on kill,
instead of only sending EOF (which sleep ignores). Handle ExitSignal
in both readers so the loop breaks immediately on signal-killed
processes. Add 500ms startup delay and 5s timeout guards in PTY
kill tests to avoid race conditions and catch regressions early.
wait_task now awaits stdout/stderr JoinHandles (with a 5s timeout)
after child.wait() returns, ensuring all output is forwarded before
ProcDone is sent. This fixes the shell test flake where `echo hello`
would exit before stdout_task could forward the PTY data.
Abort stdout/stderr read tasks in SimpleProcess::wait() instead of
awaiting them, matching the PtyProcess fix from 5a84ded. On Windows,
cmd /C creates process trees where TerminateProcess only kills cmd.exe,
leaving child processes holding pipes open indefinitely. Also add a
timeout guard to the kill test to catch this regression on CI.
…ocesses

Add kill_on_drop(true) to SimpleProcess spawn so Windows Job Object
kills the entire process tree on drop, and spawn ping directly in the
test instead of via cmd /C to avoid orphan process issues.
@chipsenkbeil chipsenkbeil changed the title Add Docker backend, refactor plugin architecture, and fix Windows SSH timeouts Add Docker backend, refactor plugin architecture, and fix process lifecycle bugs Mar 5, 2026
@chipsenkbeil chipsenkbeil merged commit 2be75a5 into master Mar 5, 2026
8 checks passed
@chipsenkbeil chipsenkbeil deleted the feature/docker-backend branch March 5, 2026 05:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant