Skip to content

Mng/run tmr#910

Draft
qi-imbue wants to merge 94 commits intomainfrom
mng/run-tmr
Draft

Mng/run tmr#910
qi-imbue wants to merge 94 commits intomainfrom
mng/run-tmr

Conversation

@qi-imbue
Copy link
Contributor

No description provided.

qi-imbue and others added 30 commits March 6, 2026 17:52
Introduce skitwright, a lightweight end-to-end testing framework for CLI
applications (a nod to Playwright). It provides:
- Session: runs shell commands and records a text transcript
- CommandResult: structured result with exit code, stdout, stderr
- expect(): fluent assertion API for results and strings
- Transcript: annotated text recording of all commands and outputs

Add 10 basic e2e tests for the mng CLI that exercise it exclusively
through its CLI interface (no library imports from mng):
- Help output (mng --help, mng create --help)
- List with no agents (table and JSON formats)
- Create + list (verifies agent appears in list)
- Create with JSON output format
- Create in headless mode
- Create + destroy lifecycle
- Create + rename
- Create with labels (verified via JSON list output)

The e2e test fixture provides full isolation: separate MNG_HOST_DIR,
MNG_PREFIX, MNG_ROOT_NAME, TMUX_TMPDIR, and disabled remote providers.
Each test saves a transcript file for debugging.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace uuid4().hex[:8] with get_short_random_string()
- Replace relative import with absolute import
- Convert MngRunner from class with __init__ to NamedTuple
- Replace subprocess.run with skitwright run_command for tmux cleanup
- Add @pytest.mark.release to all e2e tests (subprocess-based, no coverage)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add test_ratchets.py for skitwright (required by meta ratchet)
- Add pytest config to skitwright pyproject.toml
- Replace MngRunner class with lambda-based mng fixture (avoids
  __init__, NamedTuple, dataclass, and inline function ratchets)
- Update MngRunFn type alias and test_basic.py to use function style

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ix markers

- Add unit tests for skitwright (expect, transcript, session): 33 new tests,
  100% coverage
- Factor repetitive agent creation into create_agent fixture in e2e conftest
- Change e2e test markers from @pytest.mark.release to @pytest.mark.acceptance
  so they run in CI on every PR
- Fix incorrect docstring claiming "no library imports from mng"
- Document skitwright as test-only dependency in mng pyproject.toml

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The message command's CEL filter builder discarded the host/provider part
of agent addresses (e.g. agent@host.modal). Now host_name and provider_name
from parsed addresses are incorporated into the CEL filter, and the CEL
context includes host.name for matching.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
In _partition_destroy_targets, the loop over online host agents silently
skipped matched agents that were no longer present. Now raises
AgentNotFoundError if any matched agent ID is not found in get_agents().
Also removes the redundant seen_hosts set (dict keys are already unique).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixes test_every_project_has_pypi_readme CI failure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…cleanup

- Remove succeeded/failed computed properties from CommandResult; use
  exit_code checks directly (per user feedback)
- Replace subprocess.run with Popen+threads for real-time interleaved
  stdout/stderr capture in the transcript (line-buffered)
- Add OutputSource enum and OutputLine data type for interleaved output
- Remove uv run prefix from mng commands (already in PATH via uv run pytest)
- Add mng destroy --all --force cleanup in e2e fixture teardown
- Add MNG_E2E_KEEP_ON_FAILURE env var to keep agents running on failure
- Print transcript path to stderr on test failure
- Update README, ratchets, and tests accordingly

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Revert message.py, message_test.py, destroy.py changes that belong to
another branch. Fix timeout test failure in CI by using process_group=0
and os.killpg() to kill the entire process tree (not just the shell).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add --cov=imbue.skitwright to root pyproject.toml addopts so coverage
is tracked in monorepo-level CI runs. Also add proc.wait() after
os.killpg() in timeout path to deterministically reap the process.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The emit tests previously had no assertions (only verified functions
don't crash). Now they verify actual output: human format checks the
value appears in stdout, JSONL checks the parsed event structure.
Removed JSON-format variants since emit_event is silent in JSON mode.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The timeout code path was discarding all pre-timeout stderr output,
replacing it with just the timeout message. Now reconstructs real
stderr from captured lines and appends the timeout message.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
JSON mode emit tests were removed in the previous commit since
emit_event is silent in JSON mode. Restored them with assertions
that stdout is empty, verifying the code path executes without
crashing and produces no output as expected.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…-- splitting

- Add --provider option to select which provider to launch agents on
  (e.g. docker, modal). All code that previously hardcoded LOCAL_PROVIDER_NAME
  now uses the configurable provider. Each agent tracks its own host reference
  to support providers that create a separate host per agent.

- Add --env option to pass environment variables to agents (KEY=VALUE,
  repeatable). Uses the same resolve_env_vars utility as mng create.

- Add --label option to attach labels to all launched agents (KEY=VALUE,
  repeatable). Labels are applied to both test agents and integrator agents.

- Add --prompt-suffix option to append custom text to the agent prompt.

- Replace _split_pytest_args with _TmrCommand (same pattern as _CreateCommand
  in mng create) for robust -- separator handling at the Click parse level.
  Test collection args go before --, testing flags go after --.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Script parses a tutorial shell script into command blocks and matches
them against pytest functions by checking docstrings. Reports unmatched
blocks (needing tests) and unmatched tests (needing blocks).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add tutorial_matcher_test.py with 12 unit tests covering all functions
- Fix redundant file reads in find_pytest_functions (read once, reuse)
- Warn on stderr when skipping files with syntax errors

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… setup

- Move label KEY=VALUE parsing from tmr cli.py and create.py into a shared
  resolve_labels() function in env_utils.py (alongside resolve_env_vars).
  Both create.py and tmr cli.py now call resolve_labels().

- Extract _invoke_tmr_command() helper to eliminate duplicated Click command
  setup boilerplate across 5 _TmrCommand tests.

- Rename test_cli_help_contains_new_options to
  test_cli_help_contains_provider_env_label_options for clarity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
qi-imbue and others added 30 commits March 18, 2026 14:19
When a remote agent's host becomes unreachable (e.g. Modal sandbox
terminated), operations like reading results, stopping agents, and
pulling branches would crash the entire coordinator. Now:

- read_agent_result catches HostError and returns REMOTE_AGENT_ERROR
  with a descriptive summary instead of crashing.
- _stop_agent_on_host catches HostError (it only caught MngError before,
  but HostError extends BaseMngError, not MngError).
- pull_agent_branch catches HostError for the same reason.
- New REMOTE_AGENT_ERROR outcome is added to TestOutcome with its own
  color (purple) in the HTML report.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
For remote agents (modal/docker), branches don't exist locally. To pull
changes from agents with FIX_*_SUCCEEDED outcomes:
- Save the base commit hash at the start of the run.
- Before pulling, create a local branch from the base commit.
- Then pull_git fetches the remote agent's changes into that branch.

Add --integrator-provider option (defaults to "local") so the integrator
agent runs locally. This makes sense because there is only one integrator
and it needs access to the local branches just pulled.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The send_message call can fail with a raw TimeoutError from pyinfra
(SSH command timeout) in addition to SendMessageError. Broaden the
catch to handle TimeoutError and HostError as well.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix: Only pass base_commit to gather_results when using a remote provider.
For the local provider, agents use WORKTREE mode so their branches already
exist locally -- calling git branch would fail with 'already exists'.

Fix: Catch ProcessError in pull_agent_branch so git command failures
(from _create_local_branch or other git operations) are handled gracefully
instead of crashing the coordinator.

Fix: Make _create_local_branch tolerate pre-existing branches by using
is_checked_after=False and falling back to reuse.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When --integrator-provider is set to a remote provider, the integrator's
branch doesn't exist locally. Pass base_commit through to
_run_integrator_phase so pull_agent_branch can create the local branch
before pulling, matching the pattern used for test agents.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When --snapshot is provided, all agents are launched from that snapshot
directly, skipping the --use-snapshot build-and-snapshot flow. This is
useful when a snapshot was created in a previous run and can be reused.

When both --snapshot and --use-snapshot are provided, --snapshot takes
precedence (no need to build a new snapshot).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When recording REMOTE_AGENT_ERROR, the summary now says exactly which
stage failed (fetching result file, pulling branch, confirming message
delivery). Connection failures during agent stop are already ignored
since stopping is purely cleanup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The list_agents call in poll_until_all_done can fail with connection
errors (e.g. Docker daemon temporarily unavailable, network blip).
Catch MngError, HostError, ConcurrencyGroupError, and OSError, log a
warning, and retry on the next polling cycle instead of crashing.

Bump the time_sleep ratchet count from 2 to 3 for the new sleep in
the error retry path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Apply the same transient error handling to the integrator polling loop
as poll_until_all_done: catch MngError/HostError/ConcurrencyGroupError/
OSError, log a warning, and retry on the next cycle.

Also rename unused snapshot_name to _snapshot_name in cli.py.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Default HTML report path is now tmr_reports/tmr-report-<timestamp>.html
instead of the current directory. The directory is created automatically
by generate_html_report (which already calls mkdir). Added tmr_reports/
to .gitignore.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove the `mng` and `create_agent` fixture wrappers so that each test
shows the exact CLI command being run as a plain string via `e2e.run()`.
Replace `create_agent` with a simple `agent_name` fixture that provides
a unique name without hiding the command construction.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update the --output-html help to reflect the new default path
(tmr_reports/tmr-report-<timestamp>.html). Fix duplicate Step 4
comment by renumbering steps sequentially (1-10).

The time_sleep ratchet increase (2 -> 4) in earlier commits is
justified: both new sleeps are in polling error retry paths to prevent
tight-loop retries after transient network failures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When launching many agents concurrently on remote providers like Modal,
the API rate limit (25 req/s) can be hit. Add two configurable options:

- --max-parallel (default 4): max concurrent agent launches
- --launch-delay (default 2.0s): delay between submitting each launch

Agent launches are staggered by sleeping between submissions while the
executor limits concurrency. This keeps the request rate well below
provider limits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolve conflicts by adopting the new fixture pattern from e2e-tests-deux
(e2e: Session + agent_name: str) while keeping tutorial block docstrings.
Remove 4 orphaned tests (list, destroy, rename) that have no tutorial blocks.
Update test_tutorial_create.py to use the new fixtures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- test_create_with_disabled_provider: verify error output mentions the
  provider/disabled, not just that the command failed
- test_create_plugins: add assertions on both success and failure paths
- test_create_bare: verify agent runs in a worktree (different pwd)
- test_create_different_agent_type: verify agent_type == "codex" in JSON
- test_create_source_path: verify agent's pwd differs, use unique temp path
- test_create_shallow_clone: verify git rev-list --count == 1
- test_create_from_agent: verify target has same git HEAD as source
- test_create_copy_with_branch: verify no new branch created, agent on main
- test_create_connect_command: verify connect_command in JSON output
- test_create_template: verify in_place=true applied (pwd matches main repo)
- test_create_no_git: use unique temp path with cleanup

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The --no-connect and --no-ensure-clean flags were placed after the --
separator, causing them to be passed to python as arguments instead of
to mng create as flags. Move them before -- so they are correctly
handled by mng.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- test_create_connect_command: remove assertion on connect_command field
  which doesn't exist in AgentDetails JSON output; verify agent is
  RUNNING/WAITING instead
- test_create_plugins: assert deterministically that nonexistent plugin
  causes failure with plugin-related error message; also exercise
  --disable-plugin flag to match the tutorial block

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ModalProxyError extends Exception directly (not MngError or HostError),
so it was not caught by the polling error handlers. Widen the catch to
include Exception alongside the specific types.

Root cause note: the Modal plugin's list_agents path calls get_tags()
per-host, each of which triggers a separate sandbox_list() API call.
With N Modal hosts, this means N sandbox_list() calls per poll cycle,
easily hitting Modal's 25/s rate limit. The fix should be in the Modal
plugin (caching sandbox_list results), but for now we tolerate the
error in the tmr polling loop.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Modal plugin's _list_running_host_ids() already fetches all sandboxes
and their tags during discovery. Populate _sandbox_cache_by_id and
_sandbox_cache_by_name with these results so that subsequent
get_host_tags() calls (triggered by _build_host_details_from_host in
list.py:441) hit the cache instead of making N additional sandbox_list()
API calls (one per host). This was the root cause of hitting Modal's
25/s rate limit during polling.

Also make wait_for_integrator's polling error catch consistent with
poll_until_all_done.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant