Skip to content

fix: recover failed pool connections#33

Open
abnegate wants to merge 7 commits into
mainfrom
fix/destroy-failed-pool-connections
Open

fix: recover failed pool connections#33
abnegate wants to merge 7 commits into
mainfrom
fix/destroy-failed-pool-connections

Conversation

@abnegate

@abnegate abnegate commented Jun 18, 2026

Copy link
Copy Markdown
Member

What changed

  • Route Pool::use() cleanup through a shared release path that distinguishes successful use from failed callbacks.
  • Reclaim borrowed resources on successful use and on grouped acquisition failures before user code receives the resources.
  • After a callback failure, recover object resources only when reset() and/or reconnect() exists and succeeds; destroy and replace resources when recovery throws, returns false, or no recovery hook exists.
  • Destroy native PHP resources after callback failure because they have no generic recovery contract.
  • Preserve the original callback exception if destroy/replacement creation also fails.
  • Rework Group::use() to acquire connections explicitly, release them in reverse order, continue releasing every borrowed connection if one release throws, and record use-duration telemetry for grouped borrows.
  • Add regression coverage for failed-resource replacement, false recovery hooks, missing recovery hooks, native resources, exception preservation, grouped acquisition cleanup, grouped release failure cleanup, and grouped use telemetry.
  • Fix invalid Composer metadata by replacing the duplicate/nonstandard suggests key with a single suggest section.
  • Install the declared opentelemetry and protobuf extensions in the test workflow so Composer can install utopia-php/telemetry.
  • Apply Pint formatting required by the repository style check.

Why

A callback failure can leave a borrowed resource in poisoned state. Blindly reclaiming that resource puts the same broken connection back in rotation, which can amplify one connection-level failure into persistent errors for later borrowers.

The release path now keeps healthy acquisition-only resources, but prevents unrecoverable callback-failed object or native resources from cycling back into the pool. Grouped borrows now clean up all acquired connections even when one release path fails.

Verification

  • composer validate --strict
  • pint --preset psr12 --test using temporary Pint 1.x tooling
  • phpstan analyse -c phpstan.neon using temporary PHPStan 1.x tooling
  • phpunit --configuration phpunit.xml tests/Pools/Adapter/StackTest.php using temporary PHPUnit 11.x tooling
  • phpunit --configuration phpunit.xml tests/Pools/Adapter/SwooleTest.php --filter 'testUse|testGroupUseReclaims|testUseDestroys|testGroupUseRecords|testGroupUseReleases' using temporary PHPUnit 11.x tooling

Note: the full local composer test suite still exits with signal 139 after an existing Swoole race-condition test reports OK under this machine's PHP 8.5/Swoole runtime. The new Swoole lifecycle tests pass cleanly, and the crash reproduces with testSwooleCoroutineRaceCondition alone.

Copilot AI review requested due to automatic review settings June 18, 2026 03:58

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@github-actions

Copy link
Copy Markdown

Thanks for contributing! This repository is a read-only mirror — development for this library happens in packages/pools in the utopia-php monorepo. Please open this pull request there instead.

@github-actions github-actions Bot closed this Jun 18, 2026
@abnegate abnegate reopened this Jun 18, 2026
@greptile-apps

greptile-apps Bot commented Jun 18, 2026

Copy link
Copy Markdown

Greptile Summary

  • Updates Pool::use() cleanup so failed borrowed resources can be recovered or replaced separately from successful uses.
  • Reworks grouped pool borrows to acquire explicitly, release in reverse order, and record use-duration telemetry.
  • Adds regression coverage for failed recovery, replacement behavior, grouped acquisition cleanup, native resources, and exception preservation.
  • Fixes Composer metadata and CI setup for declared telemetry dependencies.

Confidence Score: 5/5

The lifecycle changes are well-scoped and covered by targeted regression tests across single-resource and grouped borrows.

No code issues were identified in the reviewed changes, and the implementation includes coverage for recovery success, failed recovery, replacement behavior, native resources, exception preservation, grouped cleanup, and telemetry recording.

T-Rex T-Rex Logs

What T-Rex did

  • I ran pool recovery tests and compared before and after states, observing how failed resources were blindly reused before and how replacements occurred after, including preserved head reuse and unrecoverable object replacements.
  • I collected proof artifacts for the group-use tests, capturing exact commands, working directories, exit codes, runtime discovery, and checked-out commits for both before and after sides.
  • I examined the opentelemetry and swoole extension suggestions, noting the before state with limited workflow extensions and the after state with explicit suggest flags and entries.
  • I reviewed the regression-coverage workflow, comparing before and after checkouts and confirming that a contract-mismatch finding was not filed due to tooling and environment blockers rather than PR failure.

View all artifacts

T-Rex Ran code and verified through T-Rex

Reviews (6): Last reviewed commit: "fix: release all grouped resources" | Re-trigger Greptile

Comment thread src/Pools/Pool.php Outdated
@greptile-apps

greptile-apps Bot commented Jun 18, 2026

Copy link
Copy Markdown

Greptile Summary

  • Changes Pool::use() cleanup so callback failures destroy and replace the borrowed connection instead of returning it to the pool.
  • Preserves the original callback exception when replacement creation also fails.
  • Adds regression coverage for failed-resource replacement and exception preservation.
  • Fixes Composer suggestion metadata, updates CI extension setup, and applies formatting updates.

Confidence Score: 3/5

The pool lifecycle change is well targeted but leaves edge cases around replacement failure handling and grouped acquisitions that should be addressed before merging.

Focused tests and implementation changes cover the intended callback-failure path, but the cleanup path does not handle all throwable replacement failures and grouped pool usage can churn healthy resources when later acquisition fails.

src/Pools/Pool.php and src/Pools/Group.php need attention.

T-Rex T-Rex Logs

What T-Rex did

  • T-Rex attempted to run the PHP runtime and dependency check, but the environment has no PHP binary available and no fallback like composer, docker, or /usr/bin/php* could be used; a focused repro script for the size-1 Pool replacement TypeError path was prepared but could not be executed.
  • T-Rex executed a focused PHP repro script that creates a Group with a healthy pool1 and a failing later pool3, observing pool1-resource-1 created before pool3 failed, followed by pool1-resource-2 created during the bubbled failure, and a subsequent pool1 use returning pool1-resource-2, indicating the earlier resource was replaced.
  • T-Rex analyzed the pool-destroy-replace runs by inspecting before and after logs, confirming that the after run created a second resource (resource-2) and the final state did not reclaim the first resource.
  • T-Rex captured two pool-preserve-exception runs: the before-run completed with a thrown LogicException after 1 init attempt and pool count 1, and the after-head run completed with a thrown LogicException after 2 init attempts and pool count 1.
  • T-Rex compared Composer metadata before and after the change, noting that the before state had composer validate --strict fail with an invalid property and only swoole as workflow extensions, while the after state had a valid composer.json and extensions listed as opentelemetry|protobuf|swoole.

View all artifacts

T-Rex Ran code and verified through T-Rex

Comments Outside Diff (1)

  1. src/Pools/Pool.php, line 431-440 (link)

    P1 Handle throwable replacements

    This cleanup only catches Exception while createConnection() can also fail with other Throwable types from the user-provided init callback, such as TypeError or Error. In that case connectionsCreated has already been incremented, but it is never decremented; then Pool::use() suppresses the replacement failure to preserve the original callback exception. A size-1 pool can be left with no active or idle connection while connectionsCreated still equals the size, so later borrowers see the pool as permanently empty instead of creating a replacement.

Reviews (2): Last reviewed commit: "fix: install protobuf extension in tests" | Re-trigger Greptile

Comment thread src/Pools/Group.php Outdated
@abnegate abnegate changed the title fix: destroy failed pool connections fix: recover failed pool connections Jun 18, 2026
Comment thread src/Pools/Pool.php Outdated
Comment thread src/Pools/Pool.php
Comment thread src/Pools/Pool.php
Comment thread src/Pools/Group.php Outdated
Comment thread src/Pools/Group.php Outdated
@abnegate

Copy link
Copy Markdown
Member Author

Approach is sound — eviction belongs to the pool (only it can destroy + replace + keep capacity), and delegating the health decision to the resource via reset()/reconnect() is the right split. Two things worth resolving before merge, from debugging the nyc3 outage this is meant to backstop:

1. recover() doesn't heal the bug it targets while the Swoole proxy is in place

For the DB pools, $connection->getResource() is the concrete adapter (new MariaDB($pdoProxy) etc.), which has no reset() and a reconnect() that returns void. So recover() calls reconnect(), it succeeds, and the connection is reclaimed — but the poison is in the wrapped Swoole\Database\PDOProxy's leaked inTransaction counter, and adapter->reconnect()PDOProxy::reconnect() does not clear it. The one hook that would (PDOProxy::reset()) lives on the proxy object the pool can't see. Net: for the exact failure this PR is meant to catch, recover() reports success and recirculates the still-poisoned connection.

This isn't a bug in this PR — it's a contract violation at the resource (its reconnect() lies about being clean). The real fix is removing that proxy layer so reconnect() is honest (utopia-php/database#896). Once that lands, recover() works exactly as designed. Worth a line in the PR/commit noting that the recovery contract assumes reset()/reconnect() leaves a genuinely clean resource — and that this only fully covers the DB case after #896.

2. Every callback failure is treated as a dead connection

Pool::use() sets failed = true on any \Throwable, so a business-logic exception from the callback — DuplicateException (409), Conflict, Authorization, Restricted, Limit — now triggers recover()reconnect(), throwing away a healthy warm connection and paying a reconnect on every ordinary 4xx. Those propagate straight through Adapter::withTransaction, so under normal write/conflict load (unique slugs, upserts) this is real pool churn + reconnect latency on the error path.

The pool is generic and can't see DB exception types, so the boundary is inherently awkward. Options: make reconnect() lazy (no-op if a cheap isHealthy()/ping still passes), let the caller signal whether the failure was connection-level, or consciously accept it as safe-but-wasteful for v1. Either way it should be a deliberate decision rather than the default.

@abnegate

Copy link
Copy Markdown
Member Author

Follow-up (not a request to change this PR) — explicit discard() + flipping the default

Ship this PR's conservative version as-is. But there's a cleaner end state worth lining up, because the "treat every callback throw as a dead connection" default is what forces the churn in point 2 above.

The problem with pool-side classification

The pool is generic over mixed, so it can't tell a connection-class fault (evict) from a business-logic exception (409 duplicate, conflict — connection is perfectly fine). It currently resolves that by assuming any throw = suspect. That's safe but evicts/recovers healthy connections on ordinary 4xx.

The component that can classify is the resource's caller — in the DB stack that's the adapter, which already owns both Connection::hasError() and the benign-exception list (Duplicate/Conflict/Authorization/…). So let it make the call explicitly.

Proposed API: a per-borrow discard signal

Hand the Connection (or a small discard callable) into the use() callback so the caller can mark it:

$pool->use(function ($resource, Connection $connection) {
    try {
        return $resource->query(...);
    } catch (\Throwable $e) {
        if (Connection::hasError($e)) {
            $connection->discard();   // connection-class fault → don't recycle
        }
        throw $e;                     // business error → connection untouched, reclaimed
    }
});

finally: discarded → destroy + replace; otherwise → reclaim. BC, because calling $callback($resource, $connection) against an existing fn($resource) just drops the surplus arg (PHP ignores extra args unless the callable uses func_get_args). Group::use() is the one wrinkle — it spreads N resources, so it'd need a discard handle keyed by name (or to hand back the Connection[]).

This also lets you drop the recover()/reset()-or-destroy machinery: a dirty connection is just destroyed + replaced (for a dead DB connection, reconnect == recreate anyway), a clean one is reclaimed as-is. No middle rehabilitation state.

Why this is a follow-up, not this PR

The default flips from "evict on any throw" to "reclaim unless discarded," and that's only safe once a recirculated stale connection self-heals instead of poisoning the pool. That safety net is utopia-php/database#896 (adapter owns reconnect-on-execute + withTransaction replay; no proxy counter to desync). Sequencing:

  • Now (proxy still wrapped): this PR's conservative default is correct — a recirculated poisoned connection is catastrophic, so don't trust an under-marked "clean."
  • After #896 + cloud/CE rewire off PDOProxy: flip to default-reclaim + explicit discard(). Worst case for a missed classification becomes one transparent reconnect, not a sticky outage, and the 409 churn disappears.

Happy to open this as its own PR against pools once #896 is in and the consumers are rewired — flagging now so the API shape is on your radar while this one settles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants