feat: harden tool orchestration and policy pipeline by batmn-dev · Pull Request #43 · batmn-dev/not-a-wrapper

batmn-dev · 2026-02-23T19:39:16Z

Summary

Harden tool orchestration across chat routing and wrappers with stricter policy/capability handling, deterministic naming/cache behavior, richer metadata shaping, and safer truncation/error paths.
Expand coverage with focused tests for policy, provider/platform behavior, wrapper/error handling, naming, UI metadata, route integration, and description quality.
Include supporting updates for model/tool config plumbing and prior branch commits already in this range (including Exa path support, unified tracing/raw output behavior, and related chat UI reasoning-state polish).

Test plan

Run bun run lint
Run bun run typecheck
Run bun test lib/tools/__tests__
Smoke test chat tool invocation flow in the app UI

Made with Cursor

Summary by cubic

Hardens tool orchestration and policy in chat to make tool use safer, deterministic, and observable. Adds Convex-backed limits, retry-safe wrappers, caching, unified errors/logging, clearer UI metadata, and edge‑case fixes for limits and truncation.

New Features
- Central capability policy with early filtering and per-step gating (tier-aware; platform vs BYOK).
- Convex tool limits: domain caps for extract_content and per-tool budgets; new toolLimitBuckets and APIs.
- Safer execution: timeouts, abort propagation, retry logic using idempotent/read-only hints (MCP trusted allowlist), deterministic naming, and LRU caches.
- Unified errors/logging: consistent ToolExecutionError/ToolPolicyError; logs include requestId, budget mode/denials, retryAfter.
- Separate Exa content extraction path with shared caches for search/extraction.
- UI: humanized tool names, service/cost hints, per-call stream metadata; refined thinking/label styles.
- Platform tools: renamed to pay_purchase/pay_status, request de-dupe, abortable PayClaw client.
- Provider search tools marked open-world to shape policy behavior.
- Hardening: fix cross-tool circuit-breaker bleed when MCP server metadata is missing; normalize per-tool domain-limit codes/messages; preserve truncation metadata and semantic paragraph boundaries.
Migration
- Run Convex deploy to add toolLimitBuckets and toolLimits.
- Update any references to flowglad_pay_buy → pay_purchase.
- Optional envs: MCP_TRUSTED_RETRY_SERVER_ALLOWLIST, ANTHROPIC_TOKEN_EFFICIENT_TOOLS.

^{Written for commit 3a2e61e. Summary will update on new commits.}

Move domain docs (architecture, api, conventions, glossary, etc.) to .agents/project/ and research files to .agents/research/. Remove stale chatgpt-logged-out.png screenshot. Co-authored-by: Cursor <cursoragent@cursor.com>

- Add keyframe animations (spinner-fade, pulse-dot, bounce-dots, typing, wave, blink, text-shimmer) to globals.css - Update thinking states test page with ArticleWrapper, semantic HTML, mask-based action reveal, and improved composer layout - Change reasoning label from "Reasoned" to "Thoughts" - Bump thinking bar text from text-sm to text-base Co-authored-by: Cursor <cursoragent@cursor.com>

Enable a dedicated Layer 2 content extraction tool path that is independent from search fallback so providers with native search can still use URL content reads. Co-authored-by: Cursor <cursoragent@cursor.com>

Improve tool routing with explicit extraction capability gating, rename purchase/extraction tool ids for consistency, and add stronger timeout/error guidance plus extraction caching for more reliable tool execution. Co-authored-by: Cursor <cursoragent@cursor.com>

Standardize MCP, third-party, and platform tools to return raw outputs while centralizing request-scoped tracing and logging so tool observability is consistent and safer across the full tool pipeline. Co-authored-by: Cursor <cursoragent@cursor.com>

Add Convex-backed tool budget/domain guardrails and unify tool error normalization so policy denials and upstream failures are surfaced consistently across route, wrappers, and tests. Co-authored-by: Cursor <cursoragent@cursor.com>

Strengthen tool routing and validation by introducing capability policy, metadata shaping, naming/cache helpers, and expanded coverage so tool behavior is safer and more deterministic across providers. Co-authored-by: Cursor <cursoragent@cursor.com>

vercel · 2026-02-23T19:39:22Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
not-a-wrapper	Ready	Preview, Comment	Feb 23, 2026 7:52pm

greptile-apps · 2026-02-23T19:43:50Z

Greptile Summary

Hardens tool orchestration with a multi-layered policy pipeline enforcing capability boundaries, budget limits, and retry safety across provider-native, third-party, MCP, and platform tool layers.

Key Improvements:

Capability policy matrix gates tools by user tier (anonymous/authenticated), key mode (BYOK/platform), risk level (read-only/stateful/destructive/open-world), and step threshold (early/late steps)
Centralized budget enforcement with sliding-window rate limits stored in Convex toolLimitBuckets, falling back to request-local soft caps when policy service is unavailable
Per-domain abuse protection for extract_content with separate quota tracking
Circuit breaker pattern for MCP tools (3 consecutive transient failures disables server for request)
Automatic retry with exponential backoff for idempotent/read-only tools when MCP servers are explicitly trusted via MCP_TRUSTED_RETRY_SERVER_ALLOWLIST
LRU caching for search and content extraction (15min TTL, 500 entries) to reduce redundant API calls
Semantic truncation policy preserving high-value content during result size enforcement
Deterministic tool naming governance with collision detection across layers
Request-scoped tracing (requestId) linking tool calls, budget events, and PostHog telemetry

Architecture:
The PR introduces 8 new production modules (policy.ts, capability-policy.ts, errors.ts, naming.ts, truncation-policy.ts, ui-metadata.ts, cache.ts, convex/toolLimits.ts) and 9 test suites with 100+ test cases covering policy enforcement, outage tolerance, retry logic, and integration scenarios.

Test Coverage:
Comprehensive unit tests for policy enforcement, capability filtering, naming governance, error normalization, retry backoff, abort signal handling, and truncation strategies. Integration tests verify route-level policy application and Convex limit store behavior.

Confidence Score: 4/5

This PR is safe to merge after running the test plan, though the complexity warrants careful smoke testing
Score reflects significant architectural improvements with excellent test coverage (20+ test files added), but complexity of the policy pipeline, state management across multiple wrappers, and request-scoped circuit breakers requires thorough integration testing. The changes are well-structured with fail-safe defaults and outage tolerance, but the interaction between budget enforcement layers and dynamic tool filtering needs validation in production-like scenarios.
Pay close attention to app/api/chat/route.ts for state management of exhausted/degraded tools across steps, and lib/tools/mcp-wrapper.ts for circuit breaker behavior under concurrent failures

Important Files Changed

Filename	Overview
lib/tools/policy.ts	introduces comprehensive tool budget and domain-limit enforcement with sliding-window buckets, outage tolerance, and Convex persistence
lib/tools/mcp-wrapper.ts	refactored to support retry logic, circuit breaker, abort signal propagation, and centralized budget enforcement
lib/tools/capability-policy.ts	implements capability matrix with user tier, key mode, risk level classification, and early/late step filtering
lib/tools/utils.ts	extensive additions for retry logic, abort signal handling, timeout management, and semantic truncation with safety metadata
app/api/chat/route.ts	major refactor integrating capability policy, budget enforcement, request-scoped tracing, and content extraction layer separation
convex/toolLimits.ts	implements sliding-window limit enforcement with actor isolation, multi-scope batching, and retry-after calculation
convex/schema.ts	adds `toolLimitBuckets` table with composite index and enriches `toolCallLog` with policy denial metadata
lib/tools/third-party.ts	adds LRU caching for search/extraction, abort signal support, domain-limit enforcement, and separate content extraction tools
lib/mcp/load-tools.ts	adds MCP annotation hint extraction, retry trust allowlist matching, and collision detection for namespaced tools
lib/config.ts	adds constants for tool budgets, cache TTLs, domain limits, retry allowlist, and freshness windows

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    Start[Chat Request] --> CapPolicy[Resolve Capability Policy]
    CapPolicy --> KeyMode{Key Mode<br/>Detection}
    KeyMode -->|BYOK| BYOKPath[BYOK Budget Policy]
    KeyMode -->|Platform| PlatformPath[Platform Budget Policy]
    
    BYOKPath --> LoadTools[Load Tool Layers]
    PlatformPath --> LoadTools
    
    LoadTools --> Layer1[Layer 1: Provider Native<br/>e.g. OpenAI search]
    LoadTools --> Layer2Search[Layer 2: Third-Party Search<br/>Exa fallback]
    LoadTools --> Layer2Content[Layer 2: Content Extraction<br/>Exa getContents]
    LoadTools --> Layer3[Layer 3: MCP Tools<br/>User servers]
    LoadTools --> Layer4[Layer 4: Platform Tools<br/>pay_purchase]
    
    Layer1 --> PolicyFilter[Apply Capability Policy Filter]
    Layer2Search --> PolicyFilter
    Layer2Content --> PolicyFilter
    Layer3 --> PolicyFilter
    Layer4 --> PolicyFilter
    
    PolicyFilter --> Naming[Naming Governance<br/>Collision Detection]
    Naming --> Wrap[Wrap Tools]
    
    Wrap --> WrapMCP[MCP: Timeout + Retry + Circuit Breaker]
    Wrap --> WrapThirdParty[Third-Party: Cache + Domain Limits]
    Wrap --> WrapPlatform[Platform: Tracing + Budget]
    
    WrapMCP --> StreamText[streamText with tools]
    WrapThirdParty --> StreamText
    WrapPlatform --> StreamText
    
    StreamText --> PrepareStep{prepareStep}
    PrepareStep -->|Step <= 3| EarlyTools[Early Step Tools<br/>All safe + MCP unknown]
    PrepareStep -->|Step > 3| LateTools[Late Step Tools<br/>Read-only only]
    
    EarlyTools --> BudgetProbe[Budget Probe<br/>consume: false]
    LateTools --> BudgetProbe
    
    BudgetProbe -->|Policy Available| BudgetOK{Budget OK?}
    BudgetProbe -->|Policy Unavailable| Degraded[Degraded Mode<br/>Request-local soft cap]
    
    BudgetOK -->|Yes| AllowTool[Include Tool]
    BudgetOK -->|No| BlockTool[Exclude Tool]
    Degraded --> SoftCap{Soft Cap<br/>Remaining?}
    SoftCap -->|Yes| AllowTool
    SoftCap -->|No| BlockTool
    
    AllowTool --> Execute[Tool Execution]
    Execute --> Retry{Retry Safe?}
    Retry -->|Yes + Transient Error| RetryBackoff[Exponential Backoff<br/>with Jitter]
    Retry -->|No or Non-Transient| Error[Return Error]
    RetryBackoff --> Execute
    
    Execute --> StepFinish[onStepFinish:<br/>Post-accounting]
    StepFinish --> ConvexLog[Convex toolCallLog +<br/>toolLimitBuckets]
    ConvexLog --> NextStep{More Steps?}
    NextStep -->|Yes| PrepareStep
    NextStep -->|No| Finish[onFinish:<br/>PostHog + Response]

_{Last reviewed commit: 3babf78}

cubic-dev-ai

5 issues found across 61 files

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="convex/toolLimits.ts">

<violation number="1" location="convex/toolLimits.ts:131">
P2: The `code` and `message` for the `"domain"` limit branch hardcode `"extract_content"` instead of using the `toolName` parameter. Any other tool passing `limitType: "domain"` will receive a misleading error code and message.</violation>
</file>

<file name="lib/tools/truncation-policy.ts">

<violation number="1" location="lib/tools/truncation-policy.ts:174">
P2: findSemanticBoundary always adds 2 for paragraph breaks, which truncates CRLF "\r\n\r\n" sequences mid-marker and yields inconsistent boundaries on Windows-style line breaks. Adjust the offset to match the marker length.</violation>
</file>

<file name="lib/tools/mcp-wrapper.ts">

<violation number="1" location="lib/tools/mcp-wrapper.ts:140">
P2: Circuit breaker uses `"unknown"` as a shared key for all tools missing server info, causing cross-tool circuit state pollution. If multiple tools lack a `serverId`, their failure counts and resets incorrectly affect each other. Use a per-tool fallback (e.g., the tool `name` itself) to ensure isolation.</violation>
</file>

<file name="lib/tools/utils.ts">

<violation number="1" location="lib/tools/utils.ts:131">
P2: Listener leak in `combineAbortSignals` fallback path: when one signal fires, the `{ once: true }` listeners on all other signals remain attached until those signals individually fire. For long-lived signals (e.g., a session-scoped abort), the closures capturing `controller` will never be cleaned up. Track all handlers and call `removeEventListener` on all of them when any one fires.</violation>

<violation number="2" location="lib/tools/utils.ts:610">
P2: Key collision in `truncateOversizedObject`: user-data keys named `_hint`, `_truncated`, `_originalSizeBytes`, or `_keptKeys` silently overwrite the truncation metadata sentinels. When the budget check then fails and the key is deleted, the sentinel is lost, leaving the model with no `_hint` or potentially a `_truncated: false` value. Filter the known sentinel keys out of the ranked entries before the loop.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

convex/toolLimits.ts

lib/tools/truncation-policy.ts

lib/tools/mcp-wrapper.ts

lib/tools/utils.ts

Prevent cross-tool circuit-breaker bleed when MCP server metadata is missing and normalize domain-limit codes/messages per tool for clearer failures. Preserve truncation metadata integrity and semantic paragraph boundary detection to keep truncated payloads reliable. Co-authored-by: Cursor <cursoragent@cursor.com>

batmn-dev and others added 7 commits February 22, 2026 17:22

feat: add Exa content extraction tool path

2ad7198

Enable a dedicated Layer 2 content extraction tool path that is independent from search fallback so providers with native search can still use URL content reads. Co-authored-by: Cursor <cursoragent@cursor.com>

cubic-dev-ai bot reviewed Feb 23, 2026

View reviewed changes

convex/toolLimits.ts Outdated Show resolved Hide resolved

lib/tools/truncation-policy.ts Outdated Show resolved Hide resolved

lib/tools/mcp-wrapper.ts Outdated Show resolved Hide resolved

lib/tools/utils.ts Show resolved Hide resolved

lib/tools/utils.ts Show resolved Hide resolved

vercel bot deployed to Preview February 23, 2026 19:52 View deployment

batmn-dev merged commit 967fa9a into main Feb 23, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: harden tool orchestration and policy pipeline#43

feat: harden tool orchestration and policy pipeline#43
batmn-dev merged 8 commits intomainfrom
fine-tune

batmn-dev commented Feb 23, 2026 •

edited by cubic-dev-ai bot

Loading

Uh oh!

vercel bot commented Feb 23, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Feb 23, 2026

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

batmn-dev commented Feb 23, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by cubic

Uh oh!

vercel bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot commented Feb 23, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

batmn-dev commented Feb 23, 2026 •

edited by cubic-dev-ai bot

Loading

vercel bot commented Feb 23, 2026 •

edited

Loading