diff --git a/.agents/context/ai-sdk-v6.md b/.agents/project/ai-sdk-v6.md similarity index 100% rename from .agents/context/ai-sdk-v6.md rename to .agents/project/ai-sdk-v6.md diff --git a/.agents/context/api.md b/.agents/project/api.md similarity index 100% rename from .agents/context/api.md rename to .agents/project/api.md diff --git a/.agents/context/architecture.md b/.agents/project/architecture.md similarity index 100% rename from .agents/context/architecture.md rename to .agents/project/architecture.md diff --git a/.agents/context/conventions.md b/.agents/project/conventions.md similarity index 100% rename from .agents/context/conventions.md rename to .agents/project/conventions.md diff --git a/.agents/context/database.md b/.agents/project/database.md similarity index 100% rename from .agents/context/database.md rename to .agents/project/database.md diff --git a/.agents/context/deployment.md b/.agents/project/deployment.md similarity index 100% rename from .agents/context/deployment.md rename to .agents/project/deployment.md diff --git a/.agents/context/glossary.md b/.agents/project/glossary.md similarity index 100% rename from .agents/context/glossary.md rename to .agents/project/glossary.md diff --git a/.agents/context/testing.md b/.agents/project/testing.md similarity index 100% rename from .agents/context/testing.md rename to .agents/project/testing.md diff --git a/.agents/context/ai-context-engineering-guide.md b/.agents/research/ai-context-engineering-guide.md similarity index 100% rename from .agents/context/ai-context-engineering-guide.md rename to .agents/research/ai-context-engineering-guide.md diff --git a/.agents/context/research/browser-extraction/component-state-extraction-research.md b/.agents/research/browser-extraction/component-state-extraction-research.md similarity index 100% rename from .agents/context/research/browser-extraction/component-state-extraction-research.md rename to .agents/research/browser-extraction/component-state-extraction-research.md diff --git a/.agents/context/research/browser-extraction/css-animation-extraction-research.md b/.agents/research/browser-extraction/css-animation-extraction-research.md similarity index 100% rename from .agents/context/research/browser-extraction/css-animation-extraction-research.md rename to .agents/research/browser-extraction/css-animation-extraction-research.md diff --git a/.agents/research/chatgpt-logged-out.png b/.agents/research/chatgpt-logged-out.png deleted file mode 100644 index 427a92c6..00000000 Binary files a/.agents/research/chatgpt-logged-out.png and /dev/null differ diff --git a/.agents/research/url-content-fetching.md b/.agents/research/url-content-fetching.md new file mode 100644 index 00000000..353917c9 --- /dev/null +++ b/.agents/research/url-content-fetching.md @@ -0,0 +1,663 @@ +# URL Content Fetching Capability for AI Chat + +> **Status**: Research Complete +> **Date**: 2026-02-22 +> **Scope**: Architecture, libraries, security, context budgets, specialized extractors, caching +> **Related**: `.agents/plans/phase-7-future-tool-integrations.md` (Sub-Phase 7.7) + +--- + +## Table of Contents + +1. [Production Landscape Survey](#1-production-landscape-survey) +2. [HTML-to-Text Conversion](#2-html-to-text-conversion) +3. [Security Considerations](#3-security-considerations) +4. [Context Budget Management](#4-context-budget-management) +5. [Architecture Placement](#5-architecture-placement) +6. [Specialized Content Types](#6-specialized-content-types) +7. [Hosted Services and APIs](#7-hosted-services-and-apis) +8. [Caching Strategy](#8-caching-strategy) +9. [Recommendations](#9-recommendations) + +--- + +## 1. Production Landscape Survey + +### How Major Platforms Implement URL Fetching + +| Platform | Architecture | JS Rendering | Content Budget | URL Source Restriction | +|----------|-------------|-------------|----------------|----------------------| +| ChatGPT | Server-side, proprietary | Yes (Atlas/Chromium) | Auto-summarization | Model constructs queries | +| Perplexity | Server-side, hybrid RAG | Unknown | `max_tokens_per_page` | Model-driven | +| Claude | Server-side, API tool | No (HTML only) | `max_content_tokens` + dynamic filtering | User-provided URLs only | +| Open WebUI | Server-side, dual-mode | No | 50K char hard cap | Model-driven | +| LibreChat | Server-side, Firecrawl | Yes (via Firecrawl) | Reranker truncation | Search-driven | +| LobeChat | Serverless plugin | No | Plugin-level | Explicit URL input | + +### ChatGPT / OpenAI + +Server-side with proprietary infrastructure. Reasoning models (o3, GPT-5) get two page-level actions beyond search: `open_page` (accesses a webpage) and `find_in_page` (searches within an opened page). The model doesn't read entire pages — it fans out short sub-queries, skims titles and introductions (~500–1,000 chars), and extracts answer blocks under headings. The Atlas browser (Oct 2025) runs a full Chromium-based browser via the OWL architecture. + +### Claude / Anthropic + +Two versions of the `web_fetch` tool: + +| Version | Features | +|---------|----------| +| `web_fetch_20250910` | Basic fetch + PDF extraction | +| `web_fetch_20260209` | Adds dynamic filtering (Opus 4.6, Sonnet 4.6) | + +Dynamic filtering enables Claude to write and execute code that filters fetched content *before* it enters the context window — achieving ~24% input token reduction and ~11% quality improvement. Anti-exfiltration measure: Claude cannot dynamically construct URLs; it can only fetch URLs explicitly provided by the user or from previous search/fetch results. + +Configuration surface: +- `max_content_tokens` — hard cap on content length +- `max_uses` — limits fetches per request +- `allowed_domains` / `blocked_domains` — domain restrictions + +### Perplexity + +Three-stage RAG pipeline: hybrid retrieval → content fetching → grounded generation. Agent API exposes `web_search` (with `max_tokens_per_page`) and `fetch_url` (full page content). Content is fetched on-demand per query and not stored. + +### Open WebUI (Open Source) + +Agentic mode exposes `search_web` and `fetch_url` tools. `fetch_url` retrieves full page text, hard-capped at 50,000 characters, injected directly into context (no Vector DB, no chunking). Requires frontier models (GPT-5, Claude 4.5+) for effective multi-step tool use. No JS rendering. + +### LibreChat (Open Source) + +Three-component pipeline: **Search** (Serper/SearXNG) → **Scrape** (Firecrawl) → **Rerank** (Jina/Cohere). Firecrawl handles JS rendering and markdown conversion. Scraper timeout defaults to 7,500ms. Open enhancement request for direct URL fetching beyond search results. + +### Common Patterns + +1. Content extraction is always **server-side** (never client-side) +2. Trend toward **direct context injection** of filtered content over RAG chunking +3. Token/character limits enforced to prevent context overflow +4. Modern approaches (Claude's dynamic filtering, OpenAI's `find_in_page`) extract **relevant portions** rather than full pages +5. **Markdown** is the preferred output format (token-efficient, preserves structure) + +### Vercel AI SDK + +No built-in URL fetch tool — composable approach. Ready-made third-party integrations: + +| Package | Tool | +|---------|------| +| `@tavily/ai-sdk` (v0.4.1) | `tavilyExtract()` — URL content extraction + search | +| `@exalabs/ai-sdk` | `webSearch()` — search + content extraction | +| `@parallel-web/ai-sdk-tools` | `searchTool` + `extractTool` | + +The `@tavily/ai-sdk` `tavilyExtract()` tool is particularly relevant — it extracts clean, structured content from URLs with configurable `format` (markdown/text) and `extractDepth` (basic/advanced). ~4.6K weekly downloads. + +--- + +## 2. HTML-to-Text Conversion + +### The Standard Pipeline + +The dominant pattern for HTML → LLM-ready text: + +``` +Raw HTML → [DOM Parser] → [Content Extraction] → [Markdown Conversion] → Clean Markdown + jsdom Readability.js Turndown +``` + +Achieves **~70–80% token reduction** vs raw HTML. + +### Article Extraction Libraries + +| Library | Version | Weekly Downloads | Bundle (min+gz) | Dependencies | Quality | +|---------|---------|-----------------|-----------------|-------------|---------| +| `@mozilla/readability` | 0.6.0 | ~500K | ~15 KB | 0 | Excellent (articles) | +| `@extractus/article-extractor` | 8.0.20 | ~11.5K | Larger | Multiple | Good (rich metadata) | +| `cheerio` | 1.0.0 | ~8M | ~50 KB | 5+ (parse5) | Flexible (manual selectors) | + +**`@mozilla/readability`** is the clear winner for general-purpose extraction: +- Battle-tested (powers Firefox Reader View on billions of page loads) +- Zero dependencies, small footprint +- Returns `{ title, content (HTML), textContent, excerpt, byline }` +- Used by Jina Reader internally +- Requires a DOM environment (`jsdom` on server side) +- Modifies DOM in-place (clone the document first) +- Optimized for articles; weaker on forums, product pages, search results + +### Markdown Conversion + +| Library | Version | Weekly Downloads | Bundle (min+gz) | Speed | +|---------|---------|-----------------|-----------------|-------| +| `turndown` | 7.2.0 | ~2.37M | 3.96 KB | Baseline | +| `node-html-markdown` | 1.3.0 | ~328K | ~8 KB | **1.57x faster** | + +Performance benchmarks (reused instance): + +| Input Size | `node-html-markdown` | `turndown` | +|------------|---------------------|-----------| +| 100 KB | 17 ms | 27 ms | +| 1 MB | 176 ms | 280 ms | + +**`turndown`** has 7x larger ecosystem, plugin system (GFM tables/strikethrough), and is used by Jina Reader in production. **`node-html-markdown`** is consistently faster but has fewer community integrations. + +Recommendation: **`turndown`** for ecosystem maturity. The 1.57x speed difference is negligible for single-page fetches (27ms vs 17ms at 100KB). + +### DOM Parsing (Server-Side) + +| Library | Import Time | HTML Parse | Dependencies | +|---------|------------|------------|-------------| +| `jsdom` | 333 ms | 256 ms | 20+ | +| `happy-dom` | 45 ms | 26 ms | Few | +| `linkedom` | Fast | Fast | Few | + +**`jsdom`** is required by `@mozilla/readability` and has the most complete browser emulation (~14M weekly downloads). `happy-dom` is 7.4x faster but less comprehensive. For this use case, `jsdom` is the correct choice because Readability depends on its DOM fidelity. + +### JavaScript-Rendered Content + +| Content Type | Approach | Cost | +|-------------|----------|------| +| Static HTML (articles, blogs, docs) | `fetch` + `jsdom` + Readability | Minimal | +| SPA / JS-rendered | Playwright/Puppeteer or hosted service (Jina/Firecrawl) | High | +| Known site structures | `fetch` + `cheerio` + custom selectors | Minimal | + +Static HTML covers the vast majority of URLs users share in chat (articles, documentation, blog posts). JS-rendered SPAs are an edge case that can be handled by falling back to Jina Reader or Firecrawl. + +### Token Reduction Benchmarks + +| Content Type | Raw HTML Tokens | After Readability+Turndown | Reduction | +|-------------|----------------|---------------------------|-----------| +| Blog post | ~16,000 | ~3,150 | **80%** | +| E-commerce page | ~40,000 | ~2,000 | **95%** | +| News article | 15–25K | 2–5K | **75–80%** | +| Documentation | 10–30K | 3–8K | **70–75%** | +| Wikipedia | 20–80K | 5–20K | **60–75%** | + +Markdown-formatted content shows **35% better RAG accuracy** vs raw HTML. + +### Emerging Alternatives + +**ReaderLM-v2** (Jina AI, Jan 2025): 1.5B parameter model trained specifically for HTML → Markdown. Handles complex elements (code fences, nested lists, tables, LaTeX) with 512K token context. 15–20% better than GPT-4o on extraction benchmarks. Available via Jina API. Trade-off: requires model inference vs zero-cost heuristic conversion. + +**MinerU-HTML / Dripper** (ICLR 2026): 0.6B parameter model for semantic block classification. Reduces HTML to 22% of original tokens while preserving structure. 81.58% ROUGE-N F1 vs Readability's 64.91%. Requires running a small model — heavier infrastructure. + +Neither is practical for the MVP, but both indicate the direction the field is heading. + +--- + +## 3. Security Considerations + +### SSRF (Server-Side Request Forgery) + +The primary risk of server-side URL fetching. Must block: + +- **Private IPs**: `10.0.0.0/8`, `172.16.0.0/12`, `192.168.0.0/16` +- **Loopback**: `127.0.0.0/8`, `::1` +- **Link-local**: `169.254.0.0/16`, `fe80::/10` +- **Cloud metadata endpoints**: AWS `169.254.169.254`, GCP `metadata.google.internal`, Azure `169.254.169.254` +- **Alternative IP representations**: Octal (`0177.0.0.1`), hex (`0x7f000001`), IPv6-mapped IPv4 (`::ffff:127.0.0.1`), decimal integer (`2130706433`) +- **URL schemes**: Only `http:` and `https:`. Block `file:`, `ftp:`, `data:`, `javascript:`, etc. + +### DNS Rebinding Prevention + +Attacker's DNS initially resolves to a public IP (passes validation), then TTL expires and resolves to `127.0.0.1` (actual request hits internal network). Mitigation requires resolving DNS *and pinning the resolved IP* for the actual request — no gap between validation and connection (TOCTOU). + +### Node.js SSRF Protection Libraries + +| Library | Weekly Downloads | DNS Rebinding | Cloud Metadata | TypeScript | +|---------|-----------------|---------------|----------------|-----------| +| `ssrf-agent-guard` (v1.1, Jan 2026) | New | Yes | AWS/GCP/Azure/Oracle/DO/K8s | Yes | +| `request-filtering-agent` | ~101K | No | Partial | No | +| `ssrf-req-filter` | ~45K | Open issue | No | No | + +**`ssrf-agent-guard`** is the most feature-complete pure-TypeScript option (MIT, Jan 2026): +- Blocks private/reserved IPs + cloud metadata endpoints +- DNS rebinding detection +- Policy-based domain filtering (allowlists, denylists, TLD blocking) +- Multiple modes (block/report/allow) +- Works with axios, node-fetch, native fetch via http.Agent wrapping +- Only 6 releases, 2 contributors — newer library, less battle-tested + +**`request-filtering-agent`** has the widest adoption (~101K weekly downloads) but lacks DNS rebinding protection and cloud metadata blocking. + +For defense-in-depth, layer: URL normalization (WHATWG URL API) + protocol restriction + DNS resolution with IP classification + redirect validation at each hop. + +### Response Handling + +| Control | Recommended Value | Rationale | +|---------|-------------------|-----------| +| Content-type allowlist | `text/html`, `text/plain`, `application/json`, `application/xml`, `application/pdf` | Reject binary, media, executables | +| Response size limit | 5 MB raw | Generous for HTML; content will be compressed to markdown | +| Redirect hops | 3–5 maximum | Validate each destination against SSRF rules | +| Connection timeout | 5–10 seconds | Prevent hanging connections | +| Total timeout | 15 seconds | Match existing `TOOL_EXECUTION_TIMEOUT_MS` | +| Streaming cutoff | AbortController at size limit | Don't buffer the entire response before checking | + +### Rate Limiting + +| Dimension | Recommended Limit | Rationale | +|-----------|-------------------|-----------| +| Per-user per hour | 30 fetches | Prevents sustained abuse | +| Per-conversation turn | 5 fetches | Matches existing step limits | +| Per-domain per minute | 3 requests | Prevents hammering a single site | +| Anonymous users | 10 fetches per day | Daily message limit is the primary control | + +### Legal/Ethical + +- **User-initiated fetches** (user shares a URL) are analogous to a browser acting on behalf of the user — distinct from autonomous crawling. ChatGPT, Perplexity, and Claude all fetch user-provided URLs without robots.txt checks. +- **Robots.txt**: Voluntary protocol (RFC 9309). For user-initiated fetches, treat like a user agent. Not legally binding. +- **Mitigation**: Rate limit aggressively, don't cache/redistribute content long-term, attribute sources in responses. +- Set a descriptive `User-Agent` header (e.g., `NotAWrapper/1.0 (User-initiated content fetch)`). + +--- + +## 4. Context Budget Management + +### Token Reduction from Extraction + +The extraction pipeline (Readability + Turndown) provides massive token savings: + +| Page Type | Raw HTML Tokens | After Extraction | Reduction | +|-----------|----------------|-----------------|-----------| +| Blog post | ~16,000 | ~3,150 | 80% | +| E-commerce | ~40,000 | ~2,000 | 95% | +| News article | 15–25K | 2–5K | 75–80% | + +### Token Counting + +| Approach | Speed | Accuracy | Portability | +|----------|-------|----------|-------------| +| `characters / 4` heuristic | Instant | ±15% | All models | +| `js-tiktoken` (exact BPE) | 1,494–31,334 ops/sec | Exact for OpenAI | OpenAI only | +| `@dqbd/tiktoken` (WASM) | 1,992 ops/sec | Exact for OpenAI | OpenAI only | + +**Recommendation**: Use `content.length / 4` heuristic for budget gating at fetch time. Different providers tokenize differently — the heuristic is more portable than exact counting with one tokenizer. Reserve exact counting for when approaching hard limits. + +English prose ratios: ~176 tokens per 1,000 characters (GPT-4o), ~185 tokens per 1,000 characters (GPT-4/cl100k). + +### Budget Framework + +For a 128K context window model: + +``` +┌──────────────────────────────────────────────────┐ +│ USABLE CONTEXT: ~100K tokens │ +│ (128K window - 28K safety margin) │ +├──────────────────────────────────────────────────┤ +│ │ +│ Fixed Costs: 7–15K │ +│ ├── System prompt 2–5K │ +│ ├── Tool definitions 2–5K │ +│ └── Static instructions 3–5K │ +│ │ +│ Variable Costs: 50–70K │ +│ ├── Conversation history 10–30K │ +│ ├── Fetched web content 20–40K │ +│ └── Tool results (non-web) 5–10K │ +│ │ +│ Reserved: 8–15K │ +│ ├── Response generation 4–8K │ +│ └── Reasoning overhead 4–8K │ +│ │ +└──────────────────────────────────────────────────┘ +``` + +**Allocation rule**: Fetched content budget = `min(user_limit, model_context_window * 0.25)`. + +20–25% of the model's practical context limit is the sweet spot — enough to be useful, conservative enough to leave room for conversation history and response generation. + +### Model-Aware Defaults + +| Model Family | Context Window | Practical Limit (~65%) | Content Budget (25%) | +|-------------|---------------|----------------------|---------------------| +| GPT-4o / GPT-5 | 128K | ~83K | 20K | +| Claude Sonnet/Opus | 200K | ~130K | 30K | +| Gemini 1.5 Pro | 1M | ~650K | 100K | +| Small models (32K) | 32K | ~21K | 5K | + +Models claiming large context windows become unreliable well before the advertised limit. Performance degrades at ~65% of advertised capacity with "sudden performance drops rather than gradual degradation." + +### Smart Truncation + +Truncation priority order: +1. Section/heading boundaries (best) +2. Paragraph boundaries (good) +3. Sentence boundaries (acceptable) +4. Word boundaries (minimum viable) + +Never truncate at arbitrary character offsets — boundary-aware truncation preserves significantly more useful information at the same token count. + +### Monitoring Thresholds + +| Context Occupancy | Action | +|-------------------|--------| +| < 70% | Normal operation | +| 70% | Soft cap — trigger history summarization | +| 85–90% | Hard cap — refuse new tool calls or drop low-value chunks | +| 95%+ | Emergency compression | + +--- + +## 5. Architecture Placement + +### Options Analysis + +| Approach | Description | Pros | Cons | +|----------|-------------|------|------| +| **A. Layer 2 standalone tool** | Add `content_extract` alongside `web_search` in `lib/tools/third-party.ts` | Minimal change, follows existing patterns, model decides when to use it | New tool name in all providers | +| **B. Search enhancement** | Automatically fetch full content for top search results | Better UX for "search and read" | Wastes tokens/money on results the model doesn't need | +| **C. Provider-native** | Use each provider's own fetch tool where available | Highest quality (Claude's dynamic filtering) | Only Anthropic has this; inconsistent across providers | +| **D. MCP server** | Optional MCP server users install | Zero default footprint | Requires opt-in configuration, not discoverable | +| **E. Separate "browsing" mode** | Toggle between search and browse modes | Clear UX intent | Complicates the interface, ChatGPT-style complexity | + +### Recommendation: A (Layer 2 Standalone) with Exa as MVP Backend + +**Rationale**: The existing Phase 7.7 plan in `.agents/plans/phase-7-future-tool-integrations.md` already describes this approach with Exa's `getContents()`. It requires: +- Zero new dependencies (Exa SDK already installed) +- Minimal code change (~30 lines in `lib/tools/third-party.ts`) +- Same API key as search (unified BYOK billing) +- $1/1K pages (cheaper than search at $5/1K) + +**Enhancement path** (post-MVP): +1. **MVP**: Exa `getContents()` — zero new deps, immediate value +2. **V2**: Self-hosted pipeline (`fetch` + `@mozilla/readability` + `turndown`) — zero per-request cost, better for high-volume +3. **V3**: Specialized extractors (YouTube transcripts, GitHub, PDFs) — highest quality per content type +4. **V4**: Jina Reader fallback for JS-rendered pages — covers the SPA edge case + +### Access Control + +| User Type | Recommendation | Rationale | +|-----------|----------------|-----------| +| Authenticated | Full access, 30 fetches/hour | Primary user base | +| Anonymous | Allowed, 10 fetches/day | At $1/1K pages, worst case ~$0.01/day per anonymous user. Daily message limit is the real control. | +| BYOK | Full access, their own API costs | Same Exa key handles search and extraction | + +### Tool Decision: Model-Driven (`toolChoice: "auto"`) + +The model should decide when to use `content_extract` based on user intent. No proactive URL extraction — URLs appear in code snippets, reference links, and other contexts where fetching would be wrong. Claude, GPT-5, and Gemini all demonstrate good judgment about when URL content is needed vs. when the URL is just a reference. + +--- + +## 6. Specialized Content Types + +### Value vs Complexity Assessment + +| Extractor | Value | Complexity | Verdict | Priority | +|-----------|-------|------------|---------|----------| +| YouTube transcripts | Very High — unique content AI can't get otherwise | Low (one npm package) | **Must have** | V3 | +| PDF | High — common link type, needs specialized parsing | Low (`unpdf`, one function) | **Must have** | V3 | +| GitHub | High — structured data (issues, code, README) | Low-Medium (Octokit, REST API) | **Must have** | V3 | +| Wikipedia | Medium — cleaner than generic scraping | Very Low (REST API, no auth) | **Worth it** | V3 | +| Twitter/X | Medium — tweets are short | Low (oEmbed, free) but unreliable for threads | **Worth it** | V4 | + +### YouTube + +Neither ChatGPT nor Claude can reliably extract YouTube transcripts today. This is a real differentiation opportunity. + +| Library | Version | Weekly Downloads | Notes | +|---------|---------|-----------------|-------| +| `youtube-transcript` | 1.2.1 | 135.6K | Most popular, zero deps, MIT | +| `youtube-transcript-plus` | 1.2.0 | Growing | Fork with proxy/custom-fetch support (Feb 2026) | + +```typescript +import { YoutubeTranscript } from 'youtube-transcript'; +const transcript = await YoutubeTranscript.fetchTranscript('dQw4w9WgXcQ'); +// Returns: [{ text: string, duration: number, offset: number, lang?: string }] +``` + +Uses unofficial YouTube endpoints (timedtext API). Technically violates YouTube ToS. Widely tolerated at low volume with rate limiting and caching. The official YouTube Data API v3 cannot download transcript text. + +### PDF + +| Library | Weekly Downloads | Serverless | Notes | +|---------|-----------------|------------|-------| +| `unpdf` | 266.8K | Yes | Modern, zero deps, recommended | +| `pdf-parse` | ~2.2M | No | Most downloaded but unmaintained | +| `pdfjs-dist` | High | No | Low-level, maximum control | + +**`unpdf`** is the clear winner — zero dependencies, works in Node.js/Bun/Deno/Cloudflare Workers, bundles PDF.js v5.4: + +```typescript +import { extractText, getDocumentProxy } from 'unpdf'; +const buffer = await fetch(pdfUrl).then(r => r.arrayBuffer()); +const pdf = await getDocumentProxy(new Uint8Array(buffer)); +const { totalPages, text } = await extractText(pdf, { mergePages: true }); +``` + +Quality: excellent for text, poor for tables (loses structure), no OCR for scanned PDFs. A 100-page PDF ≈ 30–50K tokens. + +### GitHub + +GitHub REST API with `@octokit/rest` (~3.5M weekly downloads). Rate limits: 60 req/hour unauthenticated, 5,000 req/hour authenticated. Standard pattern for "summarize this repo": fetch README + repo metadata + directory tree + package.json. + +URL detection covers: +- `github.com/{owner}/{repo}` → README + metadata +- `github.com/{owner}/{repo}/issues/{number}` → issue body + comments +- `github.com/{owner}/{repo}/pull/{number}` → PR description + diff stats +- `github.com/{owner}/{repo}/blob/{branch}/{path}` → file content + +### Wikipedia + +REST API at `en.wikipedia.org/api/rest_v1/` provides clean endpoints. No auth required. `wikipedia` npm package (v2.1.2) provides `page.summary()` and `page.content()`. `wtf_wikipedia` (v10.4.1, 6.7K weekly downloads) parses wikitext into structured sections. + +### Architecture: URL Router Pattern + +``` +User shares URL → URL Detector (regex) → Specialized Extractor or Generic Fallback + ↓ + Normalized Output: { type, title, content (markdown), metadata, sourceUrl, tokenEstimate } +``` + +All extractors return the same `ExtractedContent` shape, keeping downstream LLM prompt logic consistent regardless of source type. The generic fallback (Exa `getContents` for MVP, Readability+Turndown for V2) handles the long tail. + +--- + +## 7. Hosted Services and APIs + +### Comparison Matrix + +| Service | Best For | JS Render | Pricing | Self-Host | Latency | +|---------|----------|-----------|---------|-----------|---------| +| **Exa `getContents`** | Search + extraction pipeline | Yes (internal) | $1/1K pages | No | Fast (cached) | +| **Jina Reader** (`r.jina.ai`) | Simple URL → Markdown | Yes (headless Chrome) | Free 1M tokens, then token-based | Yes (Apache 2.0) | Fast | +| **Tavily Extract** (`@tavily/ai-sdk`) | AI SDK integration | Unknown | Credit-based | No | Medium | +| **Firecrawl** | Structured extraction + crawling | Yes (Chromium) | Free 500 credits, $16–$599/mo | Yes (AGPL, Docker) | Medium | +| **Browserless** | Full browser automation | Yes | Free 1K units, $200–$500/mo | Yes | Variable | + +### Exa `getContents()` — MVP Backend + +Already integrated via `exa-js`. `getContents()` fetches and extracts content from specific URLs: + +```typescript +const results = await exa.getContents(urls, { + text: { maxCharacters: 10000 }, +}); +``` + +- **$1/1K pages** (vs $5/1K search requests) +- Returns from Exa's cache (instant), falls back to live crawl +- Content extraction options: text (markdown), highlights (AI excerpts), summary (LLM-generated) +- Same API key as search — unified BYOK +- No new dependency + +### Jina Reader — JS-Rendered Fallback + +Prefix any URL with `https://r.jina.ai/` → returns clean Markdown. Uses Readability + Turndown internally, with Puppeteer for JS rendering. Processes 100 billion tokens daily. Open source (Apache 2.0). + +- Free tier: ~1M tokens (IP-based rate limits) +- Supports CSS selectors, image captions, PDF reading +- ReaderLM-v2 option for higher quality (3x token cost) +- No npm package needed — simple HTTP API + +### Tavily Extract — AI SDK Native + +The `@tavily/ai-sdk` (v0.4.1) provides a `tavilyExtract()` tool that plugs directly into Vercel AI SDK's `tools` parameter: + +```typescript +import { tavilyExtract } from "@tavily/ai-sdk"; +tools: { extract: tavilyExtract({ format: "markdown" }) } +``` + +Cleanest integration for AI SDK, but adds a new dependency and API key. ~4.6K weekly downloads. + +### Self-Hosted Pipeline — Long-Term + +For zero per-request cost: + +```typescript +import { JSDOM } from 'jsdom'; +import { Readability } from '@mozilla/readability'; +import TurndownService from 'turndown'; + +async function fetchAndExtract(url: string): Promise { + const response = await fetch(url, { signal: AbortSignal.timeout(15000) }); + const html = await response.text(); + const dom = new JSDOM(html, { url }); + const article = new Readability(dom.window.document).parse(); + if (!article) return ''; + return new TurndownService({ headingStyle: 'atx', codeBlockStyle: 'fenced' }) + .turndown(article.content); +} +``` + +Dependencies: `jsdom` (~14M/wk), `@mozilla/readability` (~500K/wk), `turndown` (~2.37M/wk). Total: ~3 new dependencies. Full pipeline: ~300–500ms per page. + +--- + +## 8. Caching Strategy + +### Should Fetched Content Be Cached? + +**Yes, with short TTL and per-user isolation.** + +Web page content changes. Caching too aggressively returns stale data; not caching at all wastes API credits and increases latency for repeated URLs (common in multi-turn conversations about the same page). + +### Recommended Approach + +| Dimension | Recommendation | Rationale | +|-----------|----------------|-----------| +| **TTL** | 15–30 minutes | Long enough for multi-turn conversations about the same URL; short enough that content stays fresh | +| **Scope** | Shared cache (same URL = same content) | No privacy concern — content is publicly accessible by URL. Shared cache maximizes hit rate. | +| **Storage** | In-memory (Map or LRU cache) for MVP | No infrastructure dependency. Process-level cache is fine for single-server deployments. | +| **Key** | Normalized URL (strip tracking params, normalize case) | Prevent duplicate fetches for equivalent URLs | +| **Eviction** | LRU with 500-entry cap | Prevents unbounded memory growth | +| **Privacy** | Don't log URLs fetched | URLs can reveal user interests and browsing patterns | + +### Implementation Sketch + +```typescript +const cache = new Map(); +const CACHE_TTL_MS = 15 * 60 * 1000; // 15 minutes +const MAX_ENTRIES = 500; + +function getCached(url: string): string | null { + const entry = cache.get(normalizeUrl(url)); + if (!entry) return null; + if (Date.now() - entry.fetchedAt > CACHE_TTL_MS) { + cache.delete(normalizeUrl(url)); + return null; + } + return entry.content; +} +``` + +### What Not To Cache + +- URLs with authentication tokens or session-specific content +- Content fetched via POST or non-idempotent requests +- Error responses (cache misses, not failures) + +### Scaling Beyond In-Memory + +If the application scales to multiple server instances, migrate to Redis or Convex-backed cache. Redis LangCache provides semantic caching (matches semantically similar queries to cached results) — relevant for search results but overkill for URL content caching where the key is the exact URL. + +--- + +## 9. Recommendations + +### Progressive Enhancement Path + +| Phase | What | Dependencies | Per-Request Cost | Coverage | +|-------|------|-------------|-----------------|---------| +| **MVP** | Exa `getContents()` tool in Layer 2 | None (Exa already installed) | $1/1K pages | Static HTML, cached pages | +| **V2** | Self-hosted Readability + Turndown pipeline | `jsdom`, `@mozilla/readability`, `turndown` | $0 | Static HTML (no JS rendering) | +| **V3** | Specialized extractors (YouTube, PDF, GitHub, Wikipedia) | `youtube-transcript`, `unpdf`, `@octokit/rest`, `wikipedia` | $0 (mostly) | Structured content from known platforms | +| **V4** | Jina Reader fallback for JS-rendered pages | None (HTTP API) | Token-based | SPAs, JS-rendered content | +| **V5** | SSRF hardening + production rate limiting | `ssrf-agent-guard` or `request-filtering-agent` | $0 | Security | + +### MVP Specification (Phase 7.7 Alignment) + +The MVP aligns with the existing plan in `.agents/plans/phase-7-future-tool-integrations.md` Sub-Phase 7.7: + +**Tool name**: `content_extract` +**Location**: `lib/tools/third-party.ts` (alongside `web_search`) +**Backend**: Exa `getContents()` +**Input**: `{ urls: z.array(z.string().url()).min(1).max(5) }` +**Output**: `{ ok, data: [{ url, title, content }], error, meta }` +**Content limit**: 10,000 characters per URL (≈2,500 tokens) +**Cost**: $1/1K pages (`estimatedCostPer1k: 1`) +**Timeout**: 15s (existing `TOOL_EXECUTION_TIMEOUT_MS`) +**Access**: All users (authenticated and anonymous) +**Tool metadata**: `{ displayName: "Read Page", source: "third-party", serviceName: "Exa", readOnly: true }` + +### Key Design Decisions + +| Decision | Choice | Rationale | +|----------|--------|-----------| +| Model triggers fetch | `toolChoice: "auto"` | Model judges intent better than URL regex | +| Content format | Markdown | Token-efficient, preserves structure, industry standard | +| Content budget default | 10K chars (MVP), model-aware in V2 | Conservative start; can increase based on usage data | +| Caching | In-memory LRU, 15-min TTL | Simple, effective for multi-turn conversations | +| SSRF protection | `ssrf-agent-guard` (V5) | Most complete TypeScript library; not needed for MVP since Exa handles fetching | +| Specialized extractors | V3 (after generic works) | YouTube transcripts are the highest-value differentiator | + +### Risks and Mitigations + +| Risk | Impact | Mitigation | +|------|--------|------------| +| Exa extraction quality varies by page | Medium — some pages return poor content | V2 self-hosted pipeline as fallback; Jina Reader for JS-rendered pages | +| Exa service outage | High — tool becomes unavailable | Timeout + graceful error messaging; V2 self-hosted pipeline as backup | +| SSRF vulnerability (V2 self-hosted) | Critical — server-side URL fetching | Defer self-hosted fetching to V5 with proper SSRF protection | +| Context budget overflow | Medium — degraded model performance | `chars / 4` budget gating; smart truncation at paragraph boundaries | +| YouTube scraping breaks | Medium — unofficial API can change | Cache transcripts; degrade gracefully to generic page scraping | +| Cost at scale | Low — $1/1K pages is cheap | BYOK passes cost to user; platform key has existing billing controls | + +### Dependencies Summary + +| Phase | New Dependencies | Bundle Impact | +|-------|-----------------|---------------| +| MVP | None | None | +| V2 | `jsdom`, `@mozilla/readability`, `turndown` | ~2 MB (server-only, no client bundle impact) | +| V3 | `youtube-transcript`, `unpdf`, `wikipedia` | ~200 KB (server-only) | +| V4 | None (HTTP API to Jina) | None | +| V5 | `ssrf-agent-guard` | ~50 KB (server-only) | + +--- + +## Appendix A: Tavily as Alternative to Exa + +The `@tavily/ai-sdk` (v0.4.1) provides `tavilyExtract()` that plugs directly into AI SDK's tool system. It handles URL extraction with configurable format (markdown/text) and depth (basic/advanced). Extract API supports up to 20 URLs simultaneously. + +**Comparison with Exa `getContents()`**: + +| Dimension | Exa | Tavily | +|-----------|-----|--------| +| Already integrated | Yes (`exa-js`) | No (new dependency) | +| AI SDK plugin | Via `@exalabs/ai-sdk` (not used due to BYOK) | Via `@tavily/ai-sdk` | +| Pricing | $1/1K pages | Credit-based (varies) | +| BYOK support | Yes (explicit key constructor) | Yes (API key param) | +| Batch size | Unlimited | 20 URLs | +| Output format | Text, highlights, summary | Markdown, text | + +**Verdict**: Exa is the correct MVP choice — already integrated, zero new dependencies, cheaper, proven in the codebase. + +## Appendix B: Full Library Reference + +| Library | npm Package | Version | Weekly Downloads | License | +|---------|-------------|---------|-----------------|---------| +| Readability | `@mozilla/readability` | 0.6.0 | ~500K | Apache 2.0 | +| Turndown | `turndown` | 7.2.0 | ~2.37M | MIT | +| jsdom | `jsdom` | latest | ~14M | MIT | +| node-html-markdown | `node-html-markdown` | 1.3.0 | ~328K | MIT | +| unpdf | `unpdf` | latest | 266.8K | MIT | +| youtube-transcript | `youtube-transcript` | 1.2.1 | 135.6K | MIT | +| youtube-transcript-plus | `youtube-transcript-plus` | 1.2.0 | Growing | MIT | +| wikipedia | `wikipedia` | 2.1.2 | Moderate | MIT | +| wtf_wikipedia | `wtf_wikipedia` | 10.4.1 | 6.7K | MIT | +| ssrf-agent-guard | `ssrf-agent-guard` | 1.1 | New | MIT | +| request-filtering-agent | `request-filtering-agent` | latest | ~101K | MIT | +| Tavily AI SDK | `@tavily/ai-sdk` | 0.4.1 | ~4.6K | MIT | +| Exa SDK | `exa-js` | latest | Moderate | MIT | diff --git a/.agents/workflows/correctness-decision-workflow.md b/.agents/workflows/correctness-decision-workflow.md new file mode 100644 index 00000000..17535ebb --- /dev/null +++ b/.agents/workflows/correctness-decision-workflow.md @@ -0,0 +1,64 @@ +# Workflow: Correctness-First Decision + +Use this workflow for medium/high-risk tasks to prioritize robust, industry-standard solutions over quick fixes. + +## Goal + +- Preserve high implementation quality with explicit design decisions. +- Keep always-on context minimal by loading this workflow only when needed. + +## Step 1: Risk Triage + +Classify the task before coding: + +- **Low risk:** localized refactor, copy changes, non-behavioral cleanup. +- **Medium risk:** behavior changes in one subsystem, moderate user impact. +- **High risk:** auth, schema/data model, API contracts, persistence, concurrency, migrations, billing/payments, security-critical paths. + +If the task is medium/high risk, continue with this workflow. + +## Step 2: Evaluate Approaches (Short ADR) + +Document a brief decision note: + +1. Problem and constraints. +2. Candidate approaches (2-3 options). +3. Chosen approach and why it is safer/clearer long-term. +4. Why alternatives were rejected. +5. Failure modes and rollback plan. + +Keep it concise. This is for decision quality, not long-form documentation. + +## Step 3: Industry-Standard Gate + +Before introducing new patterns or dependencies, verify: + +- Existing project pattern cannot reasonably solve the problem. +- Proposed approach is mature and actively maintained. +- Security and operational implications are acceptable. +- Migration and maintenance cost are understood. + +If the new dependency is optional or convenience-only, prefer existing patterns. + +## Step 4: Implementation Discipline + +- Implement smallest change that satisfies the specification. +- Preserve existing architecture boundaries. +- Avoid speculative abstractions. +- Add concise comments only where logic is non-obvious. + +## Step 5: Risk-Scaled Validation + +- **Low risk:** targeted tests/checks for touched behavior. +- **Medium risk:** targeted + affected integration path tests. +- **High risk:** targeted + integration + regression/failure-mode checks. + +Compilation, lint, and type checks are necessary but not sufficient for medium/high risk. + +## Step 6: Final Review Checklist + +- Chosen approach addresses root cause. +- Trade-offs were considered, not guessed. +- Changes match project patterns unless deviation is justified. +- Validation depth matches risk tier. +- Residual risks are explicitly noted in handoff/report. diff --git a/.env.example b/.env.example index b4ff83d3..5592b340 100644 --- a/.env.example +++ b/.env.example @@ -53,6 +53,9 @@ OPENAI_API_KEY= # Anthropic (Claude) - https://console.anthropic.com/ ANTHROPIC_API_KEY= +# Set to "false" to disable the token-efficient-tools beta header for Anthropic. +# Used for A/B benchmarking tool token usage. Default: enabled. +# ANTHROPIC_TOKEN_EFFICIENT_TOOLS=false # Google (Gemini) - https://aistudio.google.com/apikey GOOGLE_GENERATIVE_AI_API_KEY= diff --git a/AGENTS.md b/AGENTS.md index ff4f1df1..2f9b5f13 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -1,142 +1,70 @@ # Not A Wrapper -Open-source multi-AI chat application with unified model interface. Supports 100+ models across 8 providers with multi-model comparison, BYOK, and local model support. - -## Tech Stack - -| Layer | Technology | -|-------|------------| -| Framework | Next.js 16 (App Router), React 19, TypeScript | -| Database | Convex (reactive DB + built-in RAG) | -| Auth | Clerk | -| Payments | Flowglad | -| AI | Vercel AI SDK → Multi-provider (OpenAI, Claude, Gemini, etc.) | -| State | Zustand + TanStack Query | -| UI | Shadcn/Base UI + Tailwind 4 | - -## Commands - -```bash -bun install # Install deps -bun run dev # Dev server (:3000) -bun run dev:clean # Dev server with fresh .next cache -bun run lint # ESLint -bun run typecheck # tsc --noEmit -bun run build # Production build -bun run test # Vitest (critical paths) -``` - -## Context System - -This project uses a structured context system for AI assistants: - -| Location | Purpose | When Loaded | -|----------|---------|-------------| -| `AGENTS.md` | Quick reference (this file) | Always | -| `.cursor/rules/` | Cursor-specific patterns | Auto by Cursor | -| `.agents/context/` | Domain knowledge & references | On-demand | -| `.agents/context/glossary.md` | Domain terminology | On-demand | -| `.agents/research/` | Research, evaluations, analyses | On-demand | -| `.agents/troubleshooting/` | Known issues & fixes | On-demand | -| `.agents/design/` | Design references & UI research | On-demand | -| `.agents/plans/` | Implementation plans | On-demand | -| `.agents/skills/` | Multi-step task guides | On-demand | -| `.agents/workflows/` | Development procedures | On-demand | -| `.agents/archive/` | Superseded documents | On-demand | +Open-source multi-AI chat app with a unified model interface across providers. -### Key Skills +## Primary Objective -**Load skills BEFORE starting work** when a task matches the trigger. +Deliver correct, secure, maintainable changes with minimal, focused diffs. -| Skill | Use When | -|-------|----------| -| `add-ai-provider` | Integrating new AI service | -| `add-model` | Adding model to existing provider | -| `convex-function` | Creating database functions | +## Context File Contract (Paper-Aligned) -> Skills contain checklists and patterns that prevent common mistakes. Load via `@.agents/skills/[name]/SKILL.md` +- Keep this file minimal and high-signal. +- Include only mandatory constraints and critical patterns. +- Avoid broad repository overviews and generic checklists. +- Load deeper guidance from `.agents/` only when task-relevant. -### Workflows +## Implementation Philosophy (SHOULD) -| Workflow | Use When | -|----------|----------| -| `new-feature.md` | Implementing new features | -| `debugging.md` | Troubleshooting issues | -| `release.md` | Releasing new versions | - -## Directory Structure - -``` -app/ # Next.js App Router -├── api/ # API routes (streaming) -├── auth/ # Auth pages/actions -├── c/[chatId]/ # Chat pages -├── p/[projectId]/ # Project pages -├── share/ # Public share pages -└── components/chat/ # Chat UI - -lib/ # Shared utilities -├── chat-store/ # Chat state -├── config.ts # Constants -├── models/ # AI model definitions -└── openproviders/ # AI provider abstraction - -components/ # Shadcn UI components (Base UI primitives) -convex/ # Convex DB schema & functions - -.agents/ # AI context & knowledge base -├── context/ # Domain knowledge & references -├── research/ # Research & evaluations -├── troubleshooting/ # Known issues & fixes -├── design/ # Design references & UI research -├── plans/ # Implementation plans -├── skills/ # Multi-step task guides -├── workflows/ # Development procedures -└── archive/ # Superseded documents - -.cursor/rules/ # Cursor-specific rules -``` +- Prefer well-researched, industry-standard solutions over quick fixes. +- Extend existing project patterns instead of introducing parallel systems. +- Fix root causes instead of symptoms. +- Optimize for maintainability and clarity over short-term speed. +- If unsure, consult `.agents/research/` and document non-trivial trade-offs. -## Gold Standard Examples +## Correctness-First Escalation (MUST) -| Pattern | File | -|---------|------| -| API Route | `app/api/chat/route.ts` | -| Provider History Adapter Registry | `app/api/chat/adapters/index.ts` | -| Custom Hook | `app/components/chat/use-chat-core.ts` | -| Context Provider | `lib/chat-store/chats/provider.tsx` | -| Component | `app/components/chat/chat.tsx` | +- Use risk-based rigor: keep low-risk tasks lightweight, increase rigor for medium/high-risk tasks. +- Medium/high-risk changes require a brief approach decision before coding (options, trade-offs, chosen approach). +- High-risk triggers include: auth, schema/data model, API contracts, persistence, concurrency, migrations, billing/payments, and security-critical paths. +- Introducing a new dependency or architectural pattern requires explicit justification and at least one alternative considered. +- Validation depth must scale with risk; do not treat successful compilation as sufficient evidence of correctness. +- For the detailed process, load `.agents/workflows/correctness-decision-workflow.md` on demand. -## Implementation Philosophy +## Non-Negotiable Rules -**Prefer well-researched, industry-standard solutions over quick fixes.** +### Security (MUST) -When implementing features or fixing bugs: +- Never read/write `.env*` files. +- Never log or expose secrets, tokens, or credentials. +- Treat BYOK/API key data as encrypted-at-rest. -1. **Research first** — Understand the problem domain and established solutions before writing code -2. **Use proven patterns** — Prefer battle-tested approaches (design patterns, established libraries, documented techniques) over novel or ad-hoc solutions -3. **Optimize for maintainability** — Long-term code health over short-term velocity -4. **Extend existing conventions** — Follow and build upon the codebase's established patterns -5. **Evaluate trade-offs** — When multiple approaches exist, analyze pros/cons before committing +### Code Quality (MUST) -> When unsure, consult `.agents/research/` for prior analysis or create a new research document before implementing. +- No `// @ts-ignore`. +- No lint-rule bypassing (`eslint-disable`) without explicit documented approval. +- Do not downgrade or disable checks to "make it pass." +- Prefer source fixes over workarounds. -## Prompt Delivery Default +### Git Safety (MUST) -When the user asks to "create a prompt" (or similar), return the prompt directly in chat. -Do not create a markdown file unless the user explicitly asks for a file. -If ambiguous, prefer chat output. +- Never create branches unless explicitly asked. +- Never force-push to shared branches. +- Avoid destructive git commands unless explicitly requested. -## No Timeline Estimates +## Ask Before Making These Changes (MUST) -**Never include time estimates, durations, or effort assessments** in plans, summaries, or implementation outputs. This includes phrases like "~30 minutes", "2-3 hours", "Phase 1 (Day 1)", "Quick win", or any similar timeline/effort language. AI-generated timeline estimates are unreliable and misleading. Only include timeline or effort information if the user explicitly requests it. +- Adding dependencies (`bun add ...`) +- Modifying `package.json`, `tsconfig*`, `next.config.*` +- Editing auth-critical paths (`app/auth/`, `middleware.ts`) +- Changing DB schema (`convex/schema.ts`) +- Changing CI/CD (`.github/workflows/`) +- Deleting files -## Critical Patterns +## Required Project Patterns (MUST When Applicable) -### Streaming Responses (MUST) +### Streaming Responses (AI SDK v6) ```typescript -// ALWAYS use toUIMessageStreamResponse for AI chat (AI SDK v6) return result.toUIMessageStreamResponse({ sendReasoning: true, sendSources: true, @@ -144,108 +72,54 @@ return result.toUIMessageStreamResponse({ }) ``` -### Convex Auth Pattern (MUST) +### Convex Auth Pattern ```typescript -// All mutations modifying user data: const identity = await ctx.auth.getUserIdentity() if (!identity) throw new Error("Not authenticated") -// ... lookup user, verify ownership, then operate +// verify ownership before user-scoped mutations ``` -### Optimistic Updates +### Optimistic Update Pattern ```typescript -// Store previous → Update optimistic → Rollback on error let previous = null -setState((prev) => { previous = prev; return updated }) -try { await mutation() } -catch { if (previous) setState(previous) } -``` - -## AI Agent Permissions - -### ✅ Allowed - -- Read any source file -- Run: `dev`, `build`, `lint`, `typecheck`, `test` -- Create/edit in: `app/`, `lib/`, `components/`, `hooks/` -- Create/edit documentation in: `.agents/` (follow `.cursor/rules/070-documentation.mdc`) - -### ⚠️ Ask First - -- `bun add ` -- Modify: `package.json`, `tsconfig.json`, `next.config.*` -- Git operations -- Auth logic (`app/auth/`, `middleware.ts`) -- Delete files -- DB schema (`convex/schema.ts`) -- CI/CD (`.github/workflows/`) - -### 🚫 Forbidden - -- **Creating git branches** — NEVER create new branches unless the user explicitly asks for a branch to be created. Implementation plans, feature work, and all other tasks must be done on the current branch. Branch creation requires explicit user instruction. -- Read/write `.env*` files -- Force push or commit secrets -- `// @ts-ignore` (never acceptable) -- `eslint-disable` without documented reason -- Disabling lint rules to bypass errors - -## Security - -**Never log:** OAuth tokens, API keys, credentials, session tokens - -**Encrypt at rest:** User-provided API keys (BYOK) via AES-256-GCM - -**Rate limiting:** Check BEFORE calling `streamText()` - -## Key Terminology - -> Full glossary: `.agents/context/glossary.md` - -| Term | Meaning | -|------|---------| -| Model | Config object, ID string, or SDK instance (context-dependent) | -| providerId | Internal ID for API key lookups (`"anthropic"`) | -| baseProviderId | AI SDK identifier (`"claude"`) | -| parts | AI SDK message content array (text, tools, reasoning) | -| BYOK | Bring Your Own Key | - -## Environment Variables - -```bash -# Required -NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY= -CLERK_SECRET_KEY= -CONVEX_DEPLOYMENT= -NEXT_PUBLIC_CONVEX_URL= -CSRF_SECRET= -ENCRYPTION_KEY= # Must be 32 bytes base64 - -# AI Providers (at least one) -ANTHROPIC_API_KEY= -OPENAI_API_KEY= +setState((prev) => { + previous = prev + return updated +}) +try { + await mutation() +} catch { + if (previous) setState(previous) +} ``` -See `.env.example` for complete documentation. - -## Development Workflow +## Execution Defaults (SHOULD) -Four-phase cycle: **Research → Plan → Code & Verify → Commit** +1. Gather only the context needed for the current task. +2. Plan small, testable edits. +3. Implement focused changes. +4. Run only relevant checks (`lint`, `typecheck`, targeted tests). +5. Report key trade-offs and residual risks. -Use `ultrathink` for complex architectural decisions. +## On-Demand Context -See `.agents/workflows/development-cycle.md` for details. +Load only when needed: -## Pull Requests +- `.agents/context/` +- `.agents/skills/` +- `.agents/workflows/` +- `.agents/troubleshooting/` +- `.agents/context/glossary.md` -When creating a pull request: +## Output Preferences (SHOULD) -1. **Always fetch first** — Run `git fetch origin` before comparing branches -2. **Compare against remote** — Use `origin/main` (not local `main`) for diffs and commit logs. Local `main` may be stale. -3. **Verify commit count** — Run `git log origin/main..HEAD --oneline` and confirm the number matches what GitHub will show -4. **Scope the description** — The PR body must reflect only the commits unique to the branch, not the full history since an outdated local ref +- If asked to create a prompt, return it directly in chat unless a file is explicitly requested. +- Do not include timeline or effort estimates unless explicitly requested. ---- +## Pull Request Baseline (SHOULD When Preparing PRs) -*~200 lines. For detailed patterns, see `.cursor/rules/` and `.agents/skills/`.* +1. Run `git fetch origin` before branch comparisons. +2. Diff and log against `origin/main` (not local `main`). +3. Scope PR descriptions to commits in `origin/main..HEAD`. diff --git a/CLAUDE.md b/CLAUDE.md index 9954c16f..1fcc6bb5 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1,226 +1,21 @@ -# Claude-Specific Context +# Claude Overlay (Minimal) -This file contains Claude-specific behaviors, preferences, and context for the Not A Wrapper project. +Claude-specific guidance for this repository. Universal rules live in `AGENTS.md`. -> See `@AGENTS.md` for universal guidelines that apply to all AI agents. +## First Principle -## Claude Preferences +Follow `AGENTS.md` as the source of truth for implementation philosophy, safety, and quality. -### Thinking Mode -- Use **extended thinking** for complex architectural decisions -- Use `ultrathink` trigger for multi-step refactoring or debugging sessions -- Standard thinking is fine for simple edits and additions +## Claude-Specific Deltas (Only) -### Response Style -- Be concise; avoid over-explaining obvious code -- Use code references (`startLine:endLine:filepath`) when discussing existing code -- Prefer showing small, focused diffs over full file rewrites -- **Never include timeline or effort estimates** (e.g., "~30 min", "2 hours", "Day 1") unless the user explicitly asks for them — AI time estimates are unreliable +- Use parallel tool calls when operations are independent. +- Read only task-relevant files; avoid broad exploration by default. +- If a task is medium/high risk, load `.agents/workflows/correctness-decision-workflow.md` before implementation. +- After substantive edits, run relevant validation (`lint`, `typecheck`, targeted tests). +- Be concise and direct. +- Do not provide timeline or effort estimates unless explicitly requested. -### Tool Usage -- **Maximize parallel tool calls** when operations are independent -- Read multiple files simultaneously when exploring a feature -- Run lint/typecheck after edits to catch issues early +## Notes -## Project-Specific Behaviors - -### Prompt Delivery Default -- When the user asks to "create a prompt" (or similar), return the prompt directly in chat -- Do not create a markdown file unless the user explicitly asks for a file -- If ambiguous, prefer chat output - -### When Working on Chat Features -- Reference `app/components/chat/use-chat-core.ts` for hook patterns -- Follow optimistic update pattern from `lib/chat-store/chats/provider.tsx` -- Streaming responses use Vercel AI SDK patterns - -### When Working on API Routes -- Follow `app/api/chat/route.ts` as the gold standard -- Always validate input with proper error handling -- Use structured error responses: `{ error: string, code?: string }` - -### When Working on UI Components -- Use Shadcn/Base UI primitives from `components/ui/` -- Follow existing patterns in `app/components/` -- Prefer composition over configuration - -## Memory Hierarchy - -This project uses the following memory structure: - -``` -CLAUDE.md (this file) → Project-level Claude context -├── app/CLAUDE.md → App-specific patterns -├── lib/CLAUDE.md → Library patterns -└── ~/.claude/CLAUDE.md → Personal user preferences -``` - -## Import Syntax for Context - -When you need additional context, use the `@` import syntax: - -```markdown -@AGENTS.md # Project overview, commands, permissions -@.agents/context/glossary.md # Domain terminology definitions -@.agents/skills/add-ai-provider/ # Adding new AI providers -@.agents/skills/add-model/ # Adding new models -@.agents/skills/convex-function/ # Creating database functions -@.agents/workflows/development-cycle.md # Development workflows (four-phase cycle, TDD) -@lib/config.ts # Centralized configuration constants -``` - -## Context System - -| Location | Purpose | -|----------|---------| -| `.agents/context/` | Architecture, API, database, and deployment docs | -| `.agents/context/glossary.md` | Domain terminology (Model, providerId, parts, etc.) | -| `.agents/research/` | Research, evaluations, analyses | -| `.agents/troubleshooting/` | Known issues & fixes | -| `.agents/design/` | Design references & UI research | -| `.agents/plans/` | Implementation plans | -| `.agents/skills/` | Multi-step task guides | -| `.agents/workflows/` | Development workflows and procedures | -| `.agents/archive/` | Superseded documents | -| `.cursor/rules/` | Cursor-specific patterns (auto-loaded) | - -> **Documentation rule**: All AI-generated markdown belongs in `.agents/`. See `.cursor/rules/070-documentation.mdc` for placement guide. - -## Development Workflow - -This project follows Anthropic's four-phase coding cycle. See `@.agents/workflows/development-cycle.md` for complete details. - -### Quick Reference - -**Phase 1: Research** → Gather context, read files, understand patterns -**Phase 2: Plan** → Create detailed plan, use `ultrathink` for complex problems -**Phase 3: Code & Verify** → Implement step-by-step, verify after each step -**Phase 4: Commit** → Commit incrementally with clear messages - -### Extended Thinking - -Use extended thinking (`ultrathink`) for: -- Architectural decisions -- Complex debugging sessions -- Security analysis -- Performance optimization - -Toggle "Thinking On/Off" in Claude Code, or use `ultrathink:` prefix in prompts. - -### TDD Workflow - -For critical paths (auth, data transforms, rate limiting): -1. Write tests first -2. Confirm tests fail -3. Commit tests -4. Implement to pass tests -5. Iterate until all pass - -### Context Management - -When sessions get long: -- Summarize older messages (keep last 10) -- Write session discoveries to `NOTES.md` (project root — scratch notes only) -- Write lasting research/analysis to `.agents/research/` -- Reference `@` files instead of pasting content -- Use context compaction strategies - -See `@.agents/workflows/development-cycle.md` and `@.agents/workflows/examples.md` for detailed workflows. - -## Sub-Agent Architecture - -When the sub-agent architecture is implemented, Claude should route tasks: - -| Task Type | Agent | Model | -|-----------|-------|-------| -| Code assistance | Code Assistant | Haiku 4.5 | -| Writing/editing | Writing Editor | Sonnet 4.5 | -| Research tasks | Research Analyst | Sonnet 4.5 | -| Data analysis | Data Analyst | Sonnet 4.5 | -| General conversation | Main Orchestrator | Opus 4.5 | - -## Context Compaction - -For long sessions, Claude should: - -1. Summarize older messages when approaching token limits -2. Write important discoveries to `NOTES.md` -3. Keep the last 10 messages in full context -4. Reference `@` files instead of keeping full content in context - -## Quality Enforcement - -**This project prioritizes well-researched, industry-standard solutions over quick fixes.** See `AGENTS.md` > Implementation Philosophy for the universal principles. - -### Implementation Decision Framework - -Before writing code, follow this sequence: - -1. **Research the domain** — Search for established patterns, prior art, and industry conventions for the problem at hand -2. **Check for existing solutions** — Look in `.agents/research/` for prior analysis; check if the codebase already solves a similar problem -3. **Evaluate approaches** — When multiple solutions exist, compare trade-offs (performance, maintainability, complexity, ecosystem alignment) -4. **Align with the codebase** — Ensure the chosen approach extends existing conventions rather than introducing a parallel pattern -5. **Implement and verify** — Build incrementally, verifying each step against the gold standard examples in `AGENTS.md` - -### When You're Uncertain - -```markdown -✅ DO: Research the problem first — "Let me check how this is typically handled in Next.js App Router..." -✅ DO: Reference prior art — "React's documentation recommends this pattern for..." -✅ DO: Document your reasoning — Create a research doc in .agents/research/ for non-trivial decisions -✅ DO: Propose options — "There are two established approaches here. Option A... Option B... I recommend B because..." - -❌ DON'T: Jump to the first solution that works -❌ DON'T: Invent custom patterns when standard ones exist -❌ DON'T: Optimize for fewer lines of code over clarity and maintainability -❌ DON'T: Skip research for unfamiliar problem domains -``` - -### Hierarchy of Solutions (Errors & Issues) - -1. **Fix the code properly** — Always the first choice -2. **Refactor the pattern** — If the code is fundamentally incompatible with the correct approach -3. **Document exception** — Only with explicit user approval and clear reason -4. **Never**: Disable rules, add ignore comments, or downgrade deps without approval - -### Forbidden Actions - -- **Creating git branches** — NEVER create new branches unless the user explicitly asks for a branch to be created. Implementation plans, feature work, and all other tasks must be done on the current branch. Branch creation requires explicit user instruction. -- Setting ESLint rules to `"off"` or `"warn"` to bypass errors -- Adding `// @ts-ignore` (never acceptable) -- Adding `eslint-disable` comments without documented reason -- Suggesting "we can disable this check" as a solution -- Downgrading packages to avoid type/lint errors -- Implementing ad-hoc workarounds when a well-documented solution exists - -### Reference for Fixes - -- `.agents/research/` — Prior research and analysis -- `.agents/workflows/react-19-lint-fixes.md` — React 19 / React Compiler patterns -- `.agents/context/conventions.md` — Quality gates and acceptable exceptions -- `.agents/troubleshooting/` — Known issues & fixes -- Gold standard examples in `AGENTS.md` - -## Debugging Workflow - -When debugging issues: - -1. **Read first**: Examine the relevant files before suggesting changes -2. **Check lints**: Run `bun run lint` and `bun run typecheck` -3. **Verify patterns**: Ensure changes follow gold standard examples -4. **Test incrementally**: Suggest running tests after each significant change -5. **Fix at source**: Never suggest disabling checks as a solution - -## Common Gotchas - -- **Streaming responses**: Must use `result.toUIMessageStreamResponse()` from Vercel AI SDK (v6) -- **Server Components**: Cannot use hooks; use Client Components wrapper with `"use client"` -- **Database**: Uses Convex for all data operations (real-time queries + mutations) -- **Auth**: Uses Clerk for authentication; avoid touching `middleware.ts` without review -- **File Storage**: Uses Convex storage for file uploads -- **Model terminology**: See `.agents/context/glossary.md` for precise definitions -- **Pull requests**: Always `git fetch origin` first and compare against `origin/main`, never local `main`. Local main can be arbitrarily stale, causing commit/diff inflation in PR descriptions. - ---- - -*This file is automatically loaded by Claude Code and Claude API tools.* +- Keep this file short and Claude-specific. +- Do not duplicate policy from `AGENTS.md` unless a contradiction must be resolved. diff --git a/app/api/chat/route.ts b/app/api/chat/route.ts index 6b2a7951..d585fce0 100644 --- a/app/api/chat/route.ts +++ b/app/api/chat/route.ts @@ -18,7 +18,7 @@ import { flushPostHog, getPostHogClient, } from "@/lib/posthog" -import type { Provider } from "@/lib/user-keys" +import type { Provider, ToolKeyMode } from "@/lib/user-keys" import { UIMessage as MessageAISDK, streamText, @@ -48,12 +48,27 @@ import { loadUserMcpTools, type LoadToolsResult, } from "@/lib/mcp/load-tools" -import { resolveToolCapabilities } from "@/lib/tools/types" -import { shouldInjectSearchTools } from "./search-tools" +import { + enforceToolNamingGovernance, + type ToolLayerMap, +} from "@/lib/tools/naming" +import { + filterMetadataMapByPolicy, + filterToolSetByPolicy, + getActiveToolsForStep, + resolveCapabilityPolicy, + type ToolPolicyInput, + type ToolPolicyDecision, +} from "@/lib/tools/capability-policy" +import { + buildFinishToolInvocationStreamMetadata, + buildStartToolInvocationStreamMetadata, + buildToolInvocationMetadataByName, + type ToolInvocationMetadataByCallId, +} from "@/lib/tools/ui-metadata" import { ToolTraceCollector, wrapMcpTools, - isToolResultEnvelope, } from "@/lib/tools/mcp-wrapper" import type { ShippingAddress } from "@/lib/payclaw/schemas" @@ -265,6 +280,7 @@ export async function POST(req: Request) { ) } } + const providerToolKeyMode: ToolKeyMode = apiKey ? "byok" : "platform" // enableSearch is no longer passed to the model — it controls tool injection below. // All search is now provided via visible, auditable tool calls (Layer 1 or Layer 2). @@ -290,8 +306,27 @@ export async function POST(req: Request) { let builtInTools: ToolSet = {} as ToolSet let builtInToolMetadata = new Map() - const capabilities = resolveToolCapabilities(modelConfig.tools) - const shouldInjectSearch = shouldInjectSearchTools(enableSearch, modelConfig.tools) + const initialCapabilityPolicy = resolveCapabilityPolicy({ + modelTools: modelConfig.tools, + isAuthenticated, + }) + const capabilities = initialCapabilityPolicy.capabilities + const shouldInjectSearch = enableSearch && capabilities.search + const requestId = crypto.randomUUID() + + console.log( + JSON.stringify({ + _tag: "tool_capability_policy", + requestId, + chatId, + userId, + model, + userTier: initialCapabilityPolicy.userTier, + capabilities: initialCapabilityPolicy.capabilities, + capabilityReasons: initialCapabilityPolicy.capabilityReasons, + keyModeReason: initialCapabilityPolicy.keyModeReason, + }) + ) if (shouldInjectSearch) { const { getProviderTools } = await import("@/lib/tools/provider") @@ -301,17 +336,137 @@ export async function POST(req: Request) { } // ----------------------------------------------------------------------- - // Third-Party Tool Loading (Layer 2) + // Exa API Key Resolution (shared by Layer 2 capabilities) + // + // Resolved once, used by both search fallback and content extraction. + // Key resolution: user BYOK key → platform env var → undefined. + // The exa-js SDK accepts keys in its constructor, so BYOK keys + // are passed directly — no env var manipulation needed. + // + // NOT gated on isAuthenticated — anonymous users get Exa-powered tools + // when the platform has an EXA_API_KEY configured (same as Layer 1). + // ----------------------------------------------------------------------- + let resolvedExaKey: string | undefined + let resolvedExaKeyMode: ToolKeyMode | undefined + const { getEffectiveToolKeyWithMode } = await import("@/lib/user-keys") + const resolvedExa = await getEffectiveToolKeyWithMode("exa", convexToken) + resolvedExaKey = resolvedExa.key + resolvedExaKeyMode = resolvedExa.keyMode + + const { + createOutageTolerantToolBudgetEnforcer, + createConvexToolLimitStore, + createRequestLocalToolSoftCap, + createToolPolicyGuard, + isPolicyUnavailableError, + probeToolBudget, + } = await import("@/lib/tools/policy") + const toolLimitStore = createConvexToolLimitStore({ + convexToken, + anonymousId, + }) + const makePolicyGuard = (keyMode: ToolKeyMode) => + createToolPolicyGuard({ store: toolLimitStore, keyMode }) + + const builtInPolicyGuard = makePolicyGuard(providerToolKeyMode) + const mcpPolicyGuard = makePolicyGuard("platform") + const platformPolicyGuard = makePolicyGuard("platform") + const exaPolicyGuard = + resolvedExaKeyMode ? makePolicyGuard(resolvedExaKeyMode) : undefined + + const logOutageTolerantBudgetEvent = ( + source: "third-party" | "content" | "platform" | "mcp", + event: { + type: "recovered" | "degraded_allow" | "degraded_block" + toolName: string + keyMode: ToolKeyMode + retryAfterSeconds?: number + snapshot?: { + used: number + remaining: number + maxCalls: number + } + error?: string + } + ) => { + if (event.type === "recovered") { + console.warn( + JSON.stringify({ + _tag: "tool_budget_gate_recovered", + requestId, + tool: event.toolName, + source, + keyMode: event.keyMode, + action: "resume_policy_enforced_budgeting", + }) + ) + return + } + + console.warn( + JSON.stringify({ + _tag: "tool_budget_gate_degraded", + requestId, + tool: event.toolName, + source, + keyMode: event.keyMode, + policyUnavailable: true, + usedCalls: event.snapshot?.used ?? null, + remainingCalls: event.snapshot?.remaining ?? null, + maxCalls: event.snapshot?.maxCalls ?? null, + retryAfterSeconds: event.retryAfterSeconds ?? null, + error: event.error ?? null, + action: + event.type === "degraded_allow" + ? "allow_tool_with_request_local_soft_cap" + : "disable_tool_for_remaining_request", + }) + ) + } + + const thirdPartyBudgetEnforcer = + exaPolicyGuard && resolvedExaKeyMode + ? createOutageTolerantToolBudgetEnforcer({ + enforceToolBudget: (toolName) => exaPolicyGuard.enforceToolBudget(toolName), + keyMode: resolvedExaKeyMode, + maxCallsPerTool: PREPARE_STEP_THRESHOLD, + onEvent: (event) => logOutageTolerantBudgetEvent("third-party", event), + }) + : undefined + + const contentBudgetEnforcer = + exaPolicyGuard && resolvedExaKeyMode + ? createOutageTolerantToolBudgetEnforcer({ + enforceToolBudget: (toolName) => exaPolicyGuard.enforceToolBudget(toolName), + keyMode: resolvedExaKeyMode, + maxCallsPerTool: PREPARE_STEP_THRESHOLD, + onEvent: (event) => logOutageTolerantBudgetEvent("content", event), + }) + : undefined + + const platformBudgetEnforcer = createOutageTolerantToolBudgetEnforcer({ + enforceToolBudget: (toolName) => platformPolicyGuard.enforceToolBudget(toolName), + keyMode: "platform", + maxCallsPerTool: PREPARE_STEP_THRESHOLD, + onEvent: (event) => logOutageTolerantBudgetEvent("platform", event), + }) + + const mcpBudgetEnforcer = createOutageTolerantToolBudgetEnforcer({ + enforceToolBudget: (toolName) => mcpPolicyGuard.enforceToolBudget(toolName), + keyMode: "platform", + maxCallsPerTool: PREPARE_STEP_THRESHOLD, + onEvent: (event) => logOutageTolerantBudgetEvent("mcp", event), + }) + + // ----------------------------------------------------------------------- + // Third-Party Search Fallback (Layer 2 — Search) // Universal search fallback for providers without native search tools. // Only loaded when enableSearch is true AND Layer 1 didn't provide search. // // The coordination model is simple: // - enableSearch === true: route.ts injects search tools - // - Layer 1 provided search (builtInHasSearch): skip Layer 2 + // - Layer 1 provided search (builtInHasSearch): skip Layer 2 search // - Layer 1 did NOT provide search: load Layer 2 Exa fallback - // - // NOT gated on isAuthenticated — anonymous users get search when - // the platform has an EXA_API_KEY configured (same as Layer 1). // ----------------------------------------------------------------------- let thirdPartyTools: ToolSet = {} as ToolSet let thirdPartyToolMetadata = new Map() @@ -319,26 +474,10 @@ export async function POST(req: Request) { if (shouldInjectSearch) { const builtInHasSearch = Object.keys(builtInTools).length > 0 - // Only load Layer 2 when Layer 1 didn't provide search. - // This is the sole coordination point — third-party.ts does not - // know about providers. It just receives a skipSearch flag. if (!builtInHasSearch) { const { getThirdPartyTools } = await import("@/lib/tools/third-party") - - // Key resolution: user BYOK key → platform env var → undefined - // The exa-js SDK accepts keys in its constructor, so BYOK keys - // are passed directly — no env var manipulation needed. - let resolvedExaKey: string | undefined - if (convexToken) { - const { getEffectiveToolKey } = await import("@/lib/user-keys") - resolvedExaKey = await getEffectiveToolKey("exa", convexToken) - } - if (!resolvedExaKey) { - resolvedExaKey = process.env.EXA_API_KEY - } - const thirdPartyResult = await getThirdPartyTools({ - skipSearch: false, // We already know we need search (builtInHasSearch is false) + skipSearch: false, exaKey: resolvedExaKey, }) thirdPartyTools = thirdPartyResult.tools @@ -346,6 +485,26 @@ export async function POST(req: Request) { } } + // ----------------------------------------------------------------------- + // Content Extraction Tools (Layer 2 — Content) + // Independent capability — NOT gated on shouldInjectSearch or + // builtInHasSearch. Gated on capabilities.extract and Exa key. + // Available for ALL providers including those with native Layer 1 + // search (OpenAI, Anthropic, Google, xAI). + // ----------------------------------------------------------------------- + let contentTools: ToolSet = {} as ToolSet + let contentToolMetadata = new Map() + + if (resolvedExaKey && capabilities.extract) { + const { getContentExtractionTools } = await import("@/lib/tools/third-party") + const contentResult = await getContentExtractionTools({ + exaKey: resolvedExaKey, + policyGuard: exaPolicyGuard, + }) + contentTools = contentResult.tools + contentToolMetadata = contentResult.metadata + } + // ----------------------------------------------------------------------- // MCP Tool Loading // Gate on: auth + Convex token + model capability @@ -401,7 +560,7 @@ export async function POST(req: Request) { let platformToolMetadata = new Map() let userCardId: string | undefined - if (isAuthenticated) { + if (isAuthenticated && capabilities.platform) { if (convexToken) { try { userAddresses = (await fetchQuery( @@ -464,6 +623,134 @@ export async function POST(req: Request) { platformToolMetadata = platformResult.metadata } + const toolPolicyInputs: ToolPolicyInput[] = [ + ...Object.keys(builtInTools).map((toolName) => { + const meta = builtInToolMetadata.get(toolName) + return { + toolName, + source: meta?.source ?? "builtin", + capability: "search" as const, + readOnly: meta?.readOnly, + destructive: meta?.destructive, + idempotent: meta?.idempotent, + openWorld: meta?.openWorld, + } + }), + ...Object.keys(thirdPartyTools).map((toolName) => { + const meta = thirdPartyToolMetadata.get(toolName) + return { + toolName, + source: meta?.source ?? "third-party", + capability: "search" as const, + readOnly: meta?.readOnly, + destructive: meta?.destructive, + idempotent: meta?.idempotent, + openWorld: meta?.openWorld, + } + }), + ...Object.keys(contentTools).map((toolName) => { + const meta = contentToolMetadata.get(toolName) + return { + toolName, + source: meta?.source ?? "third-party", + capability: "extract" as const, + readOnly: meta?.readOnly, + destructive: meta?.destructive, + idempotent: meta?.idempotent, + openWorld: meta?.openWorld, + } + }), + ...Object.keys(platformTools).map((toolName) => { + const meta = platformToolMetadata.get(toolName) + return { + toolName, + source: meta?.source ?? "platform", + capability: "platform" as const, + readOnly: meta?.readOnly, + destructive: meta?.destructive, + idempotent: meta?.idempotent, + openWorld: meta?.openWorld, + } + }), + ...Object.keys(mcpTools).map((toolName) => { + const info = mcpToolServerMap.get(toolName) + const policyHintsTrusted = info?.policyHintsTrusted === true + return { + toolName, + source: "mcp" as const, + capability: "mcp" as const, + riskHintsTrusted: policyHintsTrusted, + readOnly: policyHintsTrusted ? info?.readOnly : undefined, + destructive: policyHintsTrusted ? info?.destructive : undefined, + idempotent: policyHintsTrusted ? info?.idempotent : undefined, + openWorld: policyHintsTrusted ? info?.openWorld : undefined, + } + }), + ] + + const toolPolicy = resolveCapabilityPolicy({ + modelTools: modelConfig.tools, + isAuthenticated, + keyMode: resolvedExaKeyMode, + tools: toolPolicyInputs, + }) + + const summarizeReasonCounts = ( + decisions: ToolPolicyDecision[], + selector: (decision: ToolPolicyDecision) => string + ) => { + const counts: Record = {} + for (const decision of decisions) { + const reason = selector(decision) + counts[reason] = (counts[reason] ?? 0) + 1 + } + return counts + } + + console.log( + JSON.stringify({ + _tag: "tool_policy_matrix", + requestId, + chatId, + userId, + model, + userTier: toolPolicy.userTier, + keyMode: toolPolicy.keyMode ?? null, + keyModeReason: toolPolicy.keyModeReason, + capabilities: toolPolicy.capabilities, + capabilityReasons: toolPolicy.capabilityReasons, + totalTools: toolPolicy.toolDecisions.length, + earlyAllowedCount: toolPolicy.earlyToolNames.length, + lateAllowedCount: toolPolicy.lateToolNames.length, + earlyReasonCounts: summarizeReasonCounts( + toolPolicy.toolDecisions, + (decision) => decision.earlyReasonCode + ), + lateReasonCounts: summarizeReasonCounts( + toolPolicy.toolDecisions, + (decision) => decision.lateReasonCode + ), + }) + ) + + builtInTools = filterToolSetByPolicy(builtInTools, toolPolicy) + thirdPartyTools = filterToolSetByPolicy(thirdPartyTools, toolPolicy) + contentTools = filterToolSetByPolicy(contentTools, toolPolicy) + platformTools = filterToolSetByPolicy(platformTools, toolPolicy) + mcpTools = filterToolSetByPolicy(mcpTools, toolPolicy) + + builtInToolMetadata = filterMetadataMapByPolicy(builtInToolMetadata, toolPolicy) + thirdPartyToolMetadata = filterMetadataMapByPolicy( + thirdPartyToolMetadata, + toolPolicy + ) + contentToolMetadata = filterMetadataMapByPolicy(contentToolMetadata, toolPolicy) + platformToolMetadata = filterMetadataMapByPolicy( + platformToolMetadata, + toolPolicy + ) + mcpToolServerMap = filterMetadataMapByPolicy(mcpToolServerMap, toolPolicy) + // Wrap MCP tools with timeout, timing, truncation, and envelope. // Single wrapper handles all Layer 3 concerns — follows the Exa gold // standard pattern (lib/tools/third-party.ts:82-119). @@ -474,29 +761,233 @@ export async function POST(req: Request) { mcpTools = wrapMcpTools(mcpTools, { toolServerMap: mcpToolServerMap, traceCollector, + requestId, + enforceToolBudget: async (toolName) => { + await mcpBudgetEnforcer(toolName) + }, }) as ToolSet } - // Merge all tool layers: search (Layer 1 OR Layer 2) + platform (Layer 4) + MCP (Layer 3) - // Search tools are mutually exclusive: Layer 1 XOR Layer 2 (never both). - // MCP tools are always independent and additive. - // Spread order matters for conflict resolution: - // 1. Built-in/third-party search tools (lowest priority) - // 2. Platform tools (middle priority) - // 3. MCP tools (highest priority — user-configured, namespaced) - const searchTools = { ...builtInTools, ...thirdPartyTools } - const allTools = { ...searchTools, ...platformTools, ...mcpTools } as ToolSet - - // Dev-mode collision detection: warn when duplicate keys are found - if (process.env.NODE_ENV !== "production") { - const searchKeys = new Set(Object.keys(searchTools)) - for (const key of Object.keys(mcpTools)) { - if (searchKeys.has(key)) { - console.warn(`[tools] Key collision: "${key}" exists in both search and MCP tools. MCP wins.`) + // Wrap non-builtin tools with tracing (Layer 2 + Layer 4). + // Records durationMs and resultSizeBytes into traceCollector so + // onStepFinish and onFinish can read them for ALL tool types. + const { wrapToolsWithTracing } = await import("@/lib/tools/utils") + if (Object.keys(thirdPartyTools).length > 0) { + thirdPartyTools = wrapToolsWithTracing( + thirdPartyTools, + traceCollector, + requestId, + async (toolName) => { + if (!thirdPartyBudgetEnforcer) return + await thirdPartyBudgetEnforcer(toolName) + }, + thirdPartyToolMetadata + ) + } + if (Object.keys(contentTools).length > 0) { + contentTools = wrapToolsWithTracing( + contentTools, + traceCollector, + requestId, + async (toolName) => { + if (!contentBudgetEnforcer) return + await contentBudgetEnforcer(toolName) + }, + contentToolMetadata + ) + } + if (Object.keys(platformTools).length > 0) { + platformTools = wrapToolsWithTracing( + platformTools, + traceCollector, + requestId, + async (toolName) => { + await platformBudgetEnforcer(toolName) + }, + platformToolMetadata + ) + } + + // Merge all tool layers: + // - Search: Layer 1 (built-in) XOR Layer 2 (Exa fallback) — never both + // - Content: Layer 2 content extraction — independent of search gating + // - Platform: Layer 4 (Flowglad Pay, etc.) + // - MCP: Layer 3 (user-configured servers) + // Spread order = conflict resolution priority (last wins): + // 1. Search tools (lowest priority) + // 2. Content extraction tools (same priority tier as search) + // 3. Platform tools (middle priority) + // 4. MCP tools (highest priority — user-configured, namespaced) + const toolLayers: ToolLayerMap = { + "built-in": builtInTools, + "third-party-search": thirdPartyTools, + "content-extraction": contentTools, + platform: platformTools, + mcp: mcpTools, + } + + const namingResult = enforceToolNamingGovernance(toolLayers) + if (namingResult.invalid.length > 0) { + for (const invalid of namingResult.invalid) { + console.warn( + JSON.stringify({ + _tag: "tool_name_invalid", + requestId, + tool: invalid.toolKey, + layer: invalid.layer, + reason: invalid.reason, + action: "drop_invalid_tool", + }) + ) + } + } + if (namingResult.collisions.length > 0) { + for (const collision of namingResult.collisions) { + const droppedLayers = collision.owners.filter( + (layer) => layer !== collision.winner + ) + console.warn( + JSON.stringify({ + _tag: "tool_name_collision", + requestId, + tool: collision.toolKey, + layers: collision.owners, + winner: collision.winner, + droppedLayers, + action: "keep_winner_drop_losers", + }) + ) + } + } + + builtInTools = (namingResult.sanitizedLayers["built-in"] ?? {}) as ToolSet + thirdPartyTools = (namingResult.sanitizedLayers["third-party-search"] ?? + {}) as ToolSet + contentTools = (namingResult.sanitizedLayers["content-extraction"] ?? + {}) as ToolSet + platformTools = (namingResult.sanitizedLayers.platform ?? {}) as ToolSet + mcpTools = (namingResult.sanitizedLayers.mcp ?? {}) as ToolSet + + const filterMetadataByTools = ( + metadata: ReadonlyMap, + tools: ToolSet + ) => + new Map( + Array.from(metadata.entries()).filter(([name]) => + Object.prototype.hasOwnProperty.call(tools, name) + ) + ) + + builtInToolMetadata = filterMetadataByTools(builtInToolMetadata, builtInTools) + thirdPartyToolMetadata = filterMetadataByTools( + thirdPartyToolMetadata, + thirdPartyTools + ) + contentToolMetadata = filterMetadataByTools(contentToolMetadata, contentTools) + platformToolMetadata = filterMetadataByTools( + platformToolMetadata, + platformTools + ) + mcpToolServerMap = new Map( + Array.from(mcpToolServerMap.entries()).filter(([name]) => + Object.prototype.hasOwnProperty.call(mcpTools, name) + ) + ) + + const builtInToolNames = new Set(Object.keys(builtInTools)) + const exhaustedBuiltInTools = new Set() + const degradedBuiltInTools = new Set() + const degradedBuiltInSoftCap = createRequestLocalToolSoftCap({ + maxCallsPerTool: PREPARE_STEP_THRESHOLD, + }) + + // Provider-native (Layer 1) tools are provider-executed and do not expose a + // local execute() hook for per-call preflight enforcement. Compensating + // control: probe budget during prepareStep (consume:false) and account + // actual usage in onStepFinish. This preserves centralized budget policy + // semantics, with a bounded request-local soft cap when policy is unavailable. + const isBuiltInToolBudgetAllowed = async (toolName: string): Promise => { + if (!builtInToolNames.has(toolName)) return true + if (exhaustedBuiltInTools.has(toolName)) return false + + try { + const probe = await probeToolBudget({ + store: toolLimitStore, + keyMode: providerToolKeyMode, + toolName, + }) + if (probe.allowed) { + if (degradedBuiltInTools.delete(toolName)) { + console.warn( + JSON.stringify({ + _tag: "tool_budget_gate_recovered", + requestId, + tool: toolName, + source: "builtin", + keyMode: providerToolKeyMode, + action: "resume_policy_enforced_budgeting", + }) + ) + } + return true } + degradedBuiltInTools.delete(toolName) + exhaustedBuiltInTools.add(toolName) + console.warn( + JSON.stringify({ + _tag: "tool_budget_gate", + requestId, + tool: toolName, + source: "builtin", + keyMode: providerToolKeyMode, + retryAfterSeconds: probe.retryAfterSeconds ?? null, + action: "disable_tool_for_remaining_steps", + }) + ) + return false + } catch (error) { + if (isPolicyUnavailableError(error)) { + degradedBuiltInTools.add(toolName) + const softCap = degradedBuiltInSoftCap.getSnapshot(toolName) + const allowed = softCap.remaining > 0 + console.warn( + JSON.stringify({ + _tag: "tool_budget_gate_degraded", + requestId, + tool: toolName, + source: "builtin", + keyMode: providerToolKeyMode, + policyUnavailable: true, + usedCalls: softCap.used, + remainingCalls: softCap.remaining, + maxCalls: softCap.maxCalls, + error: error.message, + action: allowed + ? "allow_tool_with_request_local_soft_cap" + : "disable_tool_until_policy_recovers", + }) + ) + return allowed + } + exhaustedBuiltInTools.add(toolName) + console.warn( + JSON.stringify({ + _tag: "tool_budget_gate_error", + requestId, + tool: toolName, + source: "builtin", + keyMode: providerToolKeyMode, + error: error instanceof Error ? error.message : String(error), + action: "disable_tool_fail_closed", + }) + ) + return false } } + const searchTools = { ...builtInTools, ...thirdPartyTools } + const allTools = { ...searchTools, ...contentTools, ...platformTools, ...mcpTools } as ToolSet + const hasAnyTools = Object.keys(allTools).length > 0 // Anonymous users get a lower step count to limit tool call cost exposure. @@ -747,20 +1238,27 @@ export async function POST(req: Request) { // The header is safe to apply: @ai-sdk/anthropic@3.0.41 comma-merges // user and inferred betas (getBetasFromHeaders + Array.from(betas).join(",")). // ----------------------------------------------------------------------- + const isTokenEfficient = + process.env.ANTHROPIC_TOKEN_EFFICIENT_TOOLS !== "false" const requestHeaders: Record = {} - if (provider === "anthropic" && hasAnyTools) { + if (provider === "anthropic" && hasAnyTools && isTokenEfficient) { requestHeaders["anthropic-beta"] = ANTHROPIC_BETA_HEADERS.tokenEfficient } // Collect all tool metadata for prepareStep tool restriction. - // Merge built-in + third-party metadata (MCP metadata not available here — - // MCP tools are conservatively included in the safe list). + // Merge built-in + third-party + content + platform metadata (MCP metadata + // not available here — MCP tools are conservatively included in the safe list). const allToolMetadata = new Map([ ...builtInToolMetadata, ...thirdPartyToolMetadata, + ...contentToolMetadata, ...platformToolMetadata, ]) + const toolMetadataByName = buildToolInvocationMetadataByName({ + nonMcpMetadata: allToolMetadata, + mcpToolServerMap, + }) let enrichedSystemPrompt = effectiveSystemPrompt if (isAuthenticated && userAddresses.length > 0) { enrichedSystemPrompt += formatAddressContext(userAddresses) @@ -768,12 +1266,14 @@ export async function POST(req: Request) { const streamStartMs = Date.now() let stepCounter = 0 + let toolMetadataByCallId: ToolInvocationMetadataByCallId = {} // Track reasoning timing for messageMetadata persistence. // The first reasoning chunk records a start timestamp; when text-delta // arrives (reasoning is done) or onFinish fires, we compute elapsed ms. let reasoningStartMs: number | null = null let reasoningDurationMs: number | null = null + let loggedLateStepPolicy = false const result = streamText({ model: aiModel, @@ -782,30 +1282,48 @@ export async function POST(req: Request) { tools: allTools, stopWhen: stepCountIs(maxSteps), - // Restrict tools after PREPARE_STEP_THRESHOLD to prevent runaway - // tool chains. Only tools explicitly marked readOnly: true remain - // available. MCP tools are conservatively included (can't classify - // read/write yet). Unclassified non-MCP tools are restricted - // (fail closed — new tools must opt in via readOnly: true). + // Centralized step gating from the capability policy resolver. + // After PREPARE_STEP_THRESHOLD, only late-step-safe tools remain + // (currently read_only risk class). Unknown risk fails closed. prepareStep: hasAnyTools ? async ({ stepNumber }) => { - if (stepNumber <= PREPARE_STEP_THRESHOLD) return {} - - // Build safe tool list: only tools explicitly marked readOnly. - // New tools that omit readOnly default to RESTRICTED (fail closed). - const safeTools: string[] = [] - for (const [name, meta] of allToolMetadata) { - if (meta.readOnly === true) safeTools.push(name) + const isLateStep = stepNumber > PREPARE_STEP_THRESHOLD + const policyToolsForStep = getActiveToolsForStep( + toolPolicy, + stepNumber, + PREPARE_STEP_THRESHOLD + ) + const budgetAllowedTools: string[] = [] + for (const toolName of policyToolsForStep ?? []) { + if (!builtInToolNames.has(toolName)) { + budgetAllowedTools.push(toolName) + continue + } + if (await isBuiltInToolBudgetAllowed(toolName)) { + budgetAllowedTools.push(toolName) + } } - // Include all MCP tools (can't classify read/write yet) - for (const name of Object.keys(mcpTools)) { - if (!safeTools.includes(name)) safeTools.push(name) + + if (isLateStep && !loggedLateStepPolicy) { + loggedLateStepPolicy = true + console.log( + JSON.stringify({ + _tag: "tool_policy_step_gate", + requestId, + chatId, + userId, + model, + stepNumber, + threshold: PREPARE_STEP_THRESHOLD, + earlyToolCount: toolPolicy.earlyToolNames.length, + lateToolCount: budgetAllowedTools.length, + blockedCount: + toolPolicy.earlyToolNames.length - budgetAllowedTools.length, + }) + ) } - // Fail closed: if no safe tools found, no tools available. - // This is intentional — prevents unrestricted tool access - // if readOnly metadata is misconfigured. - return { activeTools: safeTools } + return { activeTools: budgetAllowedTools } } : undefined, @@ -813,10 +1331,65 @@ export async function POST(req: Request) { // Captures tool name, duration, token usage, and success per step. // This data feeds into the existing toolCallLog for trajectory analysis // and future trace-based evaluation. - onStepFinish: ({ toolCalls, toolResults, usage, finishReason }) => { + onStepFinish: async ({ toolCalls, toolResults, usage, finishReason }) => { stepCounter++ if (toolCalls.length === 0) return + for (const call of toolCalls) { + if (!builtInToolNames.has(call.toolName)) continue + try { + await builtInPolicyGuard.enforceToolBudget(call.toolName) + if (degradedBuiltInTools.delete(call.toolName)) { + console.warn( + JSON.stringify({ + _tag: "tool_budget_post_accounting_recovered", + requestId, + tool: call.toolName, + source: "builtin", + keyMode: providerToolKeyMode, + action: "resume_policy_enforced_budgeting", + }) + ) + } + } catch (error) { + if (isPolicyUnavailableError(error)) { + degradedBuiltInTools.add(call.toolName) + const softCap = degradedBuiltInSoftCap.recordCall(call.toolName) + console.warn( + JSON.stringify({ + _tag: "tool_budget_post_accounting_degraded", + requestId, + tool: call.toolName, + source: "builtin", + keyMode: providerToolKeyMode, + policyUnavailable: true, + usedCalls: softCap.used, + remainingCalls: softCap.remaining, + maxCalls: softCap.maxCalls, + error: error.message, + action: + softCap.remaining > 0 + ? "allow_tool_with_request_local_soft_cap" + : "disable_tool_until_policy_recovers", + }) + ) + continue + } + exhaustedBuiltInTools.add(call.toolName) + console.warn( + JSON.stringify({ + _tag: "tool_budget_post_accounting_denied", + requestId, + tool: call.toolName, + source: "builtin", + keyMode: providerToolKeyMode, + error: error instanceof Error ? error.message : String(error), + action: "disable_tool_for_remaining_steps", + }) + ) + } + } + for (const call of toolCalls) { const result = toolResults.find( (r) => r.toolCallId === call.toolCallId @@ -825,18 +1398,26 @@ export async function POST(req: Request) { ? !(result as { isError?: boolean }).isError : false const meta = allToolMetadata.get(call.toolName) + const trace = traceCollector.get(call.toolCallId) // Structured JSON log — parseable by Vercel log drain and grep. // Uses _tag for machine filtering without affecting human readability. console.log( JSON.stringify({ _tag: "tool_trace", + requestId, chatId, + userId, step: stepCounter, tool: meta?.displayName ?? call.toolName, source: meta?.source ?? "unknown", success, - durationMs: traceCollector.get(call.toolCallId)?.durationMs ?? null, + durationMs: trace?.durationMs ?? null, + estimatedCostPer1k: meta?.estimatedCostPer1k ?? null, + errorCode: trace?.errorCode ?? null, + retryAfterSeconds: trace?.retryAfterSeconds ?? null, + budgetKeyMode: trace?.budgetKeyMode ?? null, + budgetDenied: trace?.budgetDenied ?? null, tokens: { in: usage?.inputTokens ?? null, out: usage?.outputTokens ?? null, @@ -909,6 +1490,19 @@ export async function POST(req: Request) { }, onFinish: ({ text, usage, steps, finishReason }) => { + if (steps) { + const resolvedByCallId: ToolInvocationMetadataByCallId = {} + for (const step of steps) { + for (const toolCall of step.toolCalls ?? []) { + const resolved = toolMetadataByName[toolCall.toolName] + if (resolved) { + resolvedByCallId[toolCall.toolCallId] = resolved + } + } + } + toolMetadataByCallId = resolvedByCallId + } + // Freeze reasoning duration if it wasn't already frozen by text-delta // (e.g. reasoning-only responses with no text output, or errors) if (reasoningStartMs !== null && reasoningDurationMs === null) { @@ -945,7 +1539,7 @@ export async function POST(req: Request) { console.log( `[chat] Anthropic tool usage — inputTokens: ${usage?.inputTokens ?? "?"}, ` + `toolCount: ${Object.keys(allTools).length}, ` + - `tokenEfficient: true` + `tokenEfficient: ${isTokenEfficient}` ) } @@ -974,13 +1568,6 @@ export async function POST(req: Request) { // PostHog: unified tool call events — one event per tool invocation (all sources) // Replaces the previous MCP-only mcp_tool_call event. if (steps) { - // Combine all metadata maps for source identification - const allToolMetadata = new Map([ - ...builtInToolMetadata, - ...thirdPartyToolMetadata, - ...platformToolMetadata, - ]) - for (const step of steps) { if (step.toolCalls) { for (const toolCall of step.toolCalls) { @@ -1002,6 +1589,7 @@ export async function POST(req: Request) { const success = toolResult ? !(toolResult as { isError?: boolean }).isError : false + const trace = traceCollector.get(toolCall.toolCallId) phClient.capture({ distinctId: userId, @@ -1014,7 +1602,12 @@ export async function POST(req: Request) { success, chatId, // Phase C: Observability enrichment - durationMs: traceCollector.get(toolCall.toolCallId)?.durationMs ?? undefined, + durationMs: trace?.durationMs ?? undefined, + errorCode: trace?.errorCode, + retryAfterSeconds: trace?.retryAfterSeconds, + budgetKeyMode: trace?.budgetKeyMode, + budgetDenied: trace?.budgetDenied, + requestId, // MCP-specific (optional) ...(mcpServerInfo && { serverId: mcpServerInfo.serverId, @@ -1060,12 +1653,7 @@ export async function POST(req: Request) { ? !(toolResult as { isError?: boolean }).isError : false - // Extract data from envelope for preview — avoids wasting - // 500 chars on envelope metadata ({"ok":true,"data":...}). - const output = toolResult?.output - const previewData = isToolResultEnvelope(output) - ? output.data - : output + const previewData = toolResult?.output const trace = traceCollector.get(toolCall.toolCallId) void fetchMutation( @@ -1089,6 +1677,11 @@ export async function POST(req: Request) { inputTokens: step.usage?.inputTokens, outputTokens: step.usage?.outputTokens, resultSizeBytes: trace?.resultSizeBytes, + requestId, + errorCode: trace?.errorCode, + retryAfterSeconds: trace?.retryAfterSeconds, + budgetKeyMode: trace?.budgetKeyMode, + budgetDenied: trace?.budgetDenied, }, { token: convexToken } ).catch(() => { @@ -1102,14 +1695,7 @@ export async function POST(req: Request) { // Audit log: persist built-in + third-party tool calls (fire-and-forget). // Identifies non-MCP tools by checking if the tool name is NOT in mcpToolServerMap. if (convexToken && steps) { - // Combine built-in and third-party metadata maps - const nonMcpMetadata = new Map([ - ...builtInToolMetadata, - ...thirdPartyToolMetadata, - ...platformToolMetadata, - ]) - - if (nonMcpMetadata.size > 0) { + if (allToolMetadata.size > 0) { let finishStepNumber = 0 for (const step of steps) { @@ -1120,7 +1706,7 @@ export async function POST(req: Request) { // Skip MCP tools (already logged above) if (mcpToolServerMap.get(toolCall.toolName)) continue - const meta = nonMcpMetadata.get(toolCall.toolName) + const meta = allToolMetadata.get(toolCall.toolName) if (!meta) continue // Unknown tool — skip const toolResult = step.toolResults?.find( @@ -1131,11 +1717,9 @@ export async function POST(req: Request) { ? !(toolResult as { isError?: boolean }).isError : false - // For non-MCP tools, check if result is enveloped (Exa uses envelopes) - const output = toolResult?.output - const previewData = isToolResultEnvelope(output) - ? output.data - : output + const previewData = toolResult?.output + + const trace = traceCollector.get(toolCall.toolCallId) void fetchMutation( api.toolCallLog.log, @@ -1149,18 +1733,19 @@ export async function POST(req: Request) { ? JSON.stringify(previewData).slice(0, 500) : undefined, success, - // FIX: Use actual per-tool duration instead of total stream duration. - // For Exa, the envelope's meta.durationMs has the real timing. - // For builtin tools, we don't have per-tool timing (no trace collector). - durationMs: isToolResultEnvelope(output) - ? output.meta.durationMs - : undefined, + durationMs: trace?.durationMs, source: meta.source, serviceName: meta.serviceName, // Phase C: Observability enrichment stepNumber: finishStepNumber, inputTokens: step.usage?.inputTokens, outputTokens: step.usage?.outputTokens, + resultSizeBytes: trace?.resultSizeBytes, + requestId, + errorCode: trace?.errorCode, + retryAfterSeconds: trace?.retryAfterSeconds, + budgetKeyMode: trace?.budgetKeyMode, + budgetDenied: trace?.budgetDenied, }, { token: convexToken } ).catch(() => { @@ -1178,8 +1763,14 @@ export async function POST(req: Request) { sendReasoning: true, sendSources: true, messageMetadata: ({ part }) => { - if (part.type === "finish" && reasoningDurationMs !== null) { - return { reasoningDurationMs } + if (part.type === "start") { + return buildStartToolInvocationStreamMetadata(toolMetadataByName) + } + if (part.type === "finish") { + return buildFinishToolInvocationStreamMetadata({ + toolMetadataByCallId, + reasoningDurationMs, + }) } return {} }, diff --git a/app/components/chat/message-assistant.tsx b/app/components/chat/message-assistant.tsx index 007104d3..6fe5a17e 100644 --- a/app/components/chat/message-assistant.tsx +++ b/app/components/chat/message-assistant.tsx @@ -225,7 +225,10 @@ export function MessageAssistant({ {toolInvocationParts && toolInvocationParts.length > 0 && preferences.showToolInvocations && ( - + )} {showToolProgress && ( diff --git a/app/components/chat/tool-invocation.tsx b/app/components/chat/tool-invocation.tsx index 92fe052b..21f4bbba 100644 --- a/app/components/chat/tool-invocation.tsx +++ b/app/components/chat/tool-invocation.tsx @@ -3,8 +3,15 @@ import { cn } from "@/lib/utils" import type { ToolUIPart } from 'ai' import { getStaticToolName } from 'ai' +import { + humanizeToolName, + resolveToolInvocationMetadata, + type ToolInvocationDisplayMetadata, + type ToolInvocationStreamMetadata, +} from "@/lib/tools/ui-metadata" import { HugeiconsIcon } from "@hugeicons/react" import { + AlertCircleIcon, ArrowDown01Icon, CheckmarkCircle01Icon, SourceCodeIcon, @@ -12,6 +19,7 @@ import { NutIcon, Loading01Icon, Search01Icon, + FileSearchIcon, Wrench01Icon, } from "@hugeicons-pro/core-stroke-rounded" import { AnimatePresence, motion } from "framer-motion" @@ -19,6 +27,7 @@ import { useMemo, useState } from "react" type ToolInvocationProps = { toolInvocations: ToolUIPart[] + metadata?: Record className?: string defaultOpen?: boolean } @@ -32,30 +41,119 @@ const TRANSITION = { /** Maps built-in tool names to human-readable display names and icons */ const BUILTIN_TOOL_DISPLAY: Record< string, - { name: string; icon: "search" | "code" | "image" | "extract" } + { name: string; icon: "search" | "code" | "image" | "extract" | "wrench" } > = { web_search: { name: "Web Search", icon: "search" }, google_search: { name: "Web Search", icon: "search" }, + extract_content: { name: "Read Page", icon: "extract" }, + pay_purchase: { name: "Purchase", icon: "wrench" }, + pay_status: { name: "Purchase Status", icon: "wrench" }, // Future built-in tools: // code_execution: { name: "Code Execution", icon: "code" }, // image_generation: { name: "Image Generation", icon: "image" }, } -/** Resolve icon component from BUILTIN_TOOL_DISPLAY icon identifier */ -function getToolIcon(iconId: "search" | "code" | "image" | "extract") { +/** Resolve icon component from metadata icon identifier */ +function getToolIcon(iconId: NonNullable) { switch (iconId) { case "search": return Search01Icon + case "extract": + return FileSearchIcon + case "wrench": + return Wrench01Icon default: return Wrench01Icon } } +function isToolSource(value: unknown): value is ToolInvocationDisplayMetadata["source"] { + return ( + value === "builtin" || + value === "third-party" || + value === "mcp" || + value === "platform" + ) +} + +function isToolIcon(value: unknown): value is NonNullable { + return ( + value === "search" || + value === "code" || + value === "image" || + value === "extract" || + value === "wrench" + ) +} + +function isToolInvocationDisplayMetadata( + value: unknown +): value is ToolInvocationDisplayMetadata { + if (typeof value !== "object" || value === null) return false + const candidate = value as Record + if (typeof candidate.displayName !== "string") return false + if (!isToolSource(candidate.source)) return false + if (typeof candidate.serviceName !== "string") return false + if (candidate.icon !== undefined && !isToolIcon(candidate.icon)) return false + if ( + candidate.estimatedCostPer1k !== undefined && + typeof candidate.estimatedCostPer1k !== "number" + ) return false + if (candidate.readOnly !== undefined && typeof candidate.readOnly !== "boolean") return false + if (candidate.destructive !== undefined && typeof candidate.destructive !== "boolean") return false + if (candidate.idempotent !== undefined && typeof candidate.idempotent !== "boolean") return false + if (candidate.openWorld !== undefined && typeof candidate.openWorld !== "boolean") return false + return true +} + +function toMetadataRecord( + value: unknown +): Record { + if (typeof value !== "object" || value === null) return {} + const record = value as Record + const parsed: Record = {} + + for (const [key, candidate] of Object.entries(record)) { + if (isToolInvocationDisplayMetadata(candidate)) { + parsed[key] = candidate + } + } + + return parsed +} + +function getToolMetadataMaps(metadata?: Record) { + return { + byName: toMetadataRecord(metadata?.toolMetadataByName), + byCallId: toMetadataRecord(metadata?.toolMetadataByCallId), + } +} + +function formatSource(source: ToolInvocationDisplayMetadata["source"]): string { + switch (source) { + case "builtin": + return "Built-in" + case "third-party": + return "Third-party" + case "platform": + return "Platform" + case "mcp": + return "MCP" + default: + return "Unknown" + } +} + export function ToolInvocation({ toolInvocations, + metadata, defaultOpen = false, }: ToolInvocationProps) { const [isExpanded, setIsExpanded] = useState(defaultOpen) + const { byName, byCallId } = useMemo( + () => getToolMetadataMaps(metadata), + [metadata] + ) const toolInvocationsData = Array.isArray(toolInvocations) ? toolInvocations @@ -81,6 +179,8 @@ export function ToolInvocation({ return ( @@ -138,6 +238,8 @@ export function ToolInvocation({ > ) @@ -154,12 +256,16 @@ export function ToolInvocation({ type SingleToolViewProps = { toolInvocations: ToolUIPart[] + metadataByName: Record + metadataByCallId: Record defaultOpen?: boolean className?: string } function SingleToolView({ toolInvocations, + metadataByName, + metadataByCallId, defaultOpen = false, className, }: SingleToolViewProps) { @@ -201,6 +307,8 @@ function SingleToolView({ return ( @@ -215,6 +323,8 @@ function SingleToolView({ ))} @@ -226,27 +336,48 @@ function SingleToolView({ // New component to handle individual tool cards function SingleToolCard({ toolData, + metadataByName, + metadataByCallId, defaultOpen = false, className, }: { toolData: ToolUIPart + metadataByName: Record + metadataByCallId: Record defaultOpen?: boolean className?: string }) { const [isExpanded, setIsExpanded] = useState(defaultOpen) const { state, toolCallId } = toolData - // v6: Get tool name using official helper. - // NOTE: For MCP tools, this returns the namespaced name (e.g. "my_github_server_create_issue"). - // Displaying a cleaner name requires passing toolServerMap from the chat route via stream - // metadata, which is planned for v1.1. Until then, the namespaced name is shown as-is. const toolName = getStaticToolName(toolData) + const streamMetadata: ToolInvocationStreamMetadata = { + toolMetadataByName: metadataByName, + toolMetadataByCallId: metadataByCallId, + } + const runtimeMetadata = resolveToolInvocationMetadata({ + toolName, + toolCallId, + streamMetadata, + }) const displayInfo = BUILTIN_TOOL_DISPLAY[toolName] ?? null - const displayName = displayInfo?.name ?? toolName - const ToolIcon = displayInfo ? getToolIcon(displayInfo.icon) : Wrench01Icon + const displayName = + runtimeMetadata?.displayName ?? + displayInfo?.name ?? + humanizeToolName(toolName) + const iconId = runtimeMetadata?.icon ?? displayInfo?.icon ?? "wrench" + const ToolIcon = getToolIcon(iconId) + const source = runtimeMetadata?.source + const serviceName = runtimeMetadata?.serviceName + const estimatedCostPer1k = runtimeMetadata?.estimatedCostPer1k + const readOnly = runtimeMetadata?.readOnly + const destructive = runtimeMetadata?.destructive + const idempotent = runtimeMetadata?.idempotent + const openWorld = runtimeMetadata?.openWorld const args = toolData.input as Record | undefined const isLoading = state === "input-available" || state === "input-streaming" const isCompleted = state === "output-available" const result = isCompleted ? toolData.output : undefined + const isError = isCompleted && result != null && typeof result === "object" && "isError" in result && (result as Record).isError === true // Parse the result JSON if available const { parsedResult, parseError } = useMemo(() => { @@ -421,7 +552,18 @@ function SingleToolCard({ >
- {displayName} +
+ + {displayName} + + {(source || serviceName) && ( + + {[source ? formatSource(source) : null, serviceName] + .filter(Boolean) + .join(" · ")} + + )} +
{isLoading ? ( + ) : isError ? ( + +
+ + Failed +
+
) : ( )} + {(source || + serviceName || + typeof estimatedCostPer1k === "number" || + typeof readOnly === "boolean" || + typeof destructive === "boolean" || + typeof idempotent === "boolean" || + typeof openWorld === "boolean") && ( +
+
+ Tool info +
+
+ {source && ( +
+ Source:{" "} + {formatSource(source)} +
+ )} + {serviceName && ( +
+ Service:{" "} + {serviceName} +
+ )} + {typeof estimatedCostPer1k === "number" && ( +
+ Estimated cost:{" "} + ${estimatedCostPer1k.toFixed(2)} / 1k calls +
+ )} + {typeof readOnly === "boolean" && ( +
+ Read-only:{" "} + {readOnly ? "Yes" : "No"} +
+ )} + {typeof destructive === "boolean" && ( +
+ Destructive:{" "} + {destructive ? "Yes" : "No"} +
+ )} + {typeof idempotent === "boolean" && ( +
+ Idempotent:{" "} + {idempotent ? "Yes" : "No"} +
+ )} + {typeof openWorld === "boolean" && ( +
+ Open-world:{" "} + {openWorld ? "Yes" : "No"} +
+ )} +
+
+ )} + {/* Tool call ID */}
diff --git a/app/components/chat/use-model.ts b/app/components/chat/use-model.ts index caec8747..8792ec52 100644 --- a/app/components/chat/use-model.ts +++ b/app/components/chat/use-model.ts @@ -28,16 +28,20 @@ export function useModel({ chatId, }: UseModelProps) { // Get favorite models and last-used model from ModelProvider - const { favoriteModels, lastUsedModel, setLastUsedModel } = + const { favoriteModels, lastUsedModel, modelPrefsHydrated, setLastUsedModel } = useModelProvider() // Calculate the effective model based on priority: chat model > last used > first favorite > default const getEffectiveModel = useCallback(() => { - const firstFavoriteModel = favoriteModels[0] + const hydratedLastUsedModel = modelPrefsHydrated ? lastUsedModel : null + const firstFavoriteModel = modelPrefsHydrated ? favoriteModels[0] : null return ( - currentChat?.model || lastUsedModel || firstFavoriteModel || MODEL_DEFAULT + currentChat?.model || + hydratedLastUsedModel || + firstFavoriteModel || + MODEL_DEFAULT ) - }, [currentChat?.model, lastUsedModel, favoriteModels]) + }, [currentChat?.model, favoriteModels, lastUsedModel, modelPrefsHydrated]) // Use local state only for temporary overrides, derive base value from props const [localSelectedModel, setLocalSelectedModel] = useState( diff --git a/app/globals.css b/app/globals.css index 17d258a4..ead80a14 100644 --- a/app/globals.css +++ b/app/globals.css @@ -157,6 +157,100 @@ box-shadow: 0 0 8px 1px color-mix(in oklab, var(--foreground) 45%, transparent); } } + + @keyframes spinner-fade { + 0% { + opacity: 1; + } + 100% { + opacity: 0; + } + } + + @keyframes thin-pulse { + 0%, + 100% { + transform: scale(0.8); + opacity: 0.5; + } + 50% { + transform: scale(1); + opacity: 1; + } + } + + @keyframes pulse-dot { + 0%, + 100% { + transform: scale(0.8); + opacity: 0.5; + } + 50% { + transform: scale(1.3); + opacity: 1; + } + } + + @keyframes bounce-dots { + 0%, + 80%, + 100% { + transform: scale(0); + } + 40% { + transform: scale(1); + } + } + + @keyframes typing { + 0%, + 100% { + opacity: 0.2; + } + 50% { + opacity: 1; + } + } + + @keyframes wave { + 0%, + 100% { + transform: scaleY(0.5); + } + 50% { + transform: scaleY(1.2); + } + } + + @keyframes wave-bars { + 0%, + 100% { + transform: scaleY(0.4); + } + 50% { + transform: scaleY(1); + } + } + + @keyframes blink { + 0%, + 100% { + opacity: 1; + } + 50% { + opacity: 0; + } + } + + @keyframes text-blink { + 0%, + 100% { + opacity: 1; + } + 50% { + opacity: 0.3; + } + } } @theme inline { diff --git a/app/test/thinking-states/page.tsx b/app/test/thinking-states/page.tsx index 5701a0cc..a9baa64e 100644 --- a/app/test/thinking-states/page.tsx +++ b/app/test/thinking-states/page.tsx @@ -32,7 +32,7 @@ import { Copy01Icon, } from "@hugeicons-pro/core-stroke-rounded" import type { SourceUrlUIPart, ToolUIPart } from "ai" -import { useCallback, useState, useEffect } from "react" +import { useCallback, useState, useEffect, useId } from "react" // ─── Constants ─────────────────────────────────────────────────────────────── @@ -286,20 +286,51 @@ function StateAnnotation({ ) } +function ArticleWrapper({ + children, + role, +}: { + children: React.ReactNode + role: "user" | "assistant" +}) { + return ( +
+
+ {children} +
+
+ ) +} + function UserBubble({ children }: { children: string }) { const isMultiline = children.includes("\n") return ( - - + - {children} - - +
You said:
+ + {children} + + + ) } @@ -310,26 +341,54 @@ function AssistantShell({ children: React.ReactNode isLast?: boolean }) { + const msgId = useId() return ( - -
+ - {children} -
-
+
Assistant said:
+
+ {children} +
+ + ) } function CopyRegenActions() { return ( - +