|
| 1 | +# WebSearch — Strategy & Design Decisions |
| 2 | + |
| 3 | +A Go metasearch data source for OpenBotKit. No API keys. No external services. Just HTTP requests + HTML parsing running locally. |
| 4 | + |
| 5 | +Inspired by [ddgs](https://github.com/deedy5/ddgs) (Python), rebuilt in Go as a first-class openbotkit source. |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## Problem |
| 10 | + |
| 11 | +Every open-source coding agent delegates web search to their model provider's API. There is no standalone, self-contained, local web search capability that agents can use. We need a Go equivalent integrated into openbotkit — search, news, and page fetching accessible via CLI, consumed by agents as skills. |
| 12 | + |
| 13 | +## Design Principles |
| 14 | + |
| 15 | +1. **Follows openbotkit patterns** — Source interface, CLI commands via `obk websearch`, skills as SKILL.md/REFERENCE.md, SQLite for caching/history. |
| 16 | +2. **Zero dependencies on external services** — No API keys, no Docker, no servers. Just HTTP scraping. |
| 17 | +3. **Composable** — Search, news, and fetch are separate skills. The agent orchestrates. |
| 18 | +4. **Resilient** — Multiple backends with concurrent dispatch, automatic fallback, per-host rate limiting, and health tracking. |
| 19 | + |
| 20 | +--- |
| 21 | + |
| 22 | +## Architecture |
| 23 | + |
| 24 | +``` |
| 25 | +Skills Layer |
| 26 | + web-search, web-fetch, web-news (SKILL.md + REFERENCE.md each) |
| 27 | + │ |
| 28 | +CLI Layer (obk websearch ...) |
| 29 | + search, news, fetch, backends, history, cache clear |
| 30 | + │ |
| 31 | +source/websearch/ (Go package) |
| 32 | + ┌─────────────────────────────────────────────┐ |
| 33 | + │ Orchestrator │ |
| 34 | + │ - Backend selection (auto / explicit) │ |
| 35 | + │ - Concurrent dispatch (errgroup) │ |
| 36 | + │ - Result ranking & deduplication │ |
| 37 | + │ - Health tracking (exponential cooldown) │ |
| 38 | + ├─────────────────────────────────────────────┤ |
| 39 | + │ Engines (each: HTTP request + HTML parse) │ |
| 40 | + │ DDG, Brave, Mojeek, Wikipedia, Yahoo, │ |
| 41 | + │ Yandex, Google, Bing │ |
| 42 | + ├─────────────────────────────────────────────┤ |
| 43 | + │ httpclient/ │ |
| 44 | + │ - Wraps internal/browser (utls transport) │ |
| 45 | + │ - UA rotation (4 browser profiles) │ |
| 46 | + │ - Per-host token bucket rate limiting │ |
| 47 | + ├─────────────────────────────────────────────┤ |
| 48 | + │ SQLite (cache + history) │ |
| 49 | + │ search_cache, fetch_cache, search_history │ |
| 50 | + └─────────────────────────────────────────────┘ |
| 51 | +``` |
| 52 | + |
| 53 | +--- |
| 54 | + |
| 55 | +## Key Decisions |
| 56 | + |
| 57 | +### 1. Flat package over engine/ subdirectory |
| 58 | + |
| 59 | +**Decision:** Engines live directly in `source/websearch/` (e.g., `duckduckgo.go`, `brave.go`), not in a `source/websearch/engine/` subdirectory. |
| 60 | + |
| 61 | +**Why:** The engines are small (50-120 lines each) and tightly coupled to the orchestrator. A subdirectory adds import ceremony for no benefit. Each engine file is self-contained — struct, constructor, `Search()`/`News()` method, and any helper functions. |
| 62 | + |
| 63 | +### 2. No parse/ package |
| 64 | + |
| 65 | +**Decision:** HTML parsing helpers stay private and co-located with each engine. |
| 66 | + |
| 67 | +**Why:** Each engine has its own HTML structure and parsing quirks. Extracting shared "parse helpers" would create false abstractions — the parsing code between DuckDuckGo and Brave has nothing in common. Functions are small, private, and belong next to their callers. |
| 68 | + |
| 69 | +### 3. HTTPDoer interface over concrete *http.Client |
| 70 | + |
| 71 | +**Decision:** Engines accept `HTTPDoer` interface (`Do(*http.Request) (*http.Response, error)`) instead of `*http.Client`. |
| 72 | + |
| 73 | +**Why:** This lets engines work with both `*http.Client` (in tests, using httptest) and `*httpclient.Client` (in production, with UA rotation + rate limiting). Engines stay testable without mocking infrastructure. |
| 74 | + |
| 75 | +### 4. Multi-backend fallback over exponential backoff |
| 76 | + |
| 77 | +**Decision:** No `cenkalti/backoff` dependency. Retry strategy is multi-backend fallback, not per-request retries. |
| 78 | + |
| 79 | +**Why:** This is a CLI tool, not a long-running service. When a user runs a search, they want results in seconds, not after a backoff sequence. If DuckDuckGo fails, we immediately try Brave, Mojeek, etc. The health tracker prevents repeatedly hitting a broken backend (exponential cooldown: 30s → 5min). This gives better UX than retrying the same failing backend with increasing delays. |
| 80 | + |
| 81 | +### 5. Concurrent dispatch with errgroup |
| 82 | + |
| 83 | +**Decision:** All backends in the auto set run concurrently via `errgroup`, not sequentially. |
| 84 | + |
| 85 | +**Why:** With 4 default backends, sequential dispatch means total latency = sum of all backend latencies. Concurrent dispatch means latency = max(backend latencies). Results are collected with a mutex, then sorted by engine priority before ranking to maintain deterministic output. |
| 86 | + |
| 87 | +### 6. Separate fetch client with SSRF protection |
| 88 | + |
| 89 | +**Decision:** `fetchClient()` (for `obk websearch fetch`) uses a raw `*http.Client` with SSRF guards. `httpClient()` (for search engines) uses `*httpclient.Client` without SSRF guards. |
| 90 | + |
| 91 | +**Why:** Search engines hit known external hosts (duckduckgo.com, bing.com, etc.) — SSRF isn't a concern. But `fetch` takes arbitrary user-provided URLs, so it must resolve DNS and block private IPs (loopback, RFC1918, link-local) to prevent SSRF. It also pins to the first resolved IP to prevent DNS rebinding (TOCTOU). |
| 92 | + |
| 93 | +### 7. UA rotation per client instance, not per request |
| 94 | + |
| 95 | +**Decision:** A random browser profile (Chrome/Firefox/Safari/Edge) is selected when `httpclient.Client` is created and reused for all requests from that client. |
| 96 | + |
| 97 | +**Why:** A real browser sends the same UA across a session. Rotating per-request looks suspicious to anti-bot systems. The client is created once per `WebSearch` instance and reused across searches. |
| 98 | + |
| 99 | +### 8. Auto backend set: DDG + Brave + Mojeek + Wikipedia |
| 100 | + |
| 101 | +**Decision:** The default `auto` set uses DuckDuckGo, Brave, Mojeek, and Wikipedia. Yahoo, Yandex, Google, and Bing are opt-in only via `--backend <name>`. |
| 102 | + |
| 103 | +**Why:** |
| 104 | +- **DuckDuckGo** — Most reliable scraping target. No-JS HTML endpoint. |
| 105 | +- **Brave** — Good quality, less aggressive anti-bot than Google. |
| 106 | +- **Mojeek** — Very permissive, independent index (not Bing/Google derivative). |
| 107 | +- **Wikipedia** — JSON API (no scraping needed), always high-quality for factual queries. |
| 108 | +- **Google** — Most aggressive anti-bot, CAPTCHAs likely. Opt-in only. |
| 109 | +- **Bing** — Disabled in ddgs too. Opt-in only. |
| 110 | +- **Yahoo/Yandex** — Redirect URL unwrapping adds fragility. Opt-in only. |
| 111 | + |
| 112 | +Users can override via config (`websearch.backends` list) or `--backend` flag. |
| 113 | + |
| 114 | +### 9. Ranking: frequency + token scoring + Wikipedia priority |
| 115 | + |
| 116 | +**Decision:** Results are ranked by: multi-backend appearance bonus, query token scoring (title weight 2x, snippet weight 1x), Wikipedia +10 bonus. Stable sort preserves original order for equal scores. |
| 117 | + |
| 118 | +**Why:** Simple, predictable, no ML. A result appearing from 3 backends is likely more relevant than one from 1 backend. Title matches matter more than snippet matches. Wikipedia is almost always the best single result for factual queries — the +10 bonus ensures it surfaces first without suppressing other results. |
| 119 | + |
| 120 | +### 10. Cache key includes page number |
| 121 | + |
| 122 | +**Decision:** Cache key = `sha256(query|category|backend|region|timeLimit|page)`. |
| 123 | + |
| 124 | +**Why:** Without page in the key, searching "golang" page 1 then "golang" page 2 returns page 1's cached results. Learned this from a bug caught in review. |
| 125 | + |
| 126 | +### 11. Best-effort caching and history, not transactional |
| 127 | + |
| 128 | +**Decision:** `putSearchCache`, `putFetchCache`, and `putSearchHistory` log warnings on failure but don't propagate errors. |
| 129 | + |
| 130 | +**Why:** Cache and history are convenience features. A failed cache write shouldn't make a successful search return an error. The slog.Warn ensures failures are visible for debugging. |
| 131 | + |
| 132 | +### 12. Bing URL unwrapping via base64 |
| 133 | + |
| 134 | +**Decision:** Bing wraps result URLs in `/ck/a?u=<encoded>` redirects. We decode inline: strip first 2 chars of the `u` param, base64url decode the rest. |
| 135 | + |
| 136 | +**Why:** Following Bing's real redirect URL isn't reliable (may require cookies/sessions). The base64 encoding scheme was reverse-engineered from ddgs and is stable. |
| 137 | + |
| 138 | +--- |
| 139 | + |
| 140 | +## What We Chose Not to Build |
| 141 | + |
| 142 | +| Feature | Reason | |
| 143 | +|---------|--------| |
| 144 | +| Exponential backoff (`cenkalti/backoff`) | Multi-backend fallback is the retry strategy for a CLI tool | |
| 145 | +| `SearchError` structured error type | Plain `fmt.Errorf` is sufficient; errors flow through cobra's `RunE` | |
| 146 | +| `--proxy` CLI flag / `WEBSEARCH_PROXY` env var | Proxy works via config.yaml; CLI flag adds complexity for a rarely-used feature | |
| 147 | +| Session recovery (401/403 retry) | Multi-backend fallback handles this implicitly | |
| 148 | +| Rate limiter eviction | Not needed for CLI (process is short-lived); TODO left for library use | |
| 149 | +| Daemon cache warming | Deferred — low value until usage patterns are established | |
| 150 | + |
| 151 | +--- |
| 152 | + |
| 153 | +## Backend Reference |
| 154 | + |
| 155 | +| Backend | Method | Priority | Auto | Notes | |
| 156 | +|---------|--------|----------|------|-------| |
| 157 | +| DuckDuckGo | POST `html.duckduckgo.com/html/` | 1 | Yes | Most reliable. No-JS endpoint. Max 499 char query. | |
| 158 | +| Brave | GET `search.brave.com/search` | 1 | Yes | Good quality, moderate anti-bot. | |
| 159 | +| Mojeek | GET `www.mojeek.com/search` | 1 | Yes | Very permissive. Independent index. | |
| 160 | +| Wikipedia | GET `en.wikipedia.org/w/api.php` | 2 | Yes | JSON API. +10 ranking bonus. Filters disambiguation. | |
| 161 | +| Yahoo | GET `search.yahoo.com/search` | 1 | No | Requires redirect URL unwrapping. | |
| 162 | +| Yandex | GET `yandex.com/search/site/` | 1 | No | Uses random searchid. | |
| 163 | +| Google | GET `www.google.com/search` | 0 | No | Most aggressive anti-bot. Lowest priority. | |
| 164 | +| Bing | GET `www.bing.com/search` | 0 | No | base64 URL unwrapping. Ad filtering. | |
| 165 | + |
| 166 | +News backends: DuckDuckGo (VQD token + JSON API), Yahoo (HTML scraping). |
| 167 | + |
| 168 | +--- |
| 169 | + |
| 170 | +## Security |
| 171 | + |
| 172 | +- **SSRF protection on fetch**: DNS resolution + private IP blocking + IP pinning (anti-DNS-rebinding) |
| 173 | +- **No SSRF on search**: Engines only hit known external hosts |
| 174 | +- **SQL injection**: All queries use parameterized placeholders (`db.Rebind("... ?")`) |
| 175 | +- **Query length limit**: 2000 chars max to prevent abuse |
| 176 | +- **Response body limit**: 10MB hard cap on fetch, body size limits on news/VQD responses |
| 177 | + |
| 178 | +--- |
| 179 | + |
| 180 | +## Dependencies |
| 181 | + |
| 182 | +| Package | Purpose | |
| 183 | +|---------|---------| |
| 184 | +| `github.com/PuerkitoBio/goquery` | HTML parsing with CSS selectors | |
| 185 | +| `golang.org/x/time/rate` | Per-host token bucket rate limiting | |
| 186 | +| `golang.org/x/sync/errgroup` | Concurrent backend dispatch | |
| 187 | +| `github.com/JohannesKaufmann/html-to-markdown` | HTML → Markdown for fetch | |
| 188 | +| `github.com/refraction-networking/utls` | TLS fingerprint impersonation (via internal/browser) | |
| 189 | + |
| 190 | +--- |
| 191 | + |
| 192 | +## Prior Art |
| 193 | + |
| 194 | +| Project | Language | Gap | |
| 195 | +|---------|----------|-----| |
| 196 | +| [ddgs](https://github.com/deedy5/ddgs) | Python | Python-only. Uses `primp` (Rust) for TLS. | |
| 197 | +| [SearXNG](https://github.com/searxng/searxng) | Python | Requires Docker. Server-based. | |
| 198 | +| [Djarvur/ddg-search](https://github.com/Djarvur/ddg-search) | Go | DDG only, no multi-backend. | |
| 199 | + |
| 200 | +This fills the gap: a Go data source with multi-backend search, skills integration, and agent-first CLI design. |
0 commit comments