Skip to content

Commit 4a3031f

Browse files
committed
docs(websearch): convert spec into strategy & decisions document
Replace the original implementation plan with a document capturing the key architectural decisions, trade-offs, and rationale from building websearch. Removes outdated phase checklists and proposed code snippets in favor of decisions-as-written.
1 parent 73d04e1 commit 4a3031f

File tree

1 file changed

+200
-0
lines changed

1 file changed

+200
-0
lines changed

docs/websearch.md

Lines changed: 200 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,200 @@
1+
# WebSearch — Strategy & Design Decisions
2+
3+
A Go metasearch data source for OpenBotKit. No API keys. No external services. Just HTTP requests + HTML parsing running locally.
4+
5+
Inspired by [ddgs](https://github.com/deedy5/ddgs) (Python), rebuilt in Go as a first-class openbotkit source.
6+
7+
---
8+
9+
## Problem
10+
11+
Every open-source coding agent delegates web search to their model provider's API. There is no standalone, self-contained, local web search capability that agents can use. We need a Go equivalent integrated into openbotkit — search, news, and page fetching accessible via CLI, consumed by agents as skills.
12+
13+
## Design Principles
14+
15+
1. **Follows openbotkit patterns** — Source interface, CLI commands via `obk websearch`, skills as SKILL.md/REFERENCE.md, SQLite for caching/history.
16+
2. **Zero dependencies on external services** — No API keys, no Docker, no servers. Just HTTP scraping.
17+
3. **Composable** — Search, news, and fetch are separate skills. The agent orchestrates.
18+
4. **Resilient** — Multiple backends with concurrent dispatch, automatic fallback, per-host rate limiting, and health tracking.
19+
20+
---
21+
22+
## Architecture
23+
24+
```
25+
Skills Layer
26+
web-search, web-fetch, web-news (SKILL.md + REFERENCE.md each)
27+
28+
CLI Layer (obk websearch ...)
29+
search, news, fetch, backends, history, cache clear
30+
31+
source/websearch/ (Go package)
32+
┌─────────────────────────────────────────────┐
33+
│ Orchestrator │
34+
│ - Backend selection (auto / explicit) │
35+
│ - Concurrent dispatch (errgroup) │
36+
│ - Result ranking & deduplication │
37+
│ - Health tracking (exponential cooldown) │
38+
├─────────────────────────────────────────────┤
39+
│ Engines (each: HTTP request + HTML parse) │
40+
│ DDG, Brave, Mojeek, Wikipedia, Yahoo, │
41+
│ Yandex, Google, Bing │
42+
├─────────────────────────────────────────────┤
43+
│ httpclient/ │
44+
│ - Wraps internal/browser (utls transport) │
45+
│ - UA rotation (4 browser profiles) │
46+
│ - Per-host token bucket rate limiting │
47+
├─────────────────────────────────────────────┤
48+
│ SQLite (cache + history) │
49+
│ search_cache, fetch_cache, search_history │
50+
└─────────────────────────────────────────────┘
51+
```
52+
53+
---
54+
55+
## Key Decisions
56+
57+
### 1. Flat package over engine/ subdirectory
58+
59+
**Decision:** Engines live directly in `source/websearch/` (e.g., `duckduckgo.go`, `brave.go`), not in a `source/websearch/engine/` subdirectory.
60+
61+
**Why:** The engines are small (50-120 lines each) and tightly coupled to the orchestrator. A subdirectory adds import ceremony for no benefit. Each engine file is self-contained — struct, constructor, `Search()`/`News()` method, and any helper functions.
62+
63+
### 2. No parse/ package
64+
65+
**Decision:** HTML parsing helpers stay private and co-located with each engine.
66+
67+
**Why:** Each engine has its own HTML structure and parsing quirks. Extracting shared "parse helpers" would create false abstractions — the parsing code between DuckDuckGo and Brave has nothing in common. Functions are small, private, and belong next to their callers.
68+
69+
### 3. HTTPDoer interface over concrete *http.Client
70+
71+
**Decision:** Engines accept `HTTPDoer` interface (`Do(*http.Request) (*http.Response, error)`) instead of `*http.Client`.
72+
73+
**Why:** This lets engines work with both `*http.Client` (in tests, using httptest) and `*httpclient.Client` (in production, with UA rotation + rate limiting). Engines stay testable without mocking infrastructure.
74+
75+
### 4. Multi-backend fallback over exponential backoff
76+
77+
**Decision:** No `cenkalti/backoff` dependency. Retry strategy is multi-backend fallback, not per-request retries.
78+
79+
**Why:** This is a CLI tool, not a long-running service. When a user runs a search, they want results in seconds, not after a backoff sequence. If DuckDuckGo fails, we immediately try Brave, Mojeek, etc. The health tracker prevents repeatedly hitting a broken backend (exponential cooldown: 30s → 5min). This gives better UX than retrying the same failing backend with increasing delays.
80+
81+
### 5. Concurrent dispatch with errgroup
82+
83+
**Decision:** All backends in the auto set run concurrently via `errgroup`, not sequentially.
84+
85+
**Why:** With 4 default backends, sequential dispatch means total latency = sum of all backend latencies. Concurrent dispatch means latency = max(backend latencies). Results are collected with a mutex, then sorted by engine priority before ranking to maintain deterministic output.
86+
87+
### 6. Separate fetch client with SSRF protection
88+
89+
**Decision:** `fetchClient()` (for `obk websearch fetch`) uses a raw `*http.Client` with SSRF guards. `httpClient()` (for search engines) uses `*httpclient.Client` without SSRF guards.
90+
91+
**Why:** Search engines hit known external hosts (duckduckgo.com, bing.com, etc.) — SSRF isn't a concern. But `fetch` takes arbitrary user-provided URLs, so it must resolve DNS and block private IPs (loopback, RFC1918, link-local) to prevent SSRF. It also pins to the first resolved IP to prevent DNS rebinding (TOCTOU).
92+
93+
### 7. UA rotation per client instance, not per request
94+
95+
**Decision:** A random browser profile (Chrome/Firefox/Safari/Edge) is selected when `httpclient.Client` is created and reused for all requests from that client.
96+
97+
**Why:** A real browser sends the same UA across a session. Rotating per-request looks suspicious to anti-bot systems. The client is created once per `WebSearch` instance and reused across searches.
98+
99+
### 8. Auto backend set: DDG + Brave + Mojeek + Wikipedia
100+
101+
**Decision:** The default `auto` set uses DuckDuckGo, Brave, Mojeek, and Wikipedia. Yahoo, Yandex, Google, and Bing are opt-in only via `--backend <name>`.
102+
103+
**Why:**
104+
- **DuckDuckGo** — Most reliable scraping target. No-JS HTML endpoint.
105+
- **Brave** — Good quality, less aggressive anti-bot than Google.
106+
- **Mojeek** — Very permissive, independent index (not Bing/Google derivative).
107+
- **Wikipedia** — JSON API (no scraping needed), always high-quality for factual queries.
108+
- **Google** — Most aggressive anti-bot, CAPTCHAs likely. Opt-in only.
109+
- **Bing** — Disabled in ddgs too. Opt-in only.
110+
- **Yahoo/Yandex** — Redirect URL unwrapping adds fragility. Opt-in only.
111+
112+
Users can override via config (`websearch.backends` list) or `--backend` flag.
113+
114+
### 9. Ranking: frequency + token scoring + Wikipedia priority
115+
116+
**Decision:** Results are ranked by: multi-backend appearance bonus, query token scoring (title weight 2x, snippet weight 1x), Wikipedia +10 bonus. Stable sort preserves original order for equal scores.
117+
118+
**Why:** Simple, predictable, no ML. A result appearing from 3 backends is likely more relevant than one from 1 backend. Title matches matter more than snippet matches. Wikipedia is almost always the best single result for factual queries — the +10 bonus ensures it surfaces first without suppressing other results.
119+
120+
### 10. Cache key includes page number
121+
122+
**Decision:** Cache key = `sha256(query|category|backend|region|timeLimit|page)`.
123+
124+
**Why:** Without page in the key, searching "golang" page 1 then "golang" page 2 returns page 1's cached results. Learned this from a bug caught in review.
125+
126+
### 11. Best-effort caching and history, not transactional
127+
128+
**Decision:** `putSearchCache`, `putFetchCache`, and `putSearchHistory` log warnings on failure but don't propagate errors.
129+
130+
**Why:** Cache and history are convenience features. A failed cache write shouldn't make a successful search return an error. The slog.Warn ensures failures are visible for debugging.
131+
132+
### 12. Bing URL unwrapping via base64
133+
134+
**Decision:** Bing wraps result URLs in `/ck/a?u=<encoded>` redirects. We decode inline: strip first 2 chars of the `u` param, base64url decode the rest.
135+
136+
**Why:** Following Bing's real redirect URL isn't reliable (may require cookies/sessions). The base64 encoding scheme was reverse-engineered from ddgs and is stable.
137+
138+
---
139+
140+
## What We Chose Not to Build
141+
142+
| Feature | Reason |
143+
|---------|--------|
144+
| Exponential backoff (`cenkalti/backoff`) | Multi-backend fallback is the retry strategy for a CLI tool |
145+
| `SearchError` structured error type | Plain `fmt.Errorf` is sufficient; errors flow through cobra's `RunE` |
146+
| `--proxy` CLI flag / `WEBSEARCH_PROXY` env var | Proxy works via config.yaml; CLI flag adds complexity for a rarely-used feature |
147+
| Session recovery (401/403 retry) | Multi-backend fallback handles this implicitly |
148+
| Rate limiter eviction | Not needed for CLI (process is short-lived); TODO left for library use |
149+
| Daemon cache warming | Deferred — low value until usage patterns are established |
150+
151+
---
152+
153+
## Backend Reference
154+
155+
| Backend | Method | Priority | Auto | Notes |
156+
|---------|--------|----------|------|-------|
157+
| DuckDuckGo | POST `html.duckduckgo.com/html/` | 1 | Yes | Most reliable. No-JS endpoint. Max 499 char query. |
158+
| Brave | GET `search.brave.com/search` | 1 | Yes | Good quality, moderate anti-bot. |
159+
| Mojeek | GET `www.mojeek.com/search` | 1 | Yes | Very permissive. Independent index. |
160+
| Wikipedia | GET `en.wikipedia.org/w/api.php` | 2 | Yes | JSON API. +10 ranking bonus. Filters disambiguation. |
161+
| Yahoo | GET `search.yahoo.com/search` | 1 | No | Requires redirect URL unwrapping. |
162+
| Yandex | GET `yandex.com/search/site/` | 1 | No | Uses random searchid. |
163+
| Google | GET `www.google.com/search` | 0 | No | Most aggressive anti-bot. Lowest priority. |
164+
| Bing | GET `www.bing.com/search` | 0 | No | base64 URL unwrapping. Ad filtering. |
165+
166+
News backends: DuckDuckGo (VQD token + JSON API), Yahoo (HTML scraping).
167+
168+
---
169+
170+
## Security
171+
172+
- **SSRF protection on fetch**: DNS resolution + private IP blocking + IP pinning (anti-DNS-rebinding)
173+
- **No SSRF on search**: Engines only hit known external hosts
174+
- **SQL injection**: All queries use parameterized placeholders (`db.Rebind("... ?")`)
175+
- **Query length limit**: 2000 chars max to prevent abuse
176+
- **Response body limit**: 10MB hard cap on fetch, body size limits on news/VQD responses
177+
178+
---
179+
180+
## Dependencies
181+
182+
| Package | Purpose |
183+
|---------|---------|
184+
| `github.com/PuerkitoBio/goquery` | HTML parsing with CSS selectors |
185+
| `golang.org/x/time/rate` | Per-host token bucket rate limiting |
186+
| `golang.org/x/sync/errgroup` | Concurrent backend dispatch |
187+
| `github.com/JohannesKaufmann/html-to-markdown` | HTML → Markdown for fetch |
188+
| `github.com/refraction-networking/utls` | TLS fingerprint impersonation (via internal/browser) |
189+
190+
---
191+
192+
## Prior Art
193+
194+
| Project | Language | Gap |
195+
|---------|----------|-----|
196+
| [ddgs](https://github.com/deedy5/ddgs) | Python | Python-only. Uses `primp` (Rust) for TLS. |
197+
| [SearXNG](https://github.com/searxng/searxng) | Python | Requires Docker. Server-based. |
198+
| [Djarvur/ddg-search](https://github.com/Djarvur/ddg-search) | Go | DDG only, no multi-backend. |
199+
200+
This fills the gap: a Go data source with multi-backend search, skills integration, and agent-first CLI design.

0 commit comments

Comments
 (0)