Skip to content

Commit 4704f1c

Browse files
Merge pull request #57 from priyanshujain/feat-websearch-phase3
feat(websearch): phase 3 — ranking, health tracking, httpclient, bing
2 parents bd775d2 + 4a3031f commit 4704f1c

40 files changed

+1556
-127
lines changed

config/config.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,7 @@ type WebSearchConfig struct {
143143
Proxy string `yaml:"proxy,omitempty"`
144144
Timeout string `yaml:"timeout,omitempty"`
145145
CacheTTL string `yaml:"cache_ttl,omitempty"`
146+
Backends []string `yaml:"backends,omitempty"`
146147
}
147148

148149
type ContactsConfig struct {

docs/websearch.md

Lines changed: 200 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,200 @@
1+
# WebSearch — Strategy & Design Decisions
2+
3+
A Go metasearch data source for OpenBotKit. No API keys. No external services. Just HTTP requests + HTML parsing running locally.
4+
5+
Inspired by [ddgs](https://github.com/deedy5/ddgs) (Python), rebuilt in Go as a first-class openbotkit source.
6+
7+
---
8+
9+
## Problem
10+
11+
Every open-source coding agent delegates web search to their model provider's API. There is no standalone, self-contained, local web search capability that agents can use. We need a Go equivalent integrated into openbotkit — search, news, and page fetching accessible via CLI, consumed by agents as skills.
12+
13+
## Design Principles
14+
15+
1. **Follows openbotkit patterns** — Source interface, CLI commands via `obk websearch`, skills as SKILL.md/REFERENCE.md, SQLite for caching/history.
16+
2. **Zero dependencies on external services** — No API keys, no Docker, no servers. Just HTTP scraping.
17+
3. **Composable** — Search, news, and fetch are separate skills. The agent orchestrates.
18+
4. **Resilient** — Multiple backends with concurrent dispatch, automatic fallback, per-host rate limiting, and health tracking.
19+
20+
---
21+
22+
## Architecture
23+
24+
```
25+
Skills Layer
26+
web-search, web-fetch, web-news (SKILL.md + REFERENCE.md each)
27+
28+
CLI Layer (obk websearch ...)
29+
search, news, fetch, backends, history, cache clear
30+
31+
source/websearch/ (Go package)
32+
┌─────────────────────────────────────────────┐
33+
│ Orchestrator │
34+
│ - Backend selection (auto / explicit) │
35+
│ - Concurrent dispatch (errgroup) │
36+
│ - Result ranking & deduplication │
37+
│ - Health tracking (exponential cooldown) │
38+
├─────────────────────────────────────────────┤
39+
│ Engines (each: HTTP request + HTML parse) │
40+
│ DDG, Brave, Mojeek, Wikipedia, Yahoo, │
41+
│ Yandex, Google, Bing │
42+
├─────────────────────────────────────────────┤
43+
│ httpclient/ │
44+
│ - Wraps internal/browser (utls transport) │
45+
│ - UA rotation (4 browser profiles) │
46+
│ - Per-host token bucket rate limiting │
47+
├─────────────────────────────────────────────┤
48+
│ SQLite (cache + history) │
49+
│ search_cache, fetch_cache, search_history │
50+
└─────────────────────────────────────────────┘
51+
```
52+
53+
---
54+
55+
## Key Decisions
56+
57+
### 1. Flat package over engine/ subdirectory
58+
59+
**Decision:** Engines live directly in `source/websearch/` (e.g., `duckduckgo.go`, `brave.go`), not in a `source/websearch/engine/` subdirectory.
60+
61+
**Why:** The engines are small (50-120 lines each) and tightly coupled to the orchestrator. A subdirectory adds import ceremony for no benefit. Each engine file is self-contained — struct, constructor, `Search()`/`News()` method, and any helper functions.
62+
63+
### 2. No parse/ package
64+
65+
**Decision:** HTML parsing helpers stay private and co-located with each engine.
66+
67+
**Why:** Each engine has its own HTML structure and parsing quirks. Extracting shared "parse helpers" would create false abstractions — the parsing code between DuckDuckGo and Brave has nothing in common. Functions are small, private, and belong next to their callers.
68+
69+
### 3. HTTPDoer interface over concrete *http.Client
70+
71+
**Decision:** Engines accept `HTTPDoer` interface (`Do(*http.Request) (*http.Response, error)`) instead of `*http.Client`.
72+
73+
**Why:** This lets engines work with both `*http.Client` (in tests, using httptest) and `*httpclient.Client` (in production, with UA rotation + rate limiting). Engines stay testable without mocking infrastructure.
74+
75+
### 4. Multi-backend fallback over exponential backoff
76+
77+
**Decision:** No `cenkalti/backoff` dependency. Retry strategy is multi-backend fallback, not per-request retries.
78+
79+
**Why:** This is a CLI tool, not a long-running service. When a user runs a search, they want results in seconds, not after a backoff sequence. If DuckDuckGo fails, we immediately try Brave, Mojeek, etc. The health tracker prevents repeatedly hitting a broken backend (exponential cooldown: 30s → 5min). This gives better UX than retrying the same failing backend with increasing delays.
80+
81+
### 5. Concurrent dispatch with errgroup
82+
83+
**Decision:** All backends in the auto set run concurrently via `errgroup`, not sequentially.
84+
85+
**Why:** With 4 default backends, sequential dispatch means total latency = sum of all backend latencies. Concurrent dispatch means latency = max(backend latencies). Results are collected with a mutex, then sorted by engine priority before ranking to maintain deterministic output.
86+
87+
### 6. Separate fetch client with SSRF protection
88+
89+
**Decision:** `fetchClient()` (for `obk websearch fetch`) uses a raw `*http.Client` with SSRF guards. `httpClient()` (for search engines) uses `*httpclient.Client` without SSRF guards.
90+
91+
**Why:** Search engines hit known external hosts (duckduckgo.com, bing.com, etc.) — SSRF isn't a concern. But `fetch` takes arbitrary user-provided URLs, so it must resolve DNS and block private IPs (loopback, RFC1918, link-local) to prevent SSRF. It also pins to the first resolved IP to prevent DNS rebinding (TOCTOU).
92+
93+
### 7. UA rotation per client instance, not per request
94+
95+
**Decision:** A random browser profile (Chrome/Firefox/Safari/Edge) is selected when `httpclient.Client` is created and reused for all requests from that client.
96+
97+
**Why:** A real browser sends the same UA across a session. Rotating per-request looks suspicious to anti-bot systems. The client is created once per `WebSearch` instance and reused across searches.
98+
99+
### 8. Auto backend set: DDG + Brave + Mojeek + Wikipedia
100+
101+
**Decision:** The default `auto` set uses DuckDuckGo, Brave, Mojeek, and Wikipedia. Yahoo, Yandex, Google, and Bing are opt-in only via `--backend <name>`.
102+
103+
**Why:**
104+
- **DuckDuckGo** — Most reliable scraping target. No-JS HTML endpoint.
105+
- **Brave** — Good quality, less aggressive anti-bot than Google.
106+
- **Mojeek** — Very permissive, independent index (not Bing/Google derivative).
107+
- **Wikipedia** — JSON API (no scraping needed), always high-quality for factual queries.
108+
- **Google** — Most aggressive anti-bot, CAPTCHAs likely. Opt-in only.
109+
- **Bing** — Disabled in ddgs too. Opt-in only.
110+
- **Yahoo/Yandex** — Redirect URL unwrapping adds fragility. Opt-in only.
111+
112+
Users can override via config (`websearch.backends` list) or `--backend` flag.
113+
114+
### 9. Ranking: frequency + token scoring + Wikipedia priority
115+
116+
**Decision:** Results are ranked by: multi-backend appearance bonus, query token scoring (title weight 2x, snippet weight 1x), Wikipedia +10 bonus. Stable sort preserves original order for equal scores.
117+
118+
**Why:** Simple, predictable, no ML. A result appearing from 3 backends is likely more relevant than one from 1 backend. Title matches matter more than snippet matches. Wikipedia is almost always the best single result for factual queries — the +10 bonus ensures it surfaces first without suppressing other results.
119+
120+
### 10. Cache key includes page number
121+
122+
**Decision:** Cache key = `sha256(query|category|backend|region|timeLimit|page)`.
123+
124+
**Why:** Without page in the key, searching "golang" page 1 then "golang" page 2 returns page 1's cached results. Learned this from a bug caught in review.
125+
126+
### 11. Best-effort caching and history, not transactional
127+
128+
**Decision:** `putSearchCache`, `putFetchCache`, and `putSearchHistory` log warnings on failure but don't propagate errors.
129+
130+
**Why:** Cache and history are convenience features. A failed cache write shouldn't make a successful search return an error. The slog.Warn ensures failures are visible for debugging.
131+
132+
### 12. Bing URL unwrapping via base64
133+
134+
**Decision:** Bing wraps result URLs in `/ck/a?u=<encoded>` redirects. We decode inline: strip first 2 chars of the `u` param, base64url decode the rest.
135+
136+
**Why:** Following Bing's real redirect URL isn't reliable (may require cookies/sessions). The base64 encoding scheme was reverse-engineered from ddgs and is stable.
137+
138+
---
139+
140+
## What We Chose Not to Build
141+
142+
| Feature | Reason |
143+
|---------|--------|
144+
| Exponential backoff (`cenkalti/backoff`) | Multi-backend fallback is the retry strategy for a CLI tool |
145+
| `SearchError` structured error type | Plain `fmt.Errorf` is sufficient; errors flow through cobra's `RunE` |
146+
| `--proxy` CLI flag / `WEBSEARCH_PROXY` env var | Proxy works via config.yaml; CLI flag adds complexity for a rarely-used feature |
147+
| Session recovery (401/403 retry) | Multi-backend fallback handles this implicitly |
148+
| Rate limiter eviction | Not needed for CLI (process is short-lived); TODO left for library use |
149+
| Daemon cache warming | Deferred — low value until usage patterns are established |
150+
151+
---
152+
153+
## Backend Reference
154+
155+
| Backend | Method | Priority | Auto | Notes |
156+
|---------|--------|----------|------|-------|
157+
| DuckDuckGo | POST `html.duckduckgo.com/html/` | 1 | Yes | Most reliable. No-JS endpoint. Max 499 char query. |
158+
| Brave | GET `search.brave.com/search` | 1 | Yes | Good quality, moderate anti-bot. |
159+
| Mojeek | GET `www.mojeek.com/search` | 1 | Yes | Very permissive. Independent index. |
160+
| Wikipedia | GET `en.wikipedia.org/w/api.php` | 2 | Yes | JSON API. +10 ranking bonus. Filters disambiguation. |
161+
| Yahoo | GET `search.yahoo.com/search` | 1 | No | Requires redirect URL unwrapping. |
162+
| Yandex | GET `yandex.com/search/site/` | 1 | No | Uses random searchid. |
163+
| Google | GET `www.google.com/search` | 0 | No | Most aggressive anti-bot. Lowest priority. |
164+
| Bing | GET `www.bing.com/search` | 0 | No | base64 URL unwrapping. Ad filtering. |
165+
166+
News backends: DuckDuckGo (VQD token + JSON API), Yahoo (HTML scraping).
167+
168+
---
169+
170+
## Security
171+
172+
- **SSRF protection on fetch**: DNS resolution + private IP blocking + IP pinning (anti-DNS-rebinding)
173+
- **No SSRF on search**: Engines only hit known external hosts
174+
- **SQL injection**: All queries use parameterized placeholders (`db.Rebind("... ?")`)
175+
- **Query length limit**: 2000 chars max to prevent abuse
176+
- **Response body limit**: 10MB hard cap on fetch, body size limits on news/VQD responses
177+
178+
---
179+
180+
## Dependencies
181+
182+
| Package | Purpose |
183+
|---------|---------|
184+
| `github.com/PuerkitoBio/goquery` | HTML parsing with CSS selectors |
185+
| `golang.org/x/time/rate` | Per-host token bucket rate limiting |
186+
| `golang.org/x/sync/errgroup` | Concurrent backend dispatch |
187+
| `github.com/JohannesKaufmann/html-to-markdown` | HTML → Markdown for fetch |
188+
| `github.com/refraction-networking/utls` | TLS fingerprint impersonation (via internal/browser) |
189+
190+
---
191+
192+
## Prior Art
193+
194+
| Project | Language | Gap |
195+
|---------|----------|-----|
196+
| [ddgs](https://github.com/deedy5/ddgs) | Python | Python-only. Uses `primp` (Rust) for TLS. |
197+
| [SearXNG](https://github.com/searxng/searxng) | Python | Requires Docker. Server-based. |
198+
| [Djarvur/ddg-search](https://github.com/Djarvur/ddg-search) | Go | DDG only, no multi-backend. |
199+
200+
This fills the gap: a Go data source with multi-backend search, skills integration, and agent-first CLI design.

go.mod

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ require (
2222
go.mau.fi/whatsmeow v0.0.0-20260227112304-c9652e4448a2
2323
golang.org/x/net v0.50.0
2424
golang.org/x/oauth2 v0.35.0
25+
golang.org/x/sync v0.19.0
2526
golang.org/x/time v0.14.0
2627
google.golang.org/api v0.269.0
2728
google.golang.org/protobuf v1.36.11
@@ -143,7 +144,6 @@ require (
143144
go.uber.org/goleak v1.3.0 // indirect
144145
golang.org/x/crypto v0.48.0 // indirect
145146
golang.org/x/exp v0.0.0-20260212183809-81e46e3db34a // indirect
146-
golang.org/x/sync v0.19.0 // indirect
147147
golang.org/x/sys v0.41.0 // indirect
148148
golang.org/x/text v0.34.0 // indirect
149149
google.golang.org/genproto/googleapis/api v0.0.0-20260209200024-4cfbd4190f57 // indirect

internal/cli/websearch/backends.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ var backendsCmd = &cobra.Command{
2626
{Name: "yandex", Priority: 1, News: false},
2727
{Name: "google", Priority: 0, News: false},
2828
{Name: "wikipedia", Priority: 2, News: false},
29+
{Name: "bing", Priority: 0, News: false},
2930
}
3031
return json.NewEncoder(os.Stdout).Encode(backends)
3132
},

internal/cli/websearch/cache.go

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
package websearch
2+
3+
import (
4+
"encoding/json"
5+
"fmt"
6+
"os"
7+
8+
"github.com/priyanshujain/openbotkit/config"
9+
wssrc "github.com/priyanshujain/openbotkit/source/websearch"
10+
"github.com/priyanshujain/openbotkit/store"
11+
"github.com/spf13/cobra"
12+
)
13+
14+
var cacheCmd = &cobra.Command{
15+
Use: "cache",
16+
Short: "Manage websearch cache",
17+
}
18+
19+
var cacheClearCmd = &cobra.Command{
20+
Use: "clear",
21+
Short: "Clear all cached search results and fetched pages",
22+
Args: cobra.NoArgs,
23+
RunE: func(cmd *cobra.Command, args []string) error {
24+
cfg, err := config.Load()
25+
if err != nil {
26+
return fmt.Errorf("load config: %w", err)
27+
}
28+
29+
db, err := store.Open(store.SQLiteConfig(cfg.WebSearchDataDSN()))
30+
if err != nil {
31+
return fmt.Errorf("open db: %w", err)
32+
}
33+
defer db.Close()
34+
35+
if err := wssrc.Migrate(db); err != nil {
36+
return fmt.Errorf("migrate: %w", err)
37+
}
38+
39+
ws := wssrc.New(wssrc.Config{WebSearch: cfg.WebSearch}, wssrc.WithDB(db))
40+
if err := ws.ClearCaches(); err != nil {
41+
return err
42+
}
43+
44+
return json.NewEncoder(os.Stdout).Encode(map[string]string{"status": "cleared"})
45+
},
46+
}
47+
48+
func init() {
49+
cacheCmd.AddCommand(cacheClearCmd)
50+
}

internal/cli/websearch/history.go

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
package websearch
2+
3+
import (
4+
"encoding/json"
5+
"fmt"
6+
"os"
7+
8+
"github.com/priyanshujain/openbotkit/config"
9+
wssrc "github.com/priyanshujain/openbotkit/source/websearch"
10+
"github.com/priyanshujain/openbotkit/store"
11+
"github.com/spf13/cobra"
12+
)
13+
14+
type historyEntry struct {
15+
Query string `json:"query"`
16+
Category string `json:"category"`
17+
ResultCount int `json:"result_count"`
18+
Backends string `json:"backends"`
19+
SearchMs int64 `json:"search_ms"`
20+
CreatedAt string `json:"created_at"`
21+
}
22+
23+
var historyCmd = &cobra.Command{
24+
Use: "history",
25+
Short: "Show recent search history",
26+
Args: cobra.NoArgs,
27+
RunE: func(cmd *cobra.Command, args []string) error {
28+
cfg, err := config.Load()
29+
if err != nil {
30+
return fmt.Errorf("load config: %w", err)
31+
}
32+
33+
db, err := store.Open(store.SQLiteConfig(cfg.WebSearchDataDSN()))
34+
if err != nil {
35+
return fmt.Errorf("open db: %w", err)
36+
}
37+
defer db.Close()
38+
39+
if err := wssrc.Migrate(db); err != nil {
40+
return fmt.Errorf("migrate: %w", err)
41+
}
42+
43+
limit, _ := cmd.Flags().GetInt("limit")
44+
45+
rows, err := db.Query("SELECT query, category, result_count, backends, search_ms, created_at FROM search_history ORDER BY created_at DESC LIMIT ?", limit)
46+
if err != nil {
47+
return fmt.Errorf("query history: %w", err)
48+
}
49+
defer rows.Close()
50+
51+
var entries []historyEntry
52+
for rows.Next() {
53+
var e historyEntry
54+
if err := rows.Scan(&e.Query, &e.Category, &e.ResultCount, &e.Backends, &e.SearchMs, &e.CreatedAt); err != nil {
55+
return fmt.Errorf("scan row: %w", err)
56+
}
57+
entries = append(entries, e)
58+
}
59+
if entries == nil {
60+
entries = []historyEntry{}
61+
}
62+
63+
return json.NewEncoder(os.Stdout).Encode(entries)
64+
},
65+
}
66+
67+
func init() {
68+
historyCmd.Flags().Int("limit", 20, "Maximum number of entries to show")
69+
}

internal/cli/websearch/news.go

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ var newsCmd = &cobra.Command{
2525
backend, _ := cmd.Flags().GetString("backend")
2626
timeLimit, _ := cmd.Flags().GetString("time-limit")
2727
region, _ := cmd.Flags().GetString("region")
28+
page, _ := cmd.Flags().GetInt("page")
2829
noCache, _ := cmd.Flags().GetBool("no-cache")
2930

3031
var opts []wssrc.Option
@@ -42,6 +43,7 @@ var newsCmd = &cobra.Command{
4243
Backend: backend,
4344
TimeLimit: timeLimit,
4445
Region: region,
46+
Page: page,
4547
NoCache: noCache,
4648
})
4749
if err != nil {
@@ -57,5 +59,6 @@ func init() {
5759
newsCmd.Flags().StringP("backend", "b", "auto", "News backend (auto, duckduckgo, yahoo)")
5860
newsCmd.Flags().StringP("time-limit", "t", "", "Time limit (d=day, w=week, m=month)")
5961
newsCmd.Flags().StringP("region", "r", "us-en", "Region for news results")
62+
newsCmd.Flags().IntP("page", "p", 1, "Page number for pagination")
6063
newsCmd.Flags().Bool("no-cache", false, "Bypass result cache")
6164
}

0 commit comments

Comments
 (0)