DuckDB extension for querying Google Custom Search API directly from SQL.
-- Install from community extensions (when published)
INSTALL web_search FROM community;
LOAD web_search;Create a Google Custom Search API secret:
CREATE SECRET google_search (
TYPE google_search,
key 'YOUR_API_KEY',
cx 'YOUR_SEARCH_ENGINE_ID'
);Get your credentials at:
- API Key: https://console.cloud.google.com/apis/credentials
- Search Engine ID (cx): https://programmablesearchengine.google.com/
-- Basic search
SELECT title, link, snippet FROM google_search('duckdb database') LIMIT 10;
-- Filter by site
SELECT * FROM google_search('programming') WHERE site = 'github.com' LIMIT 20;
-- Multiple sites
SELECT * FROM google_search('rust tutorial')
WHERE site IN ('github.com', 'stackoverflow.com', 'reddit.com')
LIMIT 50;
-- Exclude sites
SELECT * FROM google_search('python')
WHERE site != 'pinterest.com' AND site != 'quora.com'
LIMIT 30;
-- Combined filters
SELECT title, link, site FROM google_search('machine learning')
WHERE site IN ('arxiv.org', 'github.com') AND site != 'gist.github.com'
LIMIT 100;
-- Filter by language (pushed to API as lr param)
SELECT title, link FROM google_search('uutiset')
WHERE language = 'fi'
LIMIT 10;
-- Filter by country (pushed to API as cr param)
SELECT title, link FROM google_search('news')
WHERE country = 'US'
LIMIT 10;
-- Combined language and country
SELECT title, link FROM google_search('actualités')
WHERE language = 'fr' AND country = 'FR'
LIMIT 10;
-- Filter by file type (pushed to API as fileType param)
SELECT title, link FROM google_search('machine learning research')
WHERE file_type = 'pdf'
LIMIT 10;
-- Exact phrase match (pushed to API as exactTerms param)
SELECT title, link FROM google_search('engineering jobs')
WHERE exact_match = 'systems engineering'
LIMIT 10;
-- OR terms (pushed to API as orTerms param)
SELECT title, link FROM google_search('jobs')
WHERE term IN ('remote', 'hybrid', 'onsite')
LIMIT 10;
-- Exclude terms (pushed to API as excludeTerms param)
SELECT title, link FROM google_search('programming tutorial')
WHERE term != 'beginner' AND term != 'intro'
LIMIT 10;SELECT title, image_url, width, height
FROM google_image_search('sunset photography')
LIMIT 20;
-- With image filters
SELECT * FROM google_image_search('logo design', img_size:='large', img_type:='clipart')
LIMIT 10;| Column | Type | Description |
|---|---|---|
| title | VARCHAR | Page title |
| link | VARCHAR | Page URL |
| snippet | VARCHAR | Text snippet |
| display_link | VARCHAR | Display URL |
| formatted_url | VARCHAR | Formatted URL |
| html_title | VARCHAR | HTML-formatted title |
| html_snippet | VARCHAR | HTML-formatted snippet |
| mime | VARCHAR | MIME type (for files) |
| file_format | VARCHAR | File format |
| pagemap | JSON | Page metadata (use pagemap->'key' to access) |
| site | VARCHAR | Domain (extracted from link) |
| date | VARCHAR | Page date (NULL, for ORDER BY pushdown) |
| language | VARCHAR | Language filter (for WHERE pushdown) |
| country | VARCHAR | Country filter (for WHERE pushdown) |
| file_type | VARCHAR | File type filter (for WHERE pushdown) |
| term | VARCHAR | Term filter (for WHERE pushdown to orTerms/excludeTerms) |
| exact_match | VARCHAR | Exact phrase filter (for WHERE pushdown to exactTerms) |
| Column | Type | Description |
|---|---|---|
| title | VARCHAR | Image title |
| link | VARCHAR | Page URL |
| image_url | VARCHAR | Direct image URL |
| thumbnail_url | VARCHAR | Thumbnail URL |
| width | INTEGER | Image width |
| height | INTEGER | Image height |
| thumbnail_width | INTEGER | Thumbnail width |
| thumbnail_height | INTEGER | Thumbnail height |
| context_link | VARCHAR | Page containing image |
| mime | VARCHAR | Image MIME type |
| snippet | VARCHAR | Description |
| Parameter | Description | Example |
|---|---|---|
| exact_terms | Exact phrase match | exact_terms:='machine learning' |
| exclude_terms | Exclude words | exclude_terms:='beginner' |
| or_terms | OR terms | or_terms:='python rust' |
| file_type | File type filter | file_type:='pdf' |
| country | Country boost (gl) | country:='us' |
| cr | Country restrict | cr:='countryUS' |
| lr | Language restrict | lr:='lang_en' |
| language | Language restrict (alias) | language:='lang_en' |
| safe | Safe search | safe:=true |
| rights | Usage rights | rights:='cc_publicdomain' |
| sort | Sort/bias by structured data | sort:='date-sdate:d' |
| structured_data | Filter by pagemap | structured_data:='more:pagemap:document-author:john' |
| Parameter | Values |
|---|---|
| img_size | huge, icon, large, medium, small, xlarge, xxlarge |
| img_type | clipart, face, lineart, stock, photo, animated |
| img_color_type | color, gray, mono, trans |
| img_dominant_color | black, blue, brown, gray, green, orange, pink, purple, red, teal, white, yellow |
Use ORDER BY date to sort results by Google's estimated page date:
-- Sort by date (newest first)
SELECT title, link FROM google_search('technology news')
ORDER BY date DESC LIMIT 10;
-- Sort by date (oldest first)
SELECT title, link FROM google_search('historical events')
ORDER BY date ASC LIMIT 10;The ORDER BY date clause is pushed down to the Google API (&sort=date), affecting which results are returned.
Note: The date column is NULL in results - Google estimates the date from page features (URL, title, byline) for sorting but doesn't return it in the API response.
For more control, use the sort parameter directly:
-- Bias strongly towards newer dates (keeps older results too)
SELECT * FROM google_search('oil spill', sort:='date:d:s') LIMIT 10;
-- Bias weakly towards older dates
SELECT * FROM google_search('history', sort:='date:a:w') LIMIT 10;
-- Date range: Jan 1 to Feb 1, 2024
SELECT * FROM google_search('news', sort:='date:r:20240101:20240201') LIMIT 10;Google Custom Search can filter results based on structured data embedded in web pages (PageMaps, meta tags, JSON-LD, Microdata, RDFa).
See: https://developers.google.com/custom-search/docs/structured_search
Use the structured_data parameter to filter by pagemap attributes:
-- Filter by author
SELECT * FROM google_search('machine learning',
structured_data:='more:pagemap:document-author:smith')
LIMIT 10;
-- Filter by metatag
SELECT * FROM google_search('news',
structured_data:='more:pagemap:metatags-og\\:type:article')
LIMIT 10;Filter syntax: more:pagemap:TYPE-NAME:VALUE
For advanced sorting beyond date, use the sort named parameter:
-- Sort by rating with strong bias (keeps results without rating)
SELECT * FROM google_search('product reviews',
sort:='review-rating:d:s')
LIMIT 20;
-- Filter by date range (YYYYMMDD format)
SELECT * FROM google_search('news',
sort:='date:r:20240101:20240331')
LIMIT 20;Sort syntax: TYPE-NAME:DIRECTION:STRENGTH
| Direction | Meaning |
|---|---|
:d |
Descending (default) |
:a |
Ascending |
:r:LOW:HIGH |
Range filter |
| Strength | Meaning |
|---|---|
:h |
Hard sort (default) |
:s |
Strong bias |
:w |
Weak bias |
The pagemap column is JSON type containing page metadata extracted by Google.
Limitation: Google only extracts limited structured data types. Product/Offer schema (JSON-LD, microdata) is not returned in API results. Available data typically includes:
metatags- OpenGraph, Twitter cards, meta tagscse_image- Main page imagecse_thumbnail- Thumbnail image
Use arrow operators to extract data:
-- Get metatag info (commonly available)
SELECT
title,
link,
pagemap->'metatags'->0->>'og:title' AS og_title,
pagemap->'metatags'->0->>'og:description' AS description,
pagemap->'cse_image'->0->>'src' AS image
FROM google_search('news')
LIMIT 5;
-- Check available keys in pagemap
SELECT title, json_keys(pagemap) as available_keys
FROM google_search('shop')
LIMIT 5;Arrow operators:
->returns JSON->>returns VARCHAR (text)->0accesses array index 0
Export URL patterns to Google Programmable Search Engine annotation XML format.
See: https://developers.google.com/custom-search/docs/annotations
-- Basic export (url_pattern, action)
COPY (
SELECT url_pattern, action FROM my_sites
) TO 'annotations.xml' (FORMAT google_pse_annotation);
-- With comments
COPY (
SELECT url_pattern, action, comment FROM my_sites
) TO 'annotations.xml' (FORMAT google_pse_annotation);
-- With score (-1.0 to 1.0 for ranking bias)
COPY (
SELECT
'*.example.com/*' as url_pattern,
'include' as action,
'Main site' as comment,
1.0 as score
) TO 'annotations.xml' (FORMAT google_pse_annotation);| Column | Type | Required | Description |
|---|---|---|---|
| url_pattern | VARCHAR | Yes | URL pattern with wildcards (e.g., *.example.com/*) |
| action | VARCHAR | Yes | 'include' or 'exclude' |
| comment | VARCHAR | No | Optional comment |
| score | DOUBLE | No | Ranking bias from -1.0 to 1.0 (default: 1.0) |
Google PSE enforces these limits:
- Max 5,000 annotations per file
- Max 30KB file size
The extension validates these limits during export and fails with an error if exceeded.
<?xml version="1.0" encoding="UTF-8"?>
<Annotations>
<Annotation about="*.example.com/*" score="1.0">
<Label name="_include_"/>
<Comment>Main site</Comment>
</Annotation>
</Annotations>- Google returns max 10 results per API call
- Extension automatically paginates to fulfill LIMIT
- Max 100 results per query (Google API limit)
- LIMIT ≤ 100: Single query with
(site:a OR site:b)syntax - LIMIT > 100: Separate queries per site (up to 100 each), round-robin
WHERE clause filters are pushed down to the API:
site = 'x.com'→siteSearch=x.comsite IN (...)→ OR syntax or per-site queriessite != 'x.com'→-site:x.comin querylanguage = 'fi'→lr=lang_ficountry = 'US'→cr=countryUSfile_type = 'pdf'→fileType=pdfexact_match = 'phrase'→exactTerms=phraseterm IN ('a', 'b')→orTerms=a bterm != 'x'→excludeTerms=x
LIMIT is pushed down to minimize API calls.
# Clone with submodules
git clone --recursive https://github.com/midwork-finds-jobs/duckdb-web-search.git
cd duckdb-web-search
# Build
make release GEN=ninja VCPKG_TOOLCHAIN_PATH=$(pwd)/vcpkg/scripts/buildsystems/vcpkg.cmake
# Test
./build/release/duckdb -c "LOAD 'build/release/extension/google_search/web_search.duckdb_extension';"- DuckDB (via git submodule)
- http_request extension (auto-loaded)
- json extension (auto-loaded, for
pagemapcolumn)
MIT