Skip to content

Latest commit

 

History

History
202 lines (149 loc) · 5.9 KB

File metadata and controls

202 lines (149 loc) · 5.9 KB

commoncrawl.cc

A search-focused web console and API proxy for exploring Common Crawl index data.

CI Website OpenAPI License GitHub Sponsors

commoncrawl.cc makes Common Crawl index data easier to explore from the browser. It combines a fast web UI with a typed API proxy so you can inspect captures, timelines, and raw responses without manually stitching together index endpoints.

commoncrawl.cc

Example search workspace exploring github.blog/* snapshots, timeline metadata, and capture inspection.

Why this project exists

Common Crawl is incredibly useful, but its index APIs are still fairly low-level for day-to-day exploration. commoncrawl.cc aims to provide a cleaner workflow for developers, researchers, SEO teams, archivists, and data engineers who need to:

  • search snapshot history for a URL
  • inspect capture timelines
  • fetch raw capture responses
  • experiment from a browser instead of ad-hoc scripts
  • build against a typed OpenAPI surface

Features

  • Search-focused UI for Common Crawl index exploration
  • Snapshot, timeline, and capture inspection workflows
  • Raw response preview for capture debugging
  • Cloudflare Worker API proxy for index.commoncrawl.org
  • Generated OpenAPI spec and typed web client
  • MSW-backed local mocking for frontend development
  • Cloudflare-based deployment workflow for API and web

Live endpoints

Sponsors

commoncrawl.cc is maintained as an independent open source project. Sponsorship helps fund ongoing maintenance, UX improvements, API hardening, documentation, and the time required to keep the project useful and free for the community.

If your company uses Common Crawl for search, SEO, archival, research, data enrichment, or LLM pipelines, sponsoring this project is a practical way to support the tooling around that ecosystem.

Sponsor commoncrawl.cc

No sponsors yet — your company can become the founding sponsor.

Sponsor visibility

Founding sponsor slot
Top README placement
Project sponsor slot
Sponsor section placement
Community sponsor slot
Acknowledgement and support

A dedicated sponsor kit with tiers, logo guidelines, and company contact details can be added as the sponsorship program evolves.

Packages

  • packages/web — Preact + Vite frontend for search and capture exploration
  • packages/api — Cloudflare Worker proxy and OpenAPI source

Architecture

Browser UI (packages/web)
  -> API proxy (packages/api)
    -> index.commoncrawl.org

The web app consumes generated API clients based on the Worker's exported OpenAPI spec. That keeps the frontend and proxy contract aligned.

Quick start

1) Install dependencies

pnpm install

2) Configure the web app

cp packages/web/.env.example packages/web/.env

3) Start the API

pnpm --filter @commoncrawl.cc/api dev

4) Start the web app

pnpm --filter @commoncrawl.cc/web dev

Then open:

The web app expects the API at http://localhost:8787 by default.

Development

Build

pnpm --filter @commoncrawl.cc/api build
pnpm --filter @commoncrawl.cc/web build

Test

pnpm --filter @commoncrawl.cc/web test

Lint and format

pnpm lint
pnpm fmt:check

Sync OpenAPI artifacts

pnpm openapi:sync

This exports the API OpenAPI spec and regenerates the typed web client.

Tech stack

  • Preact
  • Vite
  • preact-iso
  • Hono
  • Cloudflare Workers
  • Cloudflare Pages
  • Orval
  • MSW
  • pnpm workspace

Contributing

Issues and pull requests are welcome. If you find rough edges in the search workflow, timeline view, replay behavior, or API contract, feedback is especially valuable.

License

MIT