Skip to content

sudosubin/commoncrawl.cc

Repository files navigation

commoncrawl.cc

A search-focused web console and API proxy for exploring Common Crawl index data.

CI Website OpenAPI License GitHub Sponsors

commoncrawl.cc makes Common Crawl index data easier to explore from the browser. It combines a fast web UI with a typed API proxy so you can inspect captures, timelines, and raw responses without manually stitching together index endpoints.

commoncrawl.cc

Example search workspace exploring github.blog/* snapshots, timeline metadata, and capture inspection.

Why this project exists

Common Crawl is incredibly useful, but its index APIs are still fairly low-level for day-to-day exploration. commoncrawl.cc aims to provide a cleaner workflow for developers, researchers, SEO teams, archivists, and data engineers who need to:

  • search snapshot history for a URL
  • inspect capture timelines
  • fetch raw capture responses
  • experiment from a browser instead of ad-hoc scripts
  • build against a typed OpenAPI surface

Features

  • Search-focused UI for Common Crawl index exploration
  • Snapshot, timeline, and capture inspection workflows
  • Raw response preview for capture debugging
  • Cloudflare Worker API proxy for index.commoncrawl.org
  • Generated OpenAPI spec and typed web client
  • MSW-backed local mocking for frontend development
  • Cloudflare-based deployment workflow for API and web

Live endpoints

Sponsors

commoncrawl.cc is maintained as an independent open source project. Sponsorship helps fund ongoing maintenance, UX improvements, API hardening, documentation, and the time required to keep the project useful and free for the community.

If your company uses Common Crawl for search, SEO, archival, research, data enrichment, or LLM pipelines, sponsoring this project is a practical way to support the tooling around that ecosystem.

Sponsor commoncrawl.cc

No sponsors yet — your company can become the founding sponsor.

Sponsor visibility

Founding sponsor slot
Top README placement
Project sponsor slot
Sponsor section placement
Community sponsor slot
Acknowledgement and support

A dedicated sponsor kit with tiers, logo guidelines, and company contact details can be added as the sponsorship program evolves.

Packages

  • packages/web — Preact + Vite frontend for search and capture exploration
  • packages/api — Cloudflare Worker proxy and OpenAPI source

Architecture

Browser UI (packages/web)
  -> API proxy (packages/api)
    -> index.commoncrawl.org

The web app consumes generated API clients based on the Worker's exported OpenAPI spec. That keeps the frontend and proxy contract aligned.

Quick start

1) Install dependencies

pnpm install

2) Configure the web app

cp packages/web/.env.example packages/web/.env

3) Start the API

pnpm --filter @commoncrawl.cc/api dev

4) Start the web app

pnpm --filter @commoncrawl.cc/web dev

Then open:

The web app expects the API at http://localhost:8787 by default.

Development

Build

pnpm --filter @commoncrawl.cc/api build
pnpm --filter @commoncrawl.cc/web build

Test

pnpm --filter @commoncrawl.cc/web test

Lint and format

pnpm lint
pnpm fmt:check

Sync OpenAPI artifacts

pnpm openapi:sync

This exports the API OpenAPI spec and regenerates the typed web client.

Tech stack

  • Preact
  • Vite
  • preact-iso
  • Hono
  • Cloudflare Workers
  • Cloudflare Pages
  • Orval
  • MSW
  • pnpm workspace

Contributing

Issues and pull requests are welcome. If you find rough edges in the search workflow, timeline view, replay behavior, or API contract, feedback is especially valuable.

License

MIT

About

commoncrawl.cc provides a search-focused console for Common Crawl index exploration

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

 
 
 

Contributors