Wayback Cache Proxy

Browse the old web through a caching Wayback Machine proxy. Redis-backed two-tier cache, admin interface, prefetch crawler, modem speed throttling, and header bar overlay. Built for museums and media art exhibitions.

Why This Exists

For Choose Your Filter! at ZKM Karlsruhe (2025), we showed 30 years of artistic web browsers -- works like JODI's %WRONG Browser, I/O/D's Web Stalker, Maciej Wisniewski's netomat, and many others. These aren't static artworks. They need to fetch live web pages to function: Web Stalker strips a site down to its link structure, netomat dissolves pages into floating streams of text and image, %WRONG Browser turns any website into a JODI piece. To show them as they were meant to be experienced, you need the web pages they were built to browse -- pages from the late 1990s and early 2000s that only exist in the Wayback Machine now.

We used a non-caching Wayback proxy (richardg867/WaybackProxy) which worked well enough -- until the Internet Archive was hit by repeated DDoS attacks and a major data breach that took it offline for days. The aftermath left the Wayback Machine significantly slower and less stable well into 2025. Every artwork that depended on it was affected -- visitors saw blank screens and error pages instead of net art.

This proxy was built after that experience so it won't happen again. It fetches pages from the Wayback Machine once, stores them in Redis, and serves them locally from then on. The prefetch crawler can spider entire sites into the curated cache before an exhibition opens, so even a complete Wayback Machine outage won't take the artworks down.

Features

Configurable backend chain — chain multiple sources in any order: local API services, pywb WARC archives, Redis cache, and the Wayback Machine. Chain order decides priority.
Caching proxy — fetches archived pages and caches them in Redis for fast, offline-capable serving
Local API proxy — intercept requests to specific hostnames and forward them to a self-hosted replacement service, for APIs that no longer exist
pywb integration — serve pages from local WARC files via pywb before falling back to remote sources
Two-tier cache — permanent curated tier (admin/crawler managed) and auto-expiring hot tier (on-demand fetches)
WARC export — export cache contents as .warc.gz files, diff against existing WARCs, and produce delta exports
Admin interface — web UI for managing crawl seeds, cache, WARC export, and monitoring crawl progress with live updates
Prefetch crawler — spider URLs from seed pages into curated cache before the exhibition opens
Speed throttling — simulate period-accurate connection speeds (14.4k, 28.8k, 56k, ISDN, DSL) with visitor-selectable dropdown
Header bar overlay — injected info bar showing current URL, archive date, and speed selector
Landing page — styled start page with most-viewed domains
Custom error pages — themed 403, 404, and generic error templates
Allowlist mode — restrict browsing to pre-approved URLs

Quick Start

Prerequisites

Python 3.11+
Redis 7+
Poetry (for dependency management)

Installation

git clone https://github.com/zkmkarlsruhe/wayback-cache-proxy.git
cd wayback-cache-proxy/proxy

poetry install

Usage

# Start Redis
redis-server &

# Run the proxy with YAML config
cp config.example.yaml config.yaml
python -m wayback_proxy --config config.yaml

# Or with CLI flags
python -m wayback_proxy --port 8888 --date 20010911 --header-bar --admin

Then configure your browser's HTTP proxy to localhost:8888 and browse any URL.

Docker

cp config.example.yaml config.yaml
docker-compose up

This starts three services:

Proxy on port 8888 — configure your browser to use this as an HTTP proxy
Admin on port 8080 — open in your browser for remote management
Redis on port 6379 — shared cache and state

Configuration

All settings can be managed through a YAML config file. Copy the example and edit:

cp config.example.yaml config.yaml

See config.example.yaml for all available options with inline documentation.

CLI Options (Proxy)

Flag	Default	Description
`--config`		Path to YAML config file
`--port`	8888	Listen port
`--date`	20010101	Wayback target date (YYYYMMDD)
`--redis`	redis://localhost:6379/0	Redis URL
`--header-bar`	off	Show overlay header bar
`--header-bar-position`	top	`top` or `bottom`
`--header-bar-text`		Custom branding text
`--speed`	unlimited	Default throttle: `14.4k`, `28.8k`, `56k`, `isdn`, `dsl`
`--speed-selector`	off	Let visitors pick speed via dropdown
`--admin`	off	Enable admin at `/_admin/`
`--admin-password`		Password for admin Basic Auth
`--allowlist`	off	Restrict to allowlisted domains
`--error-pages`		Custom error page template directory
`--no-landing-page`		Disable the landing page

Backend Chain

The backend chain controls where the proxy looks for archived pages, and in what order. Configure it in config.yaml:

backends:
  chain:
    - type: pywb
      base_url: "http://localhost:8080"
      collection: "web"
    - type: cache
    - type: wayback

This tries pywb first (local WARC files), then Redis cache, then the Wayback Machine. The chain order decides priority — once a backend responds, the rest are skipped.

Type	Description
`local`	Forward matching requests to a local HTTP service. Requires `hostnames` (fnmatch patterns). Optional `base_url` (default: `http://localhost:9000`) and `timeout` (default: 30s). Responses are not cached.
`pywb`	WARC replay via pywb. Requires `base_url`. Optional `collection` (default: `web`) and `mode` (default: `replay`). Multiple instances allowed.
`cache`	Redis cache lookup (curated tier first, then hot).
`wayback`	Wayback Machine (live internet). Optional `base_url` override.

If the backends section is omitted, the default chain is cache -> wayback (original behavior).

The crawler only uses live backends (Wayback Machine) regardless of chain configuration -- pywb and cache backends are excluded from crawl fetches.

pywb Modes

pywb backends support two modes:

Mode	Description
`replay`	(default) Constructs `/{collection}/{timestamp}id_/{url}` URLs. For pywb instances serving local WARC files.
`proxy`	Sends requests through pywb as an HTTP proxy. For pywb instances configured with `enable_http_proxy` and memento proxy sequences.

Replay mode — local WARC collections:

- type: pywb
  base_url: "http://localhost:8080"
  collection: "web"          # pywb collection name
  # mode: replay             # default

Proxy mode — memento sequences (Rhizome, Internet Archive, LOC, etc.):

- type: pywb
  base_url: "http://localhost:8089"
  mode: proxy

In proxy mode, pywb handles timestamp and source selection via its own config (e.g. default_timestamp, sequence of memento sources). The collection parameter is not used.

Multiple pywb Instances

Each artwork or WARC collection can have its own pywb instance with different archive sources and sequence configurations. List multiple pywb entries in the chain — they are tried in order:

backends:
  chain:
    - type: pywb
      base_url: "http://localhost:8089"   # artwork A — memento proxy (Rhizome → IA → LOC)
      mode: proxy
    - type: pywb
      base_url: "http://localhost:8090"   # artwork B — memento proxy (custom sequence)
      mode: proxy
    - type: pywb
      base_url: "http://localhost:8080"   # shared local WARCs
      collection: "web"
    - type: cache
    - type: wayback

Each pywb instance runs its own process (or Docker container) with its own config.yaml specifying which WARC files or remote archives to query. This is necessary because pywb does not support per-port proxy configurations — each distinct set of archive sources requires its own instance.

Local API Proxy

Some archived web pages depend on external APIs that no longer exist or have changed. The local backend intercepts requests to specific hostnames and forwards them to a self-hosted replacement service:

backends:
  chain:
    - type: local
      base_url: "http://localhost:9000"
      hostnames:
        - "api.sounddogs.com"
        - "*.example.com"
      timeout: 30
    - type: cache
    - type: wayback

When a request matches a configured hostname pattern (using fnmatch-style wildcards), the path and query string are forwarded to the local service. For example, http://api.sounddogs.com/search?q=cat becomes http://localhost:9000/search?q=cat. The local service receives X-Original-Host and X-Original-URL headers so it can distinguish requests from different origins.

Local responses are passed through directly — they are not cached and not transformed. If no hostnames are configured, the backend acts as a catch-all for all requests.

Live Config Reload

When using --config, the proxy subscribes to a Redis Pub/Sub channel for live reload signals. The admin service publishes to this channel when you save config changes, so most settings take effect immediately without restarting the proxy.

Admin Interfaces

FastAPI Admin Service (port 8080)

A separate web application for remote management with a modern dark-themed UI:

# Standalone
cd admin_service && python -m admin_service --config ../config.yaml

# Or via Docker
docker-compose up admin

Features:

Dashboard — cache stats, crawl status, most viewed domains
Configuration — edit all settings through a web form, with live reload to the proxy
Cache Browser — paginated list with search, delete individual entries, clear tiers
Crawler — seed management, start/stop/recrawl, live log with htmx auto-refresh
WARC Export — download cache as .warc.gz, diff against existing WARCs, export only new entries (delta)

Built-in Admin (/_admin/)

Access at http://proxy-host:port/_admin/ (with Basic Auth if configured). This is an IE4-compatible interface embedded in the proxy, suitable for local/exhibition use.

Crawl Seeds — add URLs with depth for prefetch crawling
Crawl Control — start, stop, or force-recrawl (clears hot cache first)
Crawl Log — live log of crawl progress
Cache Management — view stats, delete individual URLs, clear hot cache
Auto-Refresh — toggle button for live updates via XHR

How It Works

Browser  ──HTTP Proxy──>  Proxy (port 8888)  ──>  Backend Chain
                                │                   ├── Local API service(s)
                                │                   ├── pywb instance(s) (local WARCs)
                                │                   ├── Redis cache (curated/hot)
                                │                   └── Wayback Machine
                                │
                                └── chain order decides priority, rest skipped

Browser  ──HTTP──>  Admin Service (port 8080)
                         ├── config.yaml (read/write)
                         ├── Redis (cache, crawl, seeds)
                         ├── WARC export/diff
                         └── Pub/Sub reload ──> Proxy

The proxy is a raw asyncio TCP server that speaks HTTP. When a request comes in:

Check the allowlist (if enabled) -- reject URLs not on the list
Walk the backend chain in configured order (e.g. pywb -> cache -> wayback) -- the chain order decides priority
Transform the content if needed -- pywb and cache responses are already clean; Wayback responses get toolbar/script removal and URL fixing
Store in hot cache if the response came from a live backend (Wayback Machine) -- pywb and cache hits skip this
Inject the header bar (if enabled) and throttle the response to simulate period-accurate connection speeds

The header bar is injected after the cache lookup, so cached pages don't need invalidation when you change header bar settings.

Two-Tier Cache

Curated -- permanent entries managed by the admin interface and prefetch crawler. These survive Redis restarts (with AOF persistence) and represent your vetted, exhibition-ready content.
Hot -- auto-populated on cache miss, expires after 7 days (configurable). Acts as a working cache for pages visitors discover on their own.

WARC Export

The admin service can export Redis cache contents as standard .warc.gz files at /warc/:

Export — download the full cache (or a filtered subset) as a WARC archive
Diff — upload an existing .warc.gz and see which URLs are only in the cache, only in the WARC, or in both
Delta export — upload a WARC and download only the URLs that are new in the cache

This lets you build up a WARC archive incrementally: run the proxy, let visitors browse, then export the new pages and merge them into your master WARC collection.

Project Structure

proxy/                          # The proxy server
├── wayback_proxy/
│   ├── __main__.py             # CLI entry point
│   ├── config.py               # Dataclass config (YAML + env + CLI)
│   ├── server.py               # Async TCP server, request routing
│   ├── cache.py                # Redis two-tier cache
│   ├── admin.py                # Built-in /_admin/ interface
│   ├── crawler.py              # Async prefetch spider
│   ├── throttle.py             # Modem speed throttling
│   ├── warc_export.py          # WARC export, diff, delta export
│   └── wayback/
│       ├── backend.py          # Backend ABC, chain, cache backend, factory
│       ├── client.py           # Wayback Machine HTTP client
│       ├── local_client.py     # Local API proxy client
│       ├── pywb_client.py      # pywb replay client
│       └── transformer.py      # Content cleanup (toolbar, URLs, scripts)
├── error_pages/                # Error page templates
├── landing_page/               # Landing page template
└── Dockerfile

admin_service/                  # Remote admin UI (FastAPI + htmx)
├── admin_service/
│   ├── __main__.py             # Uvicorn entry point
│   ├── app.py                  # FastAPI app, auth middleware
│   ├── routes/                 # Dashboard, config, cache, crawler, WARC export
│   ├── templates/              # Jinja2 + htmx templates
│   └── static/                 # Dark theme CSS
└── Dockerfile

Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.

License

MIT License — see LICENSE

Developed at

This project is developed at ZKM | Center for Art and Media Karlsruhe, a publicly funded cultural institution exploring the intersection of art, science, and technology.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
admin_service		admin_service
proxy		proxy
.gitignore		.gitignore
.zenodo.json		.zenodo.json
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
codemeta.json		codemeta.json
config.example.yaml		config.example.yaml
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wayback Cache Proxy

Why This Exists

Features

Quick Start

Prerequisites

Installation

Usage

Docker

Configuration

CLI Options (Proxy)

Backend Chain

pywb Modes

Multiple pywb Instances

Local API Proxy

Live Config Reload

Admin Interfaces

FastAPI Admin Service (port 8080)

Built-in Admin (/_admin/)

How It Works

Two-Tier Cache

WARC Export

Project Structure

Contributing

License

Developed at

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Wayback Cache Proxy

Why This Exists

Features

Quick Start

Prerequisites

Installation

Usage

Docker

Configuration

CLI Options (Proxy)

Backend Chain

pywb Modes

Multiple pywb Instances

Local API Proxy

Live Config Reload

Admin Interfaces

FastAPI Admin Service (port 8080)

Built-in Admin (/_admin/)

How It Works

Two-Tier Cache

WARC Export

Project Structure

Contributing

License

Developed at

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages