Skip to content

zkmkarlsruhe/wayback-cache-proxy

Repository files navigation

Wayback Cache Proxy

DOI ZKM ZKM Open Source License: MIT

Browse the old web through a caching Wayback Machine proxy. Redis-backed two-tier cache, admin interface, prefetch crawler, modem speed throttling, and header bar overlay. Built for museums and media art exhibitions.


Why This Exists

For Choose Your Filter! at ZKM Karlsruhe (2025), we showed 30 years of artistic web browsers -- works like JODI's %WRONG Browser, I/O/D's Web Stalker, Maciej Wisniewski's netomat, and many others. These aren't static artworks. They need to fetch live web pages to function: Web Stalker strips a site down to its link structure, netomat dissolves pages into floating streams of text and image, %WRONG Browser turns any website into a JODI piece. To show them as they were meant to be experienced, you need the web pages they were built to browse -- pages from the late 1990s and early 2000s that only exist in the Wayback Machine now.

We used a non-caching Wayback proxy (richardg867/WaybackProxy) which worked well enough -- until the Internet Archive was hit by repeated DDoS attacks and a major data breach that took it offline for days. The aftermath left the Wayback Machine significantly slower and less stable well into 2025. Every artwork that depended on it was affected -- visitors saw blank screens and error pages instead of net art.

This proxy was built after that experience so it won't happen again. It fetches pages from the Wayback Machine once, stores them in Redis, and serves them locally from then on. The prefetch crawler can spider entire sites into the curated cache before an exhibition opens, so even a complete Wayback Machine outage won't take the artworks down.


Features

  • Configurable backend chain — chain multiple sources in any order: local API services, pywb WARC archives, Redis cache, and the Wayback Machine. Chain order decides priority.
  • Caching proxy — fetches archived pages and caches them in Redis for fast, offline-capable serving
  • Local API proxy — intercept requests to specific hostnames and forward them to a self-hosted replacement service, for APIs that no longer exist
  • pywb integration — serve pages from local WARC files via pywb before falling back to remote sources
  • Two-tier cache — permanent curated tier (admin/crawler managed) and auto-expiring hot tier (on-demand fetches)
  • WARC export — export cache contents as .warc.gz files, diff against existing WARCs, and produce delta exports
  • Admin interface — web UI for managing crawl seeds, cache, WARC export, and monitoring crawl progress with live updates
  • Prefetch crawler — spider URLs from seed pages into curated cache before the exhibition opens
  • Speed throttling — simulate period-accurate connection speeds (14.4k, 28.8k, 56k, ISDN, DSL) with visitor-selectable dropdown
  • Header bar overlay — injected info bar showing current URL, archive date, and speed selector
  • Landing page — styled start page with most-viewed domains
  • Custom error pages — themed 403, 404, and generic error templates
  • Allowlist mode — restrict browsing to pre-approved URLs

Quick Start

Prerequisites

  • Python 3.11+
  • Redis 7+
  • Poetry (for dependency management)

Installation

git clone https://github.com/zkmkarlsruhe/wayback-cache-proxy.git
cd wayback-cache-proxy/proxy

poetry install

Usage

# Start Redis
redis-server &

# Run the proxy with YAML config
cp config.example.yaml config.yaml
python -m wayback_proxy --config config.yaml

# Or with CLI flags
python -m wayback_proxy --port 8888 --date 20010911 --header-bar --admin

Then configure your browser's HTTP proxy to localhost:8888 and browse any URL.

Docker

cp config.example.yaml config.yaml
docker-compose up

This starts three services:

  • Proxy on port 8888 — configure your browser to use this as an HTTP proxy
  • Admin on port 8080 — open in your browser for remote management
  • Redis on port 6379 — shared cache and state

Configuration

All settings can be managed through a YAML config file. Copy the example and edit:

cp config.example.yaml config.yaml

See config.example.yaml for all available options with inline documentation.

CLI Options (Proxy)

Flag Default Description
--config Path to YAML config file
--port 8888 Listen port
--date 20010101 Wayback target date (YYYYMMDD)
--redis redis://localhost:6379/0 Redis URL
--header-bar off Show overlay header bar
--header-bar-position top top or bottom
--header-bar-text Custom branding text
--speed unlimited Default throttle: 14.4k, 28.8k, 56k, isdn, dsl
--speed-selector off Let visitors pick speed via dropdown
--admin off Enable admin at /_admin/
--admin-password Password for admin Basic Auth
--allowlist off Restrict to allowlisted domains
--error-pages Custom error page template directory
--no-landing-page Disable the landing page

Backend Chain

The backend chain controls where the proxy looks for archived pages, and in what order. Configure it in config.yaml:

backends:
  chain:
    - type: pywb
      base_url: "http://localhost:8080"
      collection: "web"
    - type: cache
    - type: wayback

This tries pywb first (local WARC files), then Redis cache, then the Wayback Machine. The chain order decides priority — once a backend responds, the rest are skipped.

Type Description
local Forward matching requests to a local HTTP service. Requires hostnames (fnmatch patterns). Optional base_url (default: http://localhost:9000) and timeout (default: 30s). Responses are not cached.
pywb WARC replay via pywb. Requires base_url. Optional collection (default: web) and mode (default: replay). Multiple instances allowed.
cache Redis cache lookup (curated tier first, then hot).
wayback Wayback Machine (live internet). Optional base_url override.

If the backends section is omitted, the default chain is cache -> wayback (original behavior).

The crawler only uses live backends (Wayback Machine) regardless of chain configuration -- pywb and cache backends are excluded from crawl fetches.

pywb Modes

pywb backends support two modes:

Mode Description
replay (default) Constructs /{collection}/{timestamp}id_/{url} URLs. For pywb instances serving local WARC files.
proxy Sends requests through pywb as an HTTP proxy. For pywb instances configured with enable_http_proxy and memento proxy sequences.

Replay mode — local WARC collections:

- type: pywb
  base_url: "http://localhost:8080"
  collection: "web"          # pywb collection name
  # mode: replay             # default

Proxy mode — memento sequences (Rhizome, Internet Archive, LOC, etc.):

- type: pywb
  base_url: "http://localhost:8089"
  mode: proxy

In proxy mode, pywb handles timestamp and source selection via its own config (e.g. default_timestamp, sequence of memento sources). The collection parameter is not used.

Multiple pywb Instances

Each artwork or WARC collection can have its own pywb instance with different archive sources and sequence configurations. List multiple pywb entries in the chain — they are tried in order:

backends:
  chain:
    - type: pywb
      base_url: "http://localhost:8089"   # artwork A — memento proxy (Rhizome → IA → LOC)
      mode: proxy
    - type: pywb
      base_url: "http://localhost:8090"   # artwork B — memento proxy (custom sequence)
      mode: proxy
    - type: pywb
      base_url: "http://localhost:8080"   # shared local WARCs
      collection: "web"
    - type: cache
    - type: wayback

Each pywb instance runs its own process (or Docker container) with its own config.yaml specifying which WARC files or remote archives to query. This is necessary because pywb does not support per-port proxy configurations — each distinct set of archive sources requires its own instance.

Local API Proxy

Some archived web pages depend on external APIs that no longer exist or have changed. The local backend intercepts requests to specific hostnames and forwards them to a self-hosted replacement service:

backends:
  chain:
    - type: local
      base_url: "http://localhost:9000"
      hostnames:
        - "api.sounddogs.com"
        - "*.example.com"
      timeout: 30
    - type: cache
    - type: wayback

When a request matches a configured hostname pattern (using fnmatch-style wildcards), the path and query string are forwarded to the local service. For example, http://api.sounddogs.com/search?q=cat becomes http://localhost:9000/search?q=cat. The local service receives X-Original-Host and X-Original-URL headers so it can distinguish requests from different origins.

Local responses are passed through directly — they are not cached and not transformed. If no hostnames are configured, the backend acts as a catch-all for all requests.

Live Config Reload

When using --config, the proxy subscribes to a Redis Pub/Sub channel for live reload signals. The admin service publishes to this channel when you save config changes, so most settings take effect immediately without restarting the proxy.


Admin Interfaces

FastAPI Admin Service (port 8080)

A separate web application for remote management with a modern dark-themed UI:

# Standalone
cd admin_service && python -m admin_service --config ../config.yaml

# Or via Docker
docker-compose up admin

Features:

  • Dashboard — cache stats, crawl status, most viewed domains
  • Configuration — edit all settings through a web form, with live reload to the proxy
  • Cache Browser — paginated list with search, delete individual entries, clear tiers
  • Crawler — seed management, start/stop/recrawl, live log with htmx auto-refresh
  • WARC Export — download cache as .warc.gz, diff against existing WARCs, export only new entries (delta)

Built-in Admin (/_admin/)

Access at http://proxy-host:port/_admin/ (with Basic Auth if configured). This is an IE4-compatible interface embedded in the proxy, suitable for local/exhibition use.

  • Crawl Seeds — add URLs with depth for prefetch crawling
  • Crawl Control — start, stop, or force-recrawl (clears hot cache first)
  • Crawl Log — live log of crawl progress
  • Cache Management — view stats, delete individual URLs, clear hot cache
  • Auto-Refresh — toggle button for live updates via XHR

How It Works

Browser  ──HTTP Proxy──>  Proxy (port 8888)  ──>  Backend Chain
                                │                   ├── Local API service(s)
                                │                   ├── pywb instance(s) (local WARCs)
                                │                   ├── Redis cache (curated/hot)
                                │                   └── Wayback Machine
                                │
                                └── chain order decides priority, rest skipped

Browser  ──HTTP──>  Admin Service (port 8080)
                         ├── config.yaml (read/write)
                         ├── Redis (cache, crawl, seeds)
                         ├── WARC export/diff
                         └── Pub/Sub reload ──> Proxy

The proxy is a raw asyncio TCP server that speaks HTTP. When a request comes in:

  1. Check the allowlist (if enabled) -- reject URLs not on the list
  2. Walk the backend chain in configured order (e.g. pywb -> cache -> wayback) -- the chain order decides priority
  3. Transform the content if needed -- pywb and cache responses are already clean; Wayback responses get toolbar/script removal and URL fixing
  4. Store in hot cache if the response came from a live backend (Wayback Machine) -- pywb and cache hits skip this
  5. Inject the header bar (if enabled) and throttle the response to simulate period-accurate connection speeds

The header bar is injected after the cache lookup, so cached pages don't need invalidation when you change header bar settings.

Two-Tier Cache

  • Curated -- permanent entries managed by the admin interface and prefetch crawler. These survive Redis restarts (with AOF persistence) and represent your vetted, exhibition-ready content.
  • Hot -- auto-populated on cache miss, expires after 7 days (configurable). Acts as a working cache for pages visitors discover on their own.

WARC Export

The admin service can export Redis cache contents as standard .warc.gz files at /warc/:

  • Export — download the full cache (or a filtered subset) as a WARC archive
  • Diff — upload an existing .warc.gz and see which URLs are only in the cache, only in the WARC, or in both
  • Delta export — upload a WARC and download only the URLs that are new in the cache

This lets you build up a WARC archive incrementally: run the proxy, let visitors browse, then export the new pages and merge them into your master WARC collection.

Project Structure

proxy/                          # The proxy server
├── wayback_proxy/
│   ├── __main__.py             # CLI entry point
│   ├── config.py               # Dataclass config (YAML + env + CLI)
│   ├── server.py               # Async TCP server, request routing
│   ├── cache.py                # Redis two-tier cache
│   ├── admin.py                # Built-in /_admin/ interface
│   ├── crawler.py              # Async prefetch spider
│   ├── throttle.py             # Modem speed throttling
│   ├── warc_export.py          # WARC export, diff, delta export
│   └── wayback/
│       ├── backend.py          # Backend ABC, chain, cache backend, factory
│       ├── client.py           # Wayback Machine HTTP client
│       ├── local_client.py     # Local API proxy client
│       ├── pywb_client.py      # pywb replay client
│       └── transformer.py      # Content cleanup (toolbar, URLs, scripts)
├── error_pages/                # Error page templates
├── landing_page/               # Landing page template
└── Dockerfile

admin_service/                  # Remote admin UI (FastAPI + htmx)
├── admin_service/
│   ├── __main__.py             # Uvicorn entry point
│   ├── app.py                  # FastAPI app, auth middleware
│   ├── routes/                 # Dashboard, config, cache, crawler, WARC export
│   ├── templates/              # Jinja2 + htmx templates
│   └── static/                 # Dark theme CSS
└── Dockerfile

Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.


License

MIT License — see LICENSE


Developed at

This project is developed at ZKM | Center for Art and Media Karlsruhe, a publicly funded cultural institution exploring the intersection of art, science, and technology.

ZKM

Copyright (c) 2026 ZKM | Karlsruhe

About

Browse the old web through a caching Wayback Machine proxy. Redis-backed two-tier cache, admin interface, prefetch crawler, modem speed throttling, and header bar overlay. Built for museums and media art exhibitions.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors