Browse the old web through a caching Wayback Machine proxy. Redis-backed two-tier cache, admin interface, prefetch crawler, modem speed throttling, and header bar overlay. Built for museums and media art exhibitions.
For Choose Your Filter! at ZKM Karlsruhe (2025), we showed 30 years of artistic web browsers -- works like JODI's %WRONG Browser, I/O/D's Web Stalker, Maciej Wisniewski's netomat, and many others. These aren't static artworks. They need to fetch live web pages to function: Web Stalker strips a site down to its link structure, netomat dissolves pages into floating streams of text and image, %WRONG Browser turns any website into a JODI piece. To show them as they were meant to be experienced, you need the web pages they were built to browse -- pages from the late 1990s and early 2000s that only exist in the Wayback Machine now.
We used a non-caching Wayback proxy (richardg867/WaybackProxy) which worked well enough -- until the Internet Archive was hit by repeated DDoS attacks and a major data breach that took it offline for days. The aftermath left the Wayback Machine significantly slower and less stable well into 2025. Every artwork that depended on it was affected -- visitors saw blank screens and error pages instead of net art.
This proxy was built after that experience so it won't happen again. It fetches pages from the Wayback Machine once, stores them in Redis, and serves them locally from then on. The prefetch crawler can spider entire sites into the curated cache before an exhibition opens, so even a complete Wayback Machine outage won't take the artworks down.
- Configurable backend chain — chain multiple sources in any order: local API services, pywb WARC archives, Redis cache, and the Wayback Machine. Chain order decides priority.
- Caching proxy — fetches archived pages and caches them in Redis for fast, offline-capable serving
- Local API proxy — intercept requests to specific hostnames and forward them to a self-hosted replacement service, for APIs that no longer exist
- pywb integration — serve pages from local WARC files via pywb before falling back to remote sources
- Two-tier cache — permanent curated tier (admin/crawler managed) and auto-expiring hot tier (on-demand fetches)
- WARC export — export cache contents as
.warc.gzfiles, diff against existing WARCs, and produce delta exports - Admin interface — web UI for managing crawl seeds, cache, WARC export, and monitoring crawl progress with live updates
- Prefetch crawler — spider URLs from seed pages into curated cache before the exhibition opens
- Speed throttling — simulate period-accurate connection speeds (14.4k, 28.8k, 56k, ISDN, DSL) with visitor-selectable dropdown
- Header bar overlay — injected info bar showing current URL, archive date, and speed selector
- Landing page — styled start page with most-viewed domains
- Custom error pages — themed 403, 404, and generic error templates
- Allowlist mode — restrict browsing to pre-approved URLs
- Python 3.11+
- Redis 7+
- Poetry (for dependency management)
git clone https://github.com/zkmkarlsruhe/wayback-cache-proxy.git
cd wayback-cache-proxy/proxy
poetry install# Start Redis
redis-server &
# Run the proxy with YAML config
cp config.example.yaml config.yaml
python -m wayback_proxy --config config.yaml
# Or with CLI flags
python -m wayback_proxy --port 8888 --date 20010911 --header-bar --adminThen configure your browser's HTTP proxy to localhost:8888 and browse any URL.
cp config.example.yaml config.yaml
docker-compose upThis starts three services:
- Proxy on port 8888 — configure your browser to use this as an HTTP proxy
- Admin on port 8080 — open in your browser for remote management
- Redis on port 6379 — shared cache and state
All settings can be managed through a YAML config file. Copy the example and edit:
cp config.example.yaml config.yamlSee config.example.yaml for all available options with inline documentation.
| Flag | Default | Description |
|---|---|---|
--config |
Path to YAML config file | |
--port |
8888 | Listen port |
--date |
20010101 | Wayback target date (YYYYMMDD) |
--redis |
redis://localhost:6379/0 | Redis URL |
--header-bar |
off | Show overlay header bar |
--header-bar-position |
top | top or bottom |
--header-bar-text |
Custom branding text | |
--speed |
unlimited | Default throttle: 14.4k, 28.8k, 56k, isdn, dsl |
--speed-selector |
off | Let visitors pick speed via dropdown |
--admin |
off | Enable admin at /_admin/ |
--admin-password |
Password for admin Basic Auth | |
--allowlist |
off | Restrict to allowlisted domains |
--error-pages |
Custom error page template directory | |
--no-landing-page |
Disable the landing page |
The backend chain controls where the proxy looks for archived pages, and in what order. Configure it in config.yaml:
backends:
chain:
- type: pywb
base_url: "http://localhost:8080"
collection: "web"
- type: cache
- type: waybackThis tries pywb first (local WARC files), then Redis cache, then the Wayback Machine. The chain order decides priority — once a backend responds, the rest are skipped.
| Type | Description |
|---|---|
local |
Forward matching requests to a local HTTP service. Requires hostnames (fnmatch patterns). Optional base_url (default: http://localhost:9000) and timeout (default: 30s). Responses are not cached. |
pywb |
WARC replay via pywb. Requires base_url. Optional collection (default: web) and mode (default: replay). Multiple instances allowed. |
cache |
Redis cache lookup (curated tier first, then hot). |
wayback |
Wayback Machine (live internet). Optional base_url override. |
If the backends section is omitted, the default chain is cache -> wayback (original behavior).
The crawler only uses live backends (Wayback Machine) regardless of chain configuration -- pywb and cache backends are excluded from crawl fetches.
pywb backends support two modes:
| Mode | Description |
|---|---|
replay |
(default) Constructs /{collection}/{timestamp}id_/{url} URLs. For pywb instances serving local WARC files. |
proxy |
Sends requests through pywb as an HTTP proxy. For pywb instances configured with enable_http_proxy and memento proxy sequences. |
Replay mode — local WARC collections:
- type: pywb
base_url: "http://localhost:8080"
collection: "web" # pywb collection name
# mode: replay # defaultProxy mode — memento sequences (Rhizome, Internet Archive, LOC, etc.):
- type: pywb
base_url: "http://localhost:8089"
mode: proxyIn proxy mode, pywb handles timestamp and source selection via its own config (e.g. default_timestamp, sequence of memento sources). The collection parameter is not used.
Each artwork or WARC collection can have its own pywb instance with different archive sources and sequence configurations. List multiple pywb entries in the chain — they are tried in order:
backends:
chain:
- type: pywb
base_url: "http://localhost:8089" # artwork A — memento proxy (Rhizome → IA → LOC)
mode: proxy
- type: pywb
base_url: "http://localhost:8090" # artwork B — memento proxy (custom sequence)
mode: proxy
- type: pywb
base_url: "http://localhost:8080" # shared local WARCs
collection: "web"
- type: cache
- type: waybackEach pywb instance runs its own process (or Docker container) with its own config.yaml specifying which WARC files or remote archives to query. This is necessary because pywb does not support per-port proxy configurations — each distinct set of archive sources requires its own instance.
Some archived web pages depend on external APIs that no longer exist or have changed. The local backend intercepts requests to specific hostnames and forwards them to a self-hosted replacement service:
backends:
chain:
- type: local
base_url: "http://localhost:9000"
hostnames:
- "api.sounddogs.com"
- "*.example.com"
timeout: 30
- type: cache
- type: waybackWhen a request matches a configured hostname pattern (using fnmatch-style wildcards), the path and query string are forwarded to the local service. For example, http://api.sounddogs.com/search?q=cat becomes http://localhost:9000/search?q=cat. The local service receives X-Original-Host and X-Original-URL headers so it can distinguish requests from different origins.
Local responses are passed through directly — they are not cached and not transformed. If no hostnames are configured, the backend acts as a catch-all for all requests.
When using --config, the proxy subscribes to a Redis Pub/Sub channel for live reload signals. The admin service publishes to this channel when you save config changes, so most settings take effect immediately without restarting the proxy.
A separate web application for remote management with a modern dark-themed UI:
# Standalone
cd admin_service && python -m admin_service --config ../config.yaml
# Or via Docker
docker-compose up adminFeatures:
- Dashboard — cache stats, crawl status, most viewed domains
- Configuration — edit all settings through a web form, with live reload to the proxy
- Cache Browser — paginated list with search, delete individual entries, clear tiers
- Crawler — seed management, start/stop/recrawl, live log with htmx auto-refresh
- WARC Export — download cache as
.warc.gz, diff against existing WARCs, export only new entries (delta)
Access at http://proxy-host:port/_admin/ (with Basic Auth if configured). This is an IE4-compatible interface embedded in the proxy, suitable for local/exhibition use.
- Crawl Seeds — add URLs with depth for prefetch crawling
- Crawl Control — start, stop, or force-recrawl (clears hot cache first)
- Crawl Log — live log of crawl progress
- Cache Management — view stats, delete individual URLs, clear hot cache
- Auto-Refresh — toggle button for live updates via XHR
Browser ──HTTP Proxy──> Proxy (port 8888) ──> Backend Chain
│ ├── Local API service(s)
│ ├── pywb instance(s) (local WARCs)
│ ├── Redis cache (curated/hot)
│ └── Wayback Machine
│
└── chain order decides priority, rest skipped
Browser ──HTTP──> Admin Service (port 8080)
├── config.yaml (read/write)
├── Redis (cache, crawl, seeds)
├── WARC export/diff
└── Pub/Sub reload ──> Proxy
The proxy is a raw asyncio TCP server that speaks HTTP. When a request comes in:
- Check the allowlist (if enabled) -- reject URLs not on the list
- Walk the backend chain in configured order (e.g. pywb -> cache -> wayback) -- the chain order decides priority
- Transform the content if needed -- pywb and cache responses are already clean; Wayback responses get toolbar/script removal and URL fixing
- Store in hot cache if the response came from a live backend (Wayback Machine) -- pywb and cache hits skip this
- Inject the header bar (if enabled) and throttle the response to simulate period-accurate connection speeds
The header bar is injected after the cache lookup, so cached pages don't need invalidation when you change header bar settings.
- Curated -- permanent entries managed by the admin interface and prefetch crawler. These survive Redis restarts (with AOF persistence) and represent your vetted, exhibition-ready content.
- Hot -- auto-populated on cache miss, expires after 7 days (configurable). Acts as a working cache for pages visitors discover on their own.
The admin service can export Redis cache contents as standard .warc.gz files at /warc/:
- Export — download the full cache (or a filtered subset) as a WARC archive
- Diff — upload an existing
.warc.gzand see which URLs are only in the cache, only in the WARC, or in both - Delta export — upload a WARC and download only the URLs that are new in the cache
This lets you build up a WARC archive incrementally: run the proxy, let visitors browse, then export the new pages and merge them into your master WARC collection.
proxy/ # The proxy server
├── wayback_proxy/
│ ├── __main__.py # CLI entry point
│ ├── config.py # Dataclass config (YAML + env + CLI)
│ ├── server.py # Async TCP server, request routing
│ ├── cache.py # Redis two-tier cache
│ ├── admin.py # Built-in /_admin/ interface
│ ├── crawler.py # Async prefetch spider
│ ├── throttle.py # Modem speed throttling
│ ├── warc_export.py # WARC export, diff, delta export
│ └── wayback/
│ ├── backend.py # Backend ABC, chain, cache backend, factory
│ ├── client.py # Wayback Machine HTTP client
│ ├── local_client.py # Local API proxy client
│ ├── pywb_client.py # pywb replay client
│ └── transformer.py # Content cleanup (toolbar, URLs, scripts)
├── error_pages/ # Error page templates
├── landing_page/ # Landing page template
└── Dockerfile
admin_service/ # Remote admin UI (FastAPI + htmx)
├── admin_service/
│ ├── __main__.py # Uvicorn entry point
│ ├── app.py # FastAPI app, auth middleware
│ ├── routes/ # Dashboard, config, cache, crawler, WARC export
│ ├── templates/ # Jinja2 + htmx templates
│ └── static/ # Dark theme CSS
└── Dockerfile
Contributions welcome! See CONTRIBUTING.md for guidelines.
MIT License — see LICENSE
This project is developed at ZKM | Center for Art and Media Karlsruhe, a publicly funded cultural institution exploring the intersection of art, science, and technology.
Copyright (c) 2026 ZKM | Karlsruhe