Reperio is a high-performance visualization utility for Apache Nutch data structures including CrawlDB, LinkDB, and HostDB.
Key Features:
- 🚀 Fast: WebGL-powered visualization using Sigma.js for millions of nodes
- 🔗 Integrates with Nutch: Works with Nutch tool exports (readdb, readlinkdb, readhostdb)
- 🌐 Flexible: Supports compressed/uncompressed SequenceFiles, Nutch exports, local filesystem, and HDFS
- 📊 Comprehensive: Built-in graph analysis (PageRank, centrality, components)
- 🎨 Interactive: Real-time filtering, search, and exploration
- 🔧 Developer-Friendly: REST API, CLI, and web interface
┌─────────────────────────────────────────────────────────────────────────┐
│ Data Source Layer │
│ ┌─────────────┐ ┌──────────────────────────┐ │
│ │ Local FS │ │ Nutch Database │ │
│ │ HDFS │ │ /crawldb/ │ │
│ └─────────────┘ │ current/ │ │
│ │ │ part-r-00000/data │ │
│ │ │ part-r-00001/data │ │
│ │ │ part-r-00002/data │ │
│ │ └──────────────────────────┘ │
└─────────┼────────────────────────────────────────────┬─────────────────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────────────┐
│ FileSystem │◀─────────────────────────│ Partition Discovery │
│ Manager │ │ (Auto-detect all │
└──────────────┘ │ part-r-* files) │
│ └──────────────────────┘
│ │
▼ │
┌───────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Reader Layer (with Factory) │
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ create_nutch_reader() ← Factory Function │ │
│ │ • Detects single vs multi-partition │ │
│ │ • Returns appropriate reader │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │ │ │
│ Single File│ │Multi-Partition │
│ ▼ ▼ │
│ ┌───────────────────┐ ┌────────────────────────┐ │
│ │ SequenceFile │ │ NutchDatabaseReader │ │
│ │ Reader │ │ • Aggregates partitions│ │
│ │ • Single partition│ │ • Progress reporting │ │
│ │ • Compression │ │ • Sequential reading │ │
│ └───────────────────┘ └────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
│ │
└──────────────┬───────────────┘
▼
┌───────────────┐
│ Data Parser │
│ (CrawlDB/ │
│ LinkDB/ │
│ HostDB) │
└───────────────┘
│
▼
┌───────────────┐
│ Graph Builder │
│ (NetworkX) │
└───────────────┘
│
┌───────────────────────┼───────────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌──────────────┐ ┌──────────────┐
│ FastAPI │ │ CLI │ │ Exporter │
│ API │ │ Commands │ │ (JSON/GEXF) │
│ • REST API │ │ • serve │ └──────────────┘
│ • Graph data │ │ • web │
└───────────────┘ └──────────────┘
│
▼
┌───────────────┐
│ React + Vite │
│ Frontend │
│ (Sigma.js) │
└───────────────┘
- Python 3.10 or higher
- Node.js 18+ and npm (for frontend development)
- Optional: HDFS access (with pyarrow) for reading from Hadoop clusters
# Clone the repository
git clone https://github.com/lewismc/reperio.git
cd reperio
# Create a virtual environment
python3 -m venv venv
# Activate the environment
source venv/bin/activate # On macOS/Linux
# or
venv\Scripts\activate # On Windows
# Install reperio
pip install -e .
# Optional: Install with HDFS support
pip install -e ".[hdfs]"
# Upgrade Typer if needed (requires 0.21+)
pip install --upgrade 'typer>=0.21.0'Always activate the virtual environment before running commands:
# Activate the environment
source venv/bin/activate # On macOS/Linux
# or
venv\Scripts\activate # On Windows
# Run reperio commands
reperio --help
# Load single dataset
reperio web --crawldb path/to/crawldb
# Load multiple datasets (recommended!)
reperio web --crawldb path/to/crawldb --linkdb path/to/linkdb
# Deactivate when done
deactivateThe frontend provides interactive graph visualization using React, Vite, and Sigma.js.
# Navigate to frontend directory
cd frontend
# Install dependencies
npm install
# Build the frontend for production (required for visualization)
npm run build
# This creates a 'dist' folder that the backend servesImportant: The frontend must be built before you can use interactive visualization. If you see the warning "Frontend not built" when running reperio web, run npm run build in the frontend directory.
Development mode (optional, for frontend development only):
# Run development server with hot reload
npm run devNutch handles statistics:
# CrawlDB statistics
nutch readdb /path/to/crawldb -stats
# LinkDB statistics
nutch readlinkdb /path/to/linkdb -stats
# HostDB statistics
nutch readhostdb /path/to/hostdb -statsReperio handles visualization:
- Export Nutch data as graph files
- Interactive web-based visualization
- Graph analysis (PageRank, centrality, etc.)
Prerequisites: Build the frontend first (see Frontend Installation)
Start the web UI with your data in one command:
# Load single dataset
reperio web --crawldb /path/to/crawldb
# Load multiple datasets (recommended!)
reperio web --crawldb /path/to/crawldb --linkdb /path/to/linkdb
# Load all three datasets
reperio web --crawldb /path/to/crawldb --linkdb /path/to/linkdb --hostdb /path/to/hostdb
# From HDFS (supports multi-partition)
reperio web --crawldb hdfs://namenode:9000/nutch/crawldb --linkdb hdfs://namenode:9000/nutch/linkdb
# With limits for large datasets
reperio web --crawldb /path/to/crawldb --max-records 50000
# Without opening browser
reperio web --crawldb /path/to/crawldb --no-openWhy load multiple datasets? Loading both CrawlDB and LinkDB together provides the complete picture of your crawl - nodes (pages) from CrawlDB and links between them from LinkDB. The frontend allows you to switch between datasets interactively.
Or start just the API server:
# Server loads data and exposes REST API
reperio serve --crawldb /path/to/crawldb --port 8000
# Load multiple datasets
reperio serve --crawldb /path/to/crawldb --linkdb /path/to/linkdb --port 8000
# Access API docs at: http://localhost:8000/docsFeatures:
- ✓ Multi-partition support: Automatically reads all partitions in a Nutch database directory
- ✓ Compressed files: Supports DefaultCodec, Gzip, BZip2. See Compression Guide
- ✓ Progress reporting: Shows which partition is being processed for large databases
- ✓ Interactive visualization: WebGL-based graph rendering with Sigma.js
About the Visualization:
The reperio web command provides interactive graph visualization through a React + Sigma.js frontend:
- Multiple Datasets: Load and switch between CrawlDB, LinkDB, and HostDB interactively
- Nodes: URLs from your Nutch data, sized and colored by attributes (status, score, etc.)
- Edges: Link relationships (for LinkDB data)
- Dataset Switcher: When multiple datasets are loaded, use the dropdown in the control panel to switch between them
- Interactive controls: Pan, zoom, node selection, filtering
- Layout: Force-directed positioning using ForceAtlas2 algorithm
- Node details: Click any node to see metadata (status, score, fetch time, etc.)
Important: If you see "Frontend not built" warning, the visualization won't work. Build it first:
cd frontend
npm run buildWithout the built frontend, you'll only see the API documentation at http://localhost:8000/docs, but the REST API endpoints will still work for programmatic access.
The serve command loads your data and exposes it via REST API:
# Start server with data loaded
reperio serve --linkdb /path/to/linkdb --port 8000
# Or load multiple datasets
reperio serve --crawldb /path/to/crawldb --linkdb /path/to/linkdb --port 8000API endpoints available:
GET /api/graph- Full graph dataGET /api/nodes- Node listGET /api/edges- Edge listGET /api/summary- Graph statistics
API documentation: http://localhost:8000/docs
Example API calls:
import requests
# First, start the server with data loaded:
# $ reperio serve --crawldb /path/to/crawldb --linkdb /path/to/linkdb --port 8000
# Check loaded datasets
curl http://localhost:8000/api/datasets
# Activate a specific dataset
curl -X POST http://localhost:8000/api/datasets/linkdb/activate
# Get graph summary
summary = requests.get('http://localhost:8000/api/graph/summary').json()
print(f"Nodes: {summary['num_nodes']}, Edges: {summary['num_edges']}")
# Get all nodes
nodes = requests.get('http://localhost:8000/api/graph/nodes').json()
# Get all edges
edges = requests.get('http://localhost:8000/api/graph/edges').json()
# Get full graph data
graph_data = requests.get('http://localhost:8000/api/graph').json()
print(f"Graph has {len(graph_data['nodes'])} nodes and {len(graph_data['edges'])} edges")Nutch databases are typically split into multiple partition files for distributed processing. Reperio automatically discovers and reads all partitions, giving you a complete view of your data.
Nutch Database Structure:
crawldb/
current/
part-r-00000/
data
part-r-00001/
data
part-r-00002/
data
...
Usage - All are equivalent:
# Point to database root (recommended - reads all partitions)
reperio web --crawldb /path/to/crawldb
# Point to current directory
reperio web --crawldb /path/to/crawldb/current
# Point to single partition (reads only that partition)
reperio web --crawldb /path/to/crawldb/current/part-r-00000/dataProgress Reporting:
When reading multiple partitions, Reperio shows real-time progress:
Loading crawldb from: /path/to/crawldb
Found 8 partition file(s)
Building graph...
Reading partition 1/8: part-r-00000
Reading partition 2/8: part-r-00001
Reading partition 3/8: part-r-00002
...
✓ Graph built: 150,000 nodes, 2,500,000 edges
Behavior with --max-records:
The limit applies to the total across all partitions:
# Read up to 10,000 records per dataset from all partitions
reperio web --crawldb /path/to/crawldb --max-records 10000- CrawlDB: URL-level crawl status, scores, and metadata
- LinkDB: Link graph with inbound links and anchor text
- HostDB: Host-level statistics and aggregations
- Nutch Tool Exports (Recommended): Output from
readdb,readlinkdb,readhostdb- ✅ Handles all compression formats
- ✅ Production-ready
- ✅ Well-tested
- SequenceFiles (Direct): Compressed and uncompressed SequenceFiles
- ✅ Compression support (DefaultCodec, Gzip, BZip2)
- ✅ Fast direct access
- ✅ Quick for development/testing
- Local Filesystem: Read from local disk
- HDFS: Direct access to Hadoop Distributed File System
- Auto-detection: Automatically detects format and storage type
- PageRank: Identify important pages
- Centrality: In-degree and out-degree centrality
- Connected Components: Find strongly/weakly connected subgraphs
- Host-level Analysis: Extract and analyze host relationships
- Filtering: By status, domain, score range
- Interactive Graph: Pan, zoom, and click nodes for details
- WebGL Rendering: Smooth performance with 100k+ nodes
- Customizable Layout: Force-directed, hierarchical layouts
- Real-time Search: Find nodes by URL pattern
- Export: JSON, Sigma.js format, GEXF, GraphML
Reperio works best with Apache Nutch's built-in tools. See the Nutch Integration Guide for complete details.
Quick example:
# Step 1: Use Nutch for statistics
nutch readdb /path/to/crawldb -stats
# Step 2: Visualize with Reperio (one command!)
reperio web --crawldb /path/to/crawldb --linkdb /path/to/linkdb
# Alternative: Export to file for Gephi or archiving
reperio export /path/to/crawldb crawldb-graph.gexf --type crawldb --format gexfCreate a .env file in the project root:
# HDFS Configuration
HDFS_NAMENODE=namenode.example.com
HDFS_PORT=9000
HADOOP_CONF_DIR=/etc/hadoop/conf
# API Configuration
API_HOST=0.0.0.0
API_PORT=8000
# Cache Configuration
CACHE_ENABLED=trueOr use environment variables:
export HDFS_NAMENODE=namenode.example.com
export HDFS_PORT=9000
reperio serve --crawldb /path/to/crawldbMakefile contains functions for faster development.
Install all dependencies and pre-commit hooks
Install requirements:
make installPre-commit hooks coulb be installed after git init via
make pre-commit-installCodestyle and type checks
Automatic formatting uses ruff.
make polish-codestyle
# or use synonym
make formattingCodestyle checks only, without rewriting files:
make check-codestyleNote:
check-codestyleusesruffanddarglintlibrary
Code security
If this command is not selected during installation, it cannnot be used.
make check-safetyThis command identifies security issues with Safety and Bandit.
make check-safetyTests with coverage badges
Run pytest
make testAll linters
Of course there is a command to run all linters in one:
make lintthe same as:
make check-codestyle && make test && make check-safetyDocker
make docker-buildwhich is equivalent to:
make docker-build VERSION=latestRemove docker image with
make docker-removeMore information about docker.
Cleanup
Delete pycache files
make pycache-removeRemove package build
make build-removeDelete .DS_STORE files
make dsstore-removeRemove .mypycache
make mypycache-removeOr to remove all above run:
make cleanupIf you prefer to use Poetry for dependency management, see docs/POETRY.md for detailed instructions.
- Troubleshooting Guide - Solutions for common issues including frontend setup
- Compression Support - Working with compressed SequenceFiles
- Nutch Integration - Best practices for Nutch data
- API Documentation - REST API reference
- Deployment Guide - Production deployment options
- Poetry Guide - Using Poetry for dependency management
This project is licensed under the terms of the Apache Software License 2.0 license. See LICENSE for more details.