๐ธ Miku Miku Crawler ๐ธ
โจ A Kawaii Web Crawler with Real-Time Visualization โจ
A real-time web crawler with a Miku-themed UI and live visualization .
Watch pages get crawled in real-time, analyze content quality, and export structured data
โ all wrapped in a cute interface.
๐ Live SSE streaming ยท ๐ Content analysis ยท ๐พ Persistent storage ยท ๐จ Miku-themed UI
Inspired by MikuMikuBeam by Sammwy ๐
Feature
๐ก
SSE streaming โ ordered, resumable events with sequence tracking
๐ญ
Playwright โ renders JavaScript-heavy pages with headless Chromium
โก
Cheerio โ fast HTML extraction for static pages
๐ค
robots.txt โ optional compliance with crawl rules and crawl-delay
๐
Concurrency โ configurable parallel fetch workers
๐
Retry with backoff โ automatic retries on transient failures
๐พ
Session resume โ interrupted crawls persist and resume from where they stopped
๐ฆ
Domain throttling โ per-domain rate limiting to be a polite crawler
Every crawled page goes through a full analysis pipeline:
Analysis
๐
Keywords โ frequency-based extraction, filters stop words (EN/ES/FR/DE)
๐
Language โ detection via franc
๐ญ
Sentiment โ custom lexicon-based analyzer
๐
Readability โ Flesch-Kincaid scoring
โญ
Quality โ title, meta, content length, headings, alt text, links
๐๏ธ
Structured data โ JSON-LD, Open Graph, Twitter Cards, microdata
๐ผ๏ธ
Media โ images and videos with URLs and alt text
๐
Links โ classified as internal, external, social, or navigation
Component
CrawlerForm
Configure and launch crawls
StatsGrid
Live counters โ pages, data size, speed
ProgressBar
Visual crawl progress
CrawledPagesSection
Virtualized page list with search & filter
TheatreOverlay
Full page preview with processed data
ExportDialog
JSON / CSV export
ResumeSessionsPanel
Browse & resume interrupted sessions
LogsSection
Live crawl log stream
MikuBanner
โจ Animated mascot โจ
Requires Bun โ the fast JavaScript runtime.
git clone https://github.com/renbkna/mikumikucrawler
cd mikumikucrawler
bun install
bun run dev
๐ง Environment Variables
Copy .env.example โ .env. All variables have sensible defaults. Frontend vars need the VITE_ prefix.
PORT = 3000
NODE_ENV = development
FRONTEND_URL = http://localhost:5173
VITE_BACKEND_URL = http://localhost:3000
DB_PATH = ./data/crawler.db
LOG_LEVEL = info
USER_AGENT = MikuCrawler/3.0.0
RENDER = false
โ๏ธ Crawler Options
Setting
Default
Range
Crawl Depth
2
1โ5
Max Pages
50
1โ200
Max Pages Per Domain
50
1โ200
Crawl Delay
1000ms
200โ10000ms
Method
links
links / content / media / full
Concurrent Requests
5
1โ10
Retry Limit
3
0โ5
Dynamic Content
true
โ
Respect Robots
true
โ
Content Only
false
โ
Save Media
false
โ
Full OpenAPI spec at /openapi
Method
Endpoint
Description
๐
POST
/api/crawls
Create a crawl run
๐
GET
/api/crawls
List crawl runs
๐
GET
/api/crawls/:id
Get crawl state & counters
โน๏ธ
POST
/api/crawls/:id/stop
Request graceful stop
โถ๏ธ
POST
/api/crawls/:id/resume
Resume an interrupted crawl
๐ก
GET
/api/crawls/:id/events
SSE event stream
๐ฆ
GET
/api/crawls/:id/export
Export pages (JSON / CSV)
๐๏ธ
DELETE
/api/crawls/:id
Delete a stored crawl
๐
GET
/api/pages/:id/content
Fetch stored page content
๐
GET
/api/search?q=keyword
Full-text search (FTS5)
๐
GET
/health
Health check
const source = new EventSource (
"http://localhost:3000/api/crawls/<crawl-id>/events"
) ;
source . addEventListener ( "crawl.progress" , ( event ) => {
const { sequence, payload } = JSON . parse ( event . data ) ;
console . log ( payload . counters ) ;
} ) ;
Event
When
crawl.started
Crawl begins processing
crawl.progress
Counter & queue stats update
crawl.page
A page was crawled
crawl.log
Runtime log message
crawl.completed
Crawl finished normally
crawl.stopped
Stopped by user
crawl.failed
Terminated due to error
Events are sequenced โ use Last-Event-ID for resumable connections.
Technology
โ๏ธ
React 19
๐
TypeScript
๐จ
Tailwind CSS 4
โก
Vite
๐
Eden Treaty
โ๏ธ
Lucide React
Technology
๐ฅ
Bun + bun:sqlite
๐ฆ
Elysia + OpenAPI
๐ญ
Playwright
๐
Pino
๐
OpenTelemetry
๐
IP validation + rate limiting
๐ Project Structure
server/
โโโ api/ # Elysia route handlers
โโโ contracts/ # OpenAPI schemas + shared type re-exports
โโโ domain/crawl/ # Core crawl logic
โ โโโ CrawlQueue.ts # Priority queue with domain throttling
โ โโโ CrawlState.ts # Counters, visited URLs, stop logic
โ โโโ DynamicRenderer.ts # Playwright lifecycle
โ โโโ FetchService.ts # HTTP fetching with security checks
โ โโโ PagePipeline.ts # Fetch โ process โ store pipeline
โ โโโ RobotsService.ts # robots.txt evaluation
โ โโโ UrlPolicy.ts # URL filtering and normalization
โโโ runtime/ # Crawl execution layer
โ โโโ CrawlRuntime.ts # Orchestrates a single crawl run
โ โโโ CrawlManager.ts # Creates, stops, resumes, lists runs
โ โโโ EventStream.ts # Sequenced SSE publishing
โ โโโ RuntimeRegistry.ts # Active runtime tracking
โโโ processors/ # Content analysis
โ โโโ ContentProcessor.ts # Dispatch by content type
โ โโโ analysisUtils.ts # Keywords, quality scoring
โ โโโ extractionUtils.ts # Metadata, structured data, links
โ โโโ sentimentAnalyzer.ts
โโโ storage/ # SQLite persistence
โ โโโ migrations/ # Schema migrations
โ โโโ repos/ # Query repositories
โโโ plugins/ # Elysia plugins (DI, security, logging)
โโโ config/ # Env validation, logging setup
shared/ # Cross-boundary contracts
โโโ contracts/ # Domain types (status, events, pages)
โโโ types.ts # Shared domain types
โโโ url.ts # URL validation & normalization
graph TD
A[๐ Target URL] --> B[๐ต CrawlRuntime]
B --> C[๐ PagePipeline]
C --> D{Dynamic?}
D -->|Yes| E[๐ญ Playwright]
D -->|No| F[โก Fetch + Cheerio]
E --> G[๐ ContentProcessor]
F --> G
G --> C
C --> H[๐พ SQLite]
C --> I[๐ก EventStream]
I --> J[๐จ React UI]
Loading
Client creates a crawl via POST /api/crawls
CrawlManager spawns a CrawlRuntime with its own queue and state
PagePipeline fetches each URL via FetchService (static) or Playwright (dynamic)
ContentProcessor analyzes the page, then PagePipeline stores results and enqueues discovered links
EventStream publishes sequenced SSE events to the frontend โจ
bun run build && bun start
๐ณ Docker
docker build -t mikumikucrawler .
docker run -p 3000:3000 mikumikucrawler
NODE_ENV = production
PORT = 3000
FRONTEND_URL = https://your-domain.com
DB_PATH = /data/crawler.db
Format โ Lint โ Type-aware lint โ Typecheck โ Tests โ Build
โ
Get permission before crawling
โ
Respect robots.txt and rate limits
โ
Use reasonable delays
โ
Don't overload servers
โ
Don't scrape copyrighted content without authorization
Fork the repo
Create a feature branch: git checkout -b my-feature
Commit changes: git commit -m 'Add feature'
Push: git push origin my-feature
Open a Pull Request
renbkna โ Solo Developer & Miku Enthusiast
Sammwy โ Original MikuMikuBeam inspiration
๐ MIT โ see LICENSE
๐ธ Miku Miku Crawler ๐ธ
Made with ๐ by a developer who thinks crawlers can be cute