Skip to content

renbkna/mikumikucrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

95 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Miku Miku Crawler

๐ŸŒธ Miku Miku Crawler ๐ŸŒธ

โœจ A Kawaii Web Crawler with Real-Time Visualization โœจ

Version Bun Elysia License Kawaii Level

A real-time web crawler with a Miku-themed UI and live visualization.
Watch pages get crawled in real-time, analyze content quality, and export structured data
โ€” all wrapped in a cute interface.

๐Ÿ”„ Live SSE streaming ยท ๐Ÿ“Š Content analysis ยท ๐Ÿ’พ Persistent storage ยท ๐ŸŽจ Miku-themed UI

Inspired by MikuMikuBeam by Sammwy ๐Ÿ’•

Miku Crawler Preview


๐ŸŒŸ Features

๐Ÿ•ท๏ธ Crawling

Feature
๐Ÿ“ก SSE streaming โ€” ordered, resumable events with sequence tracking
๐ŸŽญ Playwright โ€” renders JavaScript-heavy pages with headless Chromium
โšก Cheerio โ€” fast HTML extraction for static pages
๐Ÿค– robots.txt โ€” optional compliance with crawl rules and crawl-delay
๐Ÿ”€ Concurrency โ€” configurable parallel fetch workers
๐Ÿ”„ Retry with backoff โ€” automatic retries on transient failures
๐Ÿ’พ Session resume โ€” interrupted crawls persist and resume from where they stopped
๐Ÿšฆ Domain throttling โ€” per-domain rate limiting to be a polite crawler

๐Ÿ“Š Content Processing

Every crawled page goes through a full analysis pipeline:

Analysis
๐Ÿ”‘ Keywords โ€” frequency-based extraction, filters stop words (EN/ES/FR/DE)
๐ŸŒ Language โ€” detection via franc
๐Ÿ’ญ Sentiment โ€” custom lexicon-based analyzer
๐Ÿ“– Readability โ€” Flesch-Kincaid scoring
โญ Quality โ€” title, meta, content length, headings, alt text, links
๐Ÿ—๏ธ Structured data โ€” JSON-LD, Open Graph, Twitter Cards, microdata
๐Ÿ–ผ๏ธ Media โ€” images and videos with URLs and alt text
๐Ÿ”— Links โ€” classified as internal, external, social, or navigation

๐ŸŽจ Interface

Component
CrawlerForm Configure and launch crawls
StatsGrid Live counters โ€” pages, data size, speed
ProgressBar Visual crawl progress
CrawledPagesSection Virtualized page list with search & filter
TheatreOverlay Full page preview with processed data
ExportDialog JSON / CSV export
ResumeSessionsPanel Browse & resume interrupted sessions
LogsSection Live crawl log stream
MikuBanner โœจ Animated mascot โœจ

๐Ÿš€ Quick Start

Requires Bun โ€” the fast JavaScript runtime.

git clone https://github.com/renbkna/mikumikucrawler
cd mikumikucrawler
bun install
bun run dev
Service URL
๐ŸŽจ Frontend http://localhost:5173
โš™๏ธ Backend http://localhost:3000
๐Ÿ“‹ OpenAPI http://localhost:3000/openapi
๐Ÿ”ง Environment Variables

Copy .env.example โ†’ .env. All variables have sensible defaults. Frontend vars need the VITE_ prefix.

PORT=3000
NODE_ENV=development
FRONTEND_URL=http://localhost:5173
VITE_BACKEND_URL=http://localhost:3000
DB_PATH=./data/crawler.db
LOG_LEVEL=info
USER_AGENT=MikuCrawler/3.0.0
RENDER=false
โš™๏ธ Crawler Options
Setting Default Range
Crawl Depth 2 1โ€“5
Max Pages 50 1โ€“200
Max Pages Per Domain 50 1โ€“200
Crawl Delay 1000ms 200โ€“10000ms
Method links links / content / media / full
Concurrent Requests 5 1โ€“10
Retry Limit 3 0โ€“5
Dynamic Content true โ€”
Respect Robots true โ€”
Content Only false โ€”
Save Media false โ€”

๐Ÿ”Œ API

Full OpenAPI spec at /openapi

Method Endpoint Description
๐Ÿ†• POST /api/crawls Create a crawl run
๐Ÿ“‹ GET /api/crawls List crawl runs
๐Ÿ” GET /api/crawls/:id Get crawl state & counters
โน๏ธ POST /api/crawls/:id/stop Request graceful stop
โ–ถ๏ธ POST /api/crawls/:id/resume Resume an interrupted crawl
๐Ÿ“ก GET /api/crawls/:id/events SSE event stream
๐Ÿ“ฆ GET /api/crawls/:id/export Export pages (JSON / CSV)
๐Ÿ—‘๏ธ DELETE /api/crawls/:id Delete a stored crawl
๐Ÿ“„ GET /api/pages/:id/content Fetch stored page content
๐Ÿ”Ž GET /api/search?q=keyword Full-text search (FTS5)
๐Ÿ’š GET /health Health check

๐Ÿ“ก Event Stream

const source = new EventSource(
  "http://localhost:3000/api/crawls/<crawl-id>/events"
);

source.addEventListener("crawl.progress", (event) => {
  const { sequence, payload } = JSON.parse(event.data);
  console.log(payload.counters);
});
Event When
crawl.started Crawl begins processing
crawl.progress Counter & queue stats update
crawl.page A page was crawled
crawl.log Runtime log message
crawl.completed Crawl finished normally
crawl.stopped Stopped by user
crawl.failed Terminated due to error

Events are sequenced โ€” use Last-Event-ID for resumable connections.


๐Ÿ—๏ธ Tech Stack

๐ŸŽจ Frontend

Technology
โš›๏ธ React 19
๐Ÿ“˜ TypeScript
๐ŸŽจ Tailwind CSS 4
โšก Vite
๐Ÿ”— Eden Treaty
โœ๏ธ Lucide React

โš™๏ธ Backend

Technology
๐ŸฅŸ Bun + bun:sqlite
๐ŸฆŠ Elysia + OpenAPI
๐ŸŽญ Playwright
๐Ÿ“ Pino
๐Ÿ“Š OpenTelemetry
๐Ÿ”’ IP validation + rate limiting
๐Ÿ“ Project Structure
server/
โ”œโ”€โ”€ api/                    # Elysia route handlers
โ”œโ”€โ”€ contracts/              # OpenAPI schemas + shared type re-exports
โ”œโ”€โ”€ domain/crawl/           # Core crawl logic
โ”‚   โ”œโ”€โ”€ CrawlQueue.ts      #   Priority queue with domain throttling
โ”‚   โ”œโ”€โ”€ CrawlState.ts      #   Counters, visited URLs, stop logic
โ”‚   โ”œโ”€โ”€ DynamicRenderer.ts  #   Playwright lifecycle
โ”‚   โ”œโ”€โ”€ FetchService.ts     #   HTTP fetching with security checks
โ”‚   โ”œโ”€โ”€ PagePipeline.ts     #   Fetch โ†’ process โ†’ store pipeline
โ”‚   โ”œโ”€โ”€ RobotsService.ts    #   robots.txt evaluation
โ”‚   โ””โ”€โ”€ UrlPolicy.ts        #   URL filtering and normalization
โ”œโ”€โ”€ runtime/                # Crawl execution layer
โ”‚   โ”œโ”€โ”€ CrawlRuntime.ts     #   Orchestrates a single crawl run
โ”‚   โ”œโ”€โ”€ CrawlManager.ts     #   Creates, stops, resumes, lists runs
โ”‚   โ”œโ”€โ”€ EventStream.ts      #   Sequenced SSE publishing
โ”‚   โ””โ”€โ”€ RuntimeRegistry.ts  #   Active runtime tracking
โ”œโ”€โ”€ processors/             # Content analysis
โ”‚   โ”œโ”€โ”€ ContentProcessor.ts #   Dispatch by content type
โ”‚   โ”œโ”€โ”€ analysisUtils.ts    #   Keywords, quality scoring
โ”‚   โ”œโ”€โ”€ extractionUtils.ts  #   Metadata, structured data, links
โ”‚   โ””โ”€โ”€ sentimentAnalyzer.ts
โ”œโ”€โ”€ storage/                # SQLite persistence
โ”‚   โ”œโ”€โ”€ migrations/         #   Schema migrations
โ”‚   โ””โ”€โ”€ repos/              #   Query repositories
โ”œโ”€โ”€ plugins/                # Elysia plugins (DI, security, logging)
โ””โ”€โ”€ config/                 # Env validation, logging setup

shared/                     # Cross-boundary contracts
โ”œโ”€โ”€ contracts/              #   Domain types (status, events, pages)
โ”œโ”€โ”€ types.ts                #   Shared domain types
โ””โ”€โ”€ url.ts                  #   URL validation & normalization

๐Ÿ”ฎ How It Works

graph TD
    A[๐ŸŒ Target URL] --> B[๐ŸŽต CrawlRuntime]
    B --> C[๐Ÿ“„ PagePipeline]
    C --> D{Dynamic?}
    D -->|Yes| E[๐ŸŽญ Playwright]
    D -->|No| F[โšก Fetch + Cheerio]
    E --> G[๐Ÿ“Š ContentProcessor]
    F --> G
    G --> C
    C --> H[๐Ÿ’พ SQLite]
    C --> I[๐Ÿ“ก EventStream]
    I --> J[๐ŸŽจ React UI]
Loading
  1. Client creates a crawl via POST /api/crawls
  2. CrawlManager spawns a CrawlRuntime with its own queue and state
  3. PagePipeline fetches each URL via FetchService (static) or Playwright (dynamic)
  4. ContentProcessor analyzes the page, then PagePipeline stores results and enqueues discovered links
  5. EventStream publishes sequenced SSE events to the frontend โœจ

๐Ÿšข Deployment

bun run build && bun start
๐Ÿณ Docker
docker build -t mikumikucrawler .
docker run -p 3000:3000 mikumikucrawler
NODE_ENV=production
PORT=3000
FRONTEND_URL=https://your-domain.com
DB_PATH=/data/crawler.db

โœ… Verification

bun run check

Format โ†’ Lint โ†’ Type-aware lint โ†’ Typecheck โ†’ Tests โ†’ Build


โš ๏ธ Responsible Use

โœ… Get permission before crawling
โœ… Respect robots.txt and rate limits
โœ… Use reasonable delays
โŒ Don't overload servers
โŒ Don't scrape copyrighted content without authorization

๐Ÿค Contributing

  1. Fork the repo
  2. Create a feature branch: git checkout -b my-feature
  3. Commit changes: git commit -m 'Add feature'
  4. Push: git push origin my-feature
  5. Open a Pull Request

๐Ÿ‘จโ€๐Ÿ’ป Developer

renbkna โ€” Solo Developer & Miku Enthusiast

๐Ÿ™ Special Thanks

Sammwy โ€” Original MikuMikuBeam inspiration


๐Ÿ“œ MIT โ€” see LICENSE


Miku

๐ŸŒธ Miku Miku Crawler ๐ŸŒธ

Made with ๐Ÿ’– by a developer who thinks crawlers can be cute


GitHub stars GitHub forks GitHub issues