Skip to content

EPIC #4

@kordless

Description

@kordless

Right with you. Here's the cleaned-up, structured Wraith + Mitta unified TODO list, grouped by functional area, prioritized for clarity and ready to drop into GitHub (or a Notion/Linear backlog if you're doing structured sprints).


1. Content Extraction + Structure

  • Schema-based extraction via CSS selectors
  • Chunk HTML content into semantic blocks
  • Add cosine similarity scoring for block relevance
  • Support multiple extraction strategies (DOM, LLM, tag-weighted, etc.)
  • LLM-assisted data extraction using templated prompts
  • Insert summaries / extracted values into final markdown outputs
  • Link classification + filtering logic
  • Link graph generation for site structure mapping

2. Indexing + Storage

  • Vector + Solr hybrid storage engine
  • Design schema for keyterms + vector + markdown output storage
  • Containerize backend datastore (FeatureBase/Druid/etc.)
  • Support live insertion from Wraith jobs

3. Media Processing

  • Basic support for images, video, audio metadata
  • Handle <img srcset> and <picture> variants
  • Extract alt text / surrounding context for embeddings
  • Lazy-load image reveal scripting

4. AI/LLM Integration

  • LLMContentFilter for filtering noise / boosting signal
  • LLMExtractionStrategy with fallback prompts
  • Unified AI wrapper to support Claude, GPT, Mistral, LLaMA, etc.
  • Prompt template system for task-specific extraction

5. Browser Automation / Control

  • Persistent profile + cookie management
  • LLM-assisted dynamic content handling (e.g., “wait for element, click next”)
  • Reusable JS snippets per scenario (e.g., DOM flatteners, pagers)
  • Expose Playwright session controls via command/agent interface

6. Error Handling + Observability

  • Classified error types (timeouts, bad selectors, load fails, etc.)
  • Retry + fallback mechanisms
  • Structured logs for all pipeline stages (JSONL preferred)
  • Stats reporting (pages processed, blocks extracted, avg confidence, etc.)

7. CLI Tooling

  • Comprehensive CLI (wraith or crawl)
  • Supports config files + env overrides
  • Interactive debug mode for live stepping through pages

8. Browser Extension (Uploader)

  • Mitta browser extension to snapshot DOM or send screenshots
  • Fallback upload to Wraith endpoint when JS rendering fails
  • User-triggered or automated crawling triggers from extension UI
  • Handles auth/session handoff for authenticated pages

9. Mitta Frontend (Browser AI Interface)

  • UI for uploading, crawling, searching
  • Conversational chat interface to control Wraith agents
  • Markdown report viewer
  • Real-time feedback from LLM queries
  • Image upload + analysis tools (OCR, object detection, etc.)
  • Dashboard for managing crawled docs

10. Authentication + Access Control

  • Email + SMS-based login only (passwordless)
  • 2FA for paid accounts
  • Don’t store or log credit card info
  • Future: Google Authenticator integration
  • Use Flee to review Auth.py for flow + state safety

11. Open Source Packaging

  • Make Mitta downloadable + local-run friendly (e.g., Raspberry Pi)
  • Containerized stack (Frontend, Wraith, Vector DB)
  • Docs for setup, config, and local security best practices

Would you like me to export this as a GitHub markdown checklist, or break it into separate issue files for import with GitHub Projects Beta? Also happy to generate an OP-level ROADMAP.md.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions