-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Right with you. Here's the cleaned-up, structured Wraith + Mitta unified TODO list, grouped by functional area, prioritized for clarity and ready to drop into GitHub (or a Notion/Linear backlog if you're doing structured sprints).
✅ 1. Content Extraction + Structure
- Schema-based extraction via CSS selectors
- Chunk HTML content into semantic blocks
- Add cosine similarity scoring for block relevance
- Support multiple extraction strategies (DOM, LLM, tag-weighted, etc.)
- LLM-assisted data extraction using templated prompts
- Insert summaries / extracted values into final markdown outputs
- Link classification + filtering logic
- Link graph generation for site structure mapping
✅ 2. Indexing + Storage
- Vector + Solr hybrid storage engine
- Design schema for keyterms + vector + markdown output storage
- Containerize backend datastore (FeatureBase/Druid/etc.)
- Support live insertion from Wraith jobs
✅ 3. Media Processing
- Basic support for images, video, audio metadata
- Handle
<img srcset>and<picture>variants - Extract alt text / surrounding context for embeddings
- Lazy-load image reveal scripting
✅ 4. AI/LLM Integration
- LLMContentFilter for filtering noise / boosting signal
- LLMExtractionStrategy with fallback prompts
- Unified AI wrapper to support Claude, GPT, Mistral, LLaMA, etc.
- Prompt template system for task-specific extraction
✅ 5. Browser Automation / Control
- Persistent profile + cookie management
- LLM-assisted dynamic content handling (e.g., “wait for element, click next”)
- Reusable JS snippets per scenario (e.g., DOM flatteners, pagers)
- Expose Playwright session controls via command/agent interface
✅ 6. Error Handling + Observability
- Classified error types (timeouts, bad selectors, load fails, etc.)
- Retry + fallback mechanisms
- Structured logs for all pipeline stages (JSONL preferred)
- Stats reporting (pages processed, blocks extracted, avg confidence, etc.)
✅ 7. CLI Tooling
- Comprehensive CLI (
wraithorcrawl) - Supports config files + env overrides
- Interactive debug mode for live stepping through pages
✅ 8. Browser Extension (Uploader)
- Mitta browser extension to snapshot DOM or send screenshots
- Fallback upload to Wraith endpoint when JS rendering fails
- User-triggered or automated crawling triggers from extension UI
- Handles auth/session handoff for authenticated pages
✅ 9. Mitta Frontend (Browser AI Interface)
- UI for uploading, crawling, searching
- Conversational chat interface to control Wraith agents
- Markdown report viewer
- Real-time feedback from LLM queries
- Image upload + analysis tools (OCR, object detection, etc.)
- Dashboard for managing crawled docs
✅ 10. Authentication + Access Control
- Email + SMS-based login only (passwordless)
- 2FA for paid accounts
- Don’t store or log credit card info
- Future: Google Authenticator integration
- Use Flee to review Auth.py for flow + state safety
✅ 11. Open Source Packaging
- Make Mitta downloadable + local-run friendly (e.g., Raspberry Pi)
- Containerized stack (Frontend, Wraith, Vector DB)
- Docs for setup, config, and local security best practices
Would you like me to export this as a GitHub markdown checklist, or break it into separate issue files for import with GitHub Projects Beta? Also happy to generate an OP-level ROADMAP.md.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels