EPIC

Right with you. Here's the cleaned-up, structured **Wraith + Mitta unified TODO list**, grouped by functional area, prioritized for clarity and ready to drop into GitHub (or a Notion/Linear backlog if you're doing structured sprints).

---

## ✅ **1. Content Extraction + Structure**

- [ ] **Schema-based extraction via CSS selectors**
- [ ] **Chunk HTML content into semantic blocks**
- [ ] **Add cosine similarity scoring for block relevance**
- [ ] **Support multiple extraction strategies (DOM, LLM, tag-weighted, etc.)**
- [ ] **LLM-assisted data extraction using templated prompts**
- [ ] **Insert summaries / extracted values into final markdown outputs**
- [ ] **Link classification + filtering logic**
- [ ] **Link graph generation for site structure mapping**

---

## ✅ **2. Indexing + Storage**

- [ ] **Vector + Solr hybrid storage engine**
- [ ] **Design schema for keyterms + vector + markdown output storage**
- [ ] **Containerize backend datastore (FeatureBase/Druid/etc.)**
- [ ] **Support live insertion from Wraith jobs**

---

## ✅ **3. Media Processing**

- [ ] **Basic support for images, video, audio metadata**
- [ ] **Handle `<img srcset>` and `<picture>` variants**
- [ ] **Extract alt text / surrounding context for embeddings**
- [ ] **Lazy-load image reveal scripting**

---

## ✅ **4. AI/LLM Integration**

- [ ] **LLMContentFilter for filtering noise / boosting signal**
- [ ] **LLMExtractionStrategy with fallback prompts**
- [ ] **Unified AI wrapper to support Claude, GPT, Mistral, LLaMA, etc.**
- [ ] **Prompt template system for task-specific extraction**

---

## ✅ **5. Browser Automation / Control**

- [ ] **Persistent profile + cookie management**
- [ ] **LLM-assisted dynamic content handling (e.g., “wait for element, click next”)**
- [ ] **Reusable JS snippets per scenario (e.g., DOM flatteners, pagers)**
- [ ] **Expose Playwright session controls via command/agent interface**

---

## ✅ **6. Error Handling + Observability**

- [ ] **Classified error types (timeouts, bad selectors, load fails, etc.)**
- [ ] **Retry + fallback mechanisms**
- [ ] **Structured logs for all pipeline stages (JSONL preferred)**
- [ ] **Stats reporting (pages processed, blocks extracted, avg confidence, etc.)**

---

## ✅ **7. CLI Tooling**

- [ ] **Comprehensive CLI (`wraith` or `crawl`)**
- [ ] **Supports config files + env overrides**
- [ ] **Interactive debug mode for live stepping through pages**

---

## ✅ **8. Browser Extension (Uploader)**

- [ ] **Mitta browser extension to snapshot DOM or send screenshots**
- [ ] **Fallback upload to Wraith endpoint when JS rendering fails**
- [ ] **User-triggered or automated crawling triggers from extension UI**
- [ ] **Handles auth/session handoff for authenticated pages**

---

## ✅ **9. Mitta Frontend (Browser AI Interface)**

- [ ] **UI for uploading, crawling, searching**
- [ ] **Conversational chat interface to control Wraith agents**
- [ ] **Markdown report viewer**
- [ ] **Real-time feedback from LLM queries**
- [ ] **Image upload + analysis tools (OCR, object detection, etc.)**
- [ ] **Dashboard for managing crawled docs**

---

## ✅ **10. Authentication + Access Control**

- [ ] **Email + SMS-based login only (passwordless)**
- [ ] **2FA for paid accounts**
- [ ] **Don’t store or log credit card info**
- [ ] **Future: Google Authenticator integration**
- [ ] **Use Flee to review Auth.py for flow + state safety**

---

## ✅ **11. Open Source Packaging**

- [ ] **Make Mitta downloadable + local-run friendly (e.g., Raspberry Pi)**
- [ ] **Containerized stack (Frontend, Wraith, Vector DB)**
- [ ] **Docs for setup, config, and local security best practices**

---

Would you like me to export this as a GitHub markdown checklist, or break it into separate issue files for import with GitHub Projects Beta? Also happy to generate an OP-level `ROADMAP.md`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EPIC #4

✅ 1. Content Extraction + Structure

✅ 2. Indexing + Storage

✅ 3. Media Processing

✅ 4. AI/LLM Integration

✅ 5. Browser Automation / Control

✅ 6. Error Handling + Observability

✅ 7. CLI Tooling

✅ 8. Browser Extension (Uploader)

✅ 9. Mitta Frontend (Browser AI Interface)

✅ 10. Authentication + Access Control

✅ 11. Open Source Packaging

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

EPIC #4

Description

✅ 1. Content Extraction + Structure

✅ 2. Indexing + Storage

✅ 3. Media Processing

✅ 4. AI/LLM Integration

✅ 5. Browser Automation / Control

✅ 6. Error Handling + Observability

✅ 7. CLI Tooling

✅ 8. Browser Extension (Uploader)

✅ 9. Mitta Frontend (Browser AI Interface)

✅ 10. Authentication + Access Control

✅ 11. Open Source Packaging

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions