Skip to content

Latest commit

 

History

History
131 lines (111 loc) · 4.17 KB

File metadata and controls

131 lines (111 loc) · 4.17 KB

Workback Issues Crawl Plan

Overview

Crawl 106 issues from Workback.ai and export each to a structured markdown file with associated media.

Browser Setup

Output Structure

./issues/
  ├── 4226/
  │   ├── issue-4226.md
  │   └── [media files]
  ├── 4221/
  │   ├── issue-4221.md
  │   └── [media files]
  └── ...

Crawl Strategy

Phase 1: Browser Connection

  1. Connect to Chrome instance on port 9222 using browser MCP
  2. Verify connection and navigation capability
  3. Test with one issue page to understand structure

Phase 2: Page Structure Analysis

For each issue page, identify and extract:

  1. Header Section

    • Issue ID
    • Title
    • Status
    • Stage
    • Severity
    • User flow
    • WCAG information (Criterion, Criterion Name, Guideline Name, Principle Name)
    • Dates (Created, Updated)
  2. Main Content

    • Description
    • Steps to reproduce
    • Expected vs actual behavior
    • Technical details
    • Code snippets
    • References/links
  3. Media Assets

    • Screenshots
    • Images
    • Videos (if any)
    • Other attachments
  4. Metadata

    • Comments/discussions
    • Related issues
    • Tags/labels
    • Assignees
    • Pull request links

Phase 3: Extraction Process (per issue)

  1. Navigate to https://app.workback.ai/issues/{ID}/
  2. Wait for page load (check for dynamic content)
  3. Take snapshot of page structure
  4. Extract text content maintaining hierarchy:
    • Use semantic HTML structure
    • Preserve headings (h1-h6)
    • Preserve lists (ul, ol)
    • Preserve code blocks
    • Preserve tables
  5. Identify and download media:
    • Find all <img> tags
    • Find all <video> tags
    • Find all <a> tags pointing to media files
    • Download each media file
    • Rename media files descriptively (e.g., screenshot-1.png, diagram-1.svg)
    • Update markdown with relative paths to media
  6. Generate markdown file:
    • Use frontmatter for metadata
    • Structure content with proper markdown syntax
    • Preserve formatting (bold, italic, code, links)
    • Include media references
  7. Save to ./issues/{ID}/issue-{ID}.md

Phase 4: Error Handling

  • If page fails to load: log error, skip issue, continue
  • If media download fails: log error, continue with text content
  • If structure is unexpected: log warning, extract what's available
  • Create error log file: ./issues/crawl-errors.log

Phase 5: Progress Tracking

  • Create progress file: ./issues/crawl-progress.json
  • Track: completed issues, failed issues, current issue
  • Allow resumption from last completed issue

Issue List

Total issues: 106

Issue IDs (sorted): 3407, 3410, 3420, 3421, 3427, 3429, 3440, 3443, 3447, 3448, 3450, 3460, 3479, 3485, 3490, 3505, 3514, 3545, 3548, 3550, 3556, 3557, 3558, 3561, 3562, 3575, 3582, 3585, 3586, 3590, 3598, 3607, 3614, 3615, 3617, 3618, 3619, 3620, 3621, 3626, 3627, 3628, 3631, 3634, 3640, 3642, 3650, 3668, 3676, 3682, 3687, 3697, 3701, 3712, 3715, 3716, 3717, 3727, 3731, 3737, 3739, 3742, 3744, 3748, 3828, 3834, 3850, 3852, 3853, 3866, 3867, 3868, 3878, 3883, 3885, 3896, 3897, 3902, 3904, 3909, 3917, 3919, 3927, 3928, 3929, 3930, 3931, 3932, 3933, 3934, 3935, 3938, 4144, 4161, 4164, 4171, 4180, 4185, 4189, 4194, 4204, 4208, 4215, 4221, 4226

Implementation Steps

  1. Setup

    • Verify browser connection on port 9222
    • Test navigation to entrypoint
    • Analyze page structure of one issue
  2. Create extraction script/logic

    • Define markdown template structure
    • Define media download logic
    • Define error handling
  3. Execute crawl

    • Process issues sequentially
    • Save progress after each issue
    • Handle rate limiting if needed
  4. Verification

    • Check all folders created
    • Verify markdown files exist
    • Verify media files downloaded
    • Review error log

Notes

  • Rate limiting: Add delays between requests if needed
  • Authentication: Browser session should handle auth automatically
  • Dynamic content: Wait for JavaScript-rendered content
  • Media naming: Use descriptive names based on context or alt text