Workback Issues Crawl Plan

Overview

Crawl 106 issues from Workback.ai and export each to a structured markdown file with associated media.

Browser Setup

Entrypoint: https://app.workback.ai/issues/
Chrome remote debugging port: 9222
Issue URL pattern: https://app.workback.ai/issues/{ID}/

Output Structure

./issues/
  ├── 4226/
  │   ├── issue-4226.md
  │   └── [media files]
  ├── 4221/
  │   ├── issue-4221.md
  │   └── [media files]
  └── ...

Crawl Strategy

Phase 1: Browser Connection

Connect to Chrome instance on port 9222 using browser MCP
Verify connection and navigation capability
Test with one issue page to understand structure

Phase 2: Page Structure Analysis

For each issue page, identify and extract:

Header Section
- Issue ID
- Title
- Status
- Stage
- Severity
- User flow
- WCAG information (Criterion, Criterion Name, Guideline Name, Principle Name)
- Dates (Created, Updated)
Main Content
- Description
- Steps to reproduce
- Expected vs actual behavior
- Technical details
- Code snippets
- References/links
Media Assets
- Screenshots
- Images
- Videos (if any)
- Other attachments
Metadata
- Comments/discussions
- Related issues
- Tags/labels
- Assignees
- Pull request links

Phase 3: Extraction Process (per issue)

Navigate to https://app.workback.ai/issues/{ID}/
Wait for page load (check for dynamic content)
Take snapshot of page structure
Extract text content maintaining hierarchy:
- Use semantic HTML structure
- Preserve headings (h1-h6)
- Preserve lists (ul, ol)
- Preserve code blocks
- Preserve tables
Identify and download media:
- Find all <img> tags
- Find all <video> tags
- Find all <a> tags pointing to media files
- Download each media file
- Rename media files descriptively (e.g., screenshot-1.png, diagram-1.svg)
- Update markdown with relative paths to media
Generate markdown file:
- Use frontmatter for metadata
- Structure content with proper markdown syntax
- Preserve formatting (bold, italic, code, links)
- Include media references
Save to ./issues/{ID}/issue-{ID}.md

Phase 4: Error Handling

If page fails to load: log error, skip issue, continue
If media download fails: log error, continue with text content
If structure is unexpected: log warning, extract what's available
Create error log file: ./issues/crawl-errors.log

Phase 5: Progress Tracking

Create progress file: ./issues/crawl-progress.json
Track: completed issues, failed issues, current issue
Allow resumption from last completed issue

Issue List

Total issues: 106

Issue IDs (sorted): 3407, 3410, 3420, 3421, 3427, 3429, 3440, 3443, 3447, 3448, 3450, 3460, 3479, 3485, 3490, 3505, 3514, 3545, 3548, 3550, 3556, 3557, 3558, 3561, 3562, 3575, 3582, 3585, 3586, 3590, 3598, 3607, 3614, 3615, 3617, 3618, 3619, 3620, 3621, 3626, 3627, 3628, 3631, 3634, 3640, 3642, 3650, 3668, 3676, 3682, 3687, 3697, 3701, 3712, 3715, 3716, 3717, 3727, 3731, 3737, 3739, 3742, 3744, 3748, 3828, 3834, 3850, 3852, 3853, 3866, 3867, 3868, 3878, 3883, 3885, 3896, 3897, 3902, 3904, 3909, 3917, 3919, 3927, 3928, 3929, 3930, 3931, 3932, 3933, 3934, 3935, 3938, 4144, 4161, 4164, 4171, 4180, 4185, 4189, 4194, 4204, 4208, 4215, 4221, 4226

Implementation Steps

Setup
- Verify browser connection on port 9222
- Test navigation to entrypoint
- Analyze page structure of one issue
Create extraction script/logic
- Define markdown template structure
- Define media download logic
- Define error handling
Execute crawl
- Process issues sequentially
- Save progress after each issue
- Handle rate limiting if needed
Verification
- Check all folders created
- Verify markdown files exist
- Verify media files downloaded
- Review error log

Notes

Rate limiting: Add delays between requests if needed
Authentication: Browser session should handle auth automatically
Dynamic content: Wait for JavaScript-rendered content
Media naming: Use descriptive names based on context or alt text

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workback Issues Crawl Plan

Overview

Browser Setup

Output Structure

Crawl Strategy

Phase 1: Browser Connection

Phase 2: Page Structure Analysis

Phase 3: Extraction Process (per issue)

Phase 4: Error Handling

Phase 5: Progress Tracking

Issue List

Implementation Steps

Notes

FilesExpand file tree

CRAWL_PLAN.md

Latest commit

History

CRAWL_PLAN.md

File metadata and controls

Workback Issues Crawl Plan

Overview

Browser Setup

Output Structure

Crawl Strategy

Phase 1: Browser Connection

Phase 2: Page Structure Analysis

Phase 3: Extraction Process (per issue)

Phase 4: Error Handling

Phase 5: Progress Tracking

Issue List

Implementation Steps

Notes