Crawl 106 issues from Workback.ai and export each to a structured markdown file with associated media.
- Entrypoint: https://app.workback.ai/issues/
- Chrome remote debugging port: 9222
- Issue URL pattern: https://app.workback.ai/issues/{ID}/
./issues/
├── 4226/
│ ├── issue-4226.md
│ └── [media files]
├── 4221/
│ ├── issue-4221.md
│ └── [media files]
└── ...
- Connect to Chrome instance on port 9222 using browser MCP
- Verify connection and navigation capability
- Test with one issue page to understand structure
For each issue page, identify and extract:
-
Header Section
- Issue ID
- Title
- Status
- Stage
- Severity
- User flow
- WCAG information (Criterion, Criterion Name, Guideline Name, Principle Name)
- Dates (Created, Updated)
-
Main Content
- Description
- Steps to reproduce
- Expected vs actual behavior
- Technical details
- Code snippets
- References/links
-
Media Assets
- Screenshots
- Images
- Videos (if any)
- Other attachments
-
Metadata
- Comments/discussions
- Related issues
- Tags/labels
- Assignees
- Pull request links
- Navigate to
https://app.workback.ai/issues/{ID}/ - Wait for page load (check for dynamic content)
- Take snapshot of page structure
- Extract text content maintaining hierarchy:
- Use semantic HTML structure
- Preserve headings (h1-h6)
- Preserve lists (ul, ol)
- Preserve code blocks
- Preserve tables
- Identify and download media:
- Find all
<img>tags - Find all
<video>tags - Find all
<a>tags pointing to media files - Download each media file
- Rename media files descriptively (e.g.,
screenshot-1.png,diagram-1.svg) - Update markdown with relative paths to media
- Find all
- Generate markdown file:
- Use frontmatter for metadata
- Structure content with proper markdown syntax
- Preserve formatting (bold, italic, code, links)
- Include media references
- Save to
./issues/{ID}/issue-{ID}.md
- If page fails to load: log error, skip issue, continue
- If media download fails: log error, continue with text content
- If structure is unexpected: log warning, extract what's available
- Create error log file:
./issues/crawl-errors.log
- Create progress file:
./issues/crawl-progress.json - Track: completed issues, failed issues, current issue
- Allow resumption from last completed issue
Total issues: 106
Issue IDs (sorted): 3407, 3410, 3420, 3421, 3427, 3429, 3440, 3443, 3447, 3448, 3450, 3460, 3479, 3485, 3490, 3505, 3514, 3545, 3548, 3550, 3556, 3557, 3558, 3561, 3562, 3575, 3582, 3585, 3586, 3590, 3598, 3607, 3614, 3615, 3617, 3618, 3619, 3620, 3621, 3626, 3627, 3628, 3631, 3634, 3640, 3642, 3650, 3668, 3676, 3682, 3687, 3697, 3701, 3712, 3715, 3716, 3717, 3727, 3731, 3737, 3739, 3742, 3744, 3748, 3828, 3834, 3850, 3852, 3853, 3866, 3867, 3868, 3878, 3883, 3885, 3896, 3897, 3902, 3904, 3909, 3917, 3919, 3927, 3928, 3929, 3930, 3931, 3932, 3933, 3934, 3935, 3938, 4144, 4161, 4164, 4171, 4180, 4185, 4189, 4194, 4204, 4208, 4215, 4221, 4226
-
Setup
- Verify browser connection on port 9222
- Test navigation to entrypoint
- Analyze page structure of one issue
-
Create extraction script/logic
- Define markdown template structure
- Define media download logic
- Define error handling
-
Execute crawl
- Process issues sequentially
- Save progress after each issue
- Handle rate limiting if needed
-
Verification
- Check all folders created
- Verify markdown files exist
- Verify media files downloaded
- Review error log
- Rate limiting: Add delays between requests if needed
- Authentication: Browser session should handle auth automatically
- Dynamic content: Wait for JavaScript-rendered content
- Media naming: Use descriptive names based on context or alt text