Markdown Crawler for Amplitude Docs

A Node.js-based crawler that detects unrendered markdown content in your Statamic-generated static site, with special focus on tables and other markdown syntax that may not be properly processed.

Features

✅ Sitemap-based crawling - Uses your existing sitemap as entry point
✅ Table detection - Specifically looks for unrendered table markdown (| characters)
✅ Comprehensive markdown patterns - Headers, links, code blocks, lists, emphasis
✅ False positive filtering - Smart detection to avoid false alarms
✅ Concurrent crawling - Configurable concurrency for performance
✅ Detailed reporting - JSON reports with samples and statistics
✅ CI/CD integration - Ready-to-use GitHub Actions workflow
✅ Local & production ready - Works in any environment
✅ PR comments - Automatic feedback on pull requests

Installation

Install the required dependency:

npm install fast-xml-parser

The crawler script (crawler.js) and GitHub Actions workflow are already configured.

Usage

Local Development

Quick Start

# Start your local Statamic site first
php please serve

# Run the crawler against local site
npm run crawl:local

Custom Options

# Run with custom configuration
node crawler.js \
  --base-url http://localhost:8000 \
  --sitemap-path /docs/sitemap-docs.xml \
  --output my-report.json \
  --concurrency 5 \
  --verbose

Production Check

# Check your production site
npm run crawl:production

# Or with custom production URL
node crawler.js --base-url https://your-production-site.com

Ignoring Pages

# Use ignore file (recommended)
npm run crawl:local:ignore

# Ignore specific URLs
node crawler.js --ignore-url "/docs/test-page" --ignore-url "/docs/sandbox"

# Ignore URL patterns
node crawler.js --ignore-pattern "/docs/admin/*" --ignore-pattern "*/draft/*"

# Combine ignore methods
node crawler.js --ignore-file .crawlerignore --ignore-url "/docs/temp"

Command Line Options

Option	Description	Default
`--base-url`	Base URL to crawl	`http://localhost:8000` (local), `https://amplitude.com` (CI)
`--sitemap-path`	Path to sitemap	`/docs/sitemap-docs.xml`
`--output`	Output file for report	`markdown-issues.json`
`--concurrency`	Number of concurrent requests	`10`
`--timeout`	Request timeout in milliseconds	`30000`
`--ignore-pattern`	URL pattern to ignore (supports wildcards)	none
`--ignore-url`	Specific URL to ignore	none
`--ignore-file`	File containing ignore patterns/URLs	none
`--verbose`	Enable verbose logging	`false`

Ignore Configuration

Ignore File Format

Create a .crawlerignore file (or any filename) with patterns and URLs to skip:

# Lines starting with # are comments
# Empty lines are ignored

# Ignore specific URLs (exact matches)
/docs/test-page
/docs/sandbox

# Ignore URL patterns using wildcards
/docs/admin/*          # Ignore all URLs starting with /docs/admin/
/docs/*/test-*        # Ignore test pages in any subdirectory
**/draft/**           # Ignore draft pages at any level

# Ignore file types
*.pdf
*.zip

# Ignore entire sections
/docs/jp/*            # Ignore Japanese documentation

Pattern Matching

* matches any characters
? matches single character
**/ matches any directory levels
Patterns are anchored (must match from start to end)
Both absolute URLs and relative paths work

Example Ignore Scenarios

# Ignore all test/example pages
node crawler.js --ignore-pattern "*/test-*" --ignore-pattern "*/example*"

# Ignore specific documentation sections
node crawler.js --ignore-pattern "/docs/admin/*" --ignore-pattern "/docs/internal/*"

# Ignore non-English content
node crawler.js --ignore-pattern "/docs/jp/*" --ignore-pattern "/docs/es/*"

# Use ignore file for complex rules
node crawler.js --ignore-file .crawlerignore

Detection Patterns

The crawler detects the following unrendered markdown patterns:

Tables

Pattern: |column1|column2|column3|
Description: Unrendered table markdown with pipe characters

Headers

Pattern: # Header, ## Subheader, etc.
Description: Unrendered headers with hash symbols

Code Blocks

Pattern: `code` or code blocks
Description: Unrendered inline code or code blocks

Links

Pattern: [link text](url)
Description: Unrendered markdown links

Emphasis

Pattern: **bold** or *italic*
Description: Unrendered bold or italic text

Lists

Pattern: - item or * item
Description: Unrendered bullet lists

Report Format

The crawler generates a JSON report with the following structure:

{
  "timestamp": "2024-01-15T10:30:00.000Z",
  "summary": {
    "totalUrls": 150,
    "totalErrors": 2,
    "totalSkipped": 12,
    "totalIssues": 5,
    "urlsWithIssues": 3,
    "ignorePatterns": ["/docs/admin/*", "*/draft/*"],
    "ignoreUrls": ["/docs/test-page"],
    "issuesByType": [
      {
        "type": "tables",
        "count": 3,
        "description": "Unrendered table (pipe characters)"
      }
    ]
  },
  "markdownIssues": [
    {
      "type": "tables",
      "description": "Unrendered table (pipe characters)",
      "count": 1,
      "samples": ["|Column 1|Column 2|Column 3|"],
      "url": "https://amplitude.com/docs/some-page"
    }
  ],
  "issuesByType": {
    "tables": [/* array of table issues */]
  },
  "issuesByUrl": {
    "https://amplitude.com/docs/some-page": [/* array of issues for this URL */]
  },
  "skippedUrls": [
    "https://amplitude.com/docs/admin/settings",
    "https://amplitude.com/docs/test-page"
  ]
}

CI/CD Integration

GitHub Actions

The workflow (.github/workflows/markdown-check.yml) automatically:

Builds your site using the existing build process
Starts a local server with the static files
Runs the crawler against the local build
Comments on PRs with results
Uploads reports as artifacts
Fails the build if issues are found

Triggers

Push to main or develop branches
Pull requests to main
Daily schedule at 2 AM UTC
Manual trigger from GitHub Actions tab

Environment Variables

The workflow uses these variables:

APP_URL: Your application URL (set in GitHub repository variables)

Local Testing Workflow

Make changes to your Statamic content
Generate static site: php please ssg:generate

Start local server:

cd storage/app/static
python3 -m http.server 8000

Run crawler: npm run crawl:local
Review report: Check markdown-issues.json

Troubleshooting

Common Issues

Sitemap Not Found

Error: Sitemap request failed with status 404

Solution: Verify your sitemap path. Check if /docs/sitemap-docs.xml is accessible.

Connection Refused

Error: Request failed for http://localhost:8000: connect ECONNREFUSED

Solution: Ensure your local server is running before running the crawler.

False Positives

If you get false positives for tables that are actually rendered correctly, the issue might be:

Markdown content inside <script> tags (filtered out automatically)
Markdown inside code examples (check if it should be in a code block)

Debugging

Enable verbose logging:

node crawler.js --verbose

This will show:

Each URL being crawled
Progress updates every 50 URLs
Detailed error messages

Performance Tuning

For large sites, adjust concurrency:

# Lower concurrency for stability
node crawler.js --concurrency 3

# Higher concurrency for speed (if server can handle it)
node crawler.js --concurrency 15

Integration with Existing Workflow

The crawler integrates seamlessly with your existing deployment process:

No changes needed to your existing build process
Runs in parallel to your regular deploy workflow
Optional enforcement - you can choose whether to fail builds on issues
Non-intrusive - doesn't affect your production deployment

Customization

Adding New Patterns

Edit crawler.js and add new patterns to the markdownPatterns object:

this.markdownPatterns = {
    // ... existing patterns
    customPattern: {
        regex: /your-regex-here/g,
        description: 'Description of what this detects'
    }
};

Filtering Specific URLs

You can modify the crawler to skip certain URLs by adding filtering logic in the fetchSitemap() method:

urls = urls.filter(url => !url.includes('/skip-this-path/'));

Custom Report Format

Modify the generateReport() method to customize the output format or add additional metrics.

Support

For issues or questions:

Check the GitHub Actions logs for detailed error information
Review the generated markdown-issues.json report
Test locally with --verbose flag for debugging

FilesExpand file tree

CRAWLER_README.md

Latest commit

History