Custom JavaScript function for Screaming Frog SEO Spider that extracts main body content from web pages and converts it to clean markdown format.
- Intelligent content extraction with multi-level fallback strategies
- Primary: Extracts from detected main content containers
- Secondary: Falls back to full body extraction if content is minimal
- Tertiary: Extracts from semantic elements (headings, paragraphs, lists) as last resort
- Advanced element filtering to exclude navigation, headers, footers, sidebars, and UI overlays
- ID-based detection (header, footer, site-header, site-footer)
- Class-based detection (cookie-banners, newsletters, modals, loading animations)
- Smart ARIA role filtering (role="banner", role="navigation", role="complementary")
- Converts HTML elements to proper markdown syntax
- Supports headings, paragraphs, lists, links, images, tables, code blocks, and more
- Follows best practices from Mozilla's Readability library
- Handles text formatting (bold, italic, strikethrough)
- Preserves document structure and semantic meaning
- JavaScript rendering enabled Crawl Config > Spider > Rendering
- Store Rendered HTML enabled Crawl Config > Spider > Extraction
- Open Screaming Frog SEO Spider
- Navigate to Configuration > Custom > Custom JavaScript
- Click Add to create a new custom JavaScript function
- Paste the JavaScript code from
extract-content-convert-md.js(ornew.jsfor enhanced version) - Ensure JavaScript rendering is enabled (Configuration > Spider > Rendering)
- Enable Store Rendered HTML (Configuration > Spider > Extraction)
- Save the configuration
Once configured, the custom extraction will run automatically during crawls. The markdown output will appear in a custom column that you can:
- Export to CSV or Excel
- View in the Screaming Frog interface
- Use for content analysis and SEO audits
- Headings (H1-H6)
- Paragraphs and line breaks
- Text formatting (bold, italic, code)
- Links and images
- Ordered and unordered lists
- Blockquotes
- Tables
- Code blocks
- Horizontal rules
The script uses multiple filtering strategies to identify and exclude non-content elements:
HTML Tag Filtering:
<header>,<footer>,<nav>,<aside>,<noscript>,<script>,<style>
ID-Based Filtering:
- Elements with IDs:
header,footer,site-header,site-footer
Class-Based Filtering:
- Common patterns:
theme-header,theme-footer,site-header,site-footer,global-header,global-footer - UI overlays:
loading-animation,cookie-banner,cookie-consent,newsletter-popup,modal-overlay
ARIA Role Filtering:
role="navigation"- always excludedrole="banner"- excluded only if contains nav or is at top levelrole="complementary"- excluded only if <40% of parent width (sidebar detection)
The script searches for content in 20+ different selectors before falling back to extraction strategies:
- WordPress/theme-specific:
.entry-content,.wp-block-post-content - Semantic HTML:
main article,main,article,[role="main"] - Common class patterns:
.content,.main-content,.page-content,.site-content,.post-content,.article-content,.body-content - Data attributes:
[data-content],[data-main-content] - Layout containers:
.layout-container,.container - IDs:
#content,#main,#main-content - Full document body as final fallback
To modify filtering rules, edit the condition checks in the processNode() function.
MIT License - feel free to modify and use for your own projects.
Contributions are welcome. Please open an issue or submit a pull request with improvements or bug fixes.