A powerful CLI tool that extracts and converts HTML content from URLs into clean, well-formatted Markdown files. The tool automatically extracts the main article content while removing ads, sidebars, and other clutter.
- 🌐 URL Processing: Convert single URLs, multiple URLs, or URLs from a file
- 🔄 Recursive Processing: Automatically discover and process linked articles with configurable depth limits
- 🎯 Smart Content Extraction: Uses Mozilla Readability to isolate main article content
- 🧹 Content Cleaning: Automatically removes CSS, JavaScript, ads, and navigation elements
- 🎨 Custom Selectors: Optionally specify CSS selectors for targeted content extraction
- 🌐 Domain Filtering: Option to restrict recursive processing to same domain only
- 🔗 Intelligent Link Detection: AI-powered link classification to identify article-like content
- 📁 Flexible Output: Configurable output directory with auto-generated filenames
- 📊 Progress Tracking: Real-time progress indicators with depth tracking and detailed summary reports
- ⚡ TypeScript: Built with TypeScript for reliability and type safety
- 🧪 Tested: Comprehensive unit tests covering core functionality
# Install globally to use 'lxt' command anywhere
npm install -g lextract
# Use the CLI directly
lxt https://example.com
lxt stats# Clone the repository
git clone <repository-url>
cd lextract
# Install dependencies
npm install
# Build the project
npm run build
# Run locally (after building)
lxt [options] [urls...]
# Or use npm scripts during development
npm start -- [options] [urls...]# Run in development mode with TypeScript
npm run dev -- [options] [urls...]# Convert a single URL (global install)
lxt https://example.com
# Convert multiple URLs
lxt https://example.com https://another-site.com
# Or using npm scripts for development
npm start -- https://example.com
npm run dev -- https://example.com# Use custom CSS selector for content extraction
lxt -s "article.main-content" https://example.com
# Read URLs from a file
lxt -f urls.txt
# Enable recursive processing with depth limit
lxt --recursive --max-depth 3 https://blog.example.com
# Recursive mode with domain restriction
lxt --recursive --same-domain --max-links 5 https://example.com
# Combine multiple options
lxt -f urls.txt --recursive -d 2 --max-links 10
# Or using npm scripts during development:
npm start -- -s "article.main-content" https://example.com
npm start -- -f urls.txt --recursive -d 2 --max-links 10# Basic recursive mode (depth 2, max 10 links per page)
lxt --recursive https://blog.example.com
# Deep recursive crawling with custom limits
lxt --recursive --max-depth 4 --max-links 15 https://news.site.com
# Stay within the same domain
lxt --recursive --same-domain https://documentation.site.com
# Combine with file input
lxt -f seed-urls.txt --recursive -d 3
# Or using npm scripts:
npm start -- --recursive https://blog.example.com
npm start -- -f seed-urls.txt --recursive -d 3Create a text file with one URL per line:
https://example.com/article1
https://example.com/article2
# Comments starting with # are ignored
https://example.com/article3
| Option | Alias | Description | Default |
|---|---|---|---|
--selector |
-s |
CSS selector for main content element | Auto-detect with Readability |
--file |
-f |
File containing URLs (one per line) | - |
--recursive |
-r |
Recursively extract and process links from article content | false |
--max-depth |
-d |
Maximum depth for recursive processing | 2 |
--same-domain |
- | Only follow links within the same domain when recursive | false |
--max-links |
- | Maximum number of links to follow per page | 10 |
--ingest |
-i |
Export scraped content to JSONL format for vector database | false |
--jsonl-file |
- | Custom path for JSONL output file | ./data/vector-db/documents.jsonl |
--help |
-h |
Display help information | - |
--version |
-V |
Display version number | - |
Note: The output directory is automatically managed. Each run creates a unique folder in ./output/ with a timestamp and memorable name (e.g., 2025-08-25_14-30-15_purple-turtle).
Lextract supports exporting scraped content to JSONL (JSON Lines) format for vector database ingestion and local LLM search capabilities. This feature enables you to build searchable knowledge bases from web content.
- 🗄️ JSONL Export: Convert scraped content to structured JSONL format
- 🔍 Vector Database Ready: Optimized for embedding and search workflows
- 🏷️ Automatic Tagging: AI-powered content categorization and tagging
- 📄 Metadata Rich: Includes checksums, timestamps, and source information
- 📁 Batch Processing: Process existing markdown files into JSONL format
- 🔄 Incremental Updates: Support for updating existing documents
- 🧹 Content Normalization: Whitespace normalization and content cleaning
Each JSONL record contains:
{
"docId": "a1b2c3d4e5f6",
"sourceUrl": "https://example.com/article",
"path": "/path/to/article.md",
"title": "How to Build APIs: A Complete Guide",
"createdAt": "2024-08-26T10:30:00.000Z",
"updatedAt": "2024-08-26T10:30:00.000Z",
"checksum": "sha256hash...",
"tags": ["how-to", "tutorial", "api", "javascript"],
"fullMarkdown": "# How to Build APIs...\n\nContent here..."
}Field Descriptions:
docId: Unique 16-character identifier generated from source URLsourceUrl: Original URL where content was scrapedpath: File path to the saved markdown filetitle: Extracted article titlecreatedAt: ISO timestamp when record was createdupdatedAt: ISO timestamp when record was last modifiedchecksum: SHA-256 hash of full markdown content for deduplicationtags: Auto-extracted tags describing content type and technologiesfullMarkdown: Complete markdown content with normalized whitespace
# Basic scraping with JSONL export
lxt --ingest https://blog.example.com/article
# Multiple URLs with custom JSONL file
lxt --ingest --jsonl-file ./my-knowledge-base.jsonl https://site1.com https://site2.com
# Recursive crawling with JSONL export
lxt --recursive --ingest --max-depth 3 https://docs.example.com
# Batch processing from file
lxt -f urls.txt --ingestThe ingest command allows you to process existing markdown files into JSONL format:
# Ingest single markdown file
lxt ingest article.md
# Ingest multiple files
lxt ingest docs/*.md
# Ingest from file list
lxt ingest --file markdown-files.txt
# Recursive directory processing
lxt ingest --recursive ./documentation/
# Custom JSONL output file
lxt ingest --jsonl-file ./custom-db.jsonl ./docs/
# Overwrite existing entries
lxt ingest --overwrite ./updated-article.md| Option | Description | Default |
|---|---|---|
--file |
File containing markdown paths (one per line) | - |
--recursive |
Process directories recursively | false |
--jsonl-file |
Custom JSONL output file path | ./data/vector-db/documents.jsonl |
--overwrite |
Overwrite existing entries with same URL | false |
--no-extract-tags |
Disable automatic tag extraction | Tags enabled |
JSONL files are stored in the ./data/vector-db/ directory:
project/
├── data/
│ └── vector-db/
│ ├── documents.jsonl # Default JSONL file
│ ├── web-development.jsonl # Custom project file
│ └── machine-learning.jsonl # Another project file
├── output/ # Markdown files
└── src/ # Source code
Note: The ./data/vector-db/ directory is automatically excluded from version control via .gitignore.
The tool automatically organizes files by run:
- Base Directory:
./output/ - Run Folders: Each execution creates a unique subfolder with format:
YYYY-MM-DD_HH-MM-SS_adjective-noun - Examples:
2025-08-25_14-30-15_purple-turtle,2025-08-25_15-45-22_swift-eagle
Each run folder contains:
- Individual Markdown files for each processed URL
- All files from a single run are grouped together for easy organization
Each generated markdown file includes:
- Metadata header with source URL, extraction date, and word count
- Author information (if available)
- Article excerpt (if available)
- Clean markdown content with proper formatting
Example output:
# Article Title
**Source:** https://example.com/article
**Extracted:** 8/24/2025, 1:30:00 PM
**Word Count:** 1,250
**Author:** John Doe
**Excerpt:** This is a brief description of the article content...
---
# Main Article Heading
This is the main article content converted to clean markdown...Lextract includes comprehensive analytics to track your extraction history:
# View summary statistics
lxt stats --summary
# List recent runs
lxt stats --runs
# View failed extractions
lxt stats --failures
# Detailed run information
lxt stats --run-id <run-id> --detailed
# Export run data
lxt stats --run-id <run-id> --export --format json- @mozilla/readability: Intelligent article content extraction
- jsdom: HTML parsing and DOM manipulation
- turndown: HTML to Markdown conversion
- commander: CLI argument parsing
- TypeScript: Static type checking
- Jest: Unit testing framework
- ts-node: TypeScript execution for development
src/
├── index.ts # Main application entry point
├── cli.ts # Command-line interface parsing
├── fetcher.ts # HTML fetching with native fetch
├── extractor.ts # Content extraction using Readability
├── converter.ts # HTML to Markdown conversion
├── linkExtractor.ts # Link extraction and classification for recursive processing
├── recursiveQueue.ts # Queue management for recursive URL processing
└── fileOperations.ts # File I/O operations
The tool includes advanced recursive processing capabilities that can automatically discover and convert related articles:
- Content Analysis: Extracts the main article content using Mozilla Readability
- Link Discovery: Scans the content for outbound links using intelligent classification
- Link Filtering: Uses AI-powered algorithms to identify article-like content:
- Analyzes URL patterns and structures
- Evaluates link text characteristics
- Considers source context and placement
- Filters out navigation, ads, and non-article links
- Queue Management: Processes links in breadth-first order with configurable depth limits
- Duplicate Prevention: Automatically prevents infinite loops and revisiting URLs
The link extractor uses sophisticated heuristics to identify high-quality article links:
- URL Pattern Analysis: Recognizes common article URL structures
- Link Text Evaluation: Prioritizes descriptive, substantive link text
- Context Assessment: Considers the link's position and surrounding content
- Confidence Scoring: Assigns confidence levels (0-1) to each potential link
- Domain Awareness: Optional same-domain restriction for focused crawling
- Depth Limiting: Configurable maximum crawl depth (default: 2 levels)
- Link Count Limits: Maximum links per page (default: 10)
- URL Deduplication: Prevents processing the same URL multiple times
- Total URL Cap: Global safety limit to prevent runaway crawling
- Domain Filtering: Optional restriction to same domain only
- Graceful Error Handling: Continues processing if individual URLs fail
- Documentation Sites: Crawl entire documentation sections
- Blog Series: Automatically discover related articles in a series
- News Archives: Process multiple related news articles
- Research Papers: Follow citation links and references
- Knowledge Bases: Extract interconnected articles and guides
lxt https://blog.example.com/my-articleOutput:
🚀 Lextract v1.0.0
📋 Processing 1 URL(s)
📁 Output directory: ./output/2025-08-25_14-30-15_purple-turtle
🗂️ Run folder: 2025-08-25_14-30-15_purple-turtle
[1/1] Processing: https://blog.example.com/my-article
📥 Fetching HTML...
🎯 Extracting content...
📝 Converting to markdown...
💾 Saving file...
✅ Saved: ./output/2025-08-25_14-30-15_purple-turtle/my-article.md
📊 Use `lxt stats` to view detailed run statistics and analytics.
🎉 Conversion complete!
lxt --recursive --max-depth 2 https://blog.example.com/main-articleOutput:
🚀 Lextract v1.0.0
📋 Processing 1 URL(s)
📁 Output directory: ./output/2025-08-25_15-45-22_swift-eagle
🗂️ Run folder: 2025-08-25_15-45-22_swift-eagle
🔄 Recursive mode enabled (max depth: 2, max links per page: 10)
🔄 Starting recursive processing...
[第1次] [Depth 0] Processing: https://blog.example.com/main-article
📥 Fetching HTML...
🎯 Extracting content...
🔗 Extracting links...
📝 Converting to markdown...
💾 Saving file...
✅ Saved: ./output/2025-08-25_15-45-22_swift-eagle/main-article.md
🔗 Found 8 new links for processing
📈 Queue: 8 remaining, 1 processed
[第2次] [Depth 1] Processing: https://blog.example.com/related-article-1
🔗 Found via: https://blog.example.com/main-article
📥 Fetching HTML...
🎯 Extracting content...
🔗 Extracting links...
📝 Converting to markdown...
💾 Saving file...
✅ Saved: ./output/2025-08-25_15-45-22_swift-eagle/related-article-1.md
🔗 Found 5 new links for processing
📈 Queue: 12 remaining, 2 processed
🏁 Recursive processing complete!
Total processed: 15
Queue ended with: 0 remaining
==================================================
📈 CONVERSION SUMMARY
==================================================
✅ Successful: 15
❌ Failed: 0
📊 Total words: 45,230
📁 Output folder: ./output/2025-08-25_15-45-22_swift-eagle
🔗 Total links extracted: 73
🌳 Depth distribution:
Depth 0: 1 pages
Depth 1: 8 pages
Depth 2: 6 pages
🎉 Conversion complete!
================================================== ✅ Successful: 15 ❌ Failed: 0 📊 Total words: 45,230 📁 Output folder: ./output 🔗 Total links extracted: 73 🌳 Depth distribution: Depth 0: 1 pages Depth 1: 8 pages Depth 2: 6 pages
🎉 Conversion complete!
### Batch Processing
```bash
lxt -f articles.txt
lxt -s ".post-body" https://news.example.com/article# Deep crawling with domain restriction
lxt --recursive --same-domain --max-depth 4 --max-links 20 https://docs.example.com
# Crawl multiple seed URLs recursively
lxt --recursive -f seed-urls.txtThe tool provides comprehensive error handling for:
- Network issues: Timeout, connection errors, HTTP errors
- Invalid URLs: Malformed or inaccessible URLs
- Content extraction: Pages with no extractable content
- File system: Permission errors, disk space issues
- Parsing errors: Invalid HTML or CSS selectors
- Recursive limits: Maximum depth and URL count protection
- Domain restrictions: Same-domain filtering validation
- Link extraction: Graceful handling of pages with no extractable links
Failed conversions are reported in the summary with detailed error messages, including depth information for recursive processing.
# Run unit tests
npm test
# Run tests with coverage
npm test -- --coverage- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
ISC License - see LICENSE file for details.
Network Timeouts: Some sites may take longer to respond. The tool uses a 30-second timeout by default.
Content Not Extracted: Try using a custom CSS selector with the -s option to target specific content areas.
Permission Errors: Ensure you have write permissions to the output directory.
Empty Output: Some pages may not have extractable content due to heavy use of JavaScript or unusual HTML structure.
Recursive Processing Stops Early: Check if --same-domain restriction is too limiting, or increase --max-links and --max-depth values.
Too Many Links: Use --max-links to limit the number of links processed per page, or --same-domain to reduce scope.
Infinite Loops: The tool automatically prevents revisiting URLs and includes safety limits to prevent infinite crawling.
Use the --help flag for quick reference:
lxt --helpOr during development:
npm start -- --help