Skip to content

chrisdevelops/lextract

Repository files navigation

Lextract

A powerful CLI tool that extracts and converts HTML content from URLs into clean, well-formatted Markdown files. The tool automatically extracts the main article content while removing ads, sidebars, and other clutter.

Features

  • 🌐 URL Processing: Convert single URLs, multiple URLs, or URLs from a file
  • 🔄 Recursive Processing: Automatically discover and process linked articles with configurable depth limits
  • 🎯 Smart Content Extraction: Uses Mozilla Readability to isolate main article content
  • 🧹 Content Cleaning: Automatically removes CSS, JavaScript, ads, and navigation elements
  • 🎨 Custom Selectors: Optionally specify CSS selectors for targeted content extraction
  • 🌐 Domain Filtering: Option to restrict recursive processing to same domain only
  • 🔗 Intelligent Link Detection: AI-powered link classification to identify article-like content
  • 📁 Flexible Output: Configurable output directory with auto-generated filenames
  • 📊 Progress Tracking: Real-time progress indicators with depth tracking and detailed summary reports
  • TypeScript: Built with TypeScript for reliability and type safety
  • 🧪 Tested: Comprehensive unit tests covering core functionality

Installation

Global Installation (Recommended)

# Install globally to use 'lxt' command anywhere
npm install -g lextract

# Use the CLI directly
lxt https://example.com
lxt stats

From Source

# Clone the repository
git clone <repository-url>
cd lextract

# Install dependencies
npm install

# Build the project
npm run build

# Run locally (after building)
lxt [options] [urls...]

# Or use npm scripts during development
npm start -- [options] [urls...]

Development Mode

# Run in development mode with TypeScript
npm run dev -- [options] [urls...]

Usage

Basic Usage

# Convert a single URL (global install)
lxt https://example.com

# Convert multiple URLs
lxt https://example.com https://another-site.com

# Or using npm scripts for development
npm start -- https://example.com
npm run dev -- https://example.com

Advanced Options

# Use custom CSS selector for content extraction
lxt -s "article.main-content" https://example.com

# Read URLs from a file
lxt -f urls.txt

# Enable recursive processing with depth limit
lxt --recursive --max-depth 3 https://blog.example.com

# Recursive mode with domain restriction
lxt --recursive --same-domain --max-links 5 https://example.com

# Combine multiple options
lxt -f urls.txt --recursive -d 2 --max-links 10

# Or using npm scripts during development:
npm start -- -s "article.main-content" https://example.com
npm start -- -f urls.txt --recursive -d 2 --max-links 10

Recursive Processing

# Basic recursive mode (depth 2, max 10 links per page)
lxt --recursive https://blog.example.com

# Deep recursive crawling with custom limits
lxt --recursive --max-depth 4 --max-links 15 https://news.site.com

# Stay within the same domain
lxt --recursive --same-domain https://documentation.site.com

# Combine with file input
lxt -f seed-urls.txt --recursive -d 3

# Or using npm scripts:
npm start -- --recursive https://blog.example.com
npm start -- -f seed-urls.txt --recursive -d 3

URL File Format

Create a text file with one URL per line:

https://example.com/article1
https://example.com/article2
# Comments starting with # are ignored
https://example.com/article3

CLI Options

Option Alias Description Default
--selector -s CSS selector for main content element Auto-detect with Readability
--file -f File containing URLs (one per line) -
--recursive -r Recursively extract and process links from article content false
--max-depth -d Maximum depth for recursive processing 2
--same-domain - Only follow links within the same domain when recursive false
--max-links - Maximum number of links to follow per page 10
--ingest -i Export scraped content to JSONL format for vector database false
--jsonl-file - Custom path for JSONL output file ./data/vector-db/documents.jsonl
--help -h Display help information -
--version -V Display version number -

Note: The output directory is automatically managed. Each run creates a unique folder in ./output/ with a timestamp and memorable name (e.g., 2025-08-25_14-30-15_purple-turtle).

JSONL Vector Database Integration

Lextract supports exporting scraped content to JSONL (JSON Lines) format for vector database ingestion and local LLM search capabilities. This feature enables you to build searchable knowledge bases from web content.

Features

  • 🗄️ JSONL Export: Convert scraped content to structured JSONL format
  • 🔍 Vector Database Ready: Optimized for embedding and search workflows
  • 🏷️ Automatic Tagging: AI-powered content categorization and tagging
  • 📄 Metadata Rich: Includes checksums, timestamps, and source information
  • 📁 Batch Processing: Process existing markdown files into JSONL format
  • 🔄 Incremental Updates: Support for updating existing documents
  • 🧹 Content Normalization: Whitespace normalization and content cleaning

JSONL Document Format

Each JSONL record contains:

{
  "docId": "a1b2c3d4e5f6",
  "sourceUrl": "https://example.com/article",
  "path": "/path/to/article.md",
  "title": "How to Build APIs: A Complete Guide",
  "createdAt": "2024-08-26T10:30:00.000Z",
  "updatedAt": "2024-08-26T10:30:00.000Z",
  "checksum": "sha256hash...",
  "tags": ["how-to", "tutorial", "api", "javascript"],
  "fullMarkdown": "# How to Build APIs...\n\nContent here..."
}

Field Descriptions:

  • docId: Unique 16-character identifier generated from source URL
  • sourceUrl: Original URL where content was scraped
  • path: File path to the saved markdown file
  • title: Extracted article title
  • createdAt: ISO timestamp when record was created
  • updatedAt: ISO timestamp when record was last modified
  • checksum: SHA-256 hash of full markdown content for deduplication
  • tags: Auto-extracted tags describing content type and technologies
  • fullMarkdown: Complete markdown content with normalized whitespace

Usage Examples

Scrape URLs with JSONL Export

# Basic scraping with JSONL export
lxt --ingest https://blog.example.com/article

# Multiple URLs with custom JSONL file
lxt --ingest --jsonl-file ./my-knowledge-base.jsonl https://site1.com https://site2.com

# Recursive crawling with JSONL export
lxt --recursive --ingest --max-depth 3 https://docs.example.com

# Batch processing from file
lxt -f urls.txt --ingest

Ingest Existing Markdown Files

The ingest command allows you to process existing markdown files into JSONL format:

# Ingest single markdown file
lxt ingest article.md

# Ingest multiple files
lxt ingest docs/*.md

# Ingest from file list
lxt ingest --file markdown-files.txt

# Recursive directory processing
lxt ingest --recursive ./documentation/

# Custom JSONL output file
lxt ingest --jsonl-file ./custom-db.jsonl ./docs/

# Overwrite existing entries
lxt ingest --overwrite ./updated-article.md

Ingest Command Options

Option Description Default
--file File containing markdown paths (one per line) -
--recursive Process directories recursively false
--jsonl-file Custom JSONL output file path ./data/vector-db/documents.jsonl
--overwrite Overwrite existing entries with same URL false
--no-extract-tags Disable automatic tag extraction Tags enabled

File Organization

JSONL files are stored in the ./data/vector-db/ directory:

project/
├── data/
│   └── vector-db/
│       ├── documents.jsonl          # Default JSONL file
│       ├── web-development.jsonl    # Custom project file
│       └── machine-learning.jsonl   # Another project file
├── output/                          # Markdown files
└── src/                            # Source code

Note: The ./data/vector-db/ directory is automatically excluded from version control via .gitignore.

Output

File Organization

The tool automatically organizes files by run:

  • Base Directory: ./output/
  • Run Folders: Each execution creates a unique subfolder with format: YYYY-MM-DD_HH-MM-SS_adjective-noun
  • Examples: 2025-08-25_14-30-15_purple-turtle, 2025-08-25_15-45-22_swift-eagle

Generated Files

Each run folder contains:

  1. Individual Markdown files for each processed URL
  2. All files from a single run are grouped together for easy organization

Markdown Format

Each generated markdown file includes:

  • Metadata header with source URL, extraction date, and word count
  • Author information (if available)
  • Article excerpt (if available)
  • Clean markdown content with proper formatting

Example output:

# Article Title

**Source:** https://example.com/article
**Extracted:** 8/24/2025, 1:30:00 PM
**Word Count:** 1,250
**Author:** John Doe

**Excerpt:** This is a brief description of the article content...

---

# Main Article Heading

This is the main article content converted to clean markdown...

Run Analytics

Lextract includes comprehensive analytics to track your extraction history:

# View summary statistics
lxt stats --summary

# List recent runs
lxt stats --runs

# View failed extractions
lxt stats --failures

# Detailed run information
lxt stats --run-id <run-id> --detailed

# Export run data
lxt stats --run-id <run-id> --export --format json

Dependencies

Core Dependencies

Development Dependencies

  • TypeScript: Static type checking
  • Jest: Unit testing framework
  • ts-node: TypeScript execution for development

Architecture

src/
├── index.ts          # Main application entry point
├── cli.ts           # Command-line interface parsing
├── fetcher.ts       # HTML fetching with native fetch
├── extractor.ts     # Content extraction using Readability
├── converter.ts     # HTML to Markdown conversion
├── linkExtractor.ts # Link extraction and classification for recursive processing
├── recursiveQueue.ts # Queue management for recursive URL processing
└── fileOperations.ts # File I/O operations

Recursive Processing

The tool includes advanced recursive processing capabilities that can automatically discover and convert related articles:

How It Works

  1. Content Analysis: Extracts the main article content using Mozilla Readability
  2. Link Discovery: Scans the content for outbound links using intelligent classification
  3. Link Filtering: Uses AI-powered algorithms to identify article-like content:
    • Analyzes URL patterns and structures
    • Evaluates link text characteristics
    • Considers source context and placement
    • Filters out navigation, ads, and non-article links
  4. Queue Management: Processes links in breadth-first order with configurable depth limits
  5. Duplicate Prevention: Automatically prevents infinite loops and revisiting URLs

Link Classification

The link extractor uses sophisticated heuristics to identify high-quality article links:

  • URL Pattern Analysis: Recognizes common article URL structures
  • Link Text Evaluation: Prioritizes descriptive, substantive link text
  • Context Assessment: Considers the link's position and surrounding content
  • Confidence Scoring: Assigns confidence levels (0-1) to each potential link
  • Domain Awareness: Optional same-domain restriction for focused crawling

Safety Features

  • Depth Limiting: Configurable maximum crawl depth (default: 2 levels)
  • Link Count Limits: Maximum links per page (default: 10)
  • URL Deduplication: Prevents processing the same URL multiple times
  • Total URL Cap: Global safety limit to prevent runaway crawling
  • Domain Filtering: Optional restriction to same domain only
  • Graceful Error Handling: Continues processing if individual URLs fail

Use Cases

  • Documentation Sites: Crawl entire documentation sections
  • Blog Series: Automatically discover related articles in a series
  • News Archives: Process multiple related news articles
  • Research Papers: Follow citation links and references
  • Knowledge Bases: Extract interconnected articles and guides

Examples

Single URL Conversion

lxt https://blog.example.com/my-article

Output:

🚀 Lextract v1.0.0

📋 Processing 1 URL(s)
📁 Output directory: ./output/2025-08-25_14-30-15_purple-turtle
🗂️  Run folder: 2025-08-25_14-30-15_purple-turtle

[1/1] Processing: https://blog.example.com/my-article
  📥 Fetching HTML...
  🎯 Extracting content...
  📝 Converting to markdown...
  💾 Saving file...
✅ Saved: ./output/2025-08-25_14-30-15_purple-turtle/my-article.md

📊 Use `lxt stats` to view detailed run statistics and analytics.
🎉 Conversion complete!

Recursive Processing Example

lxt --recursive --max-depth 2 https://blog.example.com/main-article

Output:

🚀 Lextract v1.0.0

📋 Processing 1 URL(s)
📁 Output directory: ./output/2025-08-25_15-45-22_swift-eagle
🗂️  Run folder: 2025-08-25_15-45-22_swift-eagle
🔄 Recursive mode enabled (max depth: 2, max links per page: 10)

🔄 Starting recursive processing...

[第1次] [Depth 0] Processing: https://blog.example.com/main-article
  📥 Fetching HTML...
  🎯 Extracting content...
  🔗 Extracting links...
  📝 Converting to markdown...
  💾 Saving file...
✅ Saved: ./output/2025-08-25_15-45-22_swift-eagle/main-article.md
  🔗 Found 8 new links for processing
  📈 Queue: 8 remaining, 1 processed

[第2次] [Depth 1] Processing: https://blog.example.com/related-article-1
  🔗 Found via: https://blog.example.com/main-article
  📥 Fetching HTML...
  🎯 Extracting content...
  🔗 Extracting links...
  📝 Converting to markdown...
  💾 Saving file...
✅ Saved: ./output/2025-08-25_15-45-22_swift-eagle/related-article-1.md
  🔗 Found 5 new links for processing
  📈 Queue: 12 remaining, 2 processed

🏁 Recursive processing complete!
  Total processed: 15
  Queue ended with: 0 remaining

==================================================
📈 CONVERSION SUMMARY
==================================================
✅ Successful: 15
❌ Failed: 0
📊 Total words: 45,230
📁 Output folder: ./output/2025-08-25_15-45-22_swift-eagle
🔗 Total links extracted: 73
🌳 Depth distribution:
  Depth 0: 1 pages
  Depth 1: 8 pages
  Depth 2: 6 pages

🎉 Conversion complete!

================================================== ✅ Successful: 15 ❌ Failed: 0 📊 Total words: 45,230 📁 Output folder: ./output 🔗 Total links extracted: 73 🌳 Depth distribution: Depth 0: 1 pages Depth 1: 8 pages Depth 2: 6 pages

🎉 Conversion complete!


### Batch Processing

```bash
lxt -f articles.txt

Using Custom CSS Selector

lxt -s ".post-body" https://news.example.com/article

Advanced Recursive Configuration

# Deep crawling with domain restriction
lxt --recursive --same-domain --max-depth 4 --max-links 20 https://docs.example.com

# Crawl multiple seed URLs recursively
lxt --recursive -f seed-urls.txt

Error Handling

The tool provides comprehensive error handling for:

  • Network issues: Timeout, connection errors, HTTP errors
  • Invalid URLs: Malformed or inaccessible URLs
  • Content extraction: Pages with no extractable content
  • File system: Permission errors, disk space issues
  • Parsing errors: Invalid HTML or CSS selectors
  • Recursive limits: Maximum depth and URL count protection
  • Domain restrictions: Same-domain filtering validation
  • Link extraction: Graceful handling of pages with no extractable links

Failed conversions are reported in the summary with detailed error messages, including depth information for recursive processing.

Testing

# Run unit tests
npm test

# Run tests with coverage
npm test -- --coverage

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Ensure all tests pass
  6. Submit a pull request

License

ISC License - see LICENSE file for details.

Troubleshooting

Common Issues

Network Timeouts: Some sites may take longer to respond. The tool uses a 30-second timeout by default.

Content Not Extracted: Try using a custom CSS selector with the -s option to target specific content areas.

Permission Errors: Ensure you have write permissions to the output directory.

Empty Output: Some pages may not have extractable content due to heavy use of JavaScript or unusual HTML structure.

Recursive Processing Stops Early: Check if --same-domain restriction is too limiting, or increase --max-links and --max-depth values.

Too Many Links: Use --max-links to limit the number of links processed per page, or --same-domain to reduce scope.

Infinite Loops: The tool automatically prevents revisiting URLs and includes safety limits to prevent infinite crawling.

Getting Help

Use the --help flag for quick reference:

lxt --help

Or during development:

npm start -- --help

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published