Lextract

A powerful CLI tool that extracts and converts HTML content from URLs into clean, well-formatted Markdown files. The tool automatically extracts the main article content while removing ads, sidebars, and other clutter.

Features

🌐 URL Processing: Convert single URLs, multiple URLs, or URLs from a file
🔄 Recursive Processing: Automatically discover and process linked articles with configurable depth limits
🎯 Smart Content Extraction: Uses Mozilla Readability to isolate main article content
🧹 Content Cleaning: Automatically removes CSS, JavaScript, ads, and navigation elements
🎨 Custom Selectors: Optionally specify CSS selectors for targeted content extraction
🌐 Domain Filtering: Option to restrict recursive processing to same domain only
🔗 Intelligent Link Detection: AI-powered link classification to identify article-like content
📁 Flexible Output: Configurable output directory with auto-generated filenames
📊 Progress Tracking: Real-time progress indicators with depth tracking and detailed summary reports
⚡ TypeScript: Built with TypeScript for reliability and type safety
🧪 Tested: Comprehensive unit tests covering core functionality

Installation

Global Installation (Recommended)

# Install globally to use 'lxt' command anywhere
npm install -g lextract

# Use the CLI directly
lxt https://example.com
lxt stats

From Source

# Clone the repository
git clone <repository-url>
cd lextract

# Install dependencies
npm install

# Build the project
npm run build

# Run locally (after building)
lxt [options] [urls...]

# Or use npm scripts during development
npm start -- [options] [urls...]

Development Mode

# Run in development mode with TypeScript
npm run dev -- [options] [urls...]

Usage

Basic Usage

# Convert a single URL (global install)
lxt https://example.com

# Convert multiple URLs
lxt https://example.com https://another-site.com

# Or using npm scripts for development
npm start -- https://example.com
npm run dev -- https://example.com

Advanced Options

# Use custom CSS selector for content extraction
lxt -s "article.main-content" https://example.com

# Read URLs from a file
lxt -f urls.txt

# Enable recursive processing with depth limit
lxt --recursive --max-depth 3 https://blog.example.com

# Recursive mode with domain restriction
lxt --recursive --same-domain --max-links 5 https://example.com

# Combine multiple options
lxt -f urls.txt --recursive -d 2 --max-links 10

# Or using npm scripts during development:
npm start -- -s "article.main-content" https://example.com
npm start -- -f urls.txt --recursive -d 2 --max-links 10

Recursive Processing

# Basic recursive mode (depth 2, max 10 links per page)
lxt --recursive https://blog.example.com

# Deep recursive crawling with custom limits
lxt --recursive --max-depth 4 --max-links 15 https://news.site.com

# Stay within the same domain
lxt --recursive --same-domain https://documentation.site.com

# Combine with file input
lxt -f seed-urls.txt --recursive -d 3

# Or using npm scripts:
npm start -- --recursive https://blog.example.com
npm start -- -f seed-urls.txt --recursive -d 3

URL File Format

Create a text file with one URL per line:

https://example.com/article1
https://example.com/article2
# Comments starting with # are ignored
https://example.com/article3

CLI Options

Option	Alias	Description	Default
`--selector`	`-s`	CSS selector for main content element	Auto-detect with Readability
`--file`	`-f`	File containing URLs (one per line)	-
`--recursive`	`-r`	Recursively extract and process links from article content	`false`
`--max-depth`	`-d`	Maximum depth for recursive processing	`2`
`--same-domain`	-	Only follow links within the same domain when recursive	`false`
`--max-links`	-	Maximum number of links to follow per page	`10`
`--ingest`	`-i`	Export scraped content to JSONL format for vector database	`false`
`--jsonl-file`	-	Custom path for JSONL output file	`./data/vector-db/documents.jsonl`
`--help`	`-h`	Display help information	-
`--version`	`-V`	Display version number	-

Note: The output directory is automatically managed. Each run creates a unique folder in ./output/ with a timestamp and memorable name (e.g., 2025-08-25_14-30-15_purple-turtle).

JSONL Vector Database Integration

Lextract supports exporting scraped content to JSONL (JSON Lines) format for vector database ingestion and local LLM search capabilities. This feature enables you to build searchable knowledge bases from web content.

Features

🗄️ JSONL Export: Convert scraped content to structured JSONL format
🔍 Vector Database Ready: Optimized for embedding and search workflows
🏷️ Automatic Tagging: AI-powered content categorization and tagging
📄 Metadata Rich: Includes checksums, timestamps, and source information
📁 Batch Processing: Process existing markdown files into JSONL format
🔄 Incremental Updates: Support for updating existing documents
🧹 Content Normalization: Whitespace normalization and content cleaning

JSONL Document Format

Each JSONL record contains:

{
  "docId": "a1b2c3d4e5f6",
  "sourceUrl": "https://example.com/article",
  "path": "/path/to/article.md",
  "title": "How to Build APIs: A Complete Guide",
  "createdAt": "2024-08-26T10:30:00.000Z",
  "updatedAt": "2024-08-26T10:30:00.000Z",
  "checksum": "sha256hash...",
  "tags": ["how-to", "tutorial", "api", "javascript"],
  "fullMarkdown": "# How to Build APIs...\n\nContent here..."
}

Field Descriptions:

docId: Unique 16-character identifier generated from source URL
sourceUrl: Original URL where content was scraped
path: File path to the saved markdown file
title: Extracted article title
createdAt: ISO timestamp when record was created
updatedAt: ISO timestamp when record was last modified
checksum: SHA-256 hash of full markdown content for deduplication
tags: Auto-extracted tags describing content type and technologies
fullMarkdown: Complete markdown content with normalized whitespace

Usage Examples

Scrape URLs with JSONL Export

# Basic scraping with JSONL export
lxt --ingest https://blog.example.com/article

# Multiple URLs with custom JSONL file
lxt --ingest --jsonl-file ./my-knowledge-base.jsonl https://site1.com https://site2.com

# Recursive crawling with JSONL export
lxt --recursive --ingest --max-depth 3 https://docs.example.com

# Batch processing from file
lxt -f urls.txt --ingest

Ingest Existing Markdown Files

The ingest command allows you to process existing markdown files into JSONL format:

# Ingest single markdown file
lxt ingest article.md

# Ingest multiple files
lxt ingest docs/*.md

# Ingest from file list
lxt ingest --file markdown-files.txt

# Recursive directory processing
lxt ingest --recursive ./documentation/

# Custom JSONL output file
lxt ingest --jsonl-file ./custom-db.jsonl ./docs/

# Overwrite existing entries
lxt ingest --overwrite ./updated-article.md

Ingest Command Options

Option	Description	Default
`--file`	File containing markdown paths (one per line)	-
`--recursive`	Process directories recursively	`false`
`--jsonl-file`	Custom JSONL output file path	`./data/vector-db/documents.jsonl`
`--overwrite`	Overwrite existing entries with same URL	`false`
`--no-extract-tags`	Disable automatic tag extraction	Tags enabled

File Organization

JSONL files are stored in the ./data/vector-db/ directory:

project/
├── data/
│   └── vector-db/
│       ├── documents.jsonl          # Default JSONL file
│       ├── web-development.jsonl    # Custom project file
│       └── machine-learning.jsonl   # Another project file
├── output/                          # Markdown files
└── src/                            # Source code

Note: The ./data/vector-db/ directory is automatically excluded from version control via .gitignore.

Output

File Organization

The tool automatically organizes files by run:

Base Directory: ./output/
Run Folders: Each execution creates a unique subfolder with format: YYYY-MM-DD_HH-MM-SS_adjective-noun
Examples: 2025-08-25_14-30-15_purple-turtle, 2025-08-25_15-45-22_swift-eagle

Generated Files

Each run folder contains:

Individual Markdown files for each processed URL
All files from a single run are grouped together for easy organization

Markdown Format

Each generated markdown file includes:

Metadata header with source URL, extraction date, and word count
Author information (if available)
Article excerpt (if available)
Clean markdown content with proper formatting

Example output:

# Article Title

**Source:** https://example.com/article
**Extracted:** 8/24/2025, 1:30:00 PM
**Word Count:** 1,250
**Author:** John Doe

**Excerpt:** This is a brief description of the article content...

---

# Main Article Heading

This is the main article content converted to clean markdown...

Run Analytics

Lextract includes comprehensive analytics to track your extraction history:

# View summary statistics
lxt stats --summary

# List recent runs
lxt stats --runs

# View failed extractions
lxt stats --failures

# Detailed run information
lxt stats --run-id <run-id> --detailed

# Export run data
lxt stats --run-id <run-id> --export --format json

Dependencies

Core Dependencies

@mozilla/readability: Intelligent article content extraction
jsdom: HTML parsing and DOM manipulation
turndown: HTML to Markdown conversion
commander: CLI argument parsing

Development Dependencies

TypeScript: Static type checking
Jest: Unit testing framework
ts-node: TypeScript execution for development

Architecture

src/
├── index.ts          # Main application entry point
├── cli.ts           # Command-line interface parsing
├── fetcher.ts       # HTML fetching with native fetch
├── extractor.ts     # Content extraction using Readability
├── converter.ts     # HTML to Markdown conversion
├── linkExtractor.ts # Link extraction and classification for recursive processing
├── recursiveQueue.ts # Queue management for recursive URL processing
└── fileOperations.ts # File I/O operations

Recursive Processing

The tool includes advanced recursive processing capabilities that can automatically discover and convert related articles:

How It Works

Content Analysis: Extracts the main article content using Mozilla Readability
Link Discovery: Scans the content for outbound links using intelligent classification
Link Filtering: Uses AI-powered algorithms to identify article-like content:
- Analyzes URL patterns and structures
- Evaluates link text characteristics
- Considers source context and placement
- Filters out navigation, ads, and non-article links
Queue Management: Processes links in breadth-first order with configurable depth limits
Duplicate Prevention: Automatically prevents infinite loops and revisiting URLs

Link Classification

The link extractor uses sophisticated heuristics to identify high-quality article links:

URL Pattern Analysis: Recognizes common article URL structures
Link Text Evaluation: Prioritizes descriptive, substantive link text
Context Assessment: Considers the link's position and surrounding content
Confidence Scoring: Assigns confidence levels (0-1) to each potential link
Domain Awareness: Optional same-domain restriction for focused crawling

Safety Features

Depth Limiting: Configurable maximum crawl depth (default: 2 levels)
Link Count Limits: Maximum links per page (default: 10)
URL Deduplication: Prevents processing the same URL multiple times
Total URL Cap: Global safety limit to prevent runaway crawling
Domain Filtering: Optional restriction to same domain only
Graceful Error Handling: Continues processing if individual URLs fail

Use Cases

Documentation Sites: Crawl entire documentation sections
Blog Series: Automatically discover related articles in a series
News Archives: Process multiple related news articles
Research Papers: Follow citation links and references
Knowledge Bases: Extract interconnected articles and guides

Examples

Single URL Conversion

lxt https://blog.example.com/my-article

Output:

🚀 Lextract v1.0.0

📋 Processing 1 URL(s)
📁 Output directory: ./output/2025-08-25_14-30-15_purple-turtle
🗂️  Run folder: 2025-08-25_14-30-15_purple-turtle

[1/1] Processing: https://blog.example.com/my-article
  📥 Fetching HTML...
  🎯 Extracting content...
  📝 Converting to markdown...
  💾 Saving file...
✅ Saved: ./output/2025-08-25_14-30-15_purple-turtle/my-article.md

📊 Use `lxt stats` to view detailed run statistics and analytics.
🎉 Conversion complete!

Recursive Processing Example

lxt --recursive --max-depth 2 https://blog.example.com/main-article

Output:

🚀 Lextract v1.0.0

📋 Processing 1 URL(s)
📁 Output directory: ./output/2025-08-25_15-45-22_swift-eagle
🗂️  Run folder: 2025-08-25_15-45-22_swift-eagle
🔄 Recursive mode enabled (max depth: 2, max links per page: 10)

🔄 Starting recursive processing...

[第1次] [Depth 0] Processing: https://blog.example.com/main-article
  📥 Fetching HTML...
  🎯 Extracting content...
  🔗 Extracting links...
  📝 Converting to markdown...
  💾 Saving file...
✅ Saved: ./output/2025-08-25_15-45-22_swift-eagle/main-article.md
  🔗 Found 8 new links for processing
  📈 Queue: 8 remaining, 1 processed

[第2次] [Depth 1] Processing: https://blog.example.com/related-article-1
  🔗 Found via: https://blog.example.com/main-article
  📥 Fetching HTML...
  🎯 Extracting content...
  🔗 Extracting links...
  📝 Converting to markdown...
  💾 Saving file...
✅ Saved: ./output/2025-08-25_15-45-22_swift-eagle/related-article-1.md
  🔗 Found 5 new links for processing
  📈 Queue: 12 remaining, 2 processed

🏁 Recursive processing complete!
  Total processed: 15
  Queue ended with: 0 remaining

==================================================
📈 CONVERSION SUMMARY
==================================================
✅ Successful: 15
❌ Failed: 0
📊 Total words: 45,230
📁 Output folder: ./output/2025-08-25_15-45-22_swift-eagle
🔗 Total links extracted: 73
🌳 Depth distribution:
  Depth 0: 1 pages
  Depth 1: 8 pages
  Depth 2: 6 pages

🎉 Conversion complete!

================================================== ✅ Successful: 15 ❌ Failed: 0 📊 Total words: 45,230 📁 Output folder: ./output 🔗 Total links extracted: 73 🌳 Depth distribution: Depth 0: 1 pages Depth 1: 8 pages Depth 2: 6 pages

🎉 Conversion complete!


### Batch Processing

```bash
lxt -f articles.txt

Using Custom CSS Selector

lxt -s ".post-body" https://news.example.com/article

Advanced Recursive Configuration

# Deep crawling with domain restriction
lxt --recursive --same-domain --max-depth 4 --max-links 20 https://docs.example.com

# Crawl multiple seed URLs recursively
lxt --recursive -f seed-urls.txt

Error Handling

The tool provides comprehensive error handling for:

Network issues: Timeout, connection errors, HTTP errors
Invalid URLs: Malformed or inaccessible URLs
Content extraction: Pages with no extractable content
File system: Permission errors, disk space issues
Parsing errors: Invalid HTML or CSS selectors
Recursive limits: Maximum depth and URL count protection
Domain restrictions: Same-domain filtering validation
Link extraction: Graceful handling of pages with no extractable links

Failed conversions are reported in the summary with detailed error messages, including depth information for recursive processing.

Testing

# Run unit tests
npm test

# Run tests with coverage
npm test -- --coverage

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests for new functionality
Ensure all tests pass
Submit a pull request

License

ISC License - see LICENSE file for details.

Troubleshooting

Common Issues

Network Timeouts: Some sites may take longer to respond. The tool uses a 30-second timeout by default.

Content Not Extracted: Try using a custom CSS selector with the -s option to target specific content areas.

Permission Errors: Ensure you have write permissions to the output directory.

Empty Output: Some pages may not have extractable content due to heavy use of JavaScript or unusual HTML structure.

Recursive Processing Stops Early: Check if --same-domain restriction is too limiting, or increase --max-links and --max-depth values.

Too Many Links: Use --max-links to limit the number of links processed per page, or --same-domain to reduce scope.

Infinite Loops: The tool automatically prevents revisiting URLs and includes safety limits to prevent infinite crawling.

Getting Help

Use the --help flag for quick reference:

lxt --help

Or during development:

npm start -- --help

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.docs		.docs
.qoder/quests		.qoder/quests
data/vector-db		data/vector-db
src		src
tests		tests
.gitignore		.gitignore
RAG_IMPLEMENTATION.md		RAG_IMPLEMENTATION.md
README.md		README.md
jest.config.js		jest.config.js
package-lock.json		package-lock.json
package.json		package.json
test-demo.md		test-demo.md
tsconfig.json		tsconfig.json

chrisdevelops/lextract

Folders and files

Latest commit

History

Repository files navigation

Lextract

Features

Installation

Global Installation (Recommended)

From Source

Development Mode

Usage

Basic Usage

Advanced Options

Recursive Processing

URL File Format

CLI Options

JSONL Vector Database Integration

Features

JSONL Document Format

Usage Examples

Scrape URLs with JSONL Export

Ingest Existing Markdown Files

Ingest Command Options

File Organization

Output

File Organization

Generated Files

Markdown Format

Run Analytics

Dependencies

Core Dependencies

Development Dependencies

Architecture

Recursive Processing

How It Works

Link Classification

Safety Features

Use Cases

Examples

Single URL Conversion

Recursive Processing Example

Using Custom CSS Selector

Advanced Recursive Configuration

Error Handling

Testing

Contributing

License

Troubleshooting

Common Issues

Getting Help

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages