Skip to content

dylanfeltus/site-to-md

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

site-to-md 🤖

Make any website AI-agent-readable. Generates /llms.txt + clean markdown for every page.

robots.txt told search engines what to crawl. llms.txt tells AI agents what to read. This tool generates both automatically.

Quick Start

npx site-to-md https://mysite.com

That's it. Zero config. You'll get:

site-to-md-output/
├── llms.txt           # Index file per llmstxt.org spec
├── llms-ctx.txt       # All content inline (for single-prompt ingestion)
├── index.html.md      # Homepage as markdown
├── docs/
│   ├── getting-started.html.md
│   └── api-reference.html.md
└── blog/
    ├── hello-world.html.md
    └── release-notes.html.md

What It Does

  1. Crawls your site — follows links or uses sitemap.xml if available
  2. Extracts content — strips nav, footer, ads, scripts using Mozilla Readability (same as Firefox Reader View)
  3. Converts to markdown — clean, structured markdown via Turndown
  4. Generates /llms.txt — per the llmstxt.org spec
  5. Generates per-page .html.md files — per the spec convention
  6. Generates /llms-ctx.txt — all content inline for single-prompt ingestion

Install

# Use directly with npx (no install needed)
npx site-to-md https://mysite.com

# Or install globally
npm install -g site-to-md

# Or as a project dependency
npm install site-to-md

CLI Usage

# Crawl a live website
site-to-md https://docs.mysite.com

# Process local build output
site-to-md ./dist

# Customize output
site-to-md https://mysite.com \
  --out ./public \
  --title "My Product" \
  --desc "Developer documentation for My Product"

# Filter pages
site-to-md https://mysite.com \
  --include "/docs/**" \
  --include "/blog/**" \
  --exclude "/admin/**"

# Skip context file
site-to-md https://mysite.com --no-ctx

Options

Flag Description Default
--out <dir> Output directory ./site-to-md-output
--title <name> Site title for llms.txt Auto-detected
--desc <text> Site description Auto-detected
--include <glob> Include only matching paths (repeatable) All
--exclude <glob> Exclude matching paths (repeatable) None
--no-ctx Skip generating llms-ctx.txt
--no-sitemap Don't use sitemap.xml for crawling
--max-depth <n> Max crawl depth 3
--concurrency <n> Parallel requests 5
--strip <selector> CSS selectors to strip (repeatable)
--config <path> Config file path Auto-detect

Programmatic API

import { agentReady } from 'site-to-md';

const result = await agentReady({
  url: 'https://mysite.com',
  outDir: './public',
  title: 'My Product',
  description: 'Developer documentation',
  include: ['/docs/**'],
});

console.log(`Generated ${result.pages.length} pages`);
console.log(result.llmsTxt); // Contents of llms.txt

Config File

Create site-to-md.config.js in your project root:

export default {
  url: 'https://mysite.com',
  outDir: './public',
  title: 'My Product',
  description: 'A brief description for agents',

  include: ['/docs/**', '/blog/**'],
  exclude: ['/admin/**'],

  sections: {
    'Documentation': '/docs/**',
    'Blog': '/blog/**',
    'API Reference': '/api-docs/**',
  },

  maxDepth: 3,
  concurrency: 5,
  stripSelectors: ['.cookie-banner', '.ad-wrapper'],
};

Build Pipeline

{
  "scripts": {
    "build": "next build && site-to-md ./out --out ./out"
  }
}

Output Format

/llms.txt

Per the llmstxt.org spec:

# My Product

> Developer documentation for building with My Product

## Documentation

- [Getting Started](/docs/getting-started.html.md): Quick start guide
- [API Reference](/docs/api-reference.html.md): Complete API docs

## Blog

- [Hello World](/blog/hello-world.html.md): Our launch announcement

Per-page .html.md

Clean markdown extracted from each page — no nav, footer, ads, or scripts.

/llms-ctx.txt

All page content concatenated in a single file for one-shot ingestion by AI agents.

What is llms.txt?

llms.txt is a proposed standard (by Jeremy Howard) for making websites readable by AI agents. Think of it like robots.txt but for LLMs:

  • /llms.txt — A markdown index file listing your site's key pages with descriptions. AI agents read this first to understand what's on your site.
  • *.html.md — Clean markdown versions of each page (same URL + .md). No nav, no footer, no JavaScript — just the content.
  • /llms-ctx.txt — All content concatenated in one file for single-prompt ingestion.

Sites like Anthropic, Cloudflare, and Stripe already have /llms.txt files. site-to-md generates yours automatically.

License

MIT © Stratus Labs

About

Make any website AI-agent-readable. Generates /llms.txt + clean markdown for every page.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors