Autogenerate LLMs.txt file simply

🤖 LLMs.txt Generator Action - AI-Optimized Website Crawling

A GitHub Action that automatically crawls websites using their sitemap and generates a single llms.txt file containing clean, markdown-formatted content from every page. Designed to create AI-friendly content for Large Language Models (LLMs) following the proposed llms.txt standard.

What is llms.txt?

llms.txt is a proposed web standard by Jeremy Howard (Answer.AI) that provides AI-friendly website content in a structured format. It helps Large Language Models understand and extract your website's most important information efficiently.

Why llms.txt Matters

AI Search Optimization: Optimizes your content for AI-powered search engines like Perplexity, ChatGPT, and Claude
Better Attribution: Ensures proper citation when AI models reference your content
Context Window Friendly: Overcomes LLM token limitations with curated, essential content
Controlled AI Interaction: You decide what content AI models should prioritize
Future-Proof: Positions your website for the AI-driven web

Companies Using llms.txt

Perplexity - AI search engine
ElevenLabs - AI voice technology
FastHTML - Web framework documentation
Answer.AI - AI research company
Mintlify - Documentation platform

Features

Dual Content Extraction Backends

Jina AI (Default): Free content extraction via Jina AI Reader API
Firecrawl: Advanced crawling with Firecrawl API for complex websites

Crawling Capabilities

Sitemap Discovery: Auto-detects sitemaps from robots.txt, /sitemap.xml, and /sitemap_index.xml
Asynchronous Processing: Parallel content extraction for improved performance
Clean Markdown Output: Converts HTML content to markdown format
Aggregated Output: Combines all pages into a single llms.txt file

Technical Details

GitHub Actions Integration: Runs as a composite action
Python 3.11: Built with modern Python and async/await
Dependencies: Uses httpx, beautifulsoup4, lxml, and firecrawl-py
Error Handling: Graceful handling of failed requests and missing content

Quick Start

Basic Usage (Jina AI - Free)

Create .github/workflows/generate-llms-txt.yml in your repository:

name: Generate AI-Optimized llms.txt
permissions:
  contents: write
on:
  push:
    branches: [ main ]
  schedule:
    - cron: '0 2 * * 0' # Weekly updates

jobs:
  generate-llms-txt:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4

      - name: Generate llms.txt with Jina AI
        uses: kevinnkansah/[email protected]
        with:
          domain: https://your-website.com
          outputFile: public/llms.txt
          # backend: jina  # Default, free option

      - name: Commit and push llms.txt
        uses: EndBug/add-and-commit@v9
        with:
          author_name: 'AI Bot'
          author_email: '[email protected]'
          add: 'public/llms.txt'
          message: 'feat: update llms.txt for AI optimization'

Advanced Usage (Firecrawl)

For complex websites requiring advanced crawling:

- name: Generate llms.txt with Firecrawl
  uses: kevinnkansah/[email protected]
  with:
    domain: https://your-complex-website.com
    outputFile: docs/llms.txt
    backend: firecrawl
    firecrawl_api_key: ${{ secrets.FIRECRAWL_API_KEY }}

Inputs

Name	Required	Description	Default
`domain`	Yes	The full URL of the site to crawl (e.g., `https://example.com`).
`outputFile`	No	The path where the final `llms.txt` file will be saved.	`public/llms.txt`
`backend`	No	Content extraction backend: `"jina"` (free) or `"firecrawl"` (requires API key).	`jina`
`jina_api_key`	No	Your Jina AI Reader API key. Optional for Jina backend, recommended for higher rate limits. Store this as a GitHub Secret.
`firecrawl_api_key`	No	Your Firecrawl API key. Required when using Firecrawl backend. Get one from firecrawl.dev. Store this as a GitHub Secret.

How It Works

This action first attempts to find your sitemap by checking /robots.txt and common paths like /sitemap.xml. It then parses the sitemap(s) to get a list of all page URLs. For each URL, it uses the selected backend to fetch the content as clean markdown:

Jina Backend (Default): Uses the free Jina AI Reader API (https://r.jina.ai/)
Firecrawl Backend: Uses the Firecrawl API for more advanced crawling capabilities

Finally, it aggregates the content from all pages into the specified outputFile.

Usage Examples

Using Jina Backend (Default)

- name: Generate llms.txt with Jina
  uses: kevinnkansah/[email protected]
  with:
    domain: https://dewflow.xyz
    outputFile: public/llms.txt
    # backend: jina  # Optional, this is the default

Using Firecrawl Backend

- name: Generate llms.txt with Firecrawl
  uses: kevinnkansah/[email protected]
  with:
    domain: https://dewflow.xyz
    outputFile: public/llms.txt
    backend: firecrawl
    firecrawl_api_key: ${{ secrets.FIRECRAWL_API_KEY }}

Use Cases

Suitable For

Documentation Sites: API docs, technical guides, knowledge bases
E-commerce: Product catalogs, store policies, FAQ sections
Content Publishers: Blogs, news sites, educational content
SaaS Companies: Feature documentation, help centers
Business Websites: Company info, services, contact details
Educational Sites: Course materials, research papers

Benefits

Better AI Citations: Content gets properly attributed in AI responses
Enhanced Discoverability: Improved visibility in AI-powered search
Structured Content: Pre-formatted content for AI consumption
Content Control: Curated content selection for AI models
Standard Compliance: Follows the proposed llms.txt specification

Frequently Asked Questions

What's the difference between Jina and Firecrawl backends?

Jina AI (Default - Free)

Completely free to use
Works well for most websites
Fast and reliable
Limited customization options

Firecrawl (Premium)

Advanced crawling capabilities
Better handling of JavaScript-heavy sites
More extraction options
Requires API key and credits

How often should I update my llms.txt file?

It depends on your content update frequency:

Daily: For news sites or frequently updated blogs
Weekly: For most business websites
Monthly: For stable documentation or corporate sites
On-demand: Trigger manually when major content changes occur

What's the optimal llms.txt file size?

Small sites: 10KB - 100KB
Medium sites: 100KB - 1MB
Large sites: 1MB+

Note: Most LLMs have context windows of 128K-200K tokens (approximately 500KB-800KB of text).

Is my content safe when using these APIs?

Jina AI: Processes content on-demand, doesn't store your data permanently Firecrawl: Enterprise-grade security, GDPR compliant

Both services:

Use HTTPS encryption
Don't train models on your data
Process content temporarily for extraction only

What if my site has a robots.txt that blocks crawlers?

This action respects your robots.txt file. If you're blocking crawlers, you have options:

Add specific allow rules for AI crawlers
Manually create your llms.txt file
Use this action on a staging environment
Temporarily modify robots.txt during generation

What are the costs?

Jina AI: Completely free Firecrawl: Pay-per-use pricing

Free tier: 500 pages/month
Paid plans: Starting at $29/month
Enterprise: Custom pricing

GitHub Actions: Free for public repos, included minutes for private repos

Can I customize the output format?

Currently, the action generates standard llms.txt format. For custom formatting:

Fork this repository
Modify the crawler.py aggregation logic
Submit a PR if you think others would benefit

🌟 Real-World Examples

E-commerce Store

- name: Generate Product Catalog for AI
  uses: kevinnkansah/[email protected]
  with:
    domain: https://mystore.com
    outputFile: ai/product-catalog.txt
    backend: firecrawl  # Better for complex product pages
    firecrawl_api_key: ${{ secrets.FIRECRAWL_API_KEY }}

Documentation Site

- name: Generate API Docs for AI
  uses: kevinnkansah/[email protected]
  with:
    domain: https://docs.myapi.com
    outputFile: public/llms.txt

Multi-language Site

strategy:
  matrix:
    locale: [en, es, fr, de]
steps:
  - name: Generate llms.txt for ${{ matrix.locale }}
    uses: kevinnkansah/[email protected]
    with:
      domain: https://mysite.com/${{ matrix.locale }}
      outputFile: public/llms-${{ matrix.locale }}.txt

Development

This project uses uv for package management and pre-commit with commitizen to enforce conventional commit messages.

Clone the repository
Install dependencies:
```
uv pip install -e .[dev]
```

Activate pre-commit hooks:

uv run pre-commit install --hook-type commit-msg

Make your changes
Commit your work using the guided prompt:
```
uv run cz commit
```

Your contributions will be automatically versioned and released upon merging to main.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autogenerate LLMs.txt file simply

Tags

🤖 LLMs.txt Generator Action - AI-Optimized Website Crawling

What is llms.txt?

Why llms.txt Matters

Companies Using llms.txt

Features

Dual Content Extraction Backends

Crawling Capabilities

Technical Details

Quick Start

Basic Usage (Jina AI - Free)

Advanced Usage (Firecrawl)

Inputs

How It Works

Usage Examples

Using Jina Backend (Default)

Using Firecrawl Backend

Use Cases

Suitable For

Benefits

Frequently Asked Questions

🌟 Real-World Examples

E-commerce Store

Documentation Site

Multi-language Site

Development

License

Contributors (1)

Resources

About

Tags

Contributors (1)

Resources

🤖 LLMs.txt Generator Action - AI-Optimized Website Crawling

What is llms.txt?

Why llms.txt Matters

Companies Using llms.txt

Features

Dual Content Extraction Backends

Crawling Capabilities

Technical Details

Quick Start

Basic Usage (Jina AI - Free)

Advanced Usage (Firecrawl)

Inputs

How It Works

Usage Examples

Using Jina Backend (Default)

Using Firecrawl Backend

Use Cases

Suitable For

Benefits

Frequently Asked Questions

🌟 Real-World Examples

E-commerce Store

Documentation Site

Multi-language Site

Development

License

Contributors1 (1)

Resources

About

Tags

Contributors1 (1)

Resources

Contributors (1)

Contributors (1)