Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docusaurus.config.js
Original file line number Diff line number Diff line change
Expand Up @@ -287,7 +287,7 @@ const config = {
'docusaurus-plugin-llms',
{
generateLLMsTxt: true,
generateLLMsFullTxt: true,
generateLLMsFullTxt: false, // Disabled. We're currently using gitingest to generate a more detailed llms-full.txt file. For details, see /scripts/README.md.
docsDir: 'docs',
version: 'latest',
title: 'ScalarDB Documentation',
Expand Down
5 changes: 3 additions & 2 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,14 @@
"scripts": {
"docusaurus": "docusaurus",
"start": "docusaurus start",
"build": "docusaurus build 2>&1 | tee brokenLinks.log && node scripts/filter-broken-link-warnings.js && node scripts/generate-glossary-json.js",
"build": "docusaurus build 2>&1 | tee brokenLinks.log && node scripts/filter-broken-link-warnings.js && node scripts/generate-glossary-json.js && npm run generate-llms-full",
"swizzle": "docusaurus swizzle",
"deploy": "docusaurus deploy",
"clear": "docusaurus clear",
"serve": "docusaurus serve",
"write-translations": "docusaurus write-translations",
"write-heading-ids": "docusaurus write-heading-ids"
"write-heading-ids": "docusaurus write-heading-ids",
"generate-llms-full": "(.venv/bin/python scripts/generate-llms-full.py) || (python3 scripts/generate-llms-full.py) || (python scripts/generate-llms-full.py)"
},
"dependencies": {
"@docusaurus/core": "^3.7.0",
Expand Down
61 changes: 61 additions & 0 deletions scripts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Create `llms-full.txt` by Using the `generate-llms-full.py` Script

The `generate-llms-full.py` script generates an `llms-full.txt` file when the Docusaurus site is built.

> [!CAUTION]
>
> If this script stops working, it's because [gitingest](https://github.com/coderamp-labs/gitingest) is either down or has limited its API usage. If that happens, we'll need to find another way or host gitingest ourselves and provide it with an API key from an AI language model provider (OpenAI, Claude, etc.) to generate the `llms-full.txt` file.

## Why do we need this script?

The `docusaurus-plugin-llms` plugin can generate a `llms-full.txt` file, the file doesn't include front-matter metadata. Currently, this seems to be the expected behavior for the `llms.txt` standard.

However, we need to be able to tell AI language models when our documentation applies to only specific editions, which is already specified in `tags` in the front-matter properties of each Markdown file.

By using [gitingest](https://github.com/coderamp-labs/gitingest), we can generate a `llms-full.txt` that includes front-matter data as well as a directory tree within `llms-full.txt` to provide AI language models with better context into our documentation, particularly front-matter metadata (like edition tags) and documentation navigation.

## Usage

The `generate-llms-full` script runs when the Docusaurus site is built:

```shell
npm run generate-llms-full
```

You should rarely have to run the following Python script directly, unless you want to do testing:

```shell
python scripts/generate-llms-full.py
```

### Requirements

- Python 3.8+
- gitingest package (automatically installed by using `pip` if not present)

### What the `generate-llms-full.py` script does

1. Uses gitingest to analyze the `docs` directory.
2. Automatically installs gitingest if not already available.
3. Includes only .mdx documentation files (`docs/*.mdx`, `docs/**/*.mdx`, and `src/components/en-us`).
4. Focuses on the latest version of English documentation.
5. Excludes build artifacts, node_modules, and other irrelevant files.
6. Generates a comprehensive AI-friendly text digest.
7. Adds a custom header for ScalarDB documentation context.
8. Outputs to `build/llms-full.txt`.

### Configuration

The script includes these file patterns:

- **Include:** `docs/*.mdx`, `docs/**/*.mdx`, `src/components/en-us/*.mdx`, `src/components/en-us/**/*.mdx` (only latest English docs)
- **Exclude:** `node_modules/*`, `.git/*`, `build/*`, `*.log`
- **Max file size:** 100KB per file

### Benefits over `docusaurus-plugin-llms`

- Better repository understanding and context
- More comprehensive file inclusion
- Optimized format for AI language model consumption
- Active maintenance and updates
- Superior pattern matching and filtering
105 changes: 105 additions & 0 deletions scripts/generate-llms-full.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
#!/usr/bin/env python3
"""
Generate llms-full.txt by using gitingest instead of docusaurus-plugin-llms
"""

import asyncio
import os
import subprocess
import sys
from pathlib import Path

# Add the project root to the Python path.
project_root = Path(__file__).parent.parent
sys.path.insert(0, str(project_root))

def install_gitingest():
"""Install gitingest package using pip."""
try:
print("📦 Installing gitingest...")
subprocess.check_call([sys.executable, "-m", "pip", "install", "gitingest"])
print("✅ gitingest installed successfully")
return True
except subprocess.CalledProcessError as e:
print(f"❌ Failed to install gitingest: {e}")
return False

# Try to import gitingest, install if not available
try:
from gitingest import ingest_async
except ImportError:
print("⚠️ gitingest not found, attempting to install...")
if install_gitingest():
try:
from gitingest import ingest_async
print("✅ gitingest imported successfully after installation")
except ImportError:
print("❌ Failed to import gitingest even after installation")
print("Manual installation may be required:")
print(" pip install gitingest")
print(" # or")
print(" pipx install gitingest")
sys.exit(1)
else:
print("❌ gitingest installation failed")
print("Manual installation required:")
print(" pip install gitingest")
print(" # or")
print(" pipx install gitingest")
sys.exit(1)


async def generate_llms_full():
"""Generate llms-full.txt by using gitingest."""
try:
print("Generating llms-full.txt by using gitingest...")

# Current repository path
repo_path = Path(__file__).parent.parent
build_dir = repo_path / "build"
build_dir.mkdir(exist_ok=True)

# Configure the gitingest parameters.
include_patterns = {
"docs/*.mdx", "docs/**/*.mdx", "src/components/en-us/*.mdx", "src/components/en-us/**/*.mdx"
}

exclude_patterns = {
"node_modules/*", ".git/*", "build/*",
"*.log", ".next/*", "dist/*", ".docusaurus/*"
}

# Generate content by using gitingest.
summary, tree, content = await ingest_async(
str(repo_path),
max_file_size=102400, # 100KB max file size
include_patterns=include_patterns,
exclude_patterns=exclude_patterns,
include_gitignored=False
)

# Create a header that matches your current format.
header = """# ScalarDB Documentation - Full Repository Context
# Generated by using GitIngest for AI/LLM consumption
# Cloud-native universal transaction manager
# Website: https://scalardb.scalar-labs.com
"""

# Combine all sections.
full_content = header + summary + "\n\n" + tree + "\n\n" + content

# Write to the build directory.
output_path = build_dir / "llms-full.txt"
with open(output_path, 'w', encoding='utf-8') as f:
f.write(full_content)

print(f"✅ llms-full.txt generated successfully at {output_path}")
print(f"📊 Summary: {len(full_content)} characters, estimated tokens: {len(full_content.split())}")

except Exception as error:
print(f"❌ Error generating llms-full.txt: {error}")
sys.exit(1)

if __name__ == "__main__":
asyncio.run(generate_llms_full())