Skip to content

Commit 23c0ac9

Browse files
authored
Generate llms-full.txt by using gitingest script (#1418)
* Add script to generate `llms-full.txt` by using gitingest Introduces `scripts/generate-llms-full.py` to generate `llms-full.txt` by leveraging the gitingest package instead of `docusaurus-plugin-llms`. The script handles installation of gitingest if missing, configures include/exclude patterns, and writes the output to the build directory. * Add `generate-llms-full` script to build process Introduces a new npm script `generate-llms-full` that runs a Python script for generating LLMs data. The build process now includes this step to ensure LLMs data is generated during builds. * Disable `generateLLMsFullTxt` for `docusaurus-plugin-llms` Set `generateLLMsFullTxt` to false in `docusaurus-plugin-llms` configuration. This change is made because gitingest is now used to generate a more detailed llms-full.txt file. * Add README for `generate-llms-full.py` script Introduces documentation explaining the purpose, usage, requirements, and configuration of the `generate-llms-full.py` script. The README details how the script uses gitingest to create an AI-friendly `llms-full.txt` file with enhanced context for documentation. * Apply suggestions from Copilot and Gemini review
1 parent 2b60728 commit 23c0ac9

File tree

6 files changed

+150
-3
lines changed

6 files changed

+150
-3
lines changed

.github/workflows/deploy.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,9 @@ jobs:
2323

2424
- name: Install dependencies
2525
run: npm ci
26+
- name: Install Python dependencies
27+
run: |
28+
python3 -m pip install --user gitingest
2629
- name: Build website
2730
run: npm run build
2831

.github/workflows/test-deploy.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,5 +23,7 @@ jobs:
2323

2424
- name: Install dependencies
2525
run: npm ci
26+
- name: Install Python dependencies
27+
run: python3 -m pip install --user gitingest
2628
- name: Test build website
2729
run: npm run build

docusaurus.config.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -287,7 +287,7 @@ const config = {
287287
'docusaurus-plugin-llms',
288288
{
289289
generateLLMsTxt: true,
290-
generateLLMsFullTxt: true,
290+
generateLLMsFullTxt: false, // Disabled. We're currently using gitingest to generate a more detailed llms-full.txt file. For details, see /scripts/README.md.
291291
docsDir: 'docs',
292292
version: 'latest',
293293
title: 'ScalarDB Documentation',

package.json

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,14 @@
55
"scripts": {
66
"docusaurus": "docusaurus",
77
"start": "docusaurus start",
8-
"build": "docusaurus build 2>&1 | tee brokenLinks.log && node scripts/filter-broken-link-warnings.js && node scripts/generate-glossary-json.js",
8+
"build": "docusaurus build 2>&1 | tee brokenLinks.log && node scripts/filter-broken-link-warnings.js && node scripts/generate-glossary-json.js && npm run generate-llms-full",
99
"swizzle": "docusaurus swizzle",
1010
"deploy": "docusaurus deploy",
1111
"clear": "docusaurus clear",
1212
"serve": "docusaurus serve",
1313
"write-translations": "docusaurus write-translations",
14-
"write-heading-ids": "docusaurus write-heading-ids"
14+
"write-heading-ids": "docusaurus write-heading-ids",
15+
"generate-llms-full": "python3 scripts/generate-llms-full.py"
1516
},
1617
"dependencies": {
1718
"@docusaurus/core": "^3.7.0",

scripts/README.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# Create `llms-full.txt` by Using the `generate-llms-full.py` Script
2+
3+
The `generate-llms-full.py` script generates an `llms-full.txt` file when the Docusaurus site is built.
4+
5+
> [!CAUTION]
6+
>
7+
> If this script stops working, it's because [gitingest](https://github.com/coderamp-labs/gitingest) is either down or has limited its API usage. If that happens, we'll need to find another way or host gitingest ourselves and provide it with an API key from an AI language model provider (OpenAI, Claude, etc.) to generate the `llms-full.txt` file.
8+
9+
## Why do we need this script?
10+
11+
The `docusaurus-plugin-llms` plugin can generate a `llms-full.txt` file, the file doesn't include front-matter metadata. Currently, this seems to be the expected behavior for the `llms.txt` standard.
12+
13+
However, we need to be able to tell AI language models when our documentation applies to only specific editions, which is already specified in `tags` in the front-matter properties of each Markdown file.
14+
15+
By using [gitingest](https://github.com/coderamp-labs/gitingest), we can generate a `llms-full.txt` that includes front-matter data as well as a directory tree within `llms-full.txt` to provide AI language models with better context into our documentation, particularly front-matter metadata (like edition tags) and documentation navigation.
16+
17+
## Usage
18+
19+
The `generate-llms-full` script runs when the Docusaurus site is built:
20+
21+
```shell
22+
npm run generate-llms-full
23+
```
24+
25+
You should rarely have to run the following Python script directly, unless you want to do testing:
26+
27+
```shell
28+
python scripts/generate-llms-full.py
29+
```
30+
31+
### Requirements
32+
33+
- Python 3.8+
34+
- gitingest package
35+
36+
> [!NOTE]
37+
>
38+
> For local development, install gitingest manually by using `pip install --user gitingest` or `pipx install gitingest`. For GitHub Actions, gitingest is automatically installed in the workflow for building and deploying the docs site at `.github/workflows/deploy.yml`.
39+
40+
### What the `generate-llms-full.py` script does
41+
42+
1. Uses gitingest to analyze the `docs` directory.
43+
2. Includes only .mdx documentation files (`docs/*.mdx`, `docs/**/*.mdx`, and `src/components/en-us`).
44+
3. Focuses on the latest version of English documentation.
45+
4. Excludes build artifacts, node_modules, and other irrelevant files.
46+
5. Generates a comprehensive AI-friendly text digest.
47+
6. Adds a custom header for ScalarDB documentation context.
48+
7. Outputs to `build/llms-full.txt`.
49+
50+
### Configuration
51+
52+
The script includes these file patterns:
53+
54+
- **Include:** `docs/*.mdx`, `docs/**/*.mdx`, `src/components/en-us/*.mdx`, `src/components/en-us/**/*.mdx` (only latest English docs)
55+
- **Exclude:** `node_modules/*`, `.git/*`, `build/*`, `*.log`
56+
- **Max file size:** 100KB per file
57+
58+
### Benefits over `docusaurus-plugin-llms`
59+
60+
- Better repository understanding and context
61+
- More comprehensive file inclusion
62+
- Optimized format for AI language model consumption
63+
- Active maintenance and updates
64+
- Superior pattern matching and filtering

scripts/generate-llms-full.py

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Generate llms-full.txt by using gitingest instead of docusaurus-plugin-llms
4+
"""
5+
6+
import asyncio
7+
import sys
8+
import textwrap
9+
from pathlib import Path
10+
11+
try:
12+
from gitingest import ingest_async
13+
except ImportError:
14+
print("❌ gitingest not found. Please install it first:")
15+
print(" pip install --user gitingest")
16+
print(" # or")
17+
print(" pipx install gitingest")
18+
print("")
19+
print("For GitHub Actions, this should be installed automatically in the workflow.")
20+
sys.exit(1)
21+
22+
23+
async def generate_llms_full():
24+
"""Generate llms-full.txt by using gitingest."""
25+
try:
26+
print("Generating llms-full.txt by using gitingest...")
27+
28+
# Current repository path
29+
repo_path = Path(__file__).parent.parent
30+
build_dir = repo_path / "build"
31+
build_dir.mkdir(exist_ok=True)
32+
33+
# Configure the gitingest parameters.
34+
include_patterns = {
35+
"docs/*.mdx", "docs/**/*.mdx", "src/components/en-us/*.mdx", "src/components/en-us/**/*.mdx"
36+
}
37+
38+
exclude_patterns = {
39+
"node_modules/*", ".git/*", "build/*",
40+
"*.log", ".next/*", "dist/*", ".docusaurus/*"
41+
}
42+
43+
# Generate content by using gitingest.
44+
summary, tree, content = await ingest_async(
45+
str(repo_path),
46+
max_file_size=100000, # 100 KB max file size
47+
include_patterns=include_patterns,
48+
exclude_patterns=exclude_patterns,
49+
include_gitignored=False
50+
)
51+
52+
# Create a header that matches your current format.
53+
header = textwrap.dedent("""\
54+
# ScalarDB Documentation - Full Repository Context
55+
# Generated by using GitIngest for AI/LLM consumption
56+
# Cloud-native universal transaction manager
57+
# Website: https://scalardb.scalar-labs.com
58+
59+
""")
60+
61+
# Combine all sections.
62+
full_content = header + summary + "\n\n" + tree + "\n\n" + content
63+
64+
# Write to the build directory.
65+
output_path = build_dir / "llms-full.txt"
66+
with open(output_path, 'w', encoding='utf-8') as f:
67+
f.write(full_content)
68+
69+
print(f"✅ llms-full.txt generated successfully at {output_path}")
70+
print(f"📊 Summary: {len(full_content)} characters, estimated tokens: {len(full_content.split())}")
71+
72+
except Exception as error:
73+
print(f"❌ Error generating llms-full.txt: {error}")
74+
sys.exit(1)
75+
76+
if __name__ == "__main__":
77+
asyncio.run(generate_llms_full())

0 commit comments

Comments
 (0)