Skip to content

Commit 68f574d

Browse files
authored
Overhaul LLMs (docs changes) (#969)
* adds script to output .md pages to /.ai/, adds ai_exclude flag to needed pages, adds initial batch of ai files * adds Tutorial tag to relevant pages, updates scripts for generating new llms files (still need category files) * adds category file generation * updates main llms script, removes deprecated LLMs files * adds estimated token count to category bundle outputs * adds variable for base url for ai artifact files * wired up existing UI, updated default file outputs * update f string copy to match file renaming * updates check-llms workflow to verify llms.txt as llms-full.txt is deprecated with these changes * fresh llms * updates urls for prod * update scripts per feedback (remove bundle from file path) * Fix filepaths to remove bundle * remove log line that is no longer needed * patch to remove source repos and optional resources from llms.txt, WIP README * patch script to output llms-full.jsonl to root of docs repo (too big to download via browser like other /.ai/ files) * updates README * tweaking workflow for new full site file use * removes Tutorial category (we can figure out how we want to manage tagging these in future ticket). * updated llms * remove ai_exclude flag and tags, update README * remove blank end lines * update generate category script to only output md format (will explore other formats in future ticket) * updates per feedback, fresh llms * formatting * remove tutorial.md bundle file as not needed at this time. * update to fix errant hyphen in front of code snippets, ran fresh llm files with the corrections in place * remove check for 'ai_exclude' flag * remove ai exclude tag * remove Tutorial tags, fresh llms * updates script to fix formatting issue in output files, fresh llms * update code comment * missed a file save! * improved README * fresh llms
1 parent 02d77c0 commit 68f574d

File tree

175 files changed

+138823
-139954
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

175 files changed

+138823
-139954
lines changed

.ai/README.md

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# AI Directory: LLM-related files
2+
3+
The `/.ai/` directory contains files generated for use with LLM-driven tools like coding assistants and agents in order to improve the user experience for developers using these tools alongside our documentation.
4+
5+
## Generate Updated Files
6+
7+
### Prerequisites
8+
9+
Ensure you have cloned the `polkadot-mkdocs` repo and installed Mkdocs requirements or you will not have access to the modules needed to use the scripts. See the `polkadot-mkdocs` [README.md](./README) for instructions if needed.
10+
11+
### Get Started
12+
13+
Follow these steps to generate a fresh set of files after updating documentation content:
14+
15+
1. From the `polkadot-docs` directory, run:
16+
17+
```
18+
python scripts/generate_llms.py
19+
```
20+
21+
2. Successful file generation will result in output similar to the following:
22+
23+
▶️ Generate AI pages<br />
24+
[ai-pages] processed=150 skipped=0<br />
25+
[ai-pages] output dir: /polkadot-mkdocs/polkadot-docs/.ai/pages<br />
26+
✅ Generate AI pages complete<br />
27+
<br />
28+
▶️ Generate llms.txt<br />
29+
✅ llms.txt generated at: /polkadot-mkdocs/polkadot-docs/llms.txt<br />
30+
Pages listed: 150<br />
31+
✅ Generate llms.txt complete<br />
32+
<br />
33+
▶️ Generate site-index and full site content files<br />
34+
✅ site-index.json written: /polkadot-mkdocs/polkadot-docs/.ai/site-index.json (pages=150)<br />
35+
✅ llms-full.jsonl written: /polkadot-mkdocs/polkadot-docs/.ai/llms-full.jsonl (lines=1360)<br />
36+
✅ Generate site-index and full site content files complete<br />
37+
<br />
38+
▶️ Generate category bundles<br />
39+
✅ Category bundles written to: /polkadot-mkdocs/polkadot-docs/.ai/categories<br />
40+
✅ Generate category bundles complete
41+
<br />
42+
🎉 All steps finished successfully.
43+
<br />
44+
45+
3. Commit the updated `/.ai/` files with your content changes and open your PR as you usually do.
46+
47+
## Guide to Scripts and Files
48+
49+
The scripts for LLM-related files generation are located in `polkadot-docs/scripts` which contains the following:
50+
51+
- **`llms_config.json`**: Single point of configuration for the LLM files.
52+
- **`generate_llms.py`**: Pipeline for generating updated LLM files.
53+
- **`generate_ai_pages.py`**: Creates one resolved Markdown file per documentation page and outputs them to the `/.ai/pages` directory.
54+
- **`generate_llms_txt.py`**: Creates the `llms.txt` site index file using the Markdown file URLs and outputs it to the `/polkadot-docs/` directory.
55+
- **`generate_site_index.py`**: Creates two full-site content related files:
56+
- **`llms-full.jsonl`**: This file contains the entire documentation site, enhanced with metadata for improved indexing and chunking, and replaces the previous `llms-full.txt` file perviously used.
57+
- **`site-index.json`**: This lightweight version of the full documentation site uses content previews rather than full content bodies to allow for a smaller file size.
58+
- **`generate_category_bundles.py`**: Bundles pages with the same category tag together, along with context via Basics and Reference categories, and outputs them to `/.ai/categories/` as Markdown files.
59+
60+
## FAQs
61+
62+
### Why are we now using Markdown instead of `.txt` files?
63+
64+
- LLMs see a Markdown file and automatically know which semantic clues to look for to identify headings, bullet lists, and other structural elements. In comparison, a `.txt` file presents as a flattened sequence of words where the model has to work harder to identify the structure of the content.
65+
66+
### What do you mean by "resolved Markdown" files?
67+
68+
- The resolved Markdown files are those which are processed to replace all of the code snippet and variable placeholders with their intended contents and strip any HTML comments.
69+
70+
### Why use the `/.ai/pages` and `/.ai/categories` directories rather than ouputting the files to '/llms-files/' like before?
71+
72+
- The Markdown files are located in a directory that is not included in the site build to prevent doubling the size of the website (one HTML file + one Markdown file for every page). This arrangement also prevents the resolved Markdown being converted into HTML elements, making them less effective for LLM consumption.
Lines changed: 5155 additions & 5181 deletions
Large diffs are not rendered by default.
Lines changed: 8392 additions & 9518 deletions
Large diffs are not rendered by default.
Lines changed: 9457 additions & 9397 deletions
Large diffs are not rendered by default.
Lines changed: 7498 additions & 7975 deletions
Large diffs are not rendered by default.
Lines changed: 12591 additions & 14538 deletions
Large diffs are not rendered by default.

.ai/categories/polkadot-protocol.md

Lines changed: 11817 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 2219 additions & 2220 deletions
Large diffs are not rendered by default.
Lines changed: 10707 additions & 11398 deletions
Large diffs are not rendered by default.
Lines changed: 16614 additions & 17893 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)