Skip to content

Commit cf617cc

Browse files
committed
Implement LLMs.txt feature with API route for markdown content extraction
1 parent f3d38be commit cf617cc

File tree

3 files changed

+342
-215
lines changed

3 files changed

+342
-215
lines changed

LLMS_TXT_FEATURE.md

Lines changed: 126 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -2,22 +2,24 @@
22

33
## Overview
44

5-
This feature allows converting any page on the Sentry documentation site to a plain markdown format by simply appending `llms.txt` to the end of any URL. This is designed to make the documentation more accessible to Large Language Models (LLMs) and other automated tools that work better with plain text markdown content.
5+
This feature allows converting any page on the Sentry documentation site to a plain markdown format by simply appending `llms.txt` to the end of any URL. The feature extracts the actual page content from the source MDX files and converts it to clean markdown, making the documentation more accessible to Large Language Models (LLMs) and other automated tools.
66

77
## How It Works
88

9-
The feature is implemented using Next.js middleware that intercepts requests ending with `llms.txt` and converts the corresponding page content to markdown format.
9+
The feature is implemented using Next.js middleware that intercepts requests ending with `llms.txt` and rewrites them to an API route that extracts and converts the actual page content to markdown format.
1010

1111
### Implementation Details
1212

1313
1. **Middleware Interception**: The middleware in `src/middleware.ts` detects URLs ending with `llms.txt`
14-
2. **Path Processing**: The middleware strips the `llms.txt` suffix to get the original page path
15-
3. **Content Generation**: A comprehensive markdown representation is generated based on the page type and content
16-
4. **Response**: The markdown content is returned as plain text with appropriate headers
14+
2. **Request Rewriting**: The middleware rewrites the request to `/api/llms-txt` with the original path as a parameter
15+
3. **Content Extraction**: The API route extracts the actual MDX content from source files
16+
4. **Markdown Conversion**: JSX components and imports are stripped to create clean markdown
17+
5. **Response**: The full page content is returned as plain text with appropriate headers
1718

1819
### File Changes
1920

20-
- `src/middleware.ts`: Added `handleLlmsTxt` function and URL detection logic
21+
- `src/middleware.ts`: Added `handleLlmsTxt` function with URL detection and rewriting logic
22+
- `app/api/llms-txt/route.ts`: New API route that handles content extraction and conversion
2123

2224
## Usage Examples
2325

@@ -37,47 +39,63 @@ The feature is implemented using Next.js middleware that intercepts requests end
3739
- Original URL: `https://docs.sentry.io/product/performance/`
3840
- LLMs.txt URL: `https://docs.sentry.io/product/performance/llms.txt`
3941

40-
## Content Structure
42+
## Content Extraction Process
4143

42-
The generated markdown content includes:
44+
The API route performs the following steps to extract content:
4345

44-
1. **Page Title**: Based on the URL path structure
45-
2. **Section Overview**: Contextual information about the page type
46-
3. **Key Information**: Relevant details based on the page category
47-
4. **Additional Resources**: Link back to the original page
48-
5. **Metadata**: Generation timestamp and original URL
46+
1. **Path Resolution**: Determines the original page path from the request
47+
2. **Document Tree Lookup**: Uses `nodeForPath()` to find the page in the documentation tree
48+
3. **File System Access**: Searches for source MDX/MD files in multiple possible locations:
49+
- Direct file paths (`docs/path/to/page.mdx`)
50+
- Index files (`docs/path/to/page/index.mdx`)
51+
- Common files for platform documentation
52+
- Developer documentation files
53+
4. **Content Parsing**: Uses `gray-matter` to parse frontmatter and content
54+
5. **Markdown Cleanup**: Removes JSX components, imports, and expressions
55+
6. **Response Formatting**: Combines title, content, and metadata
4956

50-
### Content Types
57+
### Supported Content Types
5158

52-
#### Home Page (`/`)
53-
- Welcome message and overview of Sentry documentation
54-
- Main sections listing (Getting Started, Platforms, Product Guides, etc.)
55-
- Brief description of Sentry's capabilities
59+
#### Regular Documentation Pages
60+
- Extracts content from `docs/**/*.mdx` files
61+
- Handles both direct files and index files
62+
- Supports platform-specific common files
5663

57-
#### Platform Pages (`/platforms/*`)
58-
- Platform-specific integration guide overview
59-
- Key topics covered (Installation, Configuration, Error handling, etc.)
60-
- Step-by-step integration process
61-
- Link to full documentation
64+
#### Developer Documentation
65+
- Extracts content from `develop-docs/**/*.mdx` files
66+
- Uses the same file resolution logic
6267

63-
#### API Documentation (`/api/*`)
64-
- API overview and description
65-
- Key API categories
66-
- Authentication information
67-
- Rate limiting details
68-
- Link to complete API reference
68+
#### API Documentation
69+
- Provides explanatory text for dynamically generated API docs
70+
- Explains that full API reference is available interactively
6971

70-
#### Product Features (`/product/*`)
71-
- Product feature overview
72-
- Key features list (Error Monitoring, Performance Monitoring, etc.)
73-
- Usage guidance
74-
- Link to detailed feature documentation
72+
#### Home Page
73+
- Attempts to extract from `docs/index.mdx`
74+
- Falls back to curated home page content
7575

76-
#### General Pages
77-
- Generic documentation page template
78-
- Content overview
79-
- Key information points
80-
- Additional resources section
76+
## Content Cleanup
77+
78+
The `cleanupMarkdown()` function performs the following cleanup operations:
79+
80+
```typescript
81+
function cleanupMarkdown(content: string): string {
82+
return content
83+
// Remove JSX components and their content
84+
.replace(/<[A-Z][a-zA-Z0-9]*[^>]*>[\s\S]*?<\/[A-Z][a-zA-Z0-9]*>/g, '')
85+
// Remove self-closing JSX components
86+
.replace(/<[A-Z][a-zA-Z0-9]*[^>]*\/>/g, '')
87+
// Remove import statements
88+
.replace(/^import\s+.*$/gm, '')
89+
// Remove export statements
90+
.replace(/^export\s+.*$/gm, '')
91+
// Remove JSX expressions
92+
.replace(/\{[^}]*\}/g, '')
93+
// Clean up multiple newlines
94+
.replace(/\n{3,}/g, '\n\n')
95+
// Remove leading/trailing whitespace
96+
.trim();
97+
}
98+
```
8199

82100
## Technical Implementation
83101

@@ -88,21 +106,15 @@ const handleLlmsTxt = async (request: NextRequest) => {
88106
try {
89107
// Get the original path by removing llms.txt
90108
const originalPath = request.nextUrl.pathname.replace(/\/llms\.txt$/, '') || '/';
91-
const pathSegments = originalPath.split('/').filter(Boolean);
92109

93-
// Generate comprehensive markdown content based on path
94-
let markdownContent = generateContentForPath(originalPath, pathSegments);
110+
// Rewrite to the API route with the path as a parameter
111+
const apiUrl = new URL('/api/llms-txt', request.url);
112+
apiUrl.searchParams.set('path', originalPath);
95113

96-
return new Response(markdownContent, {
97-
status: 200,
98-
headers: {
99-
'Content-Type': 'text/plain; charset=utf-8',
100-
'Cache-Control': 'public, max-age=3600',
101-
},
102-
});
114+
return NextResponse.rewrite(apiUrl);
103115
} catch (error) {
104-
console.error('Error generating llms.txt:', error);
105-
return new Response('Error generating markdown content', {
116+
console.error('Error handling llms.txt rewrite:', error);
117+
return new Response('Error processing request', {
106118
status: 500,
107119
headers: {
108120
'Content-Type': 'text/plain; charset=utf-8',
@@ -112,18 +124,29 @@ const handleLlmsTxt = async (request: NextRequest) => {
112124
};
113125
```
114126

127+
### API Route Structure
128+
129+
The API route (`app/api/llms-txt/route.ts`) handles:
130+
- Path parameter validation
131+
- Document tree navigation
132+
- File system access for content extraction
133+
- Error handling for missing or inaccessible content
134+
- Response formatting with proper headers
135+
115136
### Response Headers
116137

117138
- `Content-Type: text/plain; charset=utf-8`: Ensures proper text encoding
118139
- `Cache-Control: public, max-age=3600`: Caches responses for 1 hour for performance
119140

120141
## Benefits
121142

122-
1. **LLM-Friendly**: Provides clean, structured markdown that's easy for AI models to process
123-
2. **Automated Access**: Enables automated tools to access documentation content
124-
3. **Simplified Format**: Removes complex UI elements and focuses on content
125-
4. **Fast Performance**: Cached responses with minimal processing overhead
126-
5. **Universal Access**: Works with any page on the documentation site
143+
1. **Authentic Content**: Extracts actual page content, not summaries
144+
2. **LLM-Friendly**: Provides clean, structured markdown that's easy for AI models to process
145+
3. **Automated Access**: Enables automated tools to access documentation content
146+
4. **Simplified Format**: Removes complex UI elements and focuses on content
147+
5. **Fast Performance**: Cached responses with efficient file system access
148+
6. **Universal Access**: Works with any page on the documentation site
149+
7. **Fallback Handling**: Graceful degradation for pages that can't be processed
127150

128151
## Testing
129152

@@ -132,32 +155,65 @@ To test the feature:
132155
1. Start the development server: `yarn dev`
133156
2. Visit any documentation page
134157
3. Append `llms.txt` to the URL
135-
4. Verify the markdown content is returned
158+
4. Verify the actual page content is returned in markdown format
136159

137160
### Example Test URLs (Development)
138161

139-
- `http://localhost:3000/llms.txt` - Home page
140-
- `http://localhost:3000/platforms/javascript/llms.txt` - JavaScript platform
141-
- `http://localhost:3000/api/llms.txt` - API documentation
142-
- `http://localhost:3000/product/performance/llms.txt` - Performance features
162+
- `http://localhost:3000/llms.txt` - Home page content
163+
- `http://localhost:3000/platforms/javascript/llms.txt` - JavaScript platform documentation
164+
- `http://localhost:3000/platforms/javascript/install/llms.txt` - JavaScript installation guide
165+
- `http://localhost:3000/product/performance/llms.txt` - Performance monitoring documentation
166+
167+
### Expected Output Format
168+
169+
```markdown
170+
# Page Title
171+
172+
[Actual page content converted to markdown]
173+
174+
---
175+
176+
**Original URL**: https://docs.sentry.io/original/path
177+
**Generated**: 2024-01-01T12:00:00.000Z
178+
179+
*This is the full page content converted to markdown format.*
180+
```
181+
182+
## Error Handling
183+
184+
The feature includes comprehensive error handling:
185+
186+
- **404 Not Found**: When the requested page doesn't exist
187+
- **500 Internal Server Error**: When content processing fails
188+
- **400 Bad Request**: When path parameter is missing
189+
- **Graceful Fallbacks**: When source files aren't accessible
190+
191+
## Performance Considerations
192+
193+
- **Caching**: Responses are cached for 1 hour to reduce server load
194+
- **File System Access**: Direct file system reads for optimal performance
195+
- **Efficient Processing**: Minimal regex operations for content cleanup
196+
- **Error Recovery**: Fast fallback responses when content isn't available
143197

144198
## Future Enhancements
145199

146200
Potential improvements for the feature:
147201

148-
1. **Real Content Extraction**: Integration with the actual MDX content processing
149-
2. **Enhanced Formatting**: Better markdown structure and formatting
150-
3. **Custom Templates**: Page-specific markdown templates
151-
4. **Content Optimization**: LLM-optimized content structure
152-
5. **Recursive Processing**: Full page content extraction and processing
202+
1. **Enhanced JSX Cleanup**: More sophisticated removal of React components
203+
2. **Code Block Preservation**: Better handling of code examples
204+
3. **Link Resolution**: Convert relative links to absolute URLs
205+
4. **Image Handling**: Process and reference images appropriately
206+
5. **Table of Contents**: Generate TOC from headings
207+
6. **Metadata Extraction**: Include more frontmatter data in output
153208

154209
## Maintenance
155210

156-
- The feature is self-contained in the middleware
157-
- Content templates can be updated in the `handleLlmsTxt` function
211+
- The feature is self-contained with clear separation of concerns
212+
- Content extraction logic can be enhanced in the API route
213+
- Cleanup patterns can be updated in the `cleanupMarkdown()` function
158214
- Performance can be monitored through response times and caching metrics
159-
- Error handling is built-in with fallback responses
215+
- Error handling provides clear debugging information
160216

161217
---
162218

163-
**Note**: This is a simplified implementation that provides structured markdown summaries. For complete content access, users should visit the original documentation pages.
219+
**Note**: This feature extracts the actual page content from source MDX files and converts it to clean markdown format, making it ideal for LLM consumption and automated processing.

0 commit comments

Comments
 (0)