Skip to content

Commit 81d3771

Browse files
committed
Refactor llms.txt feature with dynamic path routing and improved content extraction
1 parent cf617cc commit 81d3771

File tree

3 files changed

+87
-186
lines changed

3 files changed

+87
-186
lines changed

LLMS_TXT_FEATURE.md

Lines changed: 73 additions & 170 deletions
Original file line numberDiff line numberDiff line change
@@ -11,165 +11,66 @@ The feature is implemented using Next.js middleware that intercepts requests end
1111
### Implementation Details
1212

1313
1. **Middleware Interception**: The middleware in `src/middleware.ts` detects URLs ending with `llms.txt`
14-
2. **Request Rewriting**: The middleware rewrites the request to `/api/llms-txt` with the original path as a parameter
15-
3. **Content Extraction**: The API route extracts the actual MDX content from source files
16-
4. **Markdown Conversion**: JSX components and imports are stripped to create clean markdown
17-
5. **Response**: The full page content is returned as plain text with appropriate headers
14+
2. **Request Rewriting**: The middleware rewrites the request to `/api/llms-txt/[...path]` preserving the original path structure
15+
3. **Content Extraction**: The API route reads the actual MDX/markdown source files from the file system
16+
4. **Content Processing**: Uses `gray-matter` to parse frontmatter and extract the raw content
17+
5. **Markdown Cleanup**: Strips JSX components, imports, and expressions to create clean markdown
18+
6. **Response Generation**: Returns the cleaned content as plain text with appropriate headers
1819

19-
### File Changes
20+
### Architecture
2021

21-
- `src/middleware.ts`: Added `handleLlmsTxt` function with URL detection and rewriting logic
22-
- `app/api/llms-txt/route.ts`: New API route that handles content extraction and conversion
22+
```
23+
URL: /platforms/javascript/guides/react/llms.txt
24+
↓ (Middleware intercepts)
25+
Rewrite: /api/llms-txt/platforms/javascript/guides/react
26+
↓ (API route processes)
27+
1. Extract path segments: ['platforms', 'javascript', 'guides', 'react']
28+
2. Locate source file: docs/platforms/javascript/guides/react.mdx
29+
3. Read and parse with gray-matter
30+
4. Clean JSX/imports from markdown
31+
5. Return plain markdown content
32+
```
2333

2434
## Usage Examples
2535

2636
### Basic Usage
27-
- Original URL: `https://docs.sentry.io/platforms/javascript/`
28-
- LLMs.txt URL: `https://docs.sentry.io/platforms/javascript/llms.txt`
29-
30-
### Home Page
31-
- Original URL: `https://docs.sentry.io/`
32-
- LLMs.txt URL: `https://docs.sentry.io/llms.txt`
33-
34-
### API Documentation
35-
- Original URL: `https://docs.sentry.io/api/events/`
36-
- LLMs.txt URL: `https://docs.sentry.io/api/events/llms.txt`
37-
38-
### Product Features
39-
- Original URL: `https://docs.sentry.io/product/performance/`
40-
- LLMs.txt URL: `https://docs.sentry.io/product/performance/llms.txt`
41-
42-
## Content Extraction Process
43-
44-
The API route performs the following steps to extract content:
45-
46-
1. **Path Resolution**: Determines the original page path from the request
47-
2. **Document Tree Lookup**: Uses `nodeForPath()` to find the page in the documentation tree
48-
3. **File System Access**: Searches for source MDX/MD files in multiple possible locations:
49-
- Direct file paths (`docs/path/to/page.mdx`)
50-
- Index files (`docs/path/to/page/index.mdx`)
51-
- Common files for platform documentation
52-
- Developer documentation files
53-
4. **Content Parsing**: Uses `gray-matter` to parse frontmatter and content
54-
5. **Markdown Cleanup**: Removes JSX components, imports, and expressions
55-
6. **Response Formatting**: Combines title, content, and metadata
56-
57-
### Supported Content Types
58-
59-
#### Regular Documentation Pages
60-
- Extracts content from `docs/**/*.mdx` files
61-
- Handles both direct files and index files
62-
- Supports platform-specific common files
63-
64-
#### Developer Documentation
65-
- Extracts content from `develop-docs/**/*.mdx` files
66-
- Uses the same file resolution logic
67-
68-
#### API Documentation
69-
- Provides explanatory text for dynamically generated API docs
70-
- Explains that full API reference is available interactively
71-
72-
#### Home Page
73-
- Attempts to extract from `docs/index.mdx`
74-
- Falls back to curated home page content
75-
76-
## Content Cleanup
77-
78-
The `cleanupMarkdown()` function performs the following cleanup operations:
79-
80-
```typescript
81-
function cleanupMarkdown(content: string): string {
82-
return content
83-
// Remove JSX components and their content
84-
.replace(/<[A-Z][a-zA-Z0-9]*[^>]*>[\s\S]*?<\/[A-Z][a-zA-Z0-9]*>/g, '')
85-
// Remove self-closing JSX components
86-
.replace(/<[A-Z][a-zA-Z0-9]*[^>]*\/>/g, '')
87-
// Remove import statements
88-
.replace(/^import\s+.*$/gm, '')
89-
// Remove export statements
90-
.replace(/^export\s+.*$/gm, '')
91-
// Remove JSX expressions
92-
.replace(/\{[^}]*\}/g, '')
93-
// Clean up multiple newlines
94-
.replace(/\n{3,}/g, '\n\n')
95-
// Remove leading/trailing whitespace
96-
.trim();
97-
}
9837
```
99-
100-
## Technical Implementation
101-
102-
### Middleware Function
103-
104-
```typescript
105-
const handleLlmsTxt = async (request: NextRequest) => {
106-
try {
107-
// Get the original path by removing llms.txt
108-
const originalPath = request.nextUrl.pathname.replace(/\/llms\.txt$/, '') || '/';
109-
110-
// Rewrite to the API route with the path as a parameter
111-
const apiUrl = new URL('/api/llms-txt', request.url);
112-
apiUrl.searchParams.set('path', originalPath);
113-
114-
return NextResponse.rewrite(apiUrl);
115-
} catch (error) {
116-
console.error('Error handling llms.txt rewrite:', error);
117-
return new Response('Error processing request', {
118-
status: 500,
119-
headers: {
120-
'Content-Type': 'text/plain; charset=utf-8',
121-
},
122-
});
123-
}
124-
};
38+
Original URL: https://docs.sentry.io/platforms/javascript/
39+
LLMs.txt URL: https://docs.sentry.io/platforms/javascript/llms.txt
12540
```
12641

127-
### API Route Structure
128-
129-
The API route (`app/api/llms-txt/route.ts`) handles:
130-
- Path parameter validation
131-
- Document tree navigation
132-
- File system access for content extraction
133-
- Error handling for missing or inaccessible content
134-
- Response formatting with proper headers
135-
136-
### Response Headers
137-
138-
- `Content-Type: text/plain; charset=utf-8`: Ensures proper text encoding
139-
- `Cache-Control: public, max-age=3600`: Caches responses for 1 hour for performance
140-
141-
## Benefits
142-
143-
1. **Authentic Content**: Extracts actual page content, not summaries
144-
2. **LLM-Friendly**: Provides clean, structured markdown that's easy for AI models to process
145-
3. **Automated Access**: Enables automated tools to access documentation content
146-
4. **Simplified Format**: Removes complex UI elements and focuses on content
147-
5. **Fast Performance**: Cached responses with efficient file system access
148-
6. **Universal Access**: Works with any page on the documentation site
149-
7. **Fallback Handling**: Graceful degradation for pages that can't be processed
150-
151-
## Testing
152-
153-
To test the feature:
42+
### Deep Navigation
43+
```
44+
Original URL: https://docs.sentry.io/platforms/javascript/guides/react/configuration/
45+
LLMs.txt URL: https://docs.sentry.io/platforms/javascript/guides/react/configuration/llms.txt
46+
```
15447

155-
1. Start the development server: `yarn dev`
156-
2. Visit any documentation page
157-
3. Append `llms.txt` to the URL
158-
4. Verify the actual page content is returned in markdown format
48+
### Home Page
49+
```
50+
Original URL: https://docs.sentry.io/
51+
LLMs.txt URL: https://docs.sentry.io/llms.txt
52+
```
15953

160-
### Example Test URLs (Development)
54+
## Content Extraction Features
16155

162-
- `http://localhost:3000/llms.txt` - Home page content
163-
- `http://localhost:3000/platforms/javascript/llms.txt` - JavaScript platform documentation
164-
- `http://localhost:3000/platforms/javascript/install/llms.txt` - JavaScript installation guide
165-
- `http://localhost:3000/product/performance/llms.txt` - Performance monitoring documentation
56+
### Source File Detection
57+
- **Primary locations**: `docs/{path}.mdx`, `docs/{path}/index.mdx`
58+
- **Common files**: For platform docs, also checks `docs/platforms/{sdk}/common/` directory
59+
- **Multiple formats**: Supports both `.mdx` and `.md` files
60+
- **Fallback handling**: Graceful degradation when source files aren't found
16661

167-
### Expected Output Format
62+
### Content Processing
63+
- **Frontmatter parsing**: Extracts titles and metadata using `gray-matter`
64+
- **JSX removal**: Strips React components that don't translate to markdown
65+
- **Import cleanup**: Removes JavaScript import/export statements
66+
- **Expression removal**: Cleans JSX expressions `{...}`
67+
- **Whitespace normalization**: Removes excessive newlines and spacing
16868

69+
### Response Format
16970
```markdown
17071
# Page Title
17172

172-
[Actual page content converted to markdown]
73+
[Full cleaned markdown content]
17374

17475
---
17576

@@ -179,41 +80,43 @@ To test the feature:
17980
*This is the full page content converted to markdown format.*
18081
```
18182

182-
## Error Handling
183-
184-
The feature includes comprehensive error handling:
83+
## File Structure
18584

186-
- **404 Not Found**: When the requested page doesn't exist
187-
- **500 Internal Server Error**: When content processing fails
188-
- **400 Bad Request**: When path parameter is missing
189-
- **Graceful Fallbacks**: When source files aren't accessible
85+
```
86+
app/
87+
├── api/
88+
│ └── llms-txt/
89+
│ └── [...path]/
90+
│ └── route.ts # Dynamic API route handler
91+
src/
92+
├── middleware.ts # URL interception and rewriting
93+
LLMS_TXT_FEATURE.md # This documentation
94+
```
19095

191-
## Performance Considerations
96+
## Error Handling
19297

193-
- **Caching**: Responses are cached for 1 hour to reduce server load
194-
- **File System Access**: Direct file system reads for optimal performance
195-
- **Efficient Processing**: Minimal regex operations for content cleanup
196-
- **Error Recovery**: Fast fallback responses when content isn't available
98+
- **404 errors**: When pages don't exist in the document tree
99+
- **500 errors**: For file system or processing errors
100+
- **Graceful fallbacks**: Default content when source files can't be accessed
101+
- **Logging**: Error details logged to console for debugging
197102

198-
## Future Enhancements
103+
## Performance Considerations
199104

200-
Potential improvements for the feature:
105+
- **Caching**: Responses cached for 1 hour (`max-age=3600`)
106+
- **File system access**: Direct file reads for better performance
107+
- **Error boundaries**: Prevents crashes from affecting other routes
201108

202-
1. **Enhanced JSX Cleanup**: More sophisticated removal of React components
203-
2. **Code Block Preservation**: Better handling of code examples
204-
3. **Link Resolution**: Convert relative links to absolute URLs
205-
4. **Image Handling**: Process and reference images appropriately
206-
5. **Table of Contents**: Generate TOC from headings
207-
6. **Metadata Extraction**: Include more frontmatter data in output
109+
## Testing
208110

209-
## Maintenance
111+
Test the feature by appending `llms.txt` to any documentation URL:
210112

211-
- The feature is self-contained with clear separation of concerns
212-
- Content extraction logic can be enhanced in the API route
213-
- Cleanup patterns can be updated in the `cleanupMarkdown()` function
214-
- Performance can be monitored through response times and caching metrics
215-
- Error handling provides clear debugging information
113+
1. Visit any docs page (e.g., `/platforms/javascript/`)
114+
2. Add `llms.txt` to the end: `/platforms/javascript/llms.txt`
115+
3. Verify you receive plain markdown content instead of HTML
216116

217-
---
117+
## Implementation Notes
218118

219-
**Note**: This feature extracts the actual page content from source MDX files and converts it to clean markdown format, making it ideal for LLM consumption and automated processing.
119+
- The feature works with both regular documentation and developer documentation
120+
- API documentation (dynamically generated) gets placeholder content
121+
- Common platform files are automatically detected and used when appropriate
122+
- The middleware preserves URL structure while routing to the appropriate API endpoint

app/api/llms-txt/route.ts renamed to app/api/llms-txt/[...path]/route.ts

Lines changed: 7 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,18 @@
11
import { NextRequest, NextResponse } from 'next/server';
22
import { nodeForPath, getDocsRootNode } from 'sentry-docs/docTree';
3-
import { getFileBySlugWithCache, getDocsFrontMatter, getDevDocsFrontMatter } from 'sentry-docs/mdx';
3+
import { getFileBySlugWithCache } from 'sentry-docs/mdx';
44
import { isDeveloperDocs } from 'sentry-docs/isDeveloperDocs';
5-
import { stripVersion } from 'sentry-docs/versioning';
65
import matter from 'gray-matter';
76
import fs from 'fs';
87
import path from 'path';
98

10-
export async function GET(request: NextRequest) {
11-
const { searchParams } = new URL(request.url);
12-
const pathParam = searchParams.get('path');
13-
14-
if (!pathParam) {
15-
return new NextResponse('Path parameter is required', { status: 400 });
16-
}
17-
9+
export async function GET(
10+
request: NextRequest,
11+
{ params }: { params: Promise<{ path: string[] }> }
12+
) {
1813
try {
19-
// Parse the path - it should be the original path without llms.txt
20-
const pathSegments = pathParam.split('/').filter(Boolean);
14+
const resolvedParams = await params;
15+
const pathSegments = resolvedParams.path || [];
2116

2217
// Get the document tree
2318
const rootNode = await getDocsRootNode();
@@ -85,7 +80,6 @@ Sentry helps developers monitor and fix crashes in real time. The platform suppo
8580
}
8681
} else if (pathSegments[0] === 'api' && pathSegments.length > 1) {
8782
// Handle API docs - these are generated from OpenAPI specs
88-
// For now, provide a message explaining this
8983
pageTitle = `API Documentation: ${pathSegments.slice(1).join(' / ')}`;
9084
pageContent = `# ${pageTitle}
9185

src/middleware.ts

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -71,10 +71,14 @@ const handleLlmsTxt = async (request: NextRequest) => {
7171
try {
7272
// Get the original path by removing llms.txt
7373
const originalPath = request.nextUrl.pathname.replace(/\/llms\.txt$/, '') || '/';
74+
const pathSegments = originalPath.split('/').filter(Boolean);
7475

75-
// Rewrite to the API route with the path as a parameter
76-
const apiUrl = new URL('/api/llms-txt', request.url);
77-
apiUrl.searchParams.set('path', originalPath);
76+
// Rewrite to the API route with path segments
77+
const apiPath = pathSegments.length > 0
78+
? `/api/llms-txt/${pathSegments.join('/')}`
79+
: '/api/llms-txt';
80+
81+
const apiUrl = new URL(apiPath, request.url);
7882

7983
return NextResponse.rewrite(apiUrl);
8084
} catch (error) {

0 commit comments

Comments
 (0)