Refactor llms.txt feature with dynamic path routing and improved content extraction

cursoragent · cursoragent · commit 81d37715056f · 2025-06-10T21:40:26.000Z
diff --git a/LLMS_TXT_FEATURE.md b/LLMS_TXT_FEATURE.md
@@ -11,165 +11,66 @@ The feature is implemented using Next.js middleware that intercepts requests end
 ### Implementation Details
 
 1. **Middleware Interception**: The middleware in `src/middleware.ts` detects URLs ending with `llms.txt`
-2. **Request Rewriting**: The middleware rewrites the request to `/api/llms-txt` with the original path as a parameter
-3. **Content Extraction**: The API route extracts the actual MDX content from source files
-4. **Markdown Conversion**: JSX components and imports are stripped to create clean markdown
-5. **Response**: The full page content is returned as plain text with appropriate headers
+2. **Request Rewriting**: The middleware rewrites the request to `/api/llms-txt/[...path]` preserving the original path structure
+3. **Content Extraction**: The API route reads the actual MDX/markdown source files from the file system
+4. **Content Processing**: Uses `gray-matter` to parse frontmatter and extract the raw content
+5. **Markdown Cleanup**: Strips JSX components, imports, and expressions to create clean markdown
+6. **Response Generation**: Returns the cleaned content as plain text with appropriate headers
 
-### File Changes
+### Architecture
 
-- `src/middleware.ts`: Added `handleLlmsTxt` function with URL detection and rewriting logic
-- `app/api/llms-txt/route.ts`: New API route that handles content extraction and conversion
+```
+URL: /platforms/javascript/guides/react/llms.txt
+  ↓ (Middleware intercepts)
+Rewrite: /api/llms-txt/platforms/javascript/guides/react
+  ↓ (API route processes)
+1. Extract path segments: ['platforms', 'javascript', 'guides', 'react']
+2. Locate source file: docs/platforms/javascript/guides/react.mdx
+3. Read and parse with gray-matter
+4. Clean JSX/imports from markdown
+5. Return plain markdown content
+```
 
 ## Usage Examples
 
 ### Basic Usage
-- Original URL: `https://docs.sentry.io/platforms/javascript/`
-- LLMs.txt URL: `https://docs.sentry.io/platforms/javascript/llms.txt`
-
-### Home Page
-- Original URL: `https://docs.sentry.io/`
-- LLMs.txt URL: `https://docs.sentry.io/llms.txt`
-
-### API Documentation
-- Original URL: `https://docs.sentry.io/api/events/`
-- LLMs.txt URL: `https://docs.sentry.io/api/events/llms.txt`
-
-### Product Features
-- Original URL: `https://docs.sentry.io/product/performance/`
-- LLMs.txt URL: `https://docs.sentry.io/product/performance/llms.txt`
-
-## Content Extraction Process
-
-The API route performs the following steps to extract content:
-
-1. **Path Resolution**: Determines the original page path from the request
-2. **Document Tree Lookup**: Uses `nodeForPath()` to find the page in the documentation tree
-3. **File System Access**: Searches for source MDX/MD files in multiple possible locations:
-   - Direct file paths (`docs/path/to/page.mdx`)
-   - Index files (`docs/path/to/page/index.mdx`)
-   - Common files for platform documentation
-   - Developer documentation files
-4. **Content Parsing**: Uses `gray-matter` to parse frontmatter and content
-5. **Markdown Cleanup**: Removes JSX components, imports, and expressions
-6. **Response Formatting**: Combines title, content, and metadata
-
-### Supported Content Types
-
-#### Regular Documentation Pages
-- Extracts content from `docs/**/*.mdx` files
-- Handles both direct files and index files
-- Supports platform-specific common files
-
-#### Developer Documentation
-- Extracts content from `develop-docs/**/*.mdx` files
-- Uses the same file resolution logic
-
-#### API Documentation
-- Provides explanatory text for dynamically generated API docs
-- Explains that full API reference is available interactively
-
-#### Home Page
-- Attempts to extract from `docs/index.mdx`
-- Falls back to curated home page content
-
-## Content Cleanup
-
-The `cleanupMarkdown()` function performs the following cleanup operations:
-
-```typescript
-function cleanupMarkdown(content: string): string {
-  return content
-    // Remove JSX components and their content
-    .replace(/<[A-Z][a-zA-Z0-9]*[^>]*>[\s\S]*?<\/[A-Z][a-zA-Z0-9]*>/g, '')
-    // Remove self-closing JSX components
-    .replace(/<[A-Z][a-zA-Z0-9]*[^>]*\/>/g, '')
-    // Remove import statements
-    .replace(/^import\s+.*$/gm, '')
-    // Remove export statements
-    .replace(/^export\s+.*$/gm, '')
-    // Remove JSX expressions
-    .replace(/\{[^}]*\}/g, '')
-    // Clean up multiple newlines
-    .replace(/\n{3,}/g, '\n\n')
-    // Remove leading/trailing whitespace
-    .trim();
-}
 ```
-
-## Technical Implementation
-
-### Middleware Function
-
-```typescript
-const handleLlmsTxt = async (request: NextRequest) => {
-  try {
-    // Get the original path by removing llms.txt
-    const originalPath = request.nextUrl.pathname.replace(/\/llms\.txt$/, '') || '/';
-    
-    // Rewrite to the API route with the path as a parameter
-    const apiUrl = new URL('/api/llms-txt', request.url);
-    apiUrl.searchParams.set('path', originalPath);
-    
-    return NextResponse.rewrite(apiUrl);
-  } catch (error) {
-    console.error('Error handling llms.txt rewrite:', error);
-    return new Response('Error processing request', {
-      status: 500,
-      headers: {
-        'Content-Type': 'text/plain; charset=utf-8',
-      },
-    });
-  }
-};
+Original URL: https://docs.sentry.io/platforms/javascript/
+LLMs.txt URL: https://docs.sentry.io/platforms/javascript/llms.txt
 ```
 
-### API Route Structure
-
-The API route (`app/api/llms-txt/route.ts`) handles:
-- Path parameter validation
-- Document tree navigation
-- File system access for content extraction
-- Error handling for missing or inaccessible content
-- Response formatting with proper headers
-
-### Response Headers
-
-- `Content-Type: text/plain; charset=utf-8`: Ensures proper text encoding
-- `Cache-Control: public, max-age=3600`: Caches responses for 1 hour for performance
-
-## Benefits
-
-1. **Authentic Content**: Extracts actual page content, not summaries
-2. **LLM-Friendly**: Provides clean, structured markdown that's easy for AI models to process
-3. **Automated Access**: Enables automated tools to access documentation content
-4. **Simplified Format**: Removes complex UI elements and focuses on content
-5. **Fast Performance**: Cached responses with efficient file system access
-6. **Universal Access**: Works with any page on the documentation site
-7. **Fallback Handling**: Graceful degradation for pages that can't be processed
-
-## Testing
-
-To test the feature:
+### Deep Navigation
+```
+Original URL: https://docs.sentry.io/platforms/javascript/guides/react/configuration/
+LLMs.txt URL: https://docs.sentry.io/platforms/javascript/guides/react/configuration/llms.txt
+```
 
-1. Start the development server: `yarn dev`
-2. Visit any documentation page
-3. Append `llms.txt` to the URL
-4. Verify the actual page content is returned in markdown format
+### Home Page
+```
+Original URL: https://docs.sentry.io/
+LLMs.txt URL: https://docs.sentry.io/llms.txt
+```
 
-### Example Test URLs (Development)
+## Content Extraction Features
 
-- `http://localhost:3000/llms.txt` - Home page content
-- `http://localhost:3000/platforms/javascript/llms.txt` - JavaScript platform documentation
-- `http://localhost:3000/platforms/javascript/install/llms.txt` - JavaScript installation guide
-- `http://localhost:3000/product/performance/llms.txt` - Performance monitoring documentation
+### Source File Detection
+- **Primary locations**: `docs/{path}.mdx`, `docs/{path}/index.mdx`
+- **Common files**: For platform docs, also checks `docs/platforms/{sdk}/common/` directory
+- **Multiple formats**: Supports both `.mdx` and `.md` files
+- **Fallback handling**: Graceful degradation when source files aren't found
 
-### Expected Output Format
+### Content Processing
+- **Frontmatter parsing**: Extracts titles and metadata using `gray-matter`
+- **JSX removal**: Strips React components that don't translate to markdown
+- **Import cleanup**: Removes JavaScript import/export statements
+- **Expression removal**: Cleans JSX expressions `{...}`
+- **Whitespace normalization**: Removes excessive newlines and spacing
 
+### Response Format
 ```markdown
 # Page Title
 
-[Actual page content converted to markdown]
+[Full cleaned markdown content]
 
 ---
 
@@ -179,41 +80,43 @@ To test the feature:
 *This is the full page content converted to markdown format.*
 ```
 
-## Error Handling
-
-The feature includes comprehensive error handling:
+## File Structure
 
-- **404 Not Found**: When the requested page doesn't exist
-- **500 Internal Server Error**: When content processing fails
-- **400 Bad Request**: When path parameter is missing
-- **Graceful Fallbacks**: When source files aren't accessible
+```
+app/
+├── api/
+│   └── llms-txt/
+│       └── [...path]/
+│           └── route.ts          # Dynamic API route handler
+src/
+├── middleware.ts                 # URL interception and rewriting
+LLMS_TXT_FEATURE.md              # This documentation
+```
 
-## Performance Considerations
+## Error Handling
 
-- **Caching**: Responses are cached for 1 hour to reduce server load
-- **File System Access**: Direct file system reads for optimal performance
-- **Efficient Processing**: Minimal regex operations for content cleanup
-- **Error Recovery**: Fast fallback responses when content isn't available
+- **404 errors**: When pages don't exist in the document tree
+- **500 errors**: For file system or processing errors
+- **Graceful fallbacks**: Default content when source files can't be accessed
+- **Logging**: Error details logged to console for debugging
 
-## Future Enhancements
+## Performance Considerations
 
-Potential improvements for the feature:
+- **Caching**: Responses cached for 1 hour (`max-age=3600`)
+- **File system access**: Direct file reads for better performance
+- **Error boundaries**: Prevents crashes from affecting other routes
 
-1. **Enhanced JSX Cleanup**: More sophisticated removal of React components
-2. **Code Block Preservation**: Better handling of code examples
-3. **Link Resolution**: Convert relative links to absolute URLs
-4. **Image Handling**: Process and reference images appropriately
-5. **Table of Contents**: Generate TOC from headings
-6. **Metadata Extraction**: Include more frontmatter data in output
+## Testing
 
-## Maintenance
+Test the feature by appending `llms.txt` to any documentation URL:
 
-- The feature is self-contained with clear separation of concerns
-- Content extraction logic can be enhanced in the API route
-- Cleanup patterns can be updated in the `cleanupMarkdown()` function
-- Performance can be monitored through response times and caching metrics
-- Error handling provides clear debugging information
+1. Visit any docs page (e.g., `/platforms/javascript/`)
+2. Add `llms.txt` to the end: `/platforms/javascript/llms.txt`
+3. Verify you receive plain markdown content instead of HTML
 
----
+## Implementation Notes
 
-**Note**: This feature extracts the actual page content from source MDX files and converts it to clean markdown format, making it ideal for LLM consumption and automated processing.
+- The feature works with both regular documentation and developer documentation
+- API documentation (dynamically generated) gets placeholder content
+- Common platform files are automatically detected and used when appropriate
+- The middleware preserves URL structure while routing to the appropriate API endpoint
diff --git a/app/api/llms-txt/[...path]/route.ts b/app/api/llms-txt/[...path]/route.ts
@@ -1,23 +1,18 @@
 import { NextRequest, NextResponse } from 'next/server';
 import { nodeForPath, getDocsRootNode } from 'sentry-docs/docTree';
-import { getFileBySlugWithCache, getDocsFrontMatter, getDevDocsFrontMatter } from 'sentry-docs/mdx';
+import { getFileBySlugWithCache } from 'sentry-docs/mdx';
 import { isDeveloperDocs } from 'sentry-docs/isDeveloperDocs';
-import { stripVersion } from 'sentry-docs/versioning';
 import matter from 'gray-matter';
 import fs from 'fs';
 import path from 'path';
 
-export async function GET(request: NextRequest) {
-  const { searchParams } = new URL(request.url);
-  const pathParam = searchParams.get('path');
-  
-  if (!pathParam) {
-    return new NextResponse('Path parameter is required', { status: 400 });
-  }
-
+export async function GET(
+  request: NextRequest,
+  { params }: { params: Promise<{ path: string[] }> }
+) {
   try {
-    // Parse the path - it should be the original path without llms.txt
-    const pathSegments = pathParam.split('/').filter(Boolean);
+    const resolvedParams = await params;
+    const pathSegments = resolvedParams.path || [];
     
     // Get the document tree
     const rootNode = await getDocsRootNode();
@@ -85,7 +80,6 @@ Sentry helps developers monitor and fix crashes in real time. The platform suppo
       }
     } else if (pathSegments[0] === 'api' && pathSegments.length > 1) {
       // Handle API docs - these are generated from OpenAPI specs
-      // For now, provide a message explaining this
       pageTitle = `API Documentation: ${pathSegments.slice(1).join(' / ')}`;
       pageContent = `# ${pageTitle}
 
diff --git a/src/middleware.ts b/src/middleware.ts
@@ -71,10 +71,14 @@ const handleLlmsTxt = async (request: NextRequest) => {
   try {
     // Get the original path by removing llms.txt
     const originalPath = request.nextUrl.pathname.replace(/\/llms\.txt$/, '') || '/';
+    const pathSegments = originalPath.split('/').filter(Boolean);
     
-    // Rewrite to the API route with the path as a parameter
-    const apiUrl = new URL('/api/llms-txt', request.url);
-    apiUrl.searchParams.set('path', originalPath);
+    // Rewrite to the API route with path segments
+    const apiPath = pathSegments.length > 0 
+      ? `/api/llms-txt/${pathSegments.join('/')}`
+      : '/api/llms-txt';
+    
+    const apiUrl = new URL(apiPath, request.url);
     
     return NextResponse.rewrite(apiUrl);
   } catch (error) {