|
| 1 | +# Documentation Embeddings Generation System |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +The documentation embeddings generation system processes various documentation sources and uploads their metadata to a database for semantic search functionality. The system is located in `apps/docs/scripts/search/` and works by: |
| 6 | + |
| 7 | +1. **Discovering content sources** from multiple types of documentation |
| 8 | +2. **Processing content** into structured sections with checksums |
| 9 | +3. **Generating embeddings** using OpenAI's text-embedding-ada-002 model |
| 10 | +4. **Storing in database** with vector embeddings for semantic search |
| 11 | + |
| 12 | +## Architecture |
| 13 | + |
| 14 | +### Main Entry Point |
| 15 | +- `generate-embeddings.ts` - Main script that orchestrates the entire process |
| 16 | +- Supports `--refresh` flag to force regeneration of all content |
| 17 | + |
| 18 | +### Content Sources (`sources/` directory) |
| 19 | + |
| 20 | +#### Base Classes |
| 21 | +- `BaseLoader` - Abstract class for loading content from different sources |
| 22 | +- `BaseSource` - Abstract class for processing and formatting content |
| 23 | + |
| 24 | +#### Source Types |
| 25 | +1. **Markdown Sources** (`markdown.ts`) |
| 26 | + - Processes `.mdx` files from guides and documentation |
| 27 | + - Extracts frontmatter metadata and content sections |
| 28 | + |
| 29 | +2. **Reference Documentation** (`reference-doc.ts`) |
| 30 | + - **OpenAPI References** - Management API documentation from OpenAPI specs |
| 31 | + - **Client Library References** - JavaScript, Dart, Python, C#, Swift, Kotlin SDKs |
| 32 | + - **CLI References** - Command-line interface documentation |
| 33 | + - Processes YAML/JSON specs and matches with common sections |
| 34 | + |
| 35 | +3. **GitHub Discussions** (`github-discussion.ts`) |
| 36 | + - Fetches troubleshooting discussions from GitHub using GraphQL API |
| 37 | + - Uses GitHub App authentication for access |
| 38 | + |
| 39 | +4. **Partner Integrations** (`partner-integrations.ts`) |
| 40 | + - Fetches approved partner integration documentation from Supabase database |
| 41 | + - Technology integrations only (excludes agencies) |
| 42 | + |
| 43 | +### Processing Flow |
| 44 | + |
| 45 | +1. **Content Discovery**: Each source loader discovers and loads content files/data |
| 46 | +2. **Content Processing**: Each source processes content into: |
| 47 | + - Checksum for change detection |
| 48 | + - Metadata (title, subtitle, etc.) |
| 49 | + - Sections with headings and content |
| 50 | +3. **Change Detection**: Compares checksums against existing database records |
| 51 | +4. **Embedding Generation**: Uses OpenAI to generate embeddings for new/changed content |
| 52 | +5. **Database Storage**: Stores in `page` and `page_section` tables with embeddings |
| 53 | +6. **Cleanup**: Removes outdated pages using version tracking |
| 54 | + |
| 55 | +### Database Schema |
| 56 | + |
| 57 | +- **`page`** table: Stores page metadata, content, checksum, version |
| 58 | +- **`page_section`** table: Stores individual sections with embeddings, token counts |
| 59 | + |
0 commit comments