|
| 1 | +# Ingest |
| 2 | + |
| 3 | +## Setup |
| 4 | + |
| 5 | +### Prerequisites |
| 6 | + |
| 7 | +* [`uv`](https://docs.astral.sh/uv/) |
| 8 | +* `icu4c` (for building postgres docs) |
| 9 | + |
| 10 | +### Install Dependencies |
| 11 | + |
| 12 | +```bash |
| 13 | +uv sync |
| 14 | +``` |
| 15 | + |
| 16 | +## Running the ingest |
| 17 | + |
| 18 | +### PostgreSQL Documentation |
| 19 | + |
| 20 | +```text |
| 21 | +$ uv run python postgres_docs.py --help |
| 22 | +usage: postgres_docs.py [-h] version |
| 23 | +
|
| 24 | +Ingest Postgres documentation into the database. |
| 25 | +
|
| 26 | +positional arguments: |
| 27 | + version Postgres version to ingest |
| 28 | +
|
| 29 | +options: |
| 30 | + -h, --help show this help message and exit |
| 31 | +``` |
| 32 | + |
| 33 | +### Timescale Documentation |
| 34 | + |
| 35 | +```text |
| 36 | +uv run python timescale_docs.py --help |
| 37 | +usage: timescale_docs.py [-h] [--domain DOMAIN] [-o OUTPUT_DIR] [-m MAX_PAGES] [--strip-images] [--no-strip-images] [--chunk] [--no-chunk] [--chunking {header,semantic}] [--storage-type {file,database}] [--database-uri DATABASE_URI] |
| 38 | + [--skip-indexes] [--delay DELAY] [--concurrent CONCURRENT] [--log-level {DEBUG,INFO,WARNING,ERROR}] [--user-agent USER_AGENT] |
| 39 | +
|
| 40 | +Scrape websites using sitemaps and convert to chunked markdown for RAG applications |
| 41 | +
|
| 42 | +options: |
| 43 | + -h, --help show this help message and exit |
| 44 | + --domain, -d DOMAIN Domain to scrape (e.g., docs.tigerdata.com) |
| 45 | + -o, --output-dir OUTPUT_DIR |
| 46 | + Output directory for scraped files (default: scraped_docs) |
| 47 | + -m, --max-pages MAX_PAGES |
| 48 | + Maximum number of pages to scrape (default: unlimited) |
| 49 | + --strip-images Strip data: images from content (default: True) |
| 50 | + --no-strip-images Keep data: images in content |
| 51 | + --chunk Enable content chunking (default: True) |
| 52 | + --no-chunk Disable content chunking |
| 53 | + --chunking {header,semantic} |
| 54 | + Chunking method: header (default) or semantic (requires OPENAI_API_KEY) |
| 55 | + --storage-type {file,database} |
| 56 | + Storage type: database (default) or file |
| 57 | + --database-uri DATABASE_URI |
| 58 | + PostgreSQL connection URI (default: uses DB_URL from environment) |
| 59 | + --skip-indexes Skip creating database indexes after import (for development/testing) |
| 60 | + --delay DELAY Download delay in seconds (default: 1.0) |
| 61 | + --concurrent CONCURRENT |
| 62 | + Maximum concurrent requests (default: 4) |
| 63 | + --log-level {DEBUG,INFO,WARNING,ERROR} |
| 64 | + Logging level (default: INFO) |
| 65 | + --user-agent USER_AGENT |
| 66 | + User agent string |
| 67 | +
|
| 68 | +Examples: |
| 69 | + timescale_docs.py docs.tigerdata.com |
| 70 | + timescale_docs.py docs.tigerdata.com -o tiger_docs -m 50 |
| 71 | + timescale_docs.py docs.tigerdata.com -o semantic_docs -m 5 --chunking semantic |
| 72 | + timescale_docs.py docs.tigerdata.com --no-chunk --no-strip-images -m 100 |
| 73 | + timescale_docs.py docs.tigerdata.com --storage-type database --database-uri postgresql://user:pass@host:5432/dbname |
| 74 | + timescale_docs.py docs.tigerdata.com --storage-type database --chunking semantic -m 10 |
| 75 | +``` |
0 commit comments