Skip to content

Commit a639cc2

Browse files
committed
chore: add README and pyproject for ingest
Signed-off-by: Matthew Peveler <mpeveler@tigerdata.com>
1 parent 3fc2355 commit a639cc2

File tree

3 files changed

+1210
-0
lines changed

3 files changed

+1210
-0
lines changed

ingest/README.md

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
# Ingest
2+
3+
## Setup
4+
5+
### Prerequisites
6+
7+
* [`uv`](https://docs.astral.sh/uv/)
8+
* `icu4c` (for building postgres docs)
9+
10+
### Install Dependencies
11+
12+
```bash
13+
uv sync
14+
```
15+
16+
## Running the ingest
17+
18+
### PostgreSQL Documentation
19+
20+
```text
21+
$ uv run python postgres_docs.py --help
22+
usage: postgres_docs.py [-h] version
23+
24+
Ingest Postgres documentation into the database.
25+
26+
positional arguments:
27+
version Postgres version to ingest
28+
29+
options:
30+
-h, --help show this help message and exit
31+
```
32+
33+
### Timescale Documentation
34+
35+
```text
36+
uv run python timescale_docs.py --help
37+
usage: timescale_docs.py [-h] [--domain DOMAIN] [-o OUTPUT_DIR] [-m MAX_PAGES] [--strip-images] [--no-strip-images] [--chunk] [--no-chunk] [--chunking {header,semantic}] [--storage-type {file,database}] [--database-uri DATABASE_URI]
38+
[--skip-indexes] [--delay DELAY] [--concurrent CONCURRENT] [--log-level {DEBUG,INFO,WARNING,ERROR}] [--user-agent USER_AGENT]
39+
40+
Scrape websites using sitemaps and convert to chunked markdown for RAG applications
41+
42+
options:
43+
-h, --help show this help message and exit
44+
--domain, -d DOMAIN Domain to scrape (e.g., docs.tigerdata.com)
45+
-o, --output-dir OUTPUT_DIR
46+
Output directory for scraped files (default: scraped_docs)
47+
-m, --max-pages MAX_PAGES
48+
Maximum number of pages to scrape (default: unlimited)
49+
--strip-images Strip data: images from content (default: True)
50+
--no-strip-images Keep data: images in content
51+
--chunk Enable content chunking (default: True)
52+
--no-chunk Disable content chunking
53+
--chunking {header,semantic}
54+
Chunking method: header (default) or semantic (requires OPENAI_API_KEY)
55+
--storage-type {file,database}
56+
Storage type: database (default) or file
57+
--database-uri DATABASE_URI
58+
PostgreSQL connection URI (default: uses DB_URL from environment)
59+
--skip-indexes Skip creating database indexes after import (for development/testing)
60+
--delay DELAY Download delay in seconds (default: 1.0)
61+
--concurrent CONCURRENT
62+
Maximum concurrent requests (default: 4)
63+
--log-level {DEBUG,INFO,WARNING,ERROR}
64+
Logging level (default: INFO)
65+
--user-agent USER_AGENT
66+
User agent string
67+
68+
Examples:
69+
timescale_docs.py docs.tigerdata.com
70+
timescale_docs.py docs.tigerdata.com -o tiger_docs -m 50
71+
timescale_docs.py docs.tigerdata.com -o semantic_docs -m 5 --chunking semantic
72+
timescale_docs.py docs.tigerdata.com --no-chunk --no-strip-images -m 100
73+
timescale_docs.py docs.tigerdata.com --storage-type database --database-uri postgresql://user:pass@host:5432/dbname
74+
timescale_docs.py docs.tigerdata.com --storage-type database --chunking semantic -m 10
75+
```

ingest/pyproject.toml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
[project]
2+
name = "docs-importer"
3+
version = "0.1.0"
4+
description = "Add your description here"
5+
readme = "README.md"
6+
requires-python = ">=3.13"
7+
dependencies = [
8+
"beautifulsoup4>=4.13.5",
9+
"langchain-text-splitters>=0.3.9",
10+
"markdownify>=1.1.0",
11+
"openai>=1.97.1",
12+
"psycopg[binary,pool]>=3.2.9",
13+
"python-dotenv[cli]>=1.1.1",
14+
"scrapy>=2.13.3",
15+
"tiktoken>=0.11.0",
16+
]

0 commit comments

Comments
 (0)