Skip to content

Commit 3058d0b

Browse files
committed
feat: implement dynamic PostgreSQL source with composite key support
- Add PostgreSQL table source with dynamic schema generation - Support both single and composite primary keys with proper KTable structure - Implement type-safe column access using name lookup instead of indices - Add comprehensive PostgreSQL to CocoIndex type mapping - Support KeyValue::Struct for composite keys with individual field access - Update example with environment variable validation for key configuration
1 parent 397f38f commit 3058d0b

File tree

10 files changed

+1654
-0
lines changed

10 files changed

+1654
-0
lines changed

examples/postgres_embedding/.env

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# Database Configuration
2+
# CocoIndex Database (for storing embeddings)
3+
COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex
4+
5+
# Source Database (for reading data - can be different from CocoIndex DB)
6+
SOURCE_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/source_data
7+
8+
# ========================================
9+
# Configuration for test_multiple table
10+
# ========================================
11+
TABLE_NAME=test_multiple
12+
KEY_COLUMNS_FOR_MULTIPLE_KEYS=product_category,product_name
13+
INDEXING_COLUMN=description
14+
ORDINAL_COLUMN=modified_time
15+
16+
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# Database Configuration
2+
# CocoIndex Database (for storing embeddings)
3+
COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex
4+
5+
# Database URLs
6+
SOURCE_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/source_data
7+
8+
# ========================================
9+
# Configuration for test_simple table
10+
# ========================================
11+
TABLE_NAME=test_simple
12+
KEY_COLUMN_FOR_SINGLE_KEY=id
13+
INDEXING_COLUMN=message
14+
ORDINAL_COLUMN=created_at
15+
16+
# ========================================
17+
# Configuration for test_multiple table
18+
# ========================================
19+
TABLE_NAME=test_multiple
20+
KEY_COLUMNS_FOR_MULTIPLE_KEYS=product_category,product_name
21+
INDEXING_COLUMN=description
22+
ORDINAL_COLUMN=modified_time
Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
# PostgreSQL Source Embedding Example 🗄️
2+
3+
[![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex)
4+
5+
This example demonstrates the **PostgreSQL table source** feature in CocoIndex. It reads data from existing PostgreSQL tables, generates embeddings, and stores them in a separate CocoIndex database with pgvector for semantic search.
6+
7+
We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful.
8+
9+
## What This Example Does
10+
11+
### 📊 Data Flow
12+
```
13+
Source PostgreSQL Table (messages)
14+
↓ [Postgres Source]
15+
Text Processing & Embedding Generation
16+
↓ [SentenceTransformer]
17+
CocoIndex Database (message_embeddings) with pgvector
18+
↓ [Semantic Search]
19+
Query Results
20+
```
21+
22+
### 🔧 Key Features
23+
- **PostgreSQL Source**: Read from existing database tables
24+
- **Separate Databases**: Source data and embeddings stored in different databases
25+
- **Automatic Schema**: CocoIndex creates target tables automatically
26+
- **pgvector Integration**: Store embeddings for semantic search
27+
28+
## Prerequisites
29+
30+
Before running the example, you need to:
31+
32+
1. **PostgreSQL with pgvector**: Follow the [CocoIndex PostgreSQL setup guide](https://cocoindex.io/docs/getting_started/quickstart) to install and configure PostgreSQL with pgvector extension.
33+
34+
2. **Two databases**: You'll need two separate databases (names can be anything you choose):
35+
- One database for your source table data
36+
- One database for storing embeddings
37+
38+
3. **Environment file**: Create a `.env` file with your database configuration:
39+
```bash
40+
cp .env.example .env
41+
$EDITOR .env
42+
```
43+
44+
## Installation
45+
46+
Install dependencies:
47+
48+
```bash
49+
pip install -e .
50+
```
51+
52+
## Quick Start
53+
54+
### Environment Variables Explained
55+
56+
The example uses these environment variables to configure the PostgreSQL source:
57+
58+
- **`SOURCE_DATABASE_URL`**: Connection string to your source database containing the table you want to index
59+
- **`COCOINDEX_DATABASE_URL`**: Connection string to the database where CocoIndex will store embeddings
60+
- **`TABLE_NAME`**: Name of the table in your source database to read from
61+
- **`INDEXING_COLUMN`**: The text column to generate embeddings for (this example focuses on one column, but you can index multiple columns)
62+
- **`KEY_COLUMN_FOR_SINGLE_KEY`**: Primary key column name (for tables with single primary key)
63+
- **`KEY_COLUMNS_FOR_MULTIPLE_KEYS`**: Comma-separated primary key columns (for tables with composite primary key)
64+
- **`INCLUDED_COLUMNS`**: Optional - specify which columns to include (defaults to all)
65+
- **`ORDINAL_COLUMN`**: Optional - use for incremental updates
66+
67+
### Option A: Test with Sample Data (Recommended for first-time users)
68+
69+
1. **Setup test database with sample data**:
70+
```bash
71+
python setup_test_database.py
72+
```
73+
This will create both `test_simple` (single primary key) and `test_multiple` (composite primary key) tables with sample data.
74+
75+
2. **Copy the generated environment configuration** to your `.env` file (the script will show you exactly what to copy).
76+
77+
3. **Run the example**:
78+
```bash
79+
python main.py
80+
```
81+
82+
4. **Test semantic search** by entering queries in the interactive prompt
83+
84+
### Option B: Use Your Existing Database
85+
86+
1. **Update your `.env` file** with your database URLs and table configuration:
87+
```env
88+
# CocoIndex Database (for storing embeddings)
89+
COCOINDEX_DATABASE_URL=postgresql://username:password@localhost:5432/cocoindex
90+
91+
# Source Database (for reading data)
92+
SOURCE_DATABASE_URL=postgresql://username:password@localhost:5432/your_source_db
93+
94+
# Table Configuration
95+
TABLE_NAME=your_table_name
96+
KEY_COLUMN_FOR_SINGLE_KEY=id # or KEY_COLUMNS_FOR_MULTIPLE_KEYS=col1,col2
97+
INDEXING_COLUMN=your_text_column
98+
ORDINAL_COLUMN=your_timestamp_column # optional
99+
```
100+
101+
2. **Run the example**:
102+
```bash
103+
python main.py
104+
```
105+
106+
## How It Works
107+
108+
The example demonstrates a simple flow:
109+
110+
1. **Read from Source**: Uses `cocoindex.sources.PostgresDb` to read from your existing table
111+
2. **Generate Embeddings**: Processes text and creates embeddings using SentenceTransformers
112+
3. **Store Embeddings**: Exports to the CocoIndex database with automatic table creation
113+
4. **Search**: Provides interactive semantic search over the stored embeddings
114+
115+
**Note**: This example indexes one text column for simplicity, but you can modify the flow to index multiple columns or add more complex transformations.
116+
117+
### Key Benefits
118+
119+
- **Separate Databases**: Keep your source data separate from embeddings
120+
- **Automatic Setup**: CocoIndex creates target tables automatically
121+
- **Real-time Updates**: Live updates as source data changes
122+
- **Interactive Search**: Built-in search interface for testing
123+
124+
## Database Configuration
125+
126+
The example uses two separate databases:
127+
128+
1. **Source Database**: Contains your existing data table
129+
2. **CocoIndex Database**: Stores generated embeddings with pgvector support
130+
131+
This separation allows you to:
132+
- Keep your production data unchanged
133+
- Scale embeddings independently
134+
- Use different database configurations for each purpose
135+
136+
## Advanced Usage
137+
138+
### Primary Key Configuration
139+
140+
**Single Primary Key**:
141+
```env
142+
KEY_COLUMN_FOR_SINGLE_KEY=id
143+
```
144+
145+
**Composite Primary Key**:
146+
```env
147+
KEY_COLUMNS_FOR_MULTIPLE_KEYS=product_category,product_name
148+
```
149+
150+
151+
152+
## CocoInsight
153+
CocoInsight is in Early Access now (Free) 😊 You found us! A quick 3 minute video tutorial about CocoInsight: [Watch on YouTube](https://youtu.be/ZnmyoHslBSc?si=pPLXWALztkA710r9).
154+
155+
Run CocoInsight to understand your RAG data pipeline:
156+
157+
```sh
158+
cocoindex server -ci main.py
159+
```
160+
161+
You can also add a `-L` flag to make the server keep updating the index to reflect source changes at the same time:
162+
163+
```sh
164+
cocoindex server -ci -L main.py
165+
```
166+
167+
Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight).

0 commit comments

Comments
 (0)