|
| 1 | +# PostgreSQL Source Embedding Example 🗄️ |
| 2 | + |
| 3 | +[](https://github.com/cocoindex-io/cocoindex) |
| 4 | + |
| 5 | +This example demonstrates the **PostgreSQL table source** feature in CocoIndex. It reads data from existing PostgreSQL tables, generates embeddings, and stores them in a separate CocoIndex database with pgvector for semantic search. |
| 6 | + |
| 7 | +We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful. |
| 8 | + |
| 9 | +## What This Example Does |
| 10 | + |
| 11 | +### 📊 Data Flow |
| 12 | +``` |
| 13 | +Source PostgreSQL Table (messages) |
| 14 | + ↓ [Postgres Source] |
| 15 | +Text Processing & Embedding Generation |
| 16 | + ↓ [SentenceTransformer] |
| 17 | +CocoIndex Database (message_embeddings) with pgvector |
| 18 | + ↓ [Semantic Search] |
| 19 | +Query Results |
| 20 | +``` |
| 21 | + |
| 22 | +### 🔧 Key Features |
| 23 | +- **PostgreSQL Source**: Read from existing database tables |
| 24 | +- **Separate Databases**: Source data and embeddings stored in different databases |
| 25 | +- **Automatic Schema**: CocoIndex creates target tables automatically |
| 26 | +- **pgvector Integration**: Store embeddings for semantic search |
| 27 | + |
| 28 | +## Prerequisites |
| 29 | + |
| 30 | +Before running the example, you need to: |
| 31 | + |
| 32 | +1. **PostgreSQL with pgvector**: Follow the [CocoIndex PostgreSQL setup guide](https://cocoindex.io/docs/getting_started/quickstart) to install and configure PostgreSQL with pgvector extension. |
| 33 | + |
| 34 | +2. **Two databases**: You'll need two separate databases (names can be anything you choose): |
| 35 | + - One database for your source table data |
| 36 | + - One database for storing embeddings |
| 37 | + |
| 38 | +3. **Environment file**: Create a `.env` file with your database configuration: |
| 39 | + ```bash |
| 40 | + cp .env.example .env |
| 41 | + $EDITOR .env |
| 42 | + ``` |
| 43 | + |
| 44 | +## Installation |
| 45 | + |
| 46 | +Install dependencies: |
| 47 | + |
| 48 | +```bash |
| 49 | +pip install -e . |
| 50 | +``` |
| 51 | + |
| 52 | +## Quick Start |
| 53 | + |
| 54 | +### Environment Variables Explained |
| 55 | + |
| 56 | +The example uses these environment variables to configure the PostgreSQL source: |
| 57 | + |
| 58 | +- **`SOURCE_DATABASE_URL`**: Connection string to your source database containing the table you want to index |
| 59 | +- **`COCOINDEX_DATABASE_URL`**: Connection string to the database where CocoIndex will store embeddings |
| 60 | +- **`TABLE_NAME`**: Name of the table in your source database to read from |
| 61 | +- **`INDEXING_COLUMN`**: The text column to generate embeddings for (this example focuses on one column, but you can index multiple columns) |
| 62 | +- **`KEY_COLUMN_FOR_SINGLE_KEY`**: Primary key column name (for tables with single primary key) |
| 63 | +- **`KEY_COLUMNS_FOR_MULTIPLE_KEYS`**: Comma-separated primary key columns (for tables with composite primary key) |
| 64 | +- **`INCLUDED_COLUMNS`**: Optional - specify which columns to include (defaults to all) |
| 65 | +- **`ORDINAL_COLUMN`**: Optional - use for incremental updates |
| 66 | + |
| 67 | +### Option A: Test with Sample Data (Recommended for first-time users) |
| 68 | + |
| 69 | +1. **Setup test database with sample data**: |
| 70 | + ```bash |
| 71 | + python setup_test_database.py |
| 72 | + ``` |
| 73 | + This will create both `test_simple` (single primary key) and `test_multiple` (composite primary key) tables with sample data. |
| 74 | + |
| 75 | +2. **Copy the generated environment configuration** to your `.env` file (the script will show you exactly what to copy). |
| 76 | + |
| 77 | +3. **Run the example**: |
| 78 | + ```bash |
| 79 | + python main.py |
| 80 | + ``` |
| 81 | + |
| 82 | +4. **Test semantic search** by entering queries in the interactive prompt |
| 83 | + |
| 84 | +### Option B: Use Your Existing Database |
| 85 | + |
| 86 | +1. **Update your `.env` file** with your database URLs and table configuration: |
| 87 | + ```env |
| 88 | + # CocoIndex Database (for storing embeddings) |
| 89 | + COCOINDEX_DATABASE_URL=postgresql://username:password@localhost:5432/cocoindex |
| 90 | +
|
| 91 | + # Source Database (for reading data) |
| 92 | + SOURCE_DATABASE_URL=postgresql://username:password@localhost:5432/your_source_db |
| 93 | +
|
| 94 | + # Table Configuration |
| 95 | + TABLE_NAME=your_table_name |
| 96 | + KEY_COLUMN_FOR_SINGLE_KEY=id # or KEY_COLUMNS_FOR_MULTIPLE_KEYS=col1,col2 |
| 97 | + INDEXING_COLUMN=your_text_column |
| 98 | + ORDINAL_COLUMN=your_timestamp_column # optional |
| 99 | + ``` |
| 100 | + |
| 101 | +2. **Run the example**: |
| 102 | + ```bash |
| 103 | + python main.py |
| 104 | + ``` |
| 105 | + |
| 106 | +## How It Works |
| 107 | + |
| 108 | +The example demonstrates a simple flow: |
| 109 | + |
| 110 | +1. **Read from Source**: Uses `cocoindex.sources.PostgresDb` to read from your existing table |
| 111 | +2. **Generate Embeddings**: Processes text and creates embeddings using SentenceTransformers |
| 112 | +3. **Store Embeddings**: Exports to the CocoIndex database with automatic table creation |
| 113 | +4. **Search**: Provides interactive semantic search over the stored embeddings |
| 114 | + |
| 115 | +**Note**: This example indexes one text column for simplicity, but you can modify the flow to index multiple columns or add more complex transformations. |
| 116 | + |
| 117 | +### Key Benefits |
| 118 | + |
| 119 | +- **Separate Databases**: Keep your source data separate from embeddings |
| 120 | +- **Automatic Setup**: CocoIndex creates target tables automatically |
| 121 | +- **Real-time Updates**: Live updates as source data changes |
| 122 | +- **Interactive Search**: Built-in search interface for testing |
| 123 | + |
| 124 | +## Database Configuration |
| 125 | + |
| 126 | +The example uses two separate databases: |
| 127 | + |
| 128 | +1. **Source Database**: Contains your existing data table |
| 129 | +2. **CocoIndex Database**: Stores generated embeddings with pgvector support |
| 130 | + |
| 131 | +This separation allows you to: |
| 132 | +- Keep your production data unchanged |
| 133 | +- Scale embeddings independently |
| 134 | +- Use different database configurations for each purpose |
| 135 | + |
| 136 | +## Advanced Usage |
| 137 | + |
| 138 | +### Primary Key Configuration |
| 139 | + |
| 140 | +**Single Primary Key**: |
| 141 | +```env |
| 142 | +KEY_COLUMN_FOR_SINGLE_KEY=id |
| 143 | +``` |
| 144 | + |
| 145 | +**Composite Primary Key**: |
| 146 | +```env |
| 147 | +KEY_COLUMNS_FOR_MULTIPLE_KEYS=product_category,product_name |
| 148 | +``` |
| 149 | + |
| 150 | + |
| 151 | + |
| 152 | +## CocoInsight |
| 153 | +CocoInsight is in Early Access now (Free) 😊 You found us! A quick 3 minute video tutorial about CocoInsight: [Watch on YouTube](https://youtu.be/ZnmyoHslBSc?si=pPLXWALztkA710r9). |
| 154 | + |
| 155 | +Run CocoInsight to understand your RAG data pipeline: |
| 156 | + |
| 157 | +```sh |
| 158 | +cocoindex server -ci main.py |
| 159 | +``` |
| 160 | + |
| 161 | +You can also add a `-L` flag to make the server keep updating the index to reflect source changes at the same time: |
| 162 | + |
| 163 | +```sh |
| 164 | +cocoindex server -ci -L main.py |
| 165 | +``` |
| 166 | + |
| 167 | +Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight). |
0 commit comments