Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
165 changes: 165 additions & 0 deletions components/indexer/pgvector/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
# PGVector Indexer

pgvector Indexer for Eino framework - store and retrieve documents with vector embeddings in PostgreSQL using the pgvector extension.

## Features

- **Type-safe vector operations** using `pgvector.Vector` from official `pgvector-go` library
- **Batch processing** for efficient embedding and storage
- **Automatic conflict resolution** with UPSERT semantics
- **SQL injection protection** with identifier validation
- **Connection pooling** support via `pgxpool.Pool`
- **Eino callbacks** integration for observability

## Installation

```bash
go get github.com/cloudwego/eino-ext/components/indexer/pgvector
```

## Prerequisites

1. **PostgreSQL** with pgvector extension installed
2. **Create table** before using the indexer:

```sql
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
id TEXT PRIMARY KEY,
content TEXT NOT NULL,
embedding vector(1536), -- adjust dimension based on your model
metadata JSONB
);

-- Optional: create index for vector similarity search
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
-- or
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
```

## Usage

### Basic Example

```go
import (
"context"
"github.com/jackc/pgx/v5/pgxpool"
"github.com/cloudwego/eino-ext/components/indexer/pgvector"
"github.com/cloudwego/eino/components/embedding/openai"
)

func main() {
ctx := context.Background()

// Create connection pool
pool, err := pgxpool.New(ctx, "postgres://user:pass@localhost/dbname")
if err != nil {
panic(err)
}
defer pool.Close()

// Create indexer
indexer, err := pgvector.NewIndexer(ctx, &pgvector.IndexerConfig{
Conn: pool,
TableName: "documents",
Embedding: openai.NewEmbedder(), // or any embedding implementation
BatchSize: 10,
})
if err != nil {
panic(err)
}

// Store documents
docs := []*schema.Document{
{
ID: "doc1",
Content: "Hello world",
MetaData: map[string]any{
"category": "greeting",
},
},
// ... more documents
}

ids, err := indexer.Store(ctx, docs)
if err != nil {
panic(err)
}

fmt.Printf("Stored %d documents\n", len(ids))
}
```

## Configuration

### IndexerConfig

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `Conn` | `PgxConn` | *required* | pgx connection or pool |
| `TableName` | `string` | `"documents"` | Table name for storing documents |
| `Embedding` | `embedding.Embedder` | *required for Store* | Embedding model for vectorization |
| `BatchSize` | `int` | `10` | Batch size for embedding operations |

### Table Schema

The indexer expects a table with this schema:

```sql
CREATE TABLE table_name (
id TEXT PRIMARY KEY,
content TEXT NOT NULL,
embedding vector(N), -- N = vector dimension
metadata JSONB
);
```

## Performance Tips

1. **Use connection pooling** - `pgxpool.Pool` for concurrent access
2. **Adjust BatchSize** - Larger batches (10-100) improve throughput
3. **Create vector indexes** - Use HNSW or IVFFlat indexes for similarity search
4. **Tune index parameters** - Adjust `lists` for IVFFlat based on data size

## Dependencies

- `github.com/cloudwego/eino` - Eino framework
- `github.com/jackc/pgx/v5` - PostgreSQL driver (v5.5.1+)
- `github.com/pgvector/pgvector-go` - pgvector Go library (v0.3.0+)

## Compatibility

- **PostgreSQL**: 12+
- **pgvector extension**: 0.5.0+
- **Go**: 1.23+

## Error Handling

The indexer returns detailed errors with context:

```go
[NewIndexer] database connection not provided
[Indexer.Store] documents list is empty
[Indexer.Store] embedding failed: <cause>
[Indexer.Store] batch execution failed: <cause>
```

## Testing

Run tests:

```bash
go test -v ./...
```

## License

Apache License 2.0

## See Also

- [pgvector Documentation](https://github.com/pgvector/pgvector)
- [pgvector-go](https://github.com/pgvector/pgvector-go)
- [Eino Framework](https://github.com/cloudwego/eino)
22 changes: 22 additions & 0 deletions components/indexer/pgvector/consts.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
/*
* Copyright 2025 CloudWeGo Authors
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package pgvector

const (
// DefaultTableName is the default table name for storing documents and vectors.
DefaultTableName = "documents"
)
125 changes: 125 additions & 0 deletions components/indexer/pgvector/examples/main.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
/*
* Copyright 2025 CloudWeGo Authors
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package main

import (
"context"
"fmt"
"log"

"github.com/cloudwego/eino-ext/components/indexer/pgvector"
"github.com/cloudwego/eino/components/embedding"
"github.com/cloudwego/eino/schema"
"github.com/jackc/pgx/v5/pgxpool"
)

// This example demonstrates how to use the pgvector indexer.
// Prerequisites:
// 1. PostgreSQL installed with pgvector extension
// 2. Database created: CREATE DATABASE eino_example;
// 3. Table created:
// CREATE EXTENSION IF NOT EXISTS vector;
// CREATE TABLE documents (
// id TEXT PRIMARY KEY,
// content TEXT NOT NULL,
// embedding vector(1536),
// metadata JSONB
// );
// 4. Connection string matches your database setup

func main() {
ctx := context.Background()

// Connect to PostgreSQL
// Update the connection string to match your database configuration
connString := "postgres://test_user:test_password@localhost:5433/eino_test?sslmode=disable"
pool, err := pgxpool.New(ctx, connString)
if err != nil {
log.Fatalf("Failed to connect to database: %v", err)
}
defer pool.Close()

// Create indexer config
config := &pgvector.IndexerConfig{
Conn: pool,
TableName: "documents",
Embedding: &mockEmbedder{}, // In production, use real embedder
BatchSize: 10,
}

// Create indexer
idxr, err := pgvector.NewIndexer(ctx, config)
if err != nil {
log.Fatalf("Failed to create indexer: %v", err)
}

// Sample documents to index
docs := []*schema.Document{
{
ID: "doc1",
Content: "PostgreSQL is a powerful open-source relational database.",
MetaData: map[string]any{
"category": "database",
"tags": []string{"postgresql", "sql"},
},
},
{
ID: "doc2",
Content: "pgvector is an extension for vector similarity search.",
MetaData: map[string]any{
"category": "database",
"tags": []string{"pgvector", "extension"},
},
},
{
ID: "doc3",
Content: "Machine learning models can be embedded as vectors for similarity search.",
MetaData: map[string]any{
"category": "ml",
"tags": []string{"ml", "embedding", "search"},
},
},
}

// Store documents
ids, err := idxr.Store(ctx, docs)
if err != nil {
log.Fatalf("Failed to store documents: %v", err)
}

fmt.Printf("Successfully indexed %d documents\n", len(ids))
for _, id := range ids {
fmt.Printf(" - %s\n", id)
}
}

// mockEmbedder is a mock embedding implementation for demonstration.
// In production, replace with real embedder like:
//
// import "github.com/cloudwego/eino/components/embedding/openai"
// embedding := openai.NewEmbedder()
type mockEmbedder struct{}

func (m *mockEmbedder) EmbedStrings(ctx context.Context, texts []string, opts ...embedding.Option) ([][]float64, error) {
// Return mock 3-dimensional vectors for demonstration
// In production, your embedder should return vectors matching your model's dimensions
result := make([][]float64, len(texts))
for i := range result {
result[i] = []float64{0.1, 0.2, 0.3}
}
return result, nil
}
19 changes: 19 additions & 0 deletions components/indexer/pgvector/examples/setup.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
-- Setup script for pgvector example
-- Run this with: psql -h localhost -p 5433 -U test_user -d eino_test -f setup.sql

-- Create pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create documents table
CREATE TABLE IF NOT EXISTS documents (
id TEXT PRIMARY KEY,
content TEXT NOT NULL,
embedding vector(3), -- 3 dimensions for the mock embedder
metadata JSONB
);

-- Create index for vector similarity search (optional but recommended)
CREATE INDEX IF NOT EXISTS documents_embedding_idx ON documents USING hnsw (embedding vector_cosine_ops);

-- Verify setup
\d documents
49 changes: 49 additions & 0 deletions components/indexer/pgvector/go.mod
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
module github.com/cloudwego/eino-ext/components/indexer/pgvector

go 1.23.0

require (
github.com/cloudwego/eino v0.6.0
github.com/jackc/pgx/v5 v5.7.2
github.com/pgvector/pgvector-go v0.3.0
github.com/stretchr/testify v1.10.0
)

require (
github.com/bahlo/generic-list-go v0.2.0 // indirect
github.com/buger/jsonparser v1.1.1 // indirect
github.com/bytedance/gopkg v0.1.3 // indirect
github.com/bytedance/sonic v1.14.1 // indirect
github.com/bytedance/sonic/loader v0.3.0 // indirect
github.com/cloudwego/base64x v0.1.6 // indirect
github.com/davecgh/go-spew v1.1.1 // indirect
github.com/dustin/go-humanize v1.0.1 // indirect
github.com/eino-contrib/jsonschema v1.0.2 // indirect
github.com/goph/emperror v0.17.2 // indirect
github.com/jackc/pgpassfile v1.0.0 // indirect
github.com/jackc/pgservicefile v0.0.0-20240606120523-5a60cdf6a761 // indirect
github.com/jackc/puddle/v2 v2.2.2 // indirect
github.com/json-iterator/go v1.1.12 // indirect
github.com/klauspost/cpuid/v2 v2.2.9 // indirect
github.com/mailru/easyjson v0.7.7 // indirect
github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd // indirect
github.com/modern-go/reflect2 v1.0.2 // indirect
github.com/nikolalohinski/gonja v1.5.3 // indirect
github.com/pelletier/go-toml/v2 v2.0.9 // indirect
github.com/pkg/errors v0.9.1 // indirect
github.com/pmezard/go-difflib v1.0.0 // indirect
github.com/rogpeppe/go-internal v1.14.1 // indirect
github.com/sirupsen/logrus v1.9.3 // indirect
github.com/slongfield/pyfmt v0.0.0-20220222012616-ea85ff4c361f // indirect
github.com/twitchyliquid64/golang-asm v0.15.1 // indirect
github.com/wk8/go-ordered-map/v2 v2.1.8 // indirect
github.com/x448/float16 v0.8.4 // indirect
github.com/yargevad/filepathx v1.0.0 // indirect
golang.org/x/arch v0.11.0 // indirect
golang.org/x/crypto v0.36.0 // indirect
golang.org/x/exp v0.0.0-20230713183714-613f0c0eb8a1 // indirect
golang.org/x/sync v0.12.0 // indirect
golang.org/x/sys v0.31.0 // indirect
golang.org/x/text v0.23.0 // indirect
gopkg.in/yaml.v3 v3.0.1 // indirect
)
Loading
Loading