You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Ultra performant data transformation framework for AI, with core engine written in Rust. Support incremental processing and data lineage out-of-box. Exceptional developer velocity. Production-ready at day 0.
27
26
28
27
⭐ Drop a star to help us grow!
@@ -60,9 +59,8 @@ CocoIndex makes it effortless to transform data with AI, and keep source data an
60
59
61
60
</br>
62
61
63
-
64
-
65
62
## Exceptional velocity
63
+
66
64
Just declare transformation in dataflow with ~100 lines of python
67
65
68
66
```python
@@ -86,25 +84,30 @@ CocoIndex follows the idea of [Dataflow](https://en.wikipedia.org/wiki/Dataflow_
86
84
**Particularly**, developers don't explicitly mutate data by creating, updating and deleting. They just need to define transformation/formula for a set of source data.
87
85
88
86
## Plug-and-Play Building Blocks
87
+
89
88
Native builtins for different source, targets and transformations. Standardize interface, make it 1-line code switch between different components - as easy as assembling building blocks.
- 🎬 [Quick Start Video Tutorial](https://youtu.be/gv5R8nOXsWU?si=9ioeKYkMEnYevTXT)
@@ -119,7 +122,6 @@ pip install -U cocoindex
119
122
120
123
2.[Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one. CocoIndex uses it for incremental processing.
121
124
122
-
123
125
## Define data flow
124
126
125
127
Follow [Quick Start Guide](https://cocoindex.io/docs/getting_started/quickstart) to define your first indexing flow. An example flow looks like:
@@ -175,6 +177,7 @@ It defines an index flow like this:
175
177
|[Text Embedding](examples/text_embedding)| Index text documents with embeddings for semantic search |
176
178
|[Code Embedding](examples/code_embedding)| Index code embeddings for semantic search |
177
179
|[PDF Embedding](examples/pdf_embedding)| Parse PDF and index text embeddings for semantic search |
180
+
|[PDF Elements Embedding](examples/pdf_elements_embedding)| Extract text and images from PDFs; embed text with SentenceTransformers and images with CLIP; store in Qdrant for multimodal search |
178
181
|[Manuals LLM Extraction](examples/manuals_llm_extraction)| Extract structured information from a manual using LLM |
179
182
|[Amazon S3 Embedding](examples/amazon_s3_embedding)| Index text documents from Amazon S3 |
180
183
|[Azure Blob Storage Embedding](examples/azure_blob_embedding)| Index text documents from Azure Blob Storage |
@@ -191,16 +194,18 @@ It defines an index flow like this:
191
194
|[Custom Output Files](examples/custom_output_files)| Convert markdown files to HTML files and save them to a local directory, using *CocoIndex Custom Targets*|
192
195
|[Patient intake form extraction](examples/patient_intake_extraction)| Use LLM to extract structured data from patient intake forms with different formats |
193
196
194
-
195
197
More coming and stay tuned 👀!
196
198
197
199
## 📖 Documentation
200
+
198
201
For detailed documentation, visit [CocoIndex Documentation](https://cocoindex.io/docs), including a [Quickstart guide](https://cocoindex.io/docs/getting_started/quickstart).
199
202
200
203
## 🤝 Contributing
204
+
201
205
We love contributions from our community ❤️. For details on contributing or running the project for development, check out our [contributing guide](https://cocoindex.io/docs/about/contributing).
202
206
203
207
## 👥 Community
208
+
204
209
Welcome with a huge coconut hug 🥥⋆。˚🤗. We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests, and discussions in our Discord.
205
210
206
211
Join our community here:
@@ -210,8 +215,10 @@ Join our community here:
210
215
- ▶️ [Subscribe to our YouTube channel](https://www.youtube.com/@cocoindex-io)
211
216
- 📜 [Read our blog posts](https://cocoindex.io/blogs/)
212
217
213
-
## Support us:
218
+
## Support us
219
+
214
220
We are constantly improving, and more features and examples are coming soon. If you love this project, please drop us a star ⭐ at GitHub repo [](https://github.com/cocoindex-io/cocoindex) to stay tuned and help us grow.
We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful.
11
+
12
+
## Steps
13
+
14
+
### Indexing Flow
15
+
16
+
1. Ingest PDF files from the `source_files` directory.
17
+
2. For each PDF page:
18
+
- Extract page text and images using `pypdf`.
19
+
- Skip very small images and create thumbnails up to 512×512 for consistency.
20
+
- Split text into chunks with `SplitRecursively` (language="text", chunk_size=600, chunk_overlap=100).
21
+
- Embed text chunks with SentenceTransformers (`all-MiniLM-L6-v2`).
22
+
- Embed images with CLIP (`openai/clip-vit-large-patch14`).
23
+
3. Save embeddings and metadata in Qdrant:
24
+
- Text collection: `PdfElementsEmbeddingText`
25
+
- Image collection: `PdfElementsEmbeddingImage`
26
+
27
+
## Prerequisite
28
+
29
+
[Install Qdrant](https://qdrant.tech/documentation/guides/installation/) if you don't have one running locally.
30
+
31
+
Start Qdrant with Docker (exposes HTTP 6333 and gRPC 6334):
32
+
33
+
```bash
34
+
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant
35
+
```
36
+
37
+
Note: This example connects via gRPC at `http://localhost:6334`.
38
+
39
+
## Input Data Preparation
40
+
41
+
Download a few sample PDFs (all are board game manuals) and put them into the `source_files` directory by running:
42
+
43
+
```bash
44
+
./fetch_manual_urls.sh
45
+
```
46
+
47
+
You can also put your favorite PDFs into the `source_files` directory.
48
+
49
+
## Run
50
+
51
+
Install dependencies:
52
+
53
+
```bash
54
+
pip install -e .
55
+
```
56
+
57
+
Update index, which will also setup the tables at the first time:
58
+
59
+
```bash
60
+
cocoindex update --setup main
61
+
```
62
+
63
+
## CocoInsight
64
+
65
+
I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline. It just connects to your local CocoIndex server, with Zero pipeline data retention. Run following command to start CocoInsight:
66
+
67
+
```bash
68
+
cocoindex server -ci main
69
+
```
70
+
71
+
Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight).
0 commit comments