Skip to content

Commit 619d7b0

Browse files
committed
Merge branch 'main' into doc3
2 parents cea0b1f + aaef433 commit 619d7b0

File tree

19 files changed

+621
-397
lines changed

19 files changed

+621
-397
lines changed

README.md

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,6 @@
2222
<a href="https://trendshift.io/repositories/13939" target="_blank"><img src="https://trendshift.io/api/badge/repositories/13939" alt="cocoindex-io%2Fcocoindex | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
2323
</div>
2424

25-
2625
Ultra performant data transformation framework for AI, with core engine written in Rust. Support incremental processing and data lineage out-of-box. Exceptional developer velocity. Production-ready at day 0.
2726

2827
⭐ Drop a star to help us grow!
@@ -60,9 +59,8 @@ CocoIndex makes it effortless to transform data with AI, and keep source data an
6059

6160
</br>
6261

63-
64-
6562
## Exceptional velocity
63+
6664
Just declare transformation in dataflow with ~100 lines of python
6765

6866
```python
@@ -86,25 +84,30 @@ CocoIndex follows the idea of [Dataflow](https://en.wikipedia.org/wiki/Dataflow_
8684
**Particularly**, developers don't explicitly mutate data by creating, updating and deleting. They just need to define transformation/formula for a set of source data.
8785

8886
## Plug-and-Play Building Blocks
87+
8988
Native builtins for different source, targets and transformations. Standardize interface, make it 1-line code switch between different components - as easy as assembling building blocks.
9089

9190
<p align="center">
9291
<img src="https://cocoindex.io/images/components.svg" alt="CocoIndex Features">
9392
</p>
9493

9594
## Data Freshness
95+
9696
CocoIndex keep source data and target in sync effortlessly.
9797

9898
<p align="center">
9999
<img src="https://github.com/user-attachments/assets/f4eb29b3-84ee-4fa0-a1e2-80eedeeabde6" alt="Incremental Processing" width="700">
100100
</p>
101101

102102
It has out-of-box support for incremental indexing:
103+
103104
- minimal recomputation on source or logic change.
104105
- (re-)processing necessary portions; reuse cache when possible
105106

106-
## Quick Start:
107+
## Quick Start
108+
107109
If you're new to CocoIndex, we recommend checking out
110+
108111
- 📖 [Documentation](https://cocoindex.io/docs)
109112
-[Quick Start Guide](https://cocoindex.io/docs/getting_started/quickstart)
110113
- 🎬 [Quick Start Video Tutorial](https://youtu.be/gv5R8nOXsWU?si=9ioeKYkMEnYevTXT)
@@ -119,7 +122,6 @@ pip install -U cocoindex
119122

120123
2. [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one. CocoIndex uses it for incremental processing.
121124

122-
123125
## Define data flow
124126

125127
Follow [Quick Start Guide](https://cocoindex.io/docs/getting_started/quickstart) to define your first indexing flow. An example flow looks like:
@@ -175,6 +177,7 @@ It defines an index flow like this:
175177
| [Text Embedding](examples/text_embedding) | Index text documents with embeddings for semantic search |
176178
| [Code Embedding](examples/code_embedding) | Index code embeddings for semantic search |
177179
| [PDF Embedding](examples/pdf_embedding) | Parse PDF and index text embeddings for semantic search |
180+
| [PDF Elements Embedding](examples/pdf_elements_embedding) | Extract text and images from PDFs; embed text with SentenceTransformers and images with CLIP; store in Qdrant for multimodal search |
178181
| [Manuals LLM Extraction](examples/manuals_llm_extraction) | Extract structured information from a manual using LLM |
179182
| [Amazon S3 Embedding](examples/amazon_s3_embedding) | Index text documents from Amazon S3 |
180183
| [Azure Blob Storage Embedding](examples/azure_blob_embedding) | Index text documents from Azure Blob Storage |
@@ -191,16 +194,18 @@ It defines an index flow like this:
191194
| [Custom Output Files](examples/custom_output_files) | Convert markdown files to HTML files and save them to a local directory, using *CocoIndex Custom Targets* |
192195
| [Patient intake form extraction](examples/patient_intake_extraction) | Use LLM to extract structured data from patient intake forms with different formats |
193196

194-
195197
More coming and stay tuned 👀!
196198

197199
## 📖 Documentation
200+
198201
For detailed documentation, visit [CocoIndex Documentation](https://cocoindex.io/docs), including a [Quickstart guide](https://cocoindex.io/docs/getting_started/quickstart).
199202

200203
## 🤝 Contributing
204+
201205
We love contributions from our community ❤️. For details on contributing or running the project for development, check out our [contributing guide](https://cocoindex.io/docs/about/contributing).
202206

203207
## 👥 Community
208+
204209
Welcome with a huge coconut hug 🥥⋆。˚🤗. We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests, and discussions in our Discord.
205210

206211
Join our community here:
@@ -210,8 +215,10 @@ Join our community here:
210215
- ▶️ [Subscribe to our YouTube channel](https://www.youtube.com/@cocoindex-io)
211216
- 📜 [Read our blog posts](https://cocoindex.io/blogs/)
212217

213-
## Support us:
218+
## Support us
219+
214220
We are constantly improving, and more features and examples are coming soon. If you love this project, please drop us a star ⭐ at GitHub repo [![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex) to stay tuned and help us grow.
215221

216222
## License
223+
217224
CocoIndex is Apache 2.0 licensed.

docs/docs/ai/llm.mdx

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ We support the following types of LLM APIs:
2828
| [LiteLLM](#litellm) | `LlmApiType.LITE_LLM` |||
2929
| [OpenRouter](#openrouter) | `LlmApiType.OPEN_ROUTER` |||
3030
| [vLLM](#vllm) | `LlmApiType.VLLM` |||
31+
| [Bedrock](#bedrock) | `LlmApiType.BEDROCK` |||
3132

3233
## LLM Tasks
3334

@@ -440,3 +441,28 @@ cocoindex.LlmSpec(
440441

441442
</TabItem>
442443
</Tabs>
444+
445+
### Bedrock
446+
447+
To use the Bedrock API, you need to set up AWS credentials. You can do this by setting the following environment variables:
448+
449+
- `AWS_ACCESS_KEY_ID`
450+
- `AWS_SECRET_ACCESS_KEY`
451+
- `AWS_SESSION_TOKEN` (optional)
452+
453+
A spec for Bedrock looks like this:
454+
455+
<Tabs>
456+
<TabItem value="python" label="Python" default>
457+
458+
```python
459+
cocoindex.LlmSpec(
460+
api_type=cocoindex.LlmApiType.BEDROCK,
461+
model="us.anthropic.claude-3-5-haiku-20241022-v1:0",
462+
)
463+
```
464+
465+
</TabItem>
466+
</Tabs>
467+
468+
You can find the full list of models supported by Bedrock [here](https://docs.aws.amazon.com/bedrock/latest/userguide/model-ids.html).

examples/manuals_llm_extraction/main.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,9 @@ def manual_extraction_flow(
118118
# Replace by this spec below, to use Anthropic API model
119119
# llm_spec=cocoindex.LlmSpec(
120120
# api_type=cocoindex.LlmApiType.ANTHROPIC, model="claude-3-5-sonnet-latest"),
121+
# Replace by this spec below, to use Bedrock API model
122+
# llm_spec=cocoindex.LlmSpec(
123+
# api_type=cocoindex.LlmApiType.BEDROCK, model="us.anthropic.claude-3-5-haiku-20241022-v1:0"),
121124
output_type=ModuleInfo,
122125
instruction="Please extract Python module information from the manual.",
123126
)
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
# Postgres database address for cocoindex
2+
COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex
3+
4+
# Fallback to CPU for operations not supported by MPS on Mac.
5+
# It's no-op for other platforms.
6+
PYTORCH_ENABLE_MPS_FALLBACK=1
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
/source_files
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# Extract text and images from PDFs and build multimodal search
2+
3+
[![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex)
4+
5+
In this example, we extract texts and images from PDF pages, embed them with two models, and store them in Qdrant for multimodal search:
6+
7+
- Text: SentenceTransformers `all-MiniLM-L6-v2`
8+
- Images: CLIP `openai/clip-vit-large-patch14` (ViT-L/14, 768-dim)
9+
10+
We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful.
11+
12+
## Steps
13+
14+
### Indexing Flow
15+
16+
1. Ingest PDF files from the `source_files` directory.
17+
2. For each PDF page:
18+
- Extract page text and images using `pypdf`.
19+
- Skip very small images and create thumbnails up to 512×512 for consistency.
20+
- Split text into chunks with `SplitRecursively` (language="text", chunk_size=600, chunk_overlap=100).
21+
- Embed text chunks with SentenceTransformers (`all-MiniLM-L6-v2`).
22+
- Embed images with CLIP (`openai/clip-vit-large-patch14`).
23+
3. Save embeddings and metadata in Qdrant:
24+
- Text collection: `PdfElementsEmbeddingText`
25+
- Image collection: `PdfElementsEmbeddingImage`
26+
27+
## Prerequisite
28+
29+
[Install Qdrant](https://qdrant.tech/documentation/guides/installation/) if you don't have one running locally.
30+
31+
Start Qdrant with Docker (exposes HTTP 6333 and gRPC 6334):
32+
33+
```bash
34+
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant
35+
```
36+
37+
Note: This example connects via gRPC at `http://localhost:6334`.
38+
39+
## Input Data Preparation
40+
41+
Download a few sample PDFs (all are board game manuals) and put them into the `source_files` directory by running:
42+
43+
```bash
44+
./fetch_manual_urls.sh
45+
```
46+
47+
You can also put your favorite PDFs into the `source_files` directory.
48+
49+
## Run
50+
51+
Install dependencies:
52+
53+
```bash
54+
pip install -e .
55+
```
56+
57+
Update index, which will also setup the tables at the first time:
58+
59+
```bash
60+
cocoindex update --setup main
61+
```
62+
63+
## CocoInsight
64+
65+
I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline. It just connects to your local CocoIndex server, with Zero pipeline data retention. Run following command to start CocoInsight:
66+
67+
```bash
68+
cocoindex server -ci main
69+
```
70+
71+
Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight).
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
#!/bin/sh
2+
3+
URLS=(
4+
https://www.catan.com/sites/default/files/2021-06/catan_base_rules_2020_200707.pdf
5+
https://michalskig.wordpress.com/wp-content/uploads/2010/10/manilaenglishgame_133_gamerules.pdf
6+
https://cdn.1j1ju.com/medias/2c/f9/7f-ticket-to-ride-rulebook.pdf
7+
https://cdn.1j1ju.com/medias/0c/93/d6-stone-age-the-expansion-rulebook.pdf
8+
)
9+
10+
OUTPUT_DIR="source_files"
11+
mkdir -p $OUTPUT_DIR
12+
for URL in "${URLS[@]}"; do
13+
echo "Fetching $URL"
14+
wget -P $OUTPUT_DIR $URL
15+
done

0 commit comments

Comments
 (0)