You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+2-1Lines changed: 2 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -132,11 +132,12 @@ It defines an index flow like this:
132
132
|[Code Embedding](examples/code_embedding)| Index code embeddings for semantic search |
133
133
|[PDF Embedding](examples/pdf_embedding)| Parse PDF and index text embeddings for semantic search |
134
134
|[Manuals LLM Extraction](examples/manuals_llm_extraction)| Extract structured information from a manual using LLM |
135
+
|[Amazon S3 Embedding](examples/amazon_s3_embedding)| Index text documents from Amazon S3 |
135
136
|[Google Drive Text Embedding](examples/gdrive_text_embedding)| Index text documents from Google Drive |
136
137
|[Docs to Knowledge Graph](examples/docs_to_knowledge_graph)| Extract relationships from Markdown documents and build a knowledge graph |
137
138
|[Embeddings to Qdrant](examples/text_embedding_qdrant)| Index documents in a Qdrant collection for semantic search |
138
139
|[FastAPI Server with Docker](examples/fastapi_server_docker)| Run the semantic search server in a Dockerized FastAPI setup |
139
-
|[Product_Taxonomy_Knowledge_Graph](examples/product_taxonomy_knowledge_graph)| Build knowledge graph for product recommendations |
140
+
|[Product Recommendation](examples/product_recommendation)| Build real-time product recommendations with LLM and graph database|
140
141
|[Image Search with Vision API](examples/image_search_example)| Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend|
description: "CocoIndex basic concepts for indexing: indexing flow, data, operations, data updates, etc."
4
4
---
5
5
6
-
# CocoIndex Basics
6
+
# CocoIndex Indexing Basics
7
7
8
8
An **index** is a collection of data stored in a way that is easy for retrieval.
9
9
10
-
CocoIndex is an ETL framework for building indexes from specified data sources, a.k.a. indexing. It also offers utilities for users to retrieve data from the indexes.
10
+
CocoIndex is an ETL framework for building indexes from specified data sources, a.k.a. **indexing**. It also offers utilities for users to retrieve data from the indexes.
11
11
12
-
## Indexing flow
12
+
An **indexing flow** extracts data from specified data sources, upon specified transformations, and puts the transformed data into specified storage for later retrieval.
13
13
14
-
An indexing flow extracts data from specified data sources, upon specified transformations, and puts the transformed data into specified storage for later retrieval.
14
+
## Indexing flow elements
15
15
16
16
An indexing flow has two aspects: data and operations on data.
17
17
@@ -42,7 +42,7 @@ An **operation** in an indexing flow defines a step in the flow. An operation is
42
42
43
43
"import" and "transform" operations produce output data, whose data type is determined based on the operation spec and data types of input data (for "transform" operation only).
44
44
45
-
### Example
45
+
##An indexing flow example
46
46
47
47
For the example shown in the [Quickstart](../getting_started/quickstart) section, the indexing flow is as follows:
48
48
@@ -60,7 +60,7 @@ This shows schema and example data for the indexing flow:
60
60
61
61

62
62
63
-
###Life cycle of an indexing flow
63
+
## Life cycle of an indexing flow
64
64
65
65
An indexing flow, once set up, maintains a long-lived relationship between data source and data in target storage. This means:
66
66
@@ -95,19 +95,10 @@ CocoIndex works the same way, but with more powerful capabilities:
95
95
96
96
This means when writing your flow operations, you can treat source data as if it were static - focusing purely on defining the transformation logic. CocoIndex takes care of maintaining the dynamic relationship between sources and target data behind the scenes.
97
97
98
-
###Internal storage
98
+
## Internal storage
99
99
100
100
As an indexing flow is long-lived, it needs to store intermediate data to keep track of the states.
101
101
CocoIndex uses internal storage for this purpose.
102
102
103
103
Currently, CocoIndex uses Postgres database as the internal storage.
104
-
See [Initialization](initialization) for configuring its location, and `cocoindex setup` CLI command (see [CocoIndex CLI](cli)) creates tables for the internal storage.
105
-
106
-
## Retrieval
107
-
108
-
There are two ways to retrieve data from target storage built by an indexing flow:
109
-
110
-
* Query the underlying target storage directly for maximum flexibility.
111
-
* Use CocoIndex *query handlers* for a more convenient experience with built-in tooling support (e.g. CocoInsight) to understand query performance against the target data.
112
-
113
-
Query handlers are tied to specific indexing flows. They accept query inputs, transform them by defined operations, and retrieve matching data from the target storage that was created by the flow.
104
+
See [Initialization](initialization) for configuring its location, and `cocoindex setup` CLI command (see [CocoIndex CLI](cli)) creates tables for the internal storage.
@@ -311,6 +311,38 @@ Following metrics are supported:
311
311
312
312
## Miscellaneous
313
313
314
+
### Getting App Namespace
315
+
316
+
You can use the [`app_namespace` setting](initialization#app-namespace) or `COCOINDEX_APP_NAMESPACE` environment variable to specify the app namespace,
317
+
to organize flows across different environments (e.g., dev, staging, production), team members, etc.
318
+
319
+
In the code, You can call `flow.get_app_namespace()` to get the app namespace, and use it to name certain backends. It takes the following arguments:
320
+
321
+
*`trailing_delimiter` (optional): a string to append to the app namespace when it's not empty.
322
+
323
+
e.g. when the current app namespace is `Staging`, `flow.get_app_namespace(trailing_delimiter='.')` will return `Staging.`.
It will use `Staging__doc_embeddings` as the collection name if the current app namespace is `Staging`, and use `doc_embeddings` if the app namespace is empty.
345
+
314
346
### Target Declarations
315
347
316
348
Most time a target storage is created by calling `export()` method on a collector, and this `export()` call comes with configurations needed for the target storage, e.g. options for storage indexes.
Copy file name to clipboardExpand all lines: docs/docs/core/flow_methods.mdx
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -105,7 +105,7 @@ A data source may enable one or multiple *change capture mechanisms*:
105
105
* Configured with a [refresh interval](flow_def#refresh-interval), which is generally applicable to all data sources.
106
106
107
107
* Specific data sources also provide their specific change capture mechanisms.
108
-
For example, [`GoogleDrive` source](../ops/sources#googledrive) allows polling recent modified files.
108
+
For example, [`AmazonS3` source](../ops/sources/#amazons3) watches S3 bucket's change events, and [`GoogleDrive` source](../ops/sources#googledrive) allows polling recent modified files.
109
109
See documentations for specific data sources.
110
110
111
111
Change capture mechanisms enable CocoIndex to continuously capture changes from the source data and update the target data accordingly, under live update mode.
Copy file name to clipboardExpand all lines: docs/docs/core/initialization.mdx
+13Lines changed: 13 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -83,8 +83,20 @@ if __name__ == "__main__":
83
83
84
84
`cocoindex.Settings` is used to configure the CocoIndex library. It's a dataclass that contains the following fields:
85
85
86
+
*`app_namespace` (type: `str`, required): The namespace of the application.
86
87
*`database` (type: `DatabaseConnectionSpec`, required): The connection to the Postgres database.
87
88
89
+
### App Namespace
90
+
91
+
The `app_namespace` field helps organize flows across different environments (e.g., dev, staging, production), team members, etc. When set, it prefixes flow names with the namespace.
92
+
93
+
For example, if the namespace is `Staging`, for a flow with name specified as `Flow1` in code, the full name of the flow will be `Staging.Flow1`.
94
+
You can also get the current app namespace by calling `cocoindex.get_app_namespace()` (see [Getting App Namespace](flow_def#getting-app-namespace) for more details).
95
+
96
+
If not set, all flows are in a default unnamed namespace.
97
+
98
+
You can also control it by the `COCOINDEX_APP_NAMESPACE` environment variable.
99
+
88
100
### DatabaseConnectionSpec
89
101
90
102
`DatabaseConnectionSpec` configures the connection to a database. Only Postgres is supported for now. It has the following fields:
@@ -116,6 +128,7 @@ Each setting field has a corresponding environment variable:
116
128
117
129
| environment variable | corresponding field in `Settings`| required? |
0 commit comments