Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,7 @@ It defines an index flow like this:
| [Amazon S3 Embedding](examples/amazon_s3_embedding) | Index text documents from Amazon S3 |
| [Azure Blob Storage Embedding](examples/azure_blob_embedding) | Index text documents from Azure Blob Storage |
| [Google Drive Text Embedding](examples/gdrive_text_embedding) | Index text documents from Google Drive |
| [Meeting Notes to Knowledge Graph](examples/meeting_notes_graph) | Extract structured meeting info from Google Drive and build a knowledge graph |
| [Docs to Knowledge Graph](examples/docs_to_knowledge_graph) | Extract relationships from Markdown documents and build a knowledge graph |
| [Embeddings to Qdrant](examples/text_embedding_qdrant) | Index documents in a Qdrant collection for semantic search |
| [Embeddings to LanceDB](examples/text_embedding_lancedb) | Index documents in a LanceDB collection for semantic search |
Expand Down
1 change: 1 addition & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ Check out our [examples documentation](https://cocoindex.io/docs/examples) for m
- 🏥 [**patient_intake_extraction_baml**](./patient_intake_extraction_baml) - Extract structured data from patient intake PDFs using BAML
- 📖 [**manuals_llm_extraction**](./manuals_llm_extraction) - Extract structured information from PDF manuals using Ollama
- 📄 [**paper_metadata**](./paper_metadata) - Extract metadata (title, authors, abstract) from research papers in PDF
- 📝 [**meeting_notes_graph**](./meeting_notes_graph) - Extract structured meeting info from Google Drive and build a knowledge graph

## Custom Sources & Targets

Expand Down
14 changes: 14 additions & 0 deletions examples/meeting_notes_graph/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Postgres database address for cocoindex
COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex

# OpenAI API key.
#! PLEASE FILL IN
OPENAI_API_KEY=

# Google Drive service account credential path.
#! PLEASE FILL IN
GOOGLE_SERVICE_ACCOUNT_CREDENTIAL=/path/to/service_account_credential.json

# Google Drive root folder IDs, comma separated.
#! PLEASE FILL IN
GOOGLE_DRIVE_ROOT_FOLDER_IDS=id1,id2
1 change: 1 addition & 0 deletions examples/meeting_notes_graph/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.env
107 changes: 107 additions & 0 deletions examples/meeting_notes_graph/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Build Meeting Notes Knowledge Graph from Google Drive

We will extract structured information from meeting notes stored in Google Drive and build a knowledge graph in Neo4j. The flow ingests Markdown notes, splits them by headings into meetings, uses an LLM to parse participants, organizer, time, and tasks, and then writes nodes and relationships into a graph database.

Please drop [CocoIndex on Github](https://github.com/cocoindex-io/cocoindex) a star to support us and stay tuned for more updates. Thank you so much 🥥🤗. [![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex)

## What this builds

The pipeline defines:

- Meeting nodes: one per meeting section, keyed by source note file and meeting time
- Person nodes: people who organized or attended meetings
- Task nodes: tasks decided in meetings
- Relationships:
- `ATTENDED` Person → Meeting (organizer included, marked in flow when collected)
- `DECIDED` Meeting → Task
- `ASSIGNED_TO` Person → Task

The source is Google Drive folders shared with a service account. The flow watches for recent changes and keeps the graph up to date.

## How it works

1. Ingest files from Google Drive (service account + root folder IDs)
2. Split each note by Markdown headings into meeting sections
3. Use an LLM to extract a structured `Meeting` object: time, note, organizer, participants, and tasks (with assignees)
4. Collect nodes and relationships in-memory
5. Export to Neo4j:
- Nodes: `Meeting` (explicit export), `Person` and `Task` (declared with primary keys)
- Relationships: `ATTENDED`, `DECIDED`, `ASSIGNED_TO`

## Prerequisite

- Install [Neo4j](https://cocoindex.io/docs/targets/neo4j) and start it locally
- Default local browser: <http://localhost:7474>
- Default credentials used in this example: username `neo4j`, password `cocoindex`
- [Configure your OpenAI API key](https://cocoindex.io/docs/ai/llm#openai)
- Prepare Google Drive:
- Create a Google Cloud service account and download its JSON credential
- Share the source folders with the service account email
- Collect the root folder IDs you want to ingest
- See [Setup for Google Drive](https://cocoindex.io/docs/sources/googledrive#setup-for-google-drive) for details

## Environment

Set the following environment variables:

```sh
export OPENAI_API_KEY=sk-...
export GOOGLE_SERVICE_ACCOUNT_CREDENTIAL=/absolute/path/to/service_account.json
export GOOGLE_DRIVE_ROOT_FOLDER_IDS=folderId1,folderId2
```

Notes:

- `GOOGLE_DRIVE_ROOT_FOLDER_IDS` accepts a comma-separated list of folder IDs
- The flow polls recent changes and refreshes periodically

## Run

### Build/update the graph

Install dependencies:

```bash
pip install -e .
```

Update the index (run the flow once to build/update the graph):

```bash
cocoindex update main
```

### Browse the knowledge graph

Open Neo4j Browser at <http://localhost:7474>.

Sample Cypher queries:

```cypher
// All relationships
MATCH p=()-->() RETURN p

// Who attended which meetings (including organizer)
MATCH (p:Person)-[:ATTENDED]->(m:Meeting)
RETURN p, m

// Tasks decided in meetings
MATCH (m:Meeting)-[:DECIDED]->(t:Task)
RETURN m, t

// Task assignments
MATCH (p:Person)-[:ASSIGNED_TO]->(t:Task)
RETURN p, t
```

## CocoInsight

I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline. It just connects to your local CocoIndex server, with Zero pipeline data retention.

Start CocoInsight:

```bash
cocoindex server -ci main
```

Then open the UI at <https://cocoindex.io/cocoinsight>.
203 changes: 203 additions & 0 deletions examples/meeting_notes_graph/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
"""
This example shows how to extract relationships from Markdown documents and build a knowledge graph.
"""

from dataclasses import dataclass
import datetime
import cocoindex
import os

conn_spec = cocoindex.add_auth_entry(
"Neo4jConnection",
cocoindex.targets.Neo4jConnection(
uri="bolt://localhost:7687",
user="neo4j",
password="cocoindex",
),
)


@dataclass
class Person:
name: str


@dataclass
class Task:
description: str
assigned_to: list[Person]


@dataclass
class Meeting:
time: datetime.date
note: str
organizer: Person
participants: list[Person]
tasks: list[Task]


@cocoindex.flow_def(name="MeetingNotesGraph")
def meeting_notes_graph_flow(
flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
) -> None:
"""
Define an example flow that extracts triples from files and build knowledge graph.
"""
credential_path = os.environ["GOOGLE_SERVICE_ACCOUNT_CREDENTIAL"]
root_folder_ids = os.environ["GOOGLE_DRIVE_ROOT_FOLDER_IDS"].split(",")

data_scope["documents"] = flow_builder.add_source(
cocoindex.sources.GoogleDrive(
service_account_credential_path=credential_path,
root_folder_ids=root_folder_ids,
recent_changes_poll_interval=datetime.timedelta(seconds=10),
),
refresh_interval=datetime.timedelta(minutes=1),
)

meeting_nodes = data_scope.add_collector()
attended_rels = data_scope.add_collector()
decided_tasks_rels = data_scope.add_collector()
assigned_rels = data_scope.add_collector()

with data_scope["documents"].row() as document:
document["meetings"] = document["content"].transform(
cocoindex.functions.SplitBySeparators(
separators_regex=[r"\n\n##?\ "], keep_separator="RIGHT"
)
)
with document["meetings"].row() as meeting:
parsed = meeting["parsed"] = meeting["text"].transform(
cocoindex.functions.ExtractByLlm(
llm_spec=cocoindex.LlmSpec(
api_type=cocoindex.LlmApiType.OPENAI, model="gpt-5"
),
output_type=Meeting,
)
)
meeting_key = {"note_file": document["filename"], "time": parsed["time"]}
meeting_nodes.collect(**meeting_key, note=parsed["note"])

attended_rels.collect(
id=cocoindex.GeneratedField.UUID,
**meeting_key,
person=parsed["organizer"]["name"],
is_organizer=True,
)
with parsed["participants"].row() as participant:
attended_rels.collect(
id=cocoindex.GeneratedField.UUID,
**meeting_key,
person=participant["name"],
)

with parsed["tasks"].row() as task:
decided_tasks_rels.collect(
id=cocoindex.GeneratedField.UUID,
**meeting_key,
description=task["description"],
)
with task["assigned_to"].row() as assigned_to:
assigned_rels.collect(
id=cocoindex.GeneratedField.UUID,
**meeting_key,
task=task["description"],
person=assigned_to["name"],
)

meeting_nodes.export(
"meeting_nodes",
cocoindex.targets.Neo4j(
connection=conn_spec, mapping=cocoindex.targets.Nodes(label="Meeting")
),
primary_key_fields=["note_file", "time"],
)
flow_builder.declare(
cocoindex.targets.Neo4jDeclaration(
connection=conn_spec,
nodes_label="Person",
primary_key_fields=["name"],
)
)
flow_builder.declare(
cocoindex.targets.Neo4jDeclaration(
connection=conn_spec,
nodes_label="Task",
primary_key_fields=["description"],
)
)
attended_rels.export(
"attended_rels",
cocoindex.targets.Neo4j(
connection=conn_spec,
mapping=cocoindex.targets.Relationships(
rel_type="ATTENDED",
source=cocoindex.targets.NodeFromFields(
label="Person",
fields=[
cocoindex.targets.TargetFieldMapping(
source="person", target="name"
)
],
),
target=cocoindex.targets.NodeFromFields(
label="Meeting",
fields=[
cocoindex.targets.TargetFieldMapping("note_file"),
cocoindex.targets.TargetFieldMapping("time"),
],
),
),
),
primary_key_fields=["id"],
)
decided_tasks_rels.export(
"decided_tasks_rels",
cocoindex.targets.Neo4j(
connection=conn_spec,
mapping=cocoindex.targets.Relationships(
rel_type="DECIDED",
source=cocoindex.targets.NodeFromFields(
label="Meeting",
fields=[
cocoindex.targets.TargetFieldMapping("note_file"),
cocoindex.targets.TargetFieldMapping("time"),
],
),
target=cocoindex.targets.NodeFromFields(
label="Task",
fields=[
cocoindex.targets.TargetFieldMapping("description"),
],
),
),
),
primary_key_fields=["id"],
)
assigned_rels.export(
"assigned_rels",
cocoindex.targets.Neo4j(
connection=conn_spec,
mapping=cocoindex.targets.Relationships(
rel_type="ASSIGNED_TO",
source=cocoindex.targets.NodeFromFields(
label="Person",
fields=[
cocoindex.targets.TargetFieldMapping(
source="person", target="name"
),
],
),
target=cocoindex.targets.NodeFromFields(
label="Task",
fields=[
cocoindex.targets.TargetFieldMapping(
source="task", target="description"
),
],
),
),
),
primary_key_fields=["id"],
)
9 changes: 9 additions & 0 deletions examples/meeting_notes_graph/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
[project]
name = "cocoindex-ecommerce-taxonomy"
version = "0.1.0"
description = "Simple example for CocoIndex: extract taxonomy from e-commerce products and build knowledge graph."
requires-python = ">=3.11"
dependencies = ["cocoindex>=0.3.8"]

[tool.setuptools]
packages = []
Loading