Skip to content

Commit bc33ede

Browse files
authored
example(meeting-notes-kg): add example for meeting notes knowledge graph (#1297)
* example(meeting-notes-kg): add example for meeting notes knowledge graph * minor cleanups
1 parent 1bf8ab2 commit bc33ede

File tree

7 files changed

+336
-0
lines changed

7 files changed

+336
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -189,6 +189,7 @@ It defines an index flow like this:
189189
| [Amazon S3 Embedding](examples/amazon_s3_embedding) | Index text documents from Amazon S3 |
190190
| [Azure Blob Storage Embedding](examples/azure_blob_embedding) | Index text documents from Azure Blob Storage |
191191
| [Google Drive Text Embedding](examples/gdrive_text_embedding) | Index text documents from Google Drive |
192+
| [Meeting Notes to Knowledge Graph](examples/meeting_notes_graph) | Extract structured meeting info from Google Drive and build a knowledge graph |
192193
| [Docs to Knowledge Graph](examples/docs_to_knowledge_graph) | Extract relationships from Markdown documents and build a knowledge graph |
193194
| [Embeddings to Qdrant](examples/text_embedding_qdrant) | Index documents in a Qdrant collection for semantic search |
194195
| [Embeddings to LanceDB](examples/text_embedding_lancedb) | Index documents in a LanceDB collection for semantic search |

examples/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ Check out our [examples documentation](https://cocoindex.io/docs/examples) for m
3232
- 🏥 [**patient_intake_extraction_baml**](./patient_intake_extraction_baml) - Extract structured data from patient intake PDFs using BAML
3333
- 📖 [**manuals_llm_extraction**](./manuals_llm_extraction) - Extract structured information from PDF manuals using Ollama
3434
- 📄 [**paper_metadata**](./paper_metadata) - Extract metadata (title, authors, abstract) from research papers in PDF
35+
- 📝 [**meeting_notes_graph**](./meeting_notes_graph) - Extract structured meeting info from Google Drive and build a knowledge graph
3536

3637
## Custom Sources & Targets
3738

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Postgres database address for cocoindex
2+
COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex
3+
4+
# OpenAI API key.
5+
#! PLEASE FILL IN
6+
OPENAI_API_KEY=
7+
8+
# Google Drive service account credential path.
9+
#! PLEASE FILL IN
10+
GOOGLE_SERVICE_ACCOUNT_CREDENTIAL=/path/to/service_account_credential.json
11+
12+
# Google Drive root folder IDs, comma separated.
13+
#! PLEASE FILL IN
14+
GOOGLE_DRIVE_ROOT_FOLDER_IDS=id1,id2
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.env
Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
# Build Meeting Notes Knowledge Graph from Google Drive
2+
3+
We will extract structured information from meeting notes stored in Google Drive and build a knowledge graph in Neo4j. The flow ingests Markdown notes, splits them by headings into meetings, uses an LLM to parse participants, organizer, time, and tasks, and then writes nodes and relationships into a graph database.
4+
5+
Please drop [CocoIndex on Github](https://github.com/cocoindex-io/cocoindex) a star to support us and stay tuned for more updates. Thank you so much 🥥🤗. [![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex)
6+
7+
## What this builds
8+
9+
The pipeline defines:
10+
11+
- Meeting nodes: one per meeting section, keyed by source note file and meeting time
12+
- Person nodes: people who organized or attended meetings
13+
- Task nodes: tasks decided in meetings
14+
- Relationships:
15+
- `ATTENDED` Person → Meeting (organizer included, marked in flow when collected)
16+
- `DECIDED` Meeting → Task
17+
- `ASSIGNED_TO` Person → Task
18+
19+
The source is Google Drive folders shared with a service account. The flow watches for recent changes and keeps the graph up to date.
20+
21+
## How it works
22+
23+
1. Ingest files from Google Drive (service account + root folder IDs)
24+
2. Split each note by Markdown headings into meeting sections
25+
3. Use an LLM to extract a structured `Meeting` object: time, note, organizer, participants, and tasks (with assignees)
26+
4. Collect nodes and relationships in-memory
27+
5. Export to Neo4j:
28+
- Nodes: `Meeting` (explicit export), `Person` and `Task` (declared with primary keys)
29+
- Relationships: `ATTENDED`, `DECIDED`, `ASSIGNED_TO`
30+
31+
## Prerequisite
32+
33+
- Install [Neo4j](https://cocoindex.io/docs/targets/neo4j) and start it locally
34+
- Default local browser: <http://localhost:7474>
35+
- Default credentials used in this example: username `neo4j`, password `cocoindex`
36+
- [Configure your OpenAI API key](https://cocoindex.io/docs/ai/llm#openai)
37+
- Prepare Google Drive:
38+
- Create a Google Cloud service account and download its JSON credential
39+
- Share the source folders with the service account email
40+
- Collect the root folder IDs you want to ingest
41+
- See [Setup for Google Drive](https://cocoindex.io/docs/sources/googledrive#setup-for-google-drive) for details
42+
43+
## Environment
44+
45+
Set the following environment variables:
46+
47+
```sh
48+
export OPENAI_API_KEY=sk-...
49+
export GOOGLE_SERVICE_ACCOUNT_CREDENTIAL=/absolute/path/to/service_account.json
50+
export GOOGLE_DRIVE_ROOT_FOLDER_IDS=folderId1,folderId2
51+
```
52+
53+
Notes:
54+
55+
- `GOOGLE_DRIVE_ROOT_FOLDER_IDS` accepts a comma-separated list of folder IDs
56+
- The flow polls recent changes and refreshes periodically
57+
58+
## Run
59+
60+
### Build/update the graph
61+
62+
Install dependencies:
63+
64+
```bash
65+
pip install -e .
66+
```
67+
68+
Update the index (run the flow once to build/update the graph):
69+
70+
```bash
71+
cocoindex update main
72+
```
73+
74+
### Browse the knowledge graph
75+
76+
Open Neo4j Browser at <http://localhost:7474>.
77+
78+
Sample Cypher queries:
79+
80+
```cypher
81+
// All relationships
82+
MATCH p=()-->() RETURN p
83+
84+
// Who attended which meetings (including organizer)
85+
MATCH (p:Person)-[:ATTENDED]->(m:Meeting)
86+
RETURN p, m
87+
88+
// Tasks decided in meetings
89+
MATCH (m:Meeting)-[:DECIDED]->(t:Task)
90+
RETURN m, t
91+
92+
// Task assignments
93+
MATCH (p:Person)-[:ASSIGNED_TO]->(t:Task)
94+
RETURN p, t
95+
```
96+
97+
## CocoInsight
98+
99+
I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline. It just connects to your local CocoIndex server, with Zero pipeline data retention.
100+
101+
Start CocoInsight:
102+
103+
```bash
104+
cocoindex server -ci main
105+
```
106+
107+
Then open the UI at <https://cocoindex.io/cocoinsight>.
Lines changed: 203 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,203 @@
1+
"""
2+
This example shows how to extract relationships from Markdown documents and build a knowledge graph.
3+
"""
4+
5+
from dataclasses import dataclass
6+
import datetime
7+
import cocoindex
8+
import os
9+
10+
conn_spec = cocoindex.add_auth_entry(
11+
"Neo4jConnection",
12+
cocoindex.targets.Neo4jConnection(
13+
uri="bolt://localhost:7687",
14+
user="neo4j",
15+
password="cocoindex",
16+
),
17+
)
18+
19+
20+
@dataclass
21+
class Person:
22+
name: str
23+
24+
25+
@dataclass
26+
class Task:
27+
description: str
28+
assigned_to: list[Person]
29+
30+
31+
@dataclass
32+
class Meeting:
33+
time: datetime.date
34+
note: str
35+
organizer: Person
36+
participants: list[Person]
37+
tasks: list[Task]
38+
39+
40+
@cocoindex.flow_def(name="MeetingNotesGraph")
41+
def meeting_notes_graph_flow(
42+
flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
43+
) -> None:
44+
"""
45+
Define an example flow that extracts triples from files and build knowledge graph.
46+
"""
47+
credential_path = os.environ["GOOGLE_SERVICE_ACCOUNT_CREDENTIAL"]
48+
root_folder_ids = os.environ["GOOGLE_DRIVE_ROOT_FOLDER_IDS"].split(",")
49+
50+
data_scope["documents"] = flow_builder.add_source(
51+
cocoindex.sources.GoogleDrive(
52+
service_account_credential_path=credential_path,
53+
root_folder_ids=root_folder_ids,
54+
recent_changes_poll_interval=datetime.timedelta(seconds=10),
55+
),
56+
refresh_interval=datetime.timedelta(minutes=1),
57+
)
58+
59+
meeting_nodes = data_scope.add_collector()
60+
attended_rels = data_scope.add_collector()
61+
decided_tasks_rels = data_scope.add_collector()
62+
assigned_rels = data_scope.add_collector()
63+
64+
with data_scope["documents"].row() as document:
65+
document["meetings"] = document["content"].transform(
66+
cocoindex.functions.SplitBySeparators(
67+
separators_regex=[r"\n\n##?\ "], keep_separator="RIGHT"
68+
)
69+
)
70+
with document["meetings"].row() as meeting:
71+
parsed = meeting["parsed"] = meeting["text"].transform(
72+
cocoindex.functions.ExtractByLlm(
73+
llm_spec=cocoindex.LlmSpec(
74+
api_type=cocoindex.LlmApiType.OPENAI, model="gpt-5"
75+
),
76+
output_type=Meeting,
77+
)
78+
)
79+
meeting_key = {"note_file": document["filename"], "time": parsed["time"]}
80+
meeting_nodes.collect(**meeting_key, note=parsed["note"])
81+
82+
attended_rels.collect(
83+
id=cocoindex.GeneratedField.UUID,
84+
**meeting_key,
85+
person=parsed["organizer"]["name"],
86+
is_organizer=True,
87+
)
88+
with parsed["participants"].row() as participant:
89+
attended_rels.collect(
90+
id=cocoindex.GeneratedField.UUID,
91+
**meeting_key,
92+
person=participant["name"],
93+
)
94+
95+
with parsed["tasks"].row() as task:
96+
decided_tasks_rels.collect(
97+
id=cocoindex.GeneratedField.UUID,
98+
**meeting_key,
99+
description=task["description"],
100+
)
101+
with task["assigned_to"].row() as assigned_to:
102+
assigned_rels.collect(
103+
id=cocoindex.GeneratedField.UUID,
104+
**meeting_key,
105+
task=task["description"],
106+
person=assigned_to["name"],
107+
)
108+
109+
meeting_nodes.export(
110+
"meeting_nodes",
111+
cocoindex.targets.Neo4j(
112+
connection=conn_spec, mapping=cocoindex.targets.Nodes(label="Meeting")
113+
),
114+
primary_key_fields=["note_file", "time"],
115+
)
116+
flow_builder.declare(
117+
cocoindex.targets.Neo4jDeclaration(
118+
connection=conn_spec,
119+
nodes_label="Person",
120+
primary_key_fields=["name"],
121+
)
122+
)
123+
flow_builder.declare(
124+
cocoindex.targets.Neo4jDeclaration(
125+
connection=conn_spec,
126+
nodes_label="Task",
127+
primary_key_fields=["description"],
128+
)
129+
)
130+
attended_rels.export(
131+
"attended_rels",
132+
cocoindex.targets.Neo4j(
133+
connection=conn_spec,
134+
mapping=cocoindex.targets.Relationships(
135+
rel_type="ATTENDED",
136+
source=cocoindex.targets.NodeFromFields(
137+
label="Person",
138+
fields=[
139+
cocoindex.targets.TargetFieldMapping(
140+
source="person", target="name"
141+
)
142+
],
143+
),
144+
target=cocoindex.targets.NodeFromFields(
145+
label="Meeting",
146+
fields=[
147+
cocoindex.targets.TargetFieldMapping("note_file"),
148+
cocoindex.targets.TargetFieldMapping("time"),
149+
],
150+
),
151+
),
152+
),
153+
primary_key_fields=["id"],
154+
)
155+
decided_tasks_rels.export(
156+
"decided_tasks_rels",
157+
cocoindex.targets.Neo4j(
158+
connection=conn_spec,
159+
mapping=cocoindex.targets.Relationships(
160+
rel_type="DECIDED",
161+
source=cocoindex.targets.NodeFromFields(
162+
label="Meeting",
163+
fields=[
164+
cocoindex.targets.TargetFieldMapping("note_file"),
165+
cocoindex.targets.TargetFieldMapping("time"),
166+
],
167+
),
168+
target=cocoindex.targets.NodeFromFields(
169+
label="Task",
170+
fields=[
171+
cocoindex.targets.TargetFieldMapping("description"),
172+
],
173+
),
174+
),
175+
),
176+
primary_key_fields=["id"],
177+
)
178+
assigned_rels.export(
179+
"assigned_rels",
180+
cocoindex.targets.Neo4j(
181+
connection=conn_spec,
182+
mapping=cocoindex.targets.Relationships(
183+
rel_type="ASSIGNED_TO",
184+
source=cocoindex.targets.NodeFromFields(
185+
label="Person",
186+
fields=[
187+
cocoindex.targets.TargetFieldMapping(
188+
source="person", target="name"
189+
),
190+
],
191+
),
192+
target=cocoindex.targets.NodeFromFields(
193+
label="Task",
194+
fields=[
195+
cocoindex.targets.TargetFieldMapping(
196+
source="task", target="description"
197+
),
198+
],
199+
),
200+
),
201+
),
202+
primary_key_fields=["id"],
203+
)
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
[project]
2+
name = "cocoindex-ecommerce-taxonomy"
3+
version = "0.1.0"
4+
description = "Simple example for CocoIndex: extract taxonomy from e-commerce products and build knowledge graph."
5+
requires-python = ">=3.11"
6+
dependencies = ["cocoindex>=0.3.8"]
7+
8+
[tool.setuptools]
9+
packages = []

0 commit comments

Comments
 (0)