Skip to content

Commit c4055c4

Browse files
authored
Add knowledge graph building example docs_to_kg (#293)
* Add example `docs_to_kg`. * Add the new example to `/README.md`.
1 parent 732831b commit c4055c4

File tree

8 files changed

+1401
-0
lines changed

8 files changed

+1401
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,7 @@ Go to the [examples directory](examples) to try out with any of the examples, fo
9696
| [PDF Embedding](examples/pdf_embedding) | Parse PDF and index text embeddings for semantic search |
9797
| [Manuals LLM Extraction](examples/manuals_llm_extraction) | Extract structured information from a manual using LLM |
9898
| [Google Drive Text Embedding](examples/gdrive_text_embedding) | Index text documents from Google Drive |
99+
| [Docs to Knowledge Graph](examples/docs_to_kg) | Extract relationships from Markdown documents and build a knowledge graph |
99100

100101
More coming and stay tuned! If there's any specific examples you would like to see, please let us know in our [Discord community](https://discord.com/invite/zpA9S2DR7s) 🌱.
101102

examples/docs_to_kg/.env

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Postgres database address for cocoindex
2+
COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex

examples/docs_to_kg/README.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
# Build Knowledge Graph from Markdown Documents, with OpenAI, Neo4j and CocoIndex
2+
3+
In this example, we
4+
5+
* Extract relationships from Markdown documents.
6+
* Build a knowledge graph from the relationships.
7+
8+
Please give [Cocoindex on Github](https://github.com/cocoindex-io/cocoindex) a star to support us if you like our work. Thank you so much with a warm coconut hug 🥥🤗. [![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex)
9+
10+
## Prerequisite
11+
12+
Before running the example, you need to:
13+
14+
* [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one.
15+
* [Install Neo4j](https://cocoindex.io/docs/getting_started/installation#-install-neo4j) if you don't have one.
16+
* Install / configure LLM API. In this example we use OpenAI. You need to [configure OpenAI API key](https://cocoindex.io/docs/ai/llm#openai) before running the example. Alternatively, you can also follow the comments in source code to switch to Ollama, which runs LLM model locally, and get it ready following [this guide](https://cocoindex.io/docs/ai/llm#ollama).
17+
18+
## Run
19+
20+
### Build the index
21+
22+
Install dependencies:
23+
24+
```bash
25+
pip install -e .
26+
```
27+
28+
Setup:
29+
30+
```bash
31+
python main.py cocoindex setup
32+
```
33+
34+
Update index:
35+
36+
```bash
37+
python main.py cocoindex update
38+
```
39+
40+
### Browse the knowledge graph
41+
42+
After the knowledge graph is build, you can explore the knowledge graph you built in Neo4j Browser.
43+
You can open it at [http://localhost:7474](http://localhost:7474), and run the following Cypher query to get all relationships:
44+
45+
```cypher
46+
MATCH p=()-->() RETURN p
47+
```
48+
49+
## CocoInsight
50+
CocoInsight is a tool to help you understand your data pipeline and data index. CocoInsight is in Early Access now (Free) 😊 You found us! A quick 3 minute video tutorial about CocoInsight: [Watch on YouTube](https://youtu.be/ZnmyoHslBSc?si=pPLXWALztkA710r9).
51+
52+
Run CocoInsight to understand your RAG data pipeline:
53+
54+
```
55+
python main.py cocoindex server -c https://cocoindex.io
56+
```
57+
58+
Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight). It connects to your local CocoIndex server with zero data retention.
59+
60+
You can view the pipeline flow and the data preview in the CocoInsight UI:
61+
![CocoInsight UI](https://cocoindex.io/blogs/assets/images/cocoinsight-edd71690dcc35b6c5cf1cb31b51b6f6f.png)

examples/docs_to_kg/main.py

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
"""
2+
This example shows how to extract relationships from Markdown documents and build a knowledge graph.
3+
"""
4+
import dataclasses
5+
from dotenv import load_dotenv
6+
import cocoindex
7+
8+
9+
@dataclasses.dataclass
10+
class Relationship:
11+
"""Describe a relationship between two nodes."""
12+
source: str
13+
relationship_name: str
14+
target: str
15+
16+
@dataclasses.dataclass
17+
class Relationships:
18+
"""Describe a relationship between two nodes."""
19+
relationships: list[Relationship]
20+
21+
@cocoindex.flow_def(name="DocsToKG")
22+
def docs_to_kg_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
23+
"""
24+
Define an example flow that extracts triples from files and build knowledge graph.
25+
"""
26+
27+
conn_spec = cocoindex.add_auth_entry(
28+
"Neo4jConnection",
29+
cocoindex.storages.Neo4jConnectionSpec(
30+
uri="bolt://localhost:7687",
31+
user="neo4j",
32+
password="cocoindex",
33+
))
34+
35+
data_scope["documents"] = flow_builder.add_source(
36+
cocoindex.sources.LocalFile(path="../../docs/docs/core",
37+
included_patterns=["*.md", "*.mdx"]))
38+
39+
relationships = data_scope.add_collector()
40+
41+
with data_scope["documents"].row() as doc:
42+
doc["chunks"] = doc["content"].transform(
43+
cocoindex.functions.SplitRecursively(),
44+
language="markdown", chunk_size=10000)
45+
46+
with doc["chunks"].row() as chunk:
47+
chunk["relationships"] = chunk["text"].transform(
48+
cocoindex.functions.ExtractByLlm(
49+
llm_spec=cocoindex.LlmSpec(
50+
api_type=cocoindex.LlmApiType.OPENAI, model="gpt-4o"),
51+
output_type=Relationships,
52+
instruction=(
53+
"Please extract relationships from CocoIndex documents. "
54+
"Focus on concepts and ingnore specific examples. "
55+
"Each relationship should be a tuple of (source, relationship, target).")))
56+
57+
with chunk["relationships"]["relationships"].row() as relationship:
58+
relationships.collect(
59+
id=cocoindex.GeneratedField.UUID,
60+
source=relationship["source"],
61+
relationship_name=relationship["relationship_name"],
62+
target=relationship["target"],
63+
)
64+
65+
relationships.export(
66+
"relationships",
67+
cocoindex.storages.Neo4jRelationship(
68+
connection=conn_spec,
69+
relationship="RELATIONSHIP",
70+
source=cocoindex.storages.Neo4jRelationshipEndSpec(field_name="source", label="Entity"),
71+
target=cocoindex.storages.Neo4jRelationshipEndSpec(field_name="target", label="Entity"),
72+
nodes={
73+
"Entity": cocoindex.storages.Neo4jRelationshipNodeSpec(key_field_name="value"),
74+
},
75+
),
76+
primary_key_fields=["id"],
77+
)
78+
79+
@cocoindex.main_fn()
80+
def _run():
81+
pass
82+
83+
if __name__ == "__main__":
84+
load_dotenv(override=True)
85+
_run()

0 commit comments

Comments
 (0)