Skip to content

Commit d61e131

Browse files
committed
Merge branch 'main' into add-cosmosdb-to-storage
2 parents 231ad57 + 9836643 commit d61e131

File tree

111 files changed

+2351
-1418
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

111 files changed

+2351
-1418
lines changed

.semversioner/1.0.0.json

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
{
2+
"changes": [
3+
{
4+
"description": "Add Parent id to communities data model",
5+
"type": "patch"
6+
},
7+
{
8+
"description": "Add migration notebook.",
9+
"type": "patch"
10+
},
11+
{
12+
"description": "Create separate community workflow, collapse subflows.",
13+
"type": "patch"
14+
},
15+
{
16+
"description": "Dependency Updates",
17+
"type": "patch"
18+
},
19+
{
20+
"description": "cleanup and refactor factory classes.",
21+
"type": "patch"
22+
}
23+
],
24+
"created_at": "2024-12-11T21:41:49+00:00",
25+
"version": "1.0.0"
26+
}

CHANGELOG.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,14 @@
11
# Changelog
22
Note: version releases in the 0.x.y range may introduce breaking changes.
33

4+
## 1.0.0
5+
6+
- patch: Add Parent id to communities data model
7+
- patch: Add migration notebook.
8+
- patch: Create separate community workflow, collapse subflows.
9+
- patch: Dependency Updates
10+
- patch: cleanup and refactor factory classes.
11+
412
## 0.9.0
513

614
- minor: Refactor graph creation.

DEVELOPING.md

Lines changed: 40 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -10,24 +10,56 @@
1010
# Getting Started
1111

1212
## Install Dependencies
13-
14-
```sh
15-
# Install Python dependencies.
13+
```shell
14+
# install python dependencies
1615
poetry install
1716
```
1817

19-
## Executing the Indexing Engine
20-
21-
```sh
18+
## Execute the indexing engine
19+
```shell
2220
poetry run poe index <...args>
2321
```
2422

25-
## Executing Queries
23+
## Execute prompt tuning
24+
```shell
25+
poetry run poe prompt_tune <...args>
26+
```
2627

27-
```sh
28+
## Execute Queries
29+
```shell
2830
poetry run poe query <...args>
2931
```
3032

33+
## Repository Structure
34+
An overview of the repository's top-level folder structure is provided below, detailing the overall design and purpose.
35+
We leverage a factory design pattern where possible, enabling a variety of implementations for each core component of graphrag.
36+
37+
```shell
38+
graphrag
39+
├── api # library API definitions
40+
├── cache # cache module supporting several options
41+
│   └─ factory.py # └─ main entrypoint to create a cache
42+
├── callbacks # a collection of commonly used callback functions
43+
├── cli # library CLI
44+
│   └─ main.py # └─ primary CLI entrypoint
45+
├── config # configuration management
46+
├── index # indexing engine
47+
| └─ run/run.py # main entrypoint to build an index
48+
├── llm # generic llm interfaces
49+
├── logger # logger module supporting several options
50+
│   └─ factory.py # └─ main entrypoint to create a logger
51+
├── model # data model definitions associated with the knowledge graph
52+
├── prompt_tune # prompt tuning module
53+
├── prompts # a collection of all the system prompts used by graphrag
54+
├── query # query engine
55+
├── storage # storage module supporting several options
56+
│   └─ factory.py # └─ main entrypoint to create/load a storage endpoint
57+
├── utils # helper functions used throughout the library
58+
└── vector_stores # vector store module containing a few options
59+
└─ factory.py # └─ main entrypoint to create a vector store
60+
```
61+
Where appropriate, the factories expose a registration method for users to provide their own custom implementations if desired.
62+
3163
## Versioning
3264

3365
We use [semversioner](https://github.com/raulgomis/semversioner) to automate and enforce semantic versioning in the release process. Our CI/CD pipeline checks that all PR's include a json file generated by semversioner. When submitting a PR, please run:

docs/config/env_vars.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -156,7 +156,7 @@ This section controls the storage mechanism used by the pipeline used for export
156156

157157
| Parameter | Description | Type | Required or Optional | Default |
158158
| ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----- | -------------------- | ------- |
159-
| `GRAPHRAG_STORAGE_TYPE` | The type of reporter to use. Options are `file`, `memory`, or `blob` | `str` | optional | `file` |
159+
| `GRAPHRAG_STORAGE_TYPE` | The type of storage to use. Options are `file`, `memory`, or `blob` | `str` | optional | `file` |
160160
| `GRAPHRAG_STORAGE_STORAGE_ACCOUNT_BLOB_URL` | The Azure Storage blob endpoint to use when in `blob` mode and using managed identity. Will have the format `https://<storage_account_name>.blob.core.windows.net` | `str` | optional | None |
161161
| `GRAPHRAG_STORAGE_CONNECTION_STRING` | The Azure Storage connection string to use when in `blob` mode. | `str` | optional | None |
162162
| `GRAPHRAG_STORAGE_CONTAINER_NAME` | The Azure Storage container name to use when in `blob` mode. | `str` | optional | None |

docs/examples_notebooks/drift_search.ipynb

Lines changed: 6 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -171,9 +171,6 @@
171171
" read_indexer_reports,\n",
172172
" read_indexer_text_units,\n",
173173
")\n",
174-
"from graphrag.query.input.loaders.dfs import (\n",
175-
" store_entity_semantic_embeddings,\n",
176-
")\n",
177174
"from graphrag.query.llm.oai.chat_openai import ChatOpenAI\n",
178175
"from graphrag.query.llm.oai.embedding import OpenAIEmbedding\n",
179176
"from graphrag.query.llm.oai.typing import OpenaiApiType\n",
@@ -207,9 +204,6 @@
207204
" collection_name=\"default-entity-description\",\n",
208205
")\n",
209206
"description_embedding_store.connect(db_uri=LANCEDB_URI)\n",
210-
"entity_description_embeddings = store_entity_semantic_embeddings(\n",
211-
" entities=entities, vectorstore=description_embedding_store\n",
212-
")\n",
213207
"\n",
214208
"print(f\"Entity count: {len(entity_df)}\")\n",
215209
"entity_df.head()\n",
@@ -270,37 +264,16 @@
270264
}
271265
],
272266
"source": [
273-
"def embed_community_reports(\n",
267+
"def read_community_reports(\n",
274268
" input_dir: str,\n",
275-
" embedder: OpenAIEmbedding,\n",
276269
" community_report_table: str = COMMUNITY_REPORT_TABLE,\n",
277270
"):\n",
278271
" \"\"\"Embeds the full content of the community reports and saves the DataFrame with embeddings to the output path.\"\"\"\n",
279272
" input_path = Path(input_dir) / f\"{community_report_table}.parquet\"\n",
280-
" output_path = Path(input_dir) / f\"{community_report_table}_with_embeddings.parquet\"\n",
281-
"\n",
282-
" if not Path(output_path).exists():\n",
283-
" print(\"Embedding file not found. Computing community report embeddings...\")\n",
284-
"\n",
285-
" report_df = pd.read_parquet(input_path)\n",
286-
"\n",
287-
" if \"full_content\" not in report_df.columns:\n",
288-
" error_msg = f\"'full_content' column not found in {input_path}\"\n",
289-
" raise ValueError(error_msg)\n",
290-
"\n",
291-
" report_df[\"full_content_embeddings\"] = report_df.loc[:, \"full_content\"].apply(\n",
292-
" lambda x: embedder.embed(x)\n",
293-
" )\n",
294-
"\n",
295-
" # Save the DataFrame with embeddings to the output path\n",
296-
" report_df.to_parquet(output_path)\n",
297-
" print(f\"Embeddings saved to {output_path}\")\n",
298-
" return report_df\n",
299-
" print(f\"Embeddings file already exists at {output_path}\")\n",
300-
" return pd.read_parquet(output_path)\n",
273+
" return pd.read_parquet(input_path)\n",
301274
"\n",
302275
"\n",
303-
"report_df = embed_community_reports(INPUT_DIR, text_embedder)\n",
276+
"report_df = read_community_reports(INPUT_DIR)\n",
304277
"reports = read_indexer_reports(\n",
305278
" report_df,\n",
306279
" entity_df,\n",
@@ -321,7 +294,7 @@
321294
" entities=entities,\n",
322295
" relationships=relationships,\n",
323296
" reports=reports,\n",
324-
" entity_text_embeddings=entity_description_embeddings,\n",
297+
" entity_text_embeddings=description_embedding_store,\n",
325298
" text_units=text_units,\n",
326299
")\n",
327300
"\n",
@@ -3172,7 +3145,7 @@
31723145
],
31733146
"metadata": {
31743147
"kernelspec": {
3175-
"display_name": "graphrag-ta_-cxM1-py3.10",
3148+
"display_name": ".venv",
31763149
"language": "python",
31773150
"name": "python3"
31783151
},
@@ -3186,7 +3159,7 @@
31863159
"name": "python",
31873160
"nbconvert_exporter": "python",
31883161
"pygments_lexer": "ipython3",
3189-
"version": "3.10.12"
3162+
"version": "3.11.9"
31903163
}
31913164
},
31923165
"nbformat": 4,

0 commit comments

Comments
 (0)