Skip to content

Commit 5ec49fd

Browse files
authored
V3 docs and cleanup (#2100)
* Remove community contrib notebooks * Add migration notebook and breaking changes page edits * Update/polish docs * Make model instance name configurable * Add vector schema updates to v3 migration notebook * Spellcheck * Bump smoke test runtimes
1 parent b732445 commit 5ec49fd

28 files changed

+250
-1762
lines changed

breaking-changes.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,35 @@ There are five surface areas that may be impacted on any given release. They are
1212

1313
> TL;DR: Always run `graphrag init --path [path] --force` between minor version bumps to ensure you have the latest config format. Run the provided migration notebook between major version bumps if you want to avoid re-indexing prior datasets. Note that this will overwrite your configuration and prompts, so backup if necessary.
1414
15+
# v3
16+
Run the [migration notebook](./docs/examples_notebooks/index_migration_to_v3.ipynb) to convert older tables to the v3 format. Our main goals with v3 were to slim down the core library to minimize long-term maintenance of features that are either largely unused or should have been out of scope for a long time anyway.
17+
18+
## Data Model
19+
We made minimal data model changes that will affect your index for v3. The primary breaking change is that we removed a rarely-used document-grouping capability that resulted in the `text_units` table having a `document_ids` column with a list instead of a single entry in a column called `document_id`. v3 fixes that, and the migration notebook applies the change so you don't need to re-index.
20+
21+
Most of the other changes we made are removal of fields that are no longer used or are out of scope. For example, we removed the UMAP step that generates x/y coordinates for the entities - new indexes will not produce these columns, but they won't hurt anything if they are in your existing tables.
22+
23+
## API
24+
We have removed the multi-search variant from each search method in the API.
25+
26+
## Config
27+
28+
We did make several changes to the configuration model. The best way forward is to re-run `init`, which we always recommend for minor and major version bumps.
29+
30+
This is a summary of changes:
31+
- Removed fnllm as underlying model manager, so the model types "openai_chat", "azure_openai_chat", "openai_embedding", and "azure_openai_embedding" are all invalid. Use "chat" or "embedding".
32+
- fnllm also had an experimental rate limiting "auto" setting, which is no longer allowed. Use `null` in your config as a default, or set explicit limits to tpm/rpm.
33+
- LiteLLM does require a model_provider, so add yours as appropriate. For example, if you previously used "openai_chat" for your model type, this would be "openai", and for "azure_openai_chat" this would be "azure".
34+
- Collapsed the `vector_store` dict into a single root-level object. This is because we no longer support multi-search, and this dict required a lot of downstream complexity for that single use case.
35+
- Removed the `outputs` block that was also only used for multi-search.
36+
- Most workflows had an undocumented `strategy` config dict that allowed fine tuning of internal settings. These fine tunings are never used and had associated complexity, so we removed it.
37+
- Vector store configuration now allows custom schema per embedded field. This overrides the need for the `container_name` prefix, which caused confusion anyway. Now, the default container name will simply be the embedded field name - if you need something custom, add the `embeddings_schema` block and populate as needed.
38+
- We previously supported the ability to embed any text field in the data model. However, we only ever use text_unit_text, entity_description, and community_full_content, so all others have been removed.
39+
- Removed the `umap` and `embed_graph` blocks which were only used to add x/y fields to the entities. This fixed a long-standing dependency issue with graspologic. If you need x/y positions, see the [visualization guide](https://microsoft.github.io/graphrag/visualization_guide/) for using gephi.
40+
- Removed file filtering from input document loading. This was essentially unused.
41+
- Removed the groupby ability for text chunking. This was intended to allow short documents to be grouped before chunking, but is never used and added a bunch of complexity to the chunking process.
42+
43+
1544
# v2
1645

1746
Run the [migration notebook](./docs/examples_notebooks/index_migration_to_v2.ipynb) to convert older tables to the v2 format.

docs/config/models.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,9 +31,9 @@ To use LiteLLM one must
3131
- Set `type` to either `chat` or `embedding`.
3232
- Provide a `model_provider`, e.g., `openai`, `azure`, `gemini`, etc.
3333
- Set the `model` to a one supported by the `model_provider`'s API.
34-
- Provide a `deployment_name` if using `azure` as the `model_provider`.
34+
- Provide a `deployment_name` if using `azure` as the `model_provider` if your deployment name differs from the model name.
3535

36-
See [Detailed Configuration](yaml.md) for more details on configuration. [View LiteLLm basic usage](https://docs.litellm.ai/docs/#basic-usage) for details on how models are called (The `model_provider` is the portion prior to `/` while the `model` is the portion following the `/`).
36+
See [Detailed Configuration](yaml.md) for more details on configuration. [View LiteLLM basic usage](https://docs.litellm.ai/docs/#basic-usage) for details on how models are called (The `model_provider` is the portion prior to `/` while the `model` is the portion following the `/`).
3737

3838
## Model Selection Considerations
3939

docs/config/overview.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,4 +8,3 @@ The default configuration mode is the simplest way to get started with the Graph
88

99
- [Init command](init.md) (recommended first step)
1010
- [Edit settings.yaml for deeper control](yaml.md)
11-
- [Purely using environment variables](env_vars.md) (not recommended)

docs/config/yaml.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ For example:
1111
GRAPHRAG_API_KEY=some_api_key
1212
1313
# settings.yml
14-
llm:
14+
default_chat_model:
1515
api_key: ${GRAPHRAG_API_KEY}
1616
```
1717

@@ -44,20 +44,20 @@ models:
4444
- `api_key` **str** - The OpenAI API key to use.
4545
- `auth_type` **api_key|azure_managed_identity** - Indicate how you want to authenticate requests.
4646
- `type` **chat**|**embedding**|mock_chat|mock_embeddings** - The type of LLM to use.
47-
- `model_provider` **str|None** - The model provider to use, e.g., openai, azure, anthropic, etc. Required when `type == chat|embedding`. When `type == chat|embedding`, [LiteLLM](https://docs.litellm.ai/) is used under the hood which has support for calling 100+ models. [View LiteLLm basic usage](https://docs.litellm.ai/docs/#basic-usage) for details on how models are called (The `model_provider` is the portion prior to `/` while the `model` is the portion following the `/`). [View Language Model Selection](models.md) for more details and examples on using LiteLLM.
47+
- `model_provider` **str|None** - The model provider to use, e.g., openai, azure, anthropic, etc. [LiteLLM](https://docs.litellm.ai/) is used under the hood which has support for calling 100+ models. [View LiteLLm basic usage](https://docs.litellm.ai/docs/#basic-usage) for details on how models are called (The `model_provider` is the portion prior to `/` while the `model` is the portion following the `/`). [View Language Model Selection](models.md) for more details and examples on using LiteLLM.
4848
- `model` **str** - The model name.
4949
- `encoding_model` **str** - The text encoding model to use. Default is to use the encoding model aligned with the language model (i.e., it is retrieved from tiktoken if unset).
5050
- `api_base` **str** - The API base url to use.
5151
- `api_version` **str** - The API version.
52-
- `deployment_name` **str** - The deployment name to use (Azure).
52+
- `deployment_name` **str** - The deployment name to use if your model is hosted on Azure. Note that if your deployment name on Azure matches the model name, this is unnecessary.
5353
- `organization` **str** - The client organization.
5454
- `proxy` **str** - The proxy URL to use.
5555
- `audience` **str** - (Azure OpenAI only) The URI of the target Azure resource/service for which a managed identity token is requested. Used if `api_key` is not defined. Default=`https://cognitiveservices.azure.com/.default`
5656
- `model_supports_json` **bool** - Whether the model supports JSON-mode output.
5757
- `request_timeout` **float** - The per-request timeout.
5858
- `tokens_per_minute` **int** - Set a leaky-bucket throttle on tokens-per-minute.
5959
- `requests_per_minute` **int** - Set a leaky-bucket throttle on requests-per-minute.
60-
- `retry_strategy` **str** - Retry strategy to use, "native" is the default and uses the strategy built into the OpenAI SDK. Other allowable values include "exponential_backoff", "random_wait", and "incremental_wait".
60+
- `retry_strategy` **str** - Retry strategy to use, "exponential_backoff" is the default. Other allowable values include "native", "random_wait", and "incremental_wait".
6161
- `max_retries` **int** - The maximum number of retries to use.
6262
- `max_retry_wait` **float** - The maximum backoff time.
6363
- `concurrent_requests` **int** The number of open requests to allow at once.
@@ -201,7 +201,7 @@ Supported embeddings names are:
201201
#### Fields
202202

203203
- `model_id` **str** - Name of the model definition to use for text embedding.
204-
- `vector_store_id` **str** - Name of vector store definition to write to.
204+
- `model_instance_name` **str** - Name of the model singleton instance. Default is "text_embedding". This primarily affects the cache storage partitioning.
205205
- `batch_size` **int** - The maximum batch size to use.
206206
- `batch_max_tokens` **int** - The maximum batch # of tokens.
207207
- `names` **list[str]** - List of the embeddings names to run (must be in supported list).
@@ -213,6 +213,7 @@ Tune the language model-based graph extraction process.
213213
#### Fields
214214

215215
- `model_id` **str** - Name of the model definition to use for API calls.
216+
- `model_instance_name` **str** - Name of the model singleton instance. Default is "extract_graph". This primarily affects the cache storage partitioning.
216217
- `prompt` **str** - The prompt file to use.
217218
- `entity_types` **list[str]** - The entity types to identify.
218219
- `max_gleanings` **int** - The maximum number of gleaning cycles to use.
@@ -222,6 +223,7 @@ Tune the language model-based graph extraction process.
222223
#### Fields
223224

224225
- `model_id` **str** - Name of the model definition to use for API calls.
226+
- `model_instance_name` **str** - Name of the model singleton instance. Default is "summarize_descriptions". This primarily affects the cache storage partitioning.
225227
- `prompt` **str** - The prompt file to use.
226228
- `max_length` **int** - The maximum number of output tokens per summarization.
227229
- `max_input_length` **int** - The maximum number of tokens to collect for summarization (this will limit how many descriptions you send to be summarized for a given entity or relationship).
@@ -275,6 +277,7 @@ These are the settings used for Leiden hierarchical clustering of the graph to c
275277

276278
- `enabled` **bool** - Whether to enable claim extraction. Off by default, because claim prompts really need user tuning.
277279
- `model_id` **str** - Name of the model definition to use for API calls.
280+
- `model_instance_name` **str** - Name of the model singleton instance. Default is "extract_claims". This primarily affects the cache storage partitioning.
278281
- `prompt` **str** - The prompt file to use.
279282
- `description` **str** - Describes the types of claims we want to extract.
280283
- `max_gleanings` **int** - The maximum number of gleaning cycles to use.
@@ -284,6 +287,7 @@ These are the settings used for Leiden hierarchical clustering of the graph to c
284287
#### Fields
285288

286289
- `model_id` **str** - Name of the model definition to use for API calls.
290+
- `model_instance_name` **str** - Name of the model singleton instance. Default is "community_reporting". This primarily affects the cache storage partitioning.
287291
- `prompt` **str** - The prompt file to use.
288292
- `max_length` **int** - The maximum number of output tokens per report.
289293
- `max_input_length` **int** - The maximum number of input tokens to use when generating reports.
Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "code",
5+
"execution_count": 4,
6+
"metadata": {},
7+
"outputs": [],
8+
"source": [
9+
"# Copyright (c) 2024 Microsoft Corporation.\n",
10+
"# Licensed under the MIT License."
11+
]
12+
},
13+
{
14+
"cell_type": "markdown",
15+
"metadata": {},
16+
"source": [
17+
"## Index Migration (v2 to v3)\n",
18+
"\n",
19+
"This notebook is used to maintain data model parity with older indexes for version 3.0 of GraphRAG. If you have a pre-3.0 index and need to migrate without re-running the entire pipeline, you can use this notebook to only update the pieces necessary for alignment. If you have a pre-2.0 index, please run the v2 migration notebook first!\n",
20+
"\n",
21+
"NOTE: we recommend regenerating your settings.yml with the latest version of GraphRAG using `graphrag init`. Copy your LLM settings into it before running this notebook. This ensures your config is aligned with the latest version for the migration.\n",
22+
"\n",
23+
"This notebook will also update your settings.yaml to ensure compatibility with our newer vector store collection naming scheme in order to avoid re-ingesting.\n",
24+
"\n",
25+
"WARNING: This will overwrite your parquet files, you may want to make a backup!"
26+
]
27+
},
28+
{
29+
"cell_type": "code",
30+
"execution_count": 7,
31+
"metadata": {},
32+
"outputs": [],
33+
"source": [
34+
"# This is the directory that has your settings.yaml\n",
35+
"PROJECT_DIRECTORY = \"/Users/naevans/graphrag/working/migration\""
36+
]
37+
},
38+
{
39+
"cell_type": "code",
40+
"execution_count": 15,
41+
"metadata": {},
42+
"outputs": [],
43+
"source": [
44+
"from pathlib import Path\n",
45+
"\n",
46+
"from graphrag.config.load_config import load_config\n",
47+
"from graphrag.storage.factory import StorageFactory\n",
48+
"\n",
49+
"config = load_config(Path(PROJECT_DIRECTORY))\n",
50+
"storage_config = config.output.model_dump()\n",
51+
"storage = StorageFactory().create_storage(\n",
52+
" storage_type=storage_config[\"type\"],\n",
53+
" kwargs=storage_config,\n",
54+
")"
55+
]
56+
},
57+
{
58+
"cell_type": "code",
59+
"execution_count": 7,
60+
"metadata": {},
61+
"outputs": [],
62+
"source": [
63+
"def remove_columns(df, columns):\n",
64+
" \"\"\"Remove columns from a DataFrame, suppressing errors.\"\"\"\n",
65+
" df.drop(labels=columns, axis=1, errors=\"ignore\", inplace=True)"
66+
]
67+
},
68+
{
69+
"cell_type": "code",
70+
"execution_count": 8,
71+
"metadata": {},
72+
"outputs": [],
73+
"source": [
74+
"from graphrag.utils.storage import (\n",
75+
" load_table_from_storage,\n",
76+
" write_table_to_storage,\n",
77+
")\n",
78+
"\n",
79+
"text_units = await load_table_from_storage(\"text_units\", storage)\n",
80+
"\n",
81+
"text_units[\"document_id\"] = text_units[\"document_ids\"].apply(lambda ids: ids[0])\n",
82+
"remove_columns(text_units, [\"document_ids\"])\n",
83+
"\n",
84+
"await write_table_to_storage(text_units, \"text_units\", storage)"
85+
]
86+
},
87+
{
88+
"cell_type": "markdown",
89+
"metadata": {},
90+
"source": [
91+
"## Update settings.yaml\n",
92+
"This next section will attempt to insert index names for each vector index using our new schema structure. It depends on most things being default. If you have already customized your vector store schema it may not be necessary.\n",
93+
"\n",
94+
"The primary goal is to align v2 indexes using our old default naming schema with the new customizability. If don't need this done or you have a more complicated config, comment it out and update your config manually to ensure each index name is set.\n",
95+
"\n",
96+
"Old default index names:\n",
97+
"- default-text_unit-text\n",
98+
"- default-entity-description\n",
99+
"- default-community-full_content\n",
100+
"\n",
101+
"v3 versions are:\n",
102+
"- text_unit_text\n",
103+
"- entity_description\n",
104+
"- community_full_content\n",
105+
"\n",
106+
"Therefore, with a v2 index we will explicitly set the old index names so it connects correctly.\n",
107+
"\n",
108+
"NOTE: we are also setting the default vector_size for each index, under the assumption that you are using a prior default with 1536 dimensions. Our new default of text-embedding-3-large has 3072 dimensions, which will be populated as the default if unset. Again, if you have a more complicated situation you may want to manually configure this.\n"
109+
]
110+
},
111+
{
112+
"cell_type": "code",
113+
"execution_count": null,
114+
"metadata": {},
115+
"outputs": [],
116+
"source": [
117+
"import yaml\n",
118+
"\n",
119+
"EMBEDDING_DIMENSIONS = 1536\n",
120+
"\n",
121+
"settings = Path(PROJECT_DIRECTORY) / \"settings.yaml\"\n",
122+
"with Path.open(settings) as f:\n",
123+
" conf = yaml.safe_load(f)\n",
124+
"\n",
125+
"vector_store = conf.get(\"vector_store\", {})\n",
126+
"container_name = vector_store.get(\"container_name\", \"default\")\n",
127+
"embeddings_schema = vector_store.get(\"embeddings_schema\", {})\n",
128+
"text_unit_schema = embeddings_schema.get(\"text_unit.text\", {})\n",
129+
"if \"index_name\" not in text_unit_schema:\n",
130+
" text_unit_schema[\"index_name\"] = f\"{container_name}-text_unit-text\"\n",
131+
"if \"vector_size\" not in text_unit_schema:\n",
132+
" text_unit_schema[\"vector_size\"] = EMBEDDING_DIMENSIONS\n",
133+
"embeddings_schema[\"text_unit.text\"] = text_unit_schema\n",
134+
"entity_schema = embeddings_schema.get(\"entity.description\", {})\n",
135+
"if \"index_name\" not in entity_schema:\n",
136+
" entity_schema[\"index_name\"] = f\"{container_name}-entity-description\"\n",
137+
"if \"vector_size\" not in entity_schema:\n",
138+
" entity_schema[\"vector_size\"] = EMBEDDING_DIMENSIONS\n",
139+
"embeddings_schema[\"entity.description\"] = entity_schema\n",
140+
"community_schema = embeddings_schema.get(\"community.full_content\", {})\n",
141+
"if \"index_name\" not in community_schema:\n",
142+
" community_schema[\"index_name\"] = f\"{container_name}-community-full_content\"\n",
143+
"if \"vector_size\" not in community_schema:\n",
144+
" community_schema[\"vector_size\"] = EMBEDDING_DIMENSIONS\n",
145+
"embeddings_schema[\"community.full_content\"] = community_schema\n",
146+
"vector_store[\"embeddings_schema\"] = embeddings_schema\n",
147+
"conf[\"vector_store\"] = vector_store\n",
148+
"\n",
149+
"with Path.open(settings, \"w\") as f:\n",
150+
" yaml.safe_dump(conf, f)"
151+
]
152+
}
153+
],
154+
"metadata": {
155+
"kernelspec": {
156+
"display_name": "graphrag",
157+
"language": "python",
158+
"name": "python3"
159+
},
160+
"language_info": {
161+
"codemirror_mode": {
162+
"name": "ipython",
163+
"version": 3
164+
},
165+
"file_extension": ".py",
166+
"mimetype": "text/x-python",
167+
"name": "python",
168+
"nbconvert_exporter": "python",
169+
"pygments_lexer": "ipython3",
170+
"version": "3.12.10"
171+
}
172+
},
173+
"nbformat": 4,
174+
"nbformat_minor": 2
175+
}

examples_notebooks/community_contrib/README.md

Lines changed: 0 additions & 5 deletions
This file was deleted.

0 commit comments

Comments
 (0)