microsoft
diff --git a/‎breaking-changes.md‎
Lines changed: 29 additions & 0 deletions b/‎breaking-changes.md‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎docs/config/models.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/config/models.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/config/overview.md‎
Lines changed: 0 additions & 1 deletion b/‎docs/config/overview.md‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎docs/config/yaml.md‎
Lines changed: 9 additions & 5 deletions b/‎docs/config/yaml.md‎
Lines changed: 9 additions & 5 deletions
diff --git a/‎docs/examples_notebooks/index_migration_to_v3.ipynb‎
Lines changed: 175 additions & 0 deletions b/‎docs/examples_notebooks/index_migration_to_v3.ipynb‎
Lines changed: 175 additions & 0 deletions
diff --git a/‎examples_notebooks/community_contrib/README.md‎
Lines changed: 0 additions & 5 deletions b/‎examples_notebooks/community_contrib/README.md‎
Lines changed: 0 additions & 5 deletions
@@ -12,6 +12,35 @@ There are five surface areas that may be impacted on any given release. They are
 
 > TL;DR: Always run `graphrag init --path [path] --force` between minor version bumps to ensure you have the latest config format. Run the provided migration notebook between major version bumps if you want to avoid re-indexing prior datasets. Note that this will overwrite your configuration and prompts, so backup if necessary.
 
+# v3
+Run the [migration notebook](./docs/examples_notebooks/index_migration_to_v3.ipynb) to convert older tables to the v3 format. Our main goals with v3 were to slim down the core library to minimize long-term maintenance of features that are either largely unused or should have been out of scope for a long time anyway.
+
+## Data Model
+We made minimal data model changes that will affect your index for v3. The primary breaking change is that we removed a rarely-used document-grouping capability that resulted in the `text_units` table having a `document_ids` column with a list instead of a single entry in a column called `document_id`. v3 fixes that, and the migration notebook applies the change so you don't need to re-index.
+
+Most of the other changes we made are removal of fields that are no longer used or are out of scope. For example, we removed the UMAP step that generates x/y coordinates for the entities - new indexes will not produce these columns, but they won't hurt anything if they are in your existing tables.
+
+## API
+We have removed the multi-search variant from each search method in the API.
+
+## Config
+
+We did make several changes to the configuration model. The best way forward is to re-run `init`, which we always recommend for minor and major version bumps.
+
+This is a summary of changes:
+- Removed fnllm as underlying model manager, so the model types "openai_chat", "azure_openai_chat", "openai_embedding", and "azure_openai_embedding" are all invalid. Use "chat" or "embedding".
+- fnllm also had an experimental rate limiting "auto" setting, which is no longer allowed. Use `null` in your config as a default, or set explicit limits to tpm/rpm.
+- LiteLLM does require a model_provider, so add yours as appropriate. For example, if you previously used "openai_chat" for your model type, this would be "openai", and for "azure_openai_chat" this would be "azure".
+- Collapsed the `vector_store` dict into a single root-level object. This is because we no longer support multi-search, and this dict required a lot of downstream complexity for that single use case.
+- Removed the `outputs` block that was also only used for multi-search.
+- Most workflows had an undocumented `strategy` config dict that allowed fine tuning of internal settings. These fine tunings are never used and had associated complexity, so we removed it.
+- Vector store configuration now allows custom schema per embedded field. This overrides the need for the `container_name` prefix, which caused confusion anyway. Now, the default container name will simply be the embedded field name - if you need something custom, add the `embeddings_schema` block and populate as needed.
+- We previously supported the ability to embed any text field in the data model. However, we only ever use text_unit_text, entity_description, and community_full_content, so all others have been removed.
+- Removed the `umap` and `embed_graph` blocks which were only used to add x/y fields to the entities. This fixed a long-standing dependency issue with graspologic. If you need x/y positions, see the [visualization guide](https://microsoft.github.io/graphrag/visualization_guide/) for using gephi.
+- Removed file filtering from input document loading. This was essentially unused.
+- Removed the groupby ability for text chunking. This was intended to allow short documents to be grouped before chunking, but is never used and added a bunch of complexity to the chunking process.
+
+
 # v2
 
 Run the [migration notebook](./docs/examples_notebooks/index_migration_to_v2.ipynb) to convert older tables to the v2 format.
 
@@ -31,9 +31,9 @@ To use LiteLLM one must
 - Set `type` to either `chat` or `embedding`.
 - Provide a `model_provider`, e.g., `openai`, `azure`, `gemini`, etc.
 - Set the `model` to a one supported by the `model_provider`'s API.
-- Provide a `deployment_name` if using `azure` as the `model_provider`.
+- Provide a `deployment_name` if using `azure` as the `model_provider` if your deployment name differs from the model name.
 
-See [Detailed Configuration](yaml.md) for more details on configuration. [View LiteLLm basic usage](https://docs.litellm.ai/docs/#basic-usage) for details on how models are called (The `model_provider` is the portion prior to `/` while the `model` is the portion following the `/`).
+See [Detailed Configuration](yaml.md) for more details on configuration. [View LiteLLM basic usage](https://docs.litellm.ai/docs/#basic-usage) for details on how models are called (The `model_provider` is the portion prior to `/` while the `model` is the portion following the `/`).
 
 ## Model Selection Considerations
 
 
@@ -8,4 +8,3 @@ The default configuration mode is the simplest way to get started with the Graph
 
 - [Init command](init.md) (recommended first step)
 - [Edit settings.yaml for deeper control](yaml.md)
-- [Purely using environment variables](env_vars.md) (not recommended)
@@ -11,7 +11,7 @@ For example:
 GRAPHRAG_API_KEY=some_api_key
 
 # settings.yml
-llm: 
+default_chat_model: 
   api_key: ${GRAPHRAG_API_KEY}
 ```
 
@@ -44,20 +44,20 @@ models:
 - `api_key` **str** - The OpenAI API key to use.
 - `auth_type` **api_key|azure_managed_identity** - Indicate how you want to authenticate requests.
 - `type` **chat**|**embedding**|mock_chat|mock_embeddings** - The type of LLM to use.
-- `model_provider` **str|None** - The model provider to use, e.g., openai, azure, anthropic, etc. Required when `type == chat|embedding`. When `type == chat|embedding`, [LiteLLM](https://docs.litellm.ai/) is used under the hood which has support for calling 100+ models. [View LiteLLm basic usage](https://docs.litellm.ai/docs/#basic-usage) for details on how models are called (The `model_provider` is the portion prior to `/` while the `model` is the portion following the `/`). [View Language Model Selection](models.md) for more details and examples on using LiteLLM.
+- `model_provider` **str|None** - The model provider to use, e.g., openai, azure, anthropic, etc. [LiteLLM](https://docs.litellm.ai/) is used under the hood which has support for calling 100+ models. [View LiteLLm basic usage](https://docs.litellm.ai/docs/#basic-usage) for details on how models are called (The `model_provider` is the portion prior to `/` while the `model` is the portion following the `/`). [View Language Model Selection](models.md) for more details and examples on using LiteLLM.
 - `model` **str** - The model name.
 - `encoding_model` **str** - The text encoding model to use. Default is to use the encoding model aligned with the language model (i.e., it is retrieved from tiktoken if unset).
 - `api_base` **str** - The API base url to use.
 - `api_version` **str** - The API version.
-- `deployment_name` **str** - The deployment name to use (Azure).
+- `deployment_name` **str** - The deployment name to use if your model is hosted on Azure. Note that if your deployment name on Azure matches the model name, this is unnecessary.
 - `organization` **str** - The client organization.
 - `proxy` **str** - The proxy URL to use.
 - `audience` **str** - (Azure OpenAI only) The URI of the target Azure resource/service for which a managed identity token is requested. Used if `api_key` is not defined. Default=`https://cognitiveservices.azure.com/.default`
 - `model_supports_json` **bool** - Whether the model supports JSON-mode output.
 - `request_timeout` **float** - The per-request timeout.
 - `tokens_per_minute` **int** - Set a leaky-bucket throttle on tokens-per-minute.
 - `requests_per_minute` **int** - Set a leaky-bucket throttle on requests-per-minute.
-- `retry_strategy` **str** - Retry strategy to use, "native" is the default and uses the strategy built into the OpenAI SDK. Other allowable values include "exponential_backoff", "random_wait", and "incremental_wait".
+- `retry_strategy` **str** - Retry strategy to use, "exponential_backoff" is the default. Other allowable values include "native", "random_wait", and "incremental_wait".
 - `max_retries` **int** - The maximum number of retries to use.
 - `max_retry_wait` **float** - The maximum backoff time.
 - `concurrent_requests` **int** The number of open requests to allow at once.
@@ -201,7 +201,7 @@ Supported embeddings names are:
 #### Fields
 
 - `model_id` **str** - Name of the model definition to use for text embedding.
-- `vector_store_id` **str** - Name of vector store definition to write to.
+- `model_instance_name` **str** - Name of the model singleton instance. Default is "text_embedding". This primarily affects the cache storage partitioning.
 - `batch_size` **int** - The maximum batch size to use.
 - `batch_max_tokens` **int** - The maximum batch # of tokens.
 - `names` **list[str]** - List of the embeddings names to run (must be in supported list).
@@ -213,6 +213,7 @@ Tune the language model-based graph extraction process.
 #### Fields
 
 - `model_id` **str** - Name of the model definition to use for API calls.
+- `model_instance_name` **str** - Name of the model singleton instance. Default is "extract_graph". This primarily affects the cache storage partitioning.
 - `prompt` **str** - The prompt file to use.
 - `entity_types` **list[str]** - The entity types to identify.
 - `max_gleanings` **int** - The maximum number of gleaning cycles to use.
@@ -222,6 +223,7 @@ Tune the language model-based graph extraction process.
 #### Fields
 
 - `model_id` **str** - Name of the model definition to use for API calls.
+- `model_instance_name` **str** - Name of the model singleton instance. Default is "summarize_descriptions". This primarily affects the cache storage partitioning.
 - `prompt` **str** - The prompt file to use.
 - `max_length` **int** - The maximum number of output tokens per summarization.
 - `max_input_length` **int** - The maximum number of tokens to collect for summarization (this will limit how many descriptions you send to be summarized for a given entity or relationship).
@@ -275,6 +277,7 @@ These are the settings used for Leiden hierarchical clustering of the graph to c
 
 - `enabled` **bool** - Whether to enable claim extraction. Off by default, because claim prompts really need user tuning.
 - `model_id` **str** - Name of the model definition to use for API calls.
+- `model_instance_name` **str** - Name of the model singleton instance. Default is "extract_claims". This primarily affects the cache storage partitioning.
 - `prompt` **str** - The prompt file to use.
 - `description` **str** - Describes the types of claims we want to extract.
 - `max_gleanings` **int** - The maximum number of gleaning cycles to use.
@@ -284,6 +287,7 @@ These are the settings used for Leiden hierarchical clustering of the graph to c
 #### Fields
 
 - `model_id` **str** - Name of the model definition to use for API calls.
+- `model_instance_name` **str** - Name of the model singleton instance. Default is "community_reporting". This primarily affects the cache storage partitioning.
 - `prompt` **str** - The prompt file to use.
 - `max_length` **int** - The maximum number of output tokens per report.
 - `max_input_length` **int** - The maximum number of input tokens to use when generating reports.
 
@@ -0,0 +1,175 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Copyright (c) 2024 Microsoft Corporation.\n",
+    "# Licensed under the MIT License."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Index Migration (v2 to v3)\n",
+    "\n",
+    "This notebook is used to maintain data model parity with older indexes for version 3.0 of GraphRAG. If you have a pre-3.0 index and need to migrate without re-running the entire pipeline, you can use this notebook to only update the pieces necessary for alignment. If you have a pre-2.0 index, please run the v2 migration notebook first!\n",
+    "\n",
+    "NOTE: we recommend regenerating your settings.yml with the latest version of GraphRAG using `graphrag init`. Copy your LLM settings into it before running this notebook. This ensures your config is aligned with the latest version for the migration.\n",
+    "\n",
+    "This notebook will also update your settings.yaml to ensure compatibility with our newer vector store collection naming scheme in order to avoid re-ingesting.\n",
+    "\n",
+    "WARNING: This will overwrite your parquet files, you may want to make a backup!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# This is the directory that has your settings.yaml\n",
+    "PROJECT_DIRECTORY = \"/Users/naevans/graphrag/working/migration\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "\n",
+    "from graphrag.config.load_config import load_config\n",
+    "from graphrag.storage.factory import StorageFactory\n",
+    "\n",
+    "config = load_config(Path(PROJECT_DIRECTORY))\n",
+    "storage_config = config.output.model_dump()\n",
+    "storage = StorageFactory().create_storage(\n",
+    "    storage_type=storage_config[\"type\"],\n",
+    "    kwargs=storage_config,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def remove_columns(df, columns):\n",
+    "    \"\"\"Remove columns from a DataFrame, suppressing errors.\"\"\"\n",
+    "    df.drop(labels=columns, axis=1, errors=\"ignore\", inplace=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from graphrag.utils.storage import (\n",
+    "    load_table_from_storage,\n",
+    "    write_table_to_storage,\n",
+    ")\n",
+    "\n",
+    "text_units = await load_table_from_storage(\"text_units\", storage)\n",
+    "\n",
+    "text_units[\"document_id\"] = text_units[\"document_ids\"].apply(lambda ids: ids[0])\n",
+    "remove_columns(text_units, [\"document_ids\"])\n",
+    "\n",
+    "await write_table_to_storage(text_units, \"text_units\", storage)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Update settings.yaml\n",
+    "This next section will attempt to insert index names for each vector index using our new schema structure. It depends on most things being default. If you have already customized your vector store schema it may not be necessary.\n",
+    "\n",
+    "The primary goal is to align v2 indexes using our old default naming schema with the new customizability. If don't need this done or you have a more complicated config, comment it out and update your config manually to ensure each index name is set.\n",
+    "\n",
+    "Old default index names:\n",
+    "- default-text_unit-text\n",
+    "- default-entity-description\n",
+    "- default-community-full_content\n",
+    "\n",
+    "v3 versions are:\n",
+    "- text_unit_text\n",
+    "- entity_description\n",
+    "- community_full_content\n",
+    "\n",
+    "Therefore, with a v2 index we will explicitly set the old index names so it connects correctly.\n",
+    "\n",
+    "NOTE: we are also setting the default vector_size for each index, under the assumption that you are using a prior default with 1536 dimensions. Our new default of text-embedding-3-large has 3072 dimensions, which will be populated as the default if unset. Again, if you have a more complicated situation you may want to manually configure this.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import yaml\n",
+    "\n",
+    "EMBEDDING_DIMENSIONS = 1536\n",
+    "\n",
+    "settings = Path(PROJECT_DIRECTORY) / \"settings.yaml\"\n",
+    "with Path.open(settings) as f:\n",
+    "    conf = yaml.safe_load(f)\n",
+    "\n",
+    "vector_store = conf.get(\"vector_store\", {})\n",
+    "container_name = vector_store.get(\"container_name\", \"default\")\n",
+    "embeddings_schema = vector_store.get(\"embeddings_schema\", {})\n",
+    "text_unit_schema = embeddings_schema.get(\"text_unit.text\", {})\n",
+    "if \"index_name\" not in text_unit_schema:\n",
+    "    text_unit_schema[\"index_name\"] = f\"{container_name}-text_unit-text\"\n",
+    "if \"vector_size\" not in text_unit_schema:\n",
+    "    text_unit_schema[\"vector_size\"] = EMBEDDING_DIMENSIONS\n",
+    "embeddings_schema[\"text_unit.text\"] = text_unit_schema\n",
+    "entity_schema = embeddings_schema.get(\"entity.description\", {})\n",
+    "if \"index_name\" not in entity_schema:\n",
+    "    entity_schema[\"index_name\"] = f\"{container_name}-entity-description\"\n",
+    "if \"vector_size\" not in entity_schema:\n",
+    "    entity_schema[\"vector_size\"] = EMBEDDING_DIMENSIONS\n",
+    "embeddings_schema[\"entity.description\"] = entity_schema\n",
+    "community_schema = embeddings_schema.get(\"community.full_content\", {})\n",
+    "if \"index_name\" not in community_schema:\n",
+    "    community_schema[\"index_name\"] = f\"{container_name}-community-full_content\"\n",
+    "if \"vector_size\" not in community_schema:\n",
+    "    community_schema[\"vector_size\"] = EMBEDDING_DIMENSIONS\n",
+    "embeddings_schema[\"community.full_content\"] = community_schema\n",
+    "vector_store[\"embeddings_schema\"] = embeddings_schema\n",
+    "conf[\"vector_store\"] = vector_store\n",
+    "\n",
+    "with Path.open(settings, \"w\") as f:\n",
+    "    yaml.safe_dump(conf, f)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "graphrag",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
Original file line number	Diff line number	Diff line change
`@@ -8,4 +8,3 @@ The default configuration mode is the simplest way to get started with the Graph`
`8`	`8`
`9`	`9`	`- [Init command](init.md) (recommended first step)`
`10`	`10`	`- [Edit settings.yaml for deeper control](yaml.md)`
`11`		`-- [Purely using environment variables](env_vars.md) (not recommended)`