Skip to content

Commit 17658c5

Browse files
gaudybAlonsoGuevarajgbradley1natoverse
authored
New workflow to generate embeddings in a single workflow (#1296)
* New workflow to generate embeddings in a single workflow * New workflow to generate embeddings in a single workflow * version change * clean tests without any embeddings references * clean tests without any embeddings references * remove code * feedback implemented * changes in logic * feedback implemented * store in table bug fixed * smoke test for generate_text_embeddings workflow * smoke test fix * add generate_text_embeddings to the list of transient workflows * smoke tests * fix * ruff formatting updates * fix * smoke test fixed * smoke test fixed * fix lancedb import * smoke test fix * ignore sorting * smoke test fixed * smoke test fixed * check smoke test * smoke test fixed * change config for vector store * format fix * vector store changes * revert debug profile back to empty filepath * merge conflict solved * merge conflict solved * format fixed * format fixed * fix return dataframe * snapshot fix * format fix * embeddings param implemented * validation fixes * fix map * fix map * fix properties * config updates * smoke test fixed * settings change * Update collection config and rework back-compat * Repalce . with - for embedding store --------- Co-authored-by: Alonso Guevara <[email protected]> Co-authored-by: Josh Bradley <[email protected]> Co-authored-by: Nathan Evans <[email protected]>
1 parent 8302920 commit 17658c5

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

51 files changed

+693
-804
lines changed
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
{
2+
"type": "minor",
3+
"description": "embeddings moved to a different workflow"
4+
}

docs/config/json_yaml.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -85,16 +85,16 @@ This is the base LLM configuration section. Other steps may override this config
8585
- `async_mode` (see Async Mode top-level config)
8686
- `batch_size` **int** - The maximum batch size to use.
8787
- `batch_max_tokens` **int** - The maximum batch # of tokens.
88-
- `target` **required|all** - Determines which set of embeddings to emit.
89-
- `skip` **list[str]** - Which embeddings to skip.
88+
- `target` **required|all|none** - Determines which set of embeddings to emit.
89+
- `skip` **list[str]** - Which embeddings to skip. Only useful if target=all to customize the list.
9090
- `vector_store` **dict** - The vector store to use. Configured for lancedb by default.
9191
- `type` **str** - `lancedb` or `azure_ai_search`. Default=`lancedb`
9292
- `db_uri` **str** (only for lancedb) - The database uri. Default=`storage.base_dir/lancedb`
9393
- `url` **str** (only for AI Search) - AI Search endpoint
9494
- `api_key` **str** (optional - only for AI Search) - The AI Search api key to use.
9595
- `audience` **str** (only for AI Search) - Audience for managed identity token if managed identity authentication is used.
9696
- `overwrite` **bool** (only used at index creation time) - Overwrite collection if it exist. Default=`True`
97-
- `collection_name` **str** - The name of a vector collection. Default=`entity_description_embeddings`
97+
- `container_name` **str** - The name of a vector container. This stores all indexes (tables) for a given dataset ingest. Default=`default`
9898
- `strategy` **dict** - Fully override the text-embedding strategy.
9999

100100
## chunks

docs/examples_notebooks/local_search.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -108,7 +108,7 @@
108108
"# load description embeddings to an in-memory lancedb vectorstore\n",
109109
"# to connect to a remote db, specify url and port values.\n",
110110
"description_embedding_store = LanceDBVectorStore(\n",
111-
" collection_name=\"entity_description_embeddings\",\n",
111+
" collection_name=\"entity.description\",\n",
112112
")\n",
113113
"description_embedding_store.connect(db_uri=LANCEDB_URI)\n",
114114
"entity_description_embeddings = store_entity_semantic_embeddings(\n",

examples_notebooks/community_contrib/yfiles-jupyter-graphs/graph-visualization.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -299,7 +299,7 @@
299299
"entities = read_indexer_entities(entity_df, entity_embedding_df, COMMUNITY_LEVEL)\n",
300300
"\n",
301301
"description_embedding_store = LanceDBVectorStore(\n",
302-
" collection_name=\"entity_description_embeddings\",\n",
302+
" collection_name=\"entity.description\",\n",
303303
")\n",
304304
"description_embedding_store.connect(db_uri=LANCEDB_URI)\n",
305305
"entity_description_embeddings = store_entity_semantic_embeddings(\n",

graphrag/api/index.py

Lines changed: 20 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -59,13 +59,7 @@ async def build_index(
5959
msg = "Cannot resume and update a run at the same time."
6060
raise ValueError(msg)
6161

62-
# TODO: must update filepath of lancedb (if used) until the new config engine has been implemented
63-
# TODO: remove the type ignore annotations below once the new config engine has been refactored
64-
vector_store_type = config.embeddings.vector_store["type"] # type: ignore
65-
if vector_store_type == VectorStoreType.LanceDB:
66-
db_uri = config.embeddings.vector_store["db_uri"] # type: ignore
67-
lancedb_dir = Path(config.root_dir).resolve() / db_uri
68-
config.embeddings.vector_store["db_uri"] = str(lancedb_dir) # type: ignore
62+
config = _patch_vector_config(config)
6963

7064
pipeline_config = create_pipeline_config(config)
7165
pipeline_cache = (
@@ -90,3 +84,22 @@ async def build_index(
9084
progress_reporter.success(output.workflow)
9185
progress_reporter.info(str(output.result))
9286
return outputs
87+
88+
89+
def _patch_vector_config(config: GraphRagConfig):
90+
"""Back-compat patch to ensure a default vector store configuration."""
91+
if not config.embeddings.vector_store:
92+
config.embeddings.vector_store = {
93+
"type": "lancedb",
94+
"db_uri": "output/lancedb",
95+
"container_name": "default",
96+
"overwrite": True,
97+
}
98+
# TODO: must update filepath of lancedb (if used) until the new config engine has been implemented
99+
# TODO: remove the type ignore annotations below once the new config engine has been refactored
100+
vector_store_type = config.embeddings.vector_store["type"] # type: ignore
101+
if vector_store_type == VectorStoreType.LanceDB:
102+
db_uri = config.embeddings.vector_store["db_uri"] # type: ignore
103+
lancedb_dir = Path(config.root_dir).resolve() / db_uri
104+
config.embeddings.vector_store["db_uri"] = str(lancedb_dir) # type: ignore
105+
return config

graphrag/api/query.py

Lines changed: 55 additions & 81 deletions
Original file line numberDiff line numberDiff line change
@@ -182,56 +182,22 @@ async def local_search(
182182
------
183183
TODO: Document any exceptions to expect.
184184
"""
185-
#################################### BEGIN PATCH ####################################
186-
# TODO: remove the following patch that checks for a vector_store prior to v1 release
187-
# TODO: this is a backwards compatibility patch that injects the default vector_store settings into the config if it is not present
188-
# Only applicable in situations involving a local vector_store (lancedb). The general idea:
189-
# if vector_store not in config:
190-
# 1. assume user is running local if vector_store is not in config
191-
# 2. insert default vector_store in config
192-
# 3 .create lancedb vector_store instance
193-
# 4. upload vector embeddings from the input dataframes to the vector_store
194-
backwards_compatible = False
195-
if not config.embeddings.vector_store:
196-
backwards_compatible = True
197-
from graphrag.query.input.loaders.dfs import store_entity_semantic_embeddings
198-
from graphrag.vector_stores.lancedb import LanceDBVectorStore
199-
200-
config.embeddings.vector_store = {
201-
"type": "lancedb",
202-
"db_uri": f"{Path(config.storage.base_dir)}/lancedb",
203-
"collection_name": "entity_description_embeddings",
204-
"overwrite": True,
205-
}
206-
_entities = read_indexer_entities(nodes, entities, community_level)
207-
description_embedding_store = LanceDBVectorStore(
208-
db_uri=config.embeddings.vector_store["db_uri"],
209-
collection_name=config.embeddings.vector_store["collection_name"],
210-
overwrite=config.embeddings.vector_store["overwrite"],
211-
)
212-
description_embedding_store.connect(
213-
db_uri=config.embeddings.vector_store["db_uri"]
214-
)
215-
# dump embeddings from the entities list to the description_embedding_store
216-
store_entity_semantic_embeddings(
217-
entities=_entities, vectorstore=description_embedding_store
218-
)
219-
#################################### END PATCH ####################################
185+
config = _patch_vector_store(config, nodes, entities, community_level)
220186

221187
# TODO: update filepath of lancedb (if used) until the new config engine has been implemented
222188
# TODO: remove the type ignore annotations below once the new config engine has been refactored
223189
vector_store_type = config.embeddings.vector_store.get("type") # type: ignore
224190
vector_store_args = config.embeddings.vector_store
225-
if vector_store_type == VectorStoreType.LanceDB and not backwards_compatible:
191+
if vector_store_type == VectorStoreType.LanceDB:
226192
db_uri = config.embeddings.vector_store["db_uri"] # type: ignore
227193
lancedb_dir = Path(config.root_dir).resolve() / db_uri
228194
vector_store_args["db_uri"] = str(lancedb_dir) # type: ignore
229195

230196
reporter.info(f"Vector Store Args: {redact(vector_store_args)}") # type: ignore
231-
if not backwards_compatible: # can remove this check and always set the description_embedding_store before v1 release
232-
description_embedding_store = _get_embedding_description_store(
233-
config_args=vector_store_args, # type: ignore
234-
)
197+
198+
description_embedding_store = _get_embedding_description_store(
199+
config_args=vector_store_args, # type: ignore
200+
)
235201

236202
_entities = read_indexer_entities(nodes, entities, community_level)
237203
_covariates = read_indexer_covariates(covariates) if covariates is not None else []
@@ -289,56 +255,22 @@ async def local_search_streaming(
289255
------
290256
TODO: Document any exceptions to expect.
291257
"""
292-
#################################### BEGIN PATCH ####################################
293-
# TODO: remove the following patch that checks for a vector_store prior to v1 release
294-
# TODO: this is a backwards compatibility patch that injects the default vector_store settings into the config if it is not present
295-
# Only applicable in situations involving a local vector_store (lancedb). The general idea:
296-
# if vector_store not in config:
297-
# 1. assume user is running local if vector_store is not in config
298-
# 2. insert default vector_store in config
299-
# 3 .create lancedb vector_store instance
300-
# 4. upload vector embeddings from the input dataframes to the vector_store
301-
backwards_compatible = False
302-
if not config.embeddings.vector_store:
303-
backwards_compatible = True
304-
from graphrag.query.input.loaders.dfs import store_entity_semantic_embeddings
305-
from graphrag.vector_stores.lancedb import LanceDBVectorStore
306-
307-
config.embeddings.vector_store = {
308-
"type": "lancedb",
309-
"db_uri": f"{Path(config.storage.base_dir)}/lancedb",
310-
"collection_name": "entity_description_embeddings",
311-
"overwrite": True,
312-
}
313-
_entities = read_indexer_entities(nodes, entities, community_level)
314-
description_embedding_store = LanceDBVectorStore(
315-
db_uri=config.embeddings.vector_store["db_uri"],
316-
collection_name=config.embeddings.vector_store["collection_name"],
317-
overwrite=config.embeddings.vector_store["overwrite"],
318-
)
319-
description_embedding_store.connect(
320-
db_uri=config.embeddings.vector_store["db_uri"]
321-
)
322-
# dump embeddings from the entities list to the description_embedding_store
323-
store_entity_semantic_embeddings(
324-
entities=_entities, vectorstore=description_embedding_store
325-
)
326-
#################################### END PATCH ####################################
258+
config = _patch_vector_store(config, nodes, entities, community_level)
327259

328260
# TODO: must update filepath of lancedb (if used) until the new config engine has been implemented
329261
# TODO: remove the type ignore annotations below once the new config engine has been refactored
330262
vector_store_type = config.embeddings.vector_store.get("type") # type: ignore
331263
vector_store_args = config.embeddings.vector_store
332-
if vector_store_type == VectorStoreType.LanceDB and not backwards_compatible:
264+
if vector_store_type == VectorStoreType.LanceDB:
333265
db_uri = config.embeddings.vector_store["db_uri"] # type: ignore
334266
lancedb_dir = Path(config.root_dir).resolve() / db_uri
335267
vector_store_args["db_uri"] = str(lancedb_dir) # type: ignore
336268

337269
reporter.info(f"Vector Store Args: {redact(vector_store_args)}") # type: ignore
338-
if not backwards_compatible: # can remove this check and always set the description_embedding_store before v1 release
339-
description_embedding_store = _get_embedding_description_store(
340-
config_args=vector_store_args, # type: ignore
341-
)
270+
271+
description_embedding_store = _get_embedding_description_store(
272+
conf_args=vector_store_args, # type: ignore
273+
)
342274

343275
_entities = read_indexer_entities(nodes, entities, community_level)
344276
_covariates = read_indexer_covariates(covariates) if covariates is not None else []
@@ -368,13 +300,55 @@ async def local_search_streaming(
368300
yield stream_chunk
369301

370302

303+
def _patch_vector_store(
304+
config: GraphRagConfig,
305+
nodes: pd.DataFrame,
306+
entities: pd.DataFrame,
307+
community_level: int,
308+
) -> GraphRagConfig:
309+
# TODO: remove the following patch that checks for a vector_store prior to v1 release
310+
# TODO: this is a backwards compatibility patch that injects the default vector_store settings into the config if it is not present
311+
# Only applicable in situations involving a local vector_store (lancedb). The general idea:
312+
# if vector_store not in config:
313+
# 1. assume user is running local if vector_store is not in config
314+
# 2. insert default vector_store in config
315+
# 3 .create lancedb vector_store instance
316+
# 4. upload vector embeddings from the input dataframes to the vector_store
317+
if not config.embeddings.vector_store:
318+
from graphrag.query.input.loaders.dfs import store_entity_semantic_embeddings
319+
from graphrag.vector_stores.lancedb import LanceDBVectorStore
320+
321+
config.embeddings.vector_store = {
322+
"type": "lancedb",
323+
"db_uri": f"{Path(config.storage.base_dir)}/lancedb",
324+
"container_name": "default",
325+
"overwrite": True,
326+
}
327+
description_embedding_store = LanceDBVectorStore(
328+
db_uri=config.embeddings.vector_store["db_uri"],
329+
collection_name="default-entity-description",
330+
overwrite=config.embeddings.vector_store["overwrite"],
331+
)
332+
description_embedding_store.connect(
333+
db_uri=config.embeddings.vector_store["db_uri"]
334+
)
335+
# dump embeddings from the entities list to the description_embedding_store
336+
_entities = read_indexer_entities(nodes, entities, community_level)
337+
store_entity_semantic_embeddings(
338+
entities=_entities, vectorstore=description_embedding_store
339+
)
340+
return config
341+
342+
371343
def _get_embedding_description_store(
372344
config_args: dict,
373345
):
374346
"""Get the embedding description store."""
375347
vector_store_type = config_args["type"]
348+
collection_name = f"{config_args['container_name']}-entity-description"
376349
description_embedding_store = VectorStoreFactory.get_vector_store(
377-
vector_store_type=vector_store_type, kwargs=config_args
350+
vector_store_type=vector_store_type,
351+
kwargs={**config_args, "collection_name": collection_name},
378352
)
379353
description_embedding_store.connect(**config_args)
380354
return description_embedding_store

graphrag/cli/query.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,7 @@ def run_local_search(
115115
config.storage.base_dir = str(data_dir) if data_dir else config.storage.base_dir
116116
resolve_paths(config)
117117

118+
# TODO remove optional create_final_entities_description_embeddings.parquet to delete backwards compatibility
118119
dataframe_dict = _resolve_parquet_files(
119120
root_dir=root_dir,
120121
config=config,
@@ -125,7 +126,9 @@ def run_local_search(
125126
"create_final_relationships.parquet",
126127
"create_final_entities.parquet",
127128
],
128-
optional_list=["create_final_covariates.parquet"],
129+
optional_list=[
130+
"create_final_covariates.parquet",
131+
],
129132
)
130133
final_nodes: pd.DataFrame = dataframe_dict["create_final_nodes"]
131134
final_community_reports: pd.DataFrame = dataframe_dict[

graphrag/config/create_graphrag_config.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -414,6 +414,7 @@ def hydrate_parallelization_params(
414414
raw_entities=reader.bool("raw_entities") or defs.SNAPSHOTS_RAW_ENTITIES,
415415
top_level_nodes=reader.bool("top_level_nodes")
416416
or defs.SNAPSHOTS_TOP_LEVEL_NODES,
417+
embeddings=reader.bool("embeddings") or defs.SNAPSHOTS_EMBEDDINGS,
417418
)
418419
with reader.envvar_prefix(Section.umap), reader.use(values.get("umap")):
419420
umap_model = UmapConfig(

graphrag/config/defaults.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,7 @@
8282
SNAPSHOTS_GRAPHML = False
8383
SNAPSHOTS_RAW_ENTITIES = False
8484
SNAPSHOTS_TOP_LEVEL_NODES = False
85+
SNAPSHOTS_EMBEDDINGS = False
8586
STORAGE_BASE_DIR = "output"
8687
STORAGE_TYPE = StorageType.file
8788
SUMMARIZE_DESCRIPTIONS_MAX_LENGTH = 500
@@ -91,7 +92,7 @@
9192
VECTOR_STORE = f"""
9293
type: {VectorStoreType.LanceDB.value}
9394
db_uri: '{(Path(STORAGE_BASE_DIR) / "lancedb")!s}'
94-
collection_name: entity_description_embeddings
95+
collection_name: default
9596
overwrite: true\
9697
"""
9798

graphrag/config/enums.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,7 @@ class TextEmbeddingTarget(str, Enum):
8686

8787
all = "all"
8888
required = "required"
89+
none = "none"
8990

9091
def __repr__(self):
9192
"""Get a string representation."""

0 commit comments

Comments
 (0)