Skip to content

Commit 429e1b1

Browse files
authored
Remove graph embedding and UMAP (#2048)
* Remove umap/layout operation * Remove graph embedding * Bump unified-search to GR 2.5.0 * Remove graph vis from unified-search
1 parent ac95c91 commit 429e1b1

33 files changed

+1189
-1955
lines changed

docs/config/yaml.md

Lines changed: 0 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -287,29 +287,6 @@ These are the settings used for Leiden hierarchical clustering of the graph to c
287287
- `max_length` **int** - The maximum number of output tokens per report.
288288
- `max_input_length` **int** - The maximum number of input tokens to use when generating reports.
289289

290-
### embed_graph
291-
292-
We use node2vec to embed the graph. This is primarily used for visualization, so it is not turned on by default.
293-
294-
#### Fields
295-
296-
- `enabled` **bool** - Whether to enable graph embeddings.
297-
- `dimensions` **int** - Number of vector dimensions to produce.
298-
- `num_walks` **int** - The node2vec number of walks.
299-
- `walk_length` **int** - The node2vec walk length.
300-
- `window_size` **int** - The node2vec window size.
301-
- `iterations` **int** - The node2vec number of iterations.
302-
- `random_seed` **int** - The node2vec random seed.
303-
- `strategy` **dict** - Fully override the embed graph strategy.
304-
305-
### umap
306-
307-
Indicates whether we should run UMAP dimensionality reduction. This is used to provide an x/y coordinate to each graph node, suitable for visualization. If this is not enabled, nodes will receive a 0/0 x/y coordinate. If this is enabled, you *must* enable graph embedding as well.
308-
309-
#### Fields
310-
311-
- `enabled` **bool** - Whether to enable UMAP layouts.
312-
313290
### snapshots
314291

315292
#### Fields

docs/index/architecture.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,6 @@ stateDiagram-v2
2323
Chunk --> EmbedDocuments
2424
ExtractGraph --> GenerateReports
2525
ExtractGraph --> EmbedEntities
26-
ExtractGraph --> EmbedGraph
2726
```
2827

2928
### LLM Caching

docs/index/default_dataflow.md

Lines changed: 2 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -46,8 +46,7 @@ flowchart TB
4646
end
4747
subgraph phase6[Phase 6: Network Visualization]
4848
graph_outputs --> graph_embed[Graph Embedding]
49-
graph_embed --> umap_entities[Umap Entities]
50-
umap_entities --> combine_nodes[Final Entities]
49+
graph_embed --> combine_nodes[Final Entities]
5150
end
5251
subgraph phase7[Phase 7: Text Embeddings]
5352
textUnits --> text_embed[Text Embedding]
@@ -176,27 +175,8 @@ In this step, we link each document to the text-units that were created in the f
176175

177176
At this point, we can export the **Documents** table into the knowledge Model.
178177

179-
## Phase 6: Network Visualization (optional)
180178

181-
In this phase of the workflow, we perform some steps to support network visualization of our high-dimensional vector spaces within our existing graphs. At this point there are two logical graphs at play: the _Entity-Relationship_ graph and the _Document_ graph.
182-
183-
```mermaid
184-
---
185-
title: Network Visualization Workflows
186-
---
187-
flowchart LR
188-
ag[Graph Table] --> ge[Node2Vec Graph Embedding] --> ne[Umap Entities] --> ng[Entities Table]
189-
```
190-
191-
### Graph Embedding
192-
193-
In this step, we generate a vector representation of our graph using the Node2Vec algorithm. This will allow us to understand the implicit structure of our graph and provide an additional vector-space in which to search for related concepts during our query phase.
194-
195-
### Dimensionality Reduction
196-
197-
For each of the logical graphs, we perform a UMAP dimensionality reduction to generate a 2D representation of the graph. This will allow us to visualize the graph in a 2D space and understand the relationships between the nodes in the graph. The UMAP embeddings are reduced to two dimensions as x/y coordinates.
198-
199-
## Phase 7: Text Embedding
179+
## Phase 6: Text Embedding
200180

201181
For all artifacts that require downstream vector search, we generate text embeddings as a final step. These embeddings are written directly to a configured vector store. By default we embed entity descriptions, text unit text, and community report text.
202182

docs/index/methods.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,4 +41,4 @@ You can install it manually by running `python -m spacy download <model_name>`,
4141

4242
## Choosing a Method
4343

44-
Standard GraphRAG provides a rich description of real-world entities and relationships, but is more expensive that FastGraphRAG. We estimate graph extraction to constitute roughly 75% of indexing cost. FastGraphRAG is therefore much cheaper, but the tradeoff is that the extracted graph is less directly relevant for use outside of GraphRAG, and the graph tends to be quite a bit noisier. If high fidelity entities and graph exploration are important to your use case, we recommend staying with traditional GraphRAG. If your use case is primarily aimed at summary questions using global search, FastGraphRAG provides high quality summarization at much less LLM cost.
44+
Standard GraphRAG provides a rich description of real-world entities and relationships, but is more expensive than FastGraphRAG. We estimate graph extraction to constitute roughly 75% of indexing cost. FastGraphRAG is therefore much cheaper, but the tradeoff is that the extracted graph is less directly relevant for use outside of GraphRAG, and the graph tends to be quite a bit noisier. If high fidelity entities and graph exploration are important to your use case, we recommend staying with traditional GraphRAG. If your use case is primarily aimed at summary questions using global search, FastGraphRAG provides high quality summarization at much less LLM cost.

docs/index/outputs.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -82,8 +82,6 @@ List of all entities found in the data by the LM.
8282
| text_unit_ids | str[] | List of the text units containing the entity. |
8383
| frequency | int | Count of text units the entity was found within. |
8484
| degree | int | Node degree (connectedness) in the graph. |
85-
| x | float | X position of the node for visual layouts. If graph embeddings and UMAP are not turned on, this will be 0. |
86-
| y | float | Y position of the node for visual layouts. If graph embeddings and UMAP are not turned on, this will be 0. |
8785

8886
## relationships
8987
List of all entity-to-entity relationships found in the data by the LM. This is also the _edge list_ for the graph.

docs/visualization_guide.md

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,13 +8,6 @@ Before building an index, please review your `settings.yaml` configuration file
88
snapshots:
99
graphml: true
1010
```
11-
(Optional) To support other visualization tools and exploration, additional parameters can be enabled that provide access to vector embeddings.
12-
```yaml
13-
embed_graph:
14-
enabled: true # will generate node2vec embeddings for nodes
15-
umap:
16-
enabled: true # will generate UMAP embeddings for nodes, giving the entities table an x/y position to plot
17-
```
1811
After running the indexing pipeline over your data, there will be an output folder (defined by the `storage.base_dir` setting).
1912

2013
- **Output Folder**: Contains artifacts from the LLM’s indexing pass.

graphrag/config/defaults.py

Lines changed: 0 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -125,20 +125,6 @@ class DriftSearchDefaults:
125125
embedding_model_id: str = DEFAULT_EMBEDDING_MODEL_ID
126126

127127

128-
@dataclass
129-
class EmbedGraphDefaults:
130-
"""Default values for embedding graph."""
131-
132-
enabled: bool = False
133-
dimensions: int = 1536
134-
num_walks: int = 10
135-
walk_length: int = 40
136-
window_size: int = 2
137-
iterations: int = 3
138-
random_seed: int = 597832
139-
use_lcc: bool = True
140-
141-
142128
@dataclass
143129
class EmbedTextDefaults:
144130
"""Default values for embedding text."""
@@ -367,13 +353,6 @@ class SummarizeDescriptionsDefaults:
367353
model_id: str = DEFAULT_CHAT_MODEL_ID
368354

369355

370-
@dataclass
371-
class UmapDefaults:
372-
"""Default values for UMAP."""
373-
374-
enabled: bool = False
375-
376-
377356
@dataclass
378357
class UpdateIndexOutputDefaults(StorageDefaults):
379358
"""Default values for update index output."""
@@ -410,7 +389,6 @@ class GraphRagConfigDefaults:
410389
)
411390
cache: CacheDefaults = field(default_factory=CacheDefaults)
412391
input: InputDefaults = field(default_factory=InputDefaults)
413-
embed_graph: EmbedGraphDefaults = field(default_factory=EmbedGraphDefaults)
414392
embed_text: EmbedTextDefaults = field(default_factory=EmbedTextDefaults)
415393
chunks: ChunksDefaults = field(default_factory=ChunksDefaults)
416394
snapshots: SnapshotsDefaults = field(default_factory=SnapshotsDefaults)
@@ -427,7 +405,6 @@ class GraphRagConfigDefaults:
427405
extract_claims: ExtractClaimsDefaults = field(default_factory=ExtractClaimsDefaults)
428406
prune_graph: PruneGraphDefaults = field(default_factory=PruneGraphDefaults)
429407
cluster_graph: ClusterGraphDefaults = field(default_factory=ClusterGraphDefaults)
430-
umap: UmapDefaults = field(default_factory=UmapDefaults)
431408
local_search: LocalSearchDefaults = field(default_factory=LocalSearchDefaults)
432409
global_search: GlobalSearchDefaults = field(default_factory=GlobalSearchDefaults)
433410
drift_search: DriftSearchDefaults = field(default_factory=DriftSearchDefaults)

graphrag/config/init_content.py

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -130,12 +130,6 @@
130130
max_length: {graphrag_config_defaults.community_reports.max_length}
131131
max_input_length: {graphrag_config_defaults.community_reports.max_input_length}
132132
133-
embed_graph:
134-
enabled: false # if true, will generate node2vec embeddings for nodes
135-
136-
umap:
137-
enabled: false # if true, will generate UMAP embeddings for nodes (embed_graph must also be enabled)
138-
139133
snapshots:
140134
graphml: false
141135
embeddings: false

graphrag/config/models/embed_graph_config.py

Lines changed: 0 additions & 45 deletions
This file was deleted.

graphrag/config/models/graph_rag_config.py

Lines changed: 0 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,6 @@
1919
from graphrag.config.models.cluster_graph_config import ClusterGraphConfig
2020
from graphrag.config.models.community_reports_config import CommunityReportsConfig
2121
from graphrag.config.models.drift_search_config import DRIFTSearchConfig
22-
from graphrag.config.models.embed_graph_config import EmbedGraphConfig
2322
from graphrag.config.models.extract_claims_config import ClaimExtractionConfig
2423
from graphrag.config.models.extract_graph_config import ExtractGraphConfig
2524
from graphrag.config.models.extract_graph_nlp_config import ExtractGraphNLPConfig
@@ -35,7 +34,6 @@
3534
SummarizeDescriptionsConfig,
3635
)
3736
from graphrag.config.models.text_embedding_config import TextEmbeddingConfig
38-
from graphrag.config.models.umap_config import UmapConfig
3937
from graphrag.config.models.vector_store_config import VectorStoreConfig
4038

4139

@@ -254,17 +252,6 @@ def _validate_reporting_base_dir(self) -> None:
254252
)
255253
"""The community reports configuration to use."""
256254

257-
embed_graph: EmbedGraphConfig = Field(
258-
description="Graph embedding configuration.",
259-
default=EmbedGraphConfig(),
260-
)
261-
"""Graph Embedding configuration."""
262-
263-
umap: UmapConfig = Field(
264-
description="The UMAP configuration to use.", default=UmapConfig()
265-
)
266-
"""The UMAP configuration to use."""
267-
268255
snapshots: SnapshotsConfig = Field(
269256
description="The snapshots configuration to use.",
270257
default=SnapshotsConfig(),

0 commit comments

Comments
 (0)