Skip to content

Commit dad2176

Browse files
authored
Miscellaneous code cleanup procedures (#1452)
1 parent 0b2120c commit dad2176

File tree

96 files changed

+201
-452
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

96 files changed

+201
-452
lines changed
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
{
2+
"type": "patch",
3+
"description": "miscellaneous code cleanup and minor changes for better alignment of style across the codebase."
4+
}

docs/config/env_vars.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Text-Embeddings Customization
44

5-
By default, the GraphRAG indexer will only emit embeddings required for our query methods. However, the model has embeddings defined for all plaintext fields, and these can be generated by setting the `GRAPHRAG_EMBEDDING_TARGET` environment variable to `all`.
5+
By default, the GraphRAG indexer will only export embeddings required for our query methods. However, the model has embeddings defined for all plaintext fields, and these can be generated by setting the `GRAPHRAG_EMBEDDING_TARGET` environment variable to `all`.
66

77
If the embedding target is `all`, and you want to only embed a subset of these fields, you may specify which embeddings to skip using the `GRAPHRAG_EMBEDDING_SKIP` argument described below.
88

@@ -152,7 +152,7 @@ These settings control the data input used by the pipeline. Any settings with a
152152

153153
## Storage
154154

155-
This section controls the storage mechanism used by the pipeline used for emitting output tables.
155+
This section controls the storage mechanism used by the pipeline used for exporting output tables.
156156

157157
| Parameter | Description | Type | Required or Optional | Default |
158158
| ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----- | -------------------- | ------- |

docs/config/yaml.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ This is the base LLM configuration section. Other steps may override this config
6767
- `async_mode` (see Async Mode top-level config)
6868
- `batch_size` **int** - The maximum batch size to use.
6969
- `batch_max_tokens` **int** - The maximum batch # of tokens.
70-
- `target` **required|all|none** - Determines which set of embeddings to emit.
70+
- `target` **required|all|none** - Determines which set of embeddings to export.
7171
- `skip` **list[str]** - Which embeddings to skip. Only useful if target=all to customize the list.
7272
- `vector_store` **dict** - The vector store to use. Configured for lancedb by default.
7373
- `type` **str** - `lancedb` or `azure_ai_search`. Default=`lancedb`
@@ -203,7 +203,7 @@ This is the base LLM configuration section. Other steps may override this config
203203

204204
#### Fields
205205

206-
- `max_cluster_size` **int** - The maximum cluster size to emit.
206+
- `max_cluster_size` **int** - The maximum cluster size to export.
207207
- `strategy` **dict** - Fully override the cluster_graph strategy.
208208

209209
### embed_graph
@@ -228,11 +228,11 @@ This is the base LLM configuration section. Other steps may override this config
228228

229229
#### Fields
230230

231-
- `embeddings` **bool** - Emit embeddings snapshots to parquet.
232-
- `graphml` **bool** - Emit graph snapshots to GraphML.
233-
- `raw_entities` **bool** - Emit raw entity snapshots to JSON.
234-
- `top_level_nodes` **bool** - Emit top-level-node snapshots to JSON.
235-
- `transient` **bool** - Emit transient workflow tables snapshots to parquet.
231+
- `embeddings` **bool** - Export embeddings snapshots to parquet.
232+
- `graphml` **bool** - Export graph snapshots to GraphML.
233+
- `raw_entities` **bool** - Export raw entity snapshots to JSON.
234+
- `top_level_nodes` **bool** - Export top-level-node snapshots to JSON.
235+
- `transient` **bool** - Export transient workflow tables snapshots to parquet.
236236

237237
### encoding_model
238238

docs/index/default_dataflow.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -105,9 +105,9 @@ Now that we have a graph of entities and relationships, each with a list of desc
105105

106106
### Claim Extraction & Emission
107107

108-
Finally, as an independent workflow, we extract claims from the source TextUnits. These claims represent positive factual statements with an evaluated status and time-bounds. These are emitted as a primary artifact called **Covariates**.
108+
Finally, as an independent workflow, we extract claims from the source TextUnits. These claims represent positive factual statements with an evaluated status and time-bounds. These get exported as a primary artifact called **Covariates**.
109109

110-
Note: claim extraction is _optional_ and turned off by default. This is because claim extraction generally needs prompt tuning to be useful.
110+
Note: claim extraction is _optional_ and turned off by default. This is because claim extraction generally requires prompt tuning to be useful.
111111

112112
## Phase 3: Graph Augmentation
113113

@@ -131,7 +131,7 @@ In this step, we generate a vector representation of our graph using the Node2Ve
131131

132132
### Graph Tables Emission
133133

134-
Once our graph augmentation steps are complete, the final **Entities** and **Relationships** tables are emitted after their text fields are text-embedded.
134+
Once our graph augmentation steps are complete, the final **Entities** and **Relationships** tables are exported after their text fields are text-embedded.
135135

136136
## Phase 4: Community Summarization
137137

@@ -161,7 +161,7 @@ In this step, we generate a vector representation of our communities by generati
161161

162162
### Community Tables Emission
163163

164-
At this point, some bookkeeping work is performed and we emit the **Communities** and **CommunityReports** tables.
164+
At this point, some bookkeeping work is performed and we export the **Communities** and **CommunityReports** tables.
165165

166166
## Phase 5: Document Processing
167167

@@ -189,7 +189,7 @@ In this step, we generate a vector representation of our documents using an aver
189189

190190
### Documents Table Emission
191191

192-
At this point, we can emit the **Documents** table into the knowledge Model.
192+
At this point, we can export the **Documents** table into the knowledge Model.
193193

194194
## Phase 6: Network Visualization
195195

@@ -203,4 +203,4 @@ flowchart LR
203203
nv[Umap Documents] --> ne[Umap Entities] --> ng[Nodes Table Emission]
204204
```
205205

206-
For each of the logical graphs, we perform a UMAP dimensionality reduction to generate a 2D representation of the graph. This will allow us to visualize the graph in a 2D space and understand the relationships between the nodes in the graph. The UMAP embeddings are then emitted as a table of _Nodes_. The rows of this table include a discriminator indicating whether the node is a document or an entity, and the UMAP coordinates.
206+
For each of the logical graphs, we perform a UMAP dimensionality reduction to generate a 2D representation of the graph. This will allow us to visualize the graph in a 2D space and understand the relationships between the nodes in the graph. The UMAP embeddings are then exported as a table of _Nodes_. The rows of this table include a discriminator indicating whether the node is a document or an entity, and the UMAP coordinates.

examples/use_built_in_workflows/run.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55

66
from graphrag.index.config.input import PipelineCSVInputConfig
77
from graphrag.index.config.workflow import PipelineWorkflowReference
8-
from graphrag.index.input.load_input import load_input
8+
from graphrag.index.input.factory import create_input
99
from graphrag.index.run import run_pipeline, run_pipeline_with_config
1010

1111
sample_data_dir = os.path.join(
@@ -14,7 +14,7 @@
1414

1515
# Load our dataset once
1616
shared_dataset = asyncio.run(
17-
load_input(
17+
create_input(
1818
PipelineCSVInputConfig(
1919
file_pattern=".*\\.csv$",
2020
base_dir=sample_data_dir,

graphrag/api/index.py

Lines changed: 1 addition & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,10 @@
1010

1111
from pathlib import Path
1212

13+
from graphrag.cache.noop_pipeline_cache import NoopPipelineCache
1314
from graphrag.config.enums import CacheType
1415
from graphrag.config.models.graph_rag_config import GraphRagConfig
15-
from graphrag.index.cache.noop_pipeline_cache import NoopPipelineCache
1616
from graphrag.index.create_pipeline_config import create_pipeline_config
17-
from graphrag.index.emit.types import TableEmitterType
1817
from graphrag.index.run import run_pipeline_with_config
1918
from graphrag.index.typing import PipelineRunResult
2019
from graphrag.logging.base import ProgressReporter
@@ -27,7 +26,6 @@ async def build_index(
2726
is_resume_run: bool = False,
2827
memory_profile: bool = False,
2928
progress_reporter: ProgressReporter | None = None,
30-
emit: list[TableEmitterType] = [TableEmitterType.Parquet], # noqa: B006
3129
) -> list[PipelineRunResult]:
3230
"""Run the pipeline with the given configuration.
3331
@@ -45,9 +43,6 @@ async def build_index(
4543
Whether to enable memory profiling.
4644
progress_reporter : ProgressReporter | None default=None
4745
The progress reporter.
48-
emit : list[str]
49-
The list of emitter types to emit.
50-
Accepted values {"parquet", "csv"}.
5146
5247
Returns
5348
-------
@@ -60,10 +55,6 @@ async def build_index(
6055
msg = "Cannot resume and update a run at the same time."
6156
raise ValueError(msg)
6257

63-
# Ensure Parquet is part of the emitters
64-
if TableEmitterType.Parquet not in emit:
65-
emit.append(TableEmitterType.Parquet)
66-
6758
config = _patch_vector_config(config)
6859

6960
pipeline_config = create_pipeline_config(config)
@@ -77,7 +68,6 @@ async def build_index(
7768
memory_profile=memory_profile,
7869
cache=pipeline_cache,
7970
progress_reporter=progress_reporter,
80-
emit=emit,
8171
is_resume_run=is_resume_run,
8272
is_update_run=is_update_run,
8373
):

graphrag/api/query.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@
3030
entity_description_embedding,
3131
)
3232
from graphrag.logging.print_progress import PrintProgressReporter
33-
from graphrag.query.factories import (
33+
from graphrag.query.factory import (
3434
get_drift_search_engine,
3535
get_global_search_engine,
3636
get_local_search_engine,
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
# Copyright (c) 2024 Microsoft Corporation.
22
# Licensed under the MIT License
33

4-
"""The Indexing Engine storage package root."""
4+
"""A package containing cache implementations."""
Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Copyright (c) 2024 Microsoft Corporation.
22
# Licensed under the MIT License
33

4-
"""A module containing load_cache method definition."""
4+
"""A module containing create_cache method definition."""
55

66
from __future__ import annotations
77

@@ -12,29 +12,32 @@
1212
PipelineBlobCacheConfig,
1313
PipelineFileCacheConfig,
1414
)
15-
from graphrag.index.storage.blob_pipeline_storage import BlobPipelineStorage
16-
from graphrag.index.storage.file_pipeline_storage import FilePipelineStorage
15+
from graphrag.storage.blob_pipeline_storage import BlobPipelineStorage
16+
from graphrag.storage.file_pipeline_storage import FilePipelineStorage
1717

1818
if TYPE_CHECKING:
19+
from graphrag.cache.pipeline_cache import PipelineCache
1920
from graphrag.index.config.cache import (
2021
PipelineCacheConfig,
2122
)
2223

23-
from graphrag.index.cache.json_pipeline_cache import JsonPipelineCache
24-
from graphrag.index.cache.memory_pipeline_cache import create_memory_cache
25-
from graphrag.index.cache.noop_pipeline_cache import NoopPipelineCache
24+
from graphrag.cache.json_pipeline_cache import JsonPipelineCache
25+
from graphrag.cache.memory_pipeline_cache import InMemoryCache
26+
from graphrag.cache.noop_pipeline_cache import NoopPipelineCache
2627

2728

28-
def load_cache(config: PipelineCacheConfig | None, root_dir: str | None):
29-
"""Load the cache from the given config."""
29+
def create_cache(
30+
config: PipelineCacheConfig | None, root_dir: str | None
31+
) -> PipelineCache:
32+
"""Create a cache from the given config."""
3033
if config is None:
3134
return NoopPipelineCache()
3235

3336
match config.type:
3437
case CacheType.none:
3538
return NoopPipelineCache()
3639
case CacheType.memory:
37-
return create_memory_cache()
40+
return InMemoryCache()
3841
case CacheType.file:
3942
config = cast(PipelineFileCacheConfig, config)
4043
storage = FilePipelineStorage(root_dir).child(config.base_dir)

graphrag/index/cache/json_pipeline_cache.py renamed to graphrag/cache/json_pipeline_cache.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@
66
import json
77
from typing import Any
88

9-
from graphrag.index.cache.pipeline_cache import PipelineCache
10-
from graphrag.index.storage.pipeline_storage import PipelineStorage
9+
from graphrag.cache.pipeline_cache import PipelineCache
10+
from graphrag.storage.pipeline_storage import PipelineStorage
1111

1212

1313
class JsonPipelineCache(PipelineCache):

0 commit comments

Comments
 (0)