Skip to content

Commit c8c354e

Browse files
authored
Artifact cleanup (#1341)
* Add source documents for verb tests * Remove entity_type erroneous column * Add new test data * Remove source/target degree columns * Remove top_level_node_id * Remove chunk column configs * Rename "chunk" to "text" * Rename "chunk" to "text" in base * Re-map document input to use base text units * Revert base text units as final documents dep * Update test data * Split/rename node source_id * Drop node size (dup of degree) * Drop document_ids from covariates * Remove unused document_ids from models * Remove n_tokens from covariate table * Fix missed document_ids delete * Wire base text units to final documents * Rename relationship rank as combined_degree * Add rank as first-class property to Relationship * Remove split_text operation * Fix relationships test parquet * Update test parquets * Add entity ids to community table * Remove stored graph embedding columns * Format * Semver * Fix JSON typo * Spelling * Rename lancedb * Sort lancedb * Fix unit test * Fix test to account for changing period * Update tests for separate embeddings * Format * Better assertion printing * Fix unit test for windows * Rename document.raw_content -> document.text * Remove read_documents function * Remove unused document summary from model * Remove unused imports * Format * Add new snapshots to default init * Use util to construct embeddings collection name * Align inc index model with branch changes * Update data and tests for int ids * Clean up embedding locs * Switch entity "name" to "title" for consistency * Fix short_id -> human_readable_id defaults * Format * Rework community IDs * Fix community size compute * Fix unit tests * Fix report read * Pare down nodes table output * Fix unit test * Fix merge * Fix community loading * Format * Fix community id report extraction * Update tests * Consistent short IDs and ordering * Update ordering and tests * Update incremental for new nodes model * Guard document columns loc * Match column ordering * Fix document guard * Update smoke tests * Fill NA on community extract * Logging for smoke test debug * Add parquet schema details doc * Fix community hierarchy guard * Use better empty hierarchy guard * Back-compat shims * Semver * Fix warning * Format * Remove default fallback * Reuse key
1 parent e534223 commit c8c354e

File tree

83 files changed

+687
-681
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

83 files changed

+687
-681
lines changed
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
{
2+
"type": "minor",
3+
"description": "Data model changes."
4+
}
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
{
2+
"type": "patch",
3+
"description": "Cleanup of artifact outputs/schemas."
4+
}

docs/config/env_vars.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,8 @@ If the embedding target is `all`, and you want to only embed a subset of these f
99
### Embedded Fields
1010

1111
- `text_unit.text`
12-
- `document.raw_content`
13-
- `entity.name`
12+
- `document.text`
13+
- `entity.title`
1414
- `entity.description`
1515
- `relationship.description`
1616
- `community.title`

docs/examples_notebooks/drift_search.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -204,7 +204,7 @@
204204
"# load description embeddings to an in-memory lancedb vectorstore\n",
205205
"# to connect to a remote db, specify url and port values.\n",
206206
"description_embedding_store = LanceDBVectorStore(\n",
207-
" collection_name=\"entity_description_embeddings\",\n",
207+
" collection_name=\"default-entity-description\",\n",
208208
")\n",
209209
"description_embedding_store.connect(db_uri=LANCEDB_URI)\n",
210210
"entity_description_embeddings = store_entity_semantic_embeddings(\n",

docs/examples_notebooks/local_search.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -108,7 +108,7 @@
108108
"# load description embeddings to an in-memory lancedb vectorstore\n",
109109
"# to connect to a remote db, specify url and port values.\n",
110110
"description_embedding_store = LanceDBVectorStore(\n",
111-
" collection_name=\"entity.description\",\n",
111+
" collection_name=\"default-entity-description\",\n",
112112
")\n",
113113
"description_embedding_store.connect(db_uri=LANCEDB_URI)\n",
114114
"entity_description_embeddings = store_entity_semantic_embeddings(\n",

docs/index/default_dataflow.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,8 @@ The knowledge model is a specification for data outputs that conform to our data
99
- `Entity` - An entity extracted from a TextUnit. These represent people, places, events, or some other entity-model that you provide.
1010
- `Relationship` - A relationship between two entities. These are generated from the covariates.
1111
- `Covariate` - Extracted claim information, which contains statements about entities which may be time-bound.
12-
- `Community Report` - Once entities are generated, we perform hierarchical community detection on them and generate reports for each community in this hierarchy.
12+
- `Community` - Once the graph of entities and relationships is built, we perform hierarchical community detection on them to create a clustering structure.
13+
- `Community Report` - The contents of each community are summarized into a generated report, useful for human reading and downstream search.
1314
- `Node` - This table contains layout information for rendered graph-views of the Entities and Documents which have been embedded and clustered.
1415

1516
## The Default Configuration Workflow

docs/index/outputs.md

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
# Outputs
2+
3+
The default pipeline produces a series of output tables that align with the [conceptual knowledge model](../index/default_dataflow.md). This page describes the detailed output table schemas. By default we write these tables out as parquet files on disk.
4+
5+
## Shared fields
6+
All tables have two identifier fields:
7+
- id: str - Generated UUID, assuring global uniqueness
8+
- human_readable_id: int - This is an incremented short ID created per-run. For example, we use this short ID with generated summaries that print citations so they are easy to cross-reference visually.
9+
10+
## create_final_communities
11+
This is a list of the final communities generated by Leiden. Communities are strictly hierarchical, subdividing into children as the cluster affinity is narrowed.
12+
- community: int - Leiden-generated cluster ID for the community. Note that these increment with depth, so they are unique through all levels of the community hierarchy. For this table, human_readable_id is a copy of the community ID rather than a plain increment.
13+
- level: int - Depth of the community in the hierarchy.
14+
- title: str - Friendly name of the community.
15+
- entity_ids - List of entities that are members of the community.
16+
- relationship_ids - List of relationships that are wholly within the community (source and target are both in the community).
17+
- text_unit_ids - List of text units represented within the community.
18+
- period - Date of ingest, used for incremental update merges.
19+
- size - Size of the community (entity count), used for incremental update merges.
20+
21+
## create_final_community_reports
22+
This is the list of summarized reports for each community.
23+
- community: int - Short ID of the community this report applies to.
24+
- level: int - Level of the community this report applies to.
25+
- title: str - LM-generated title for the report.
26+
- summary: str - LM-generated summary of the report.
27+
- full_content: str - LM-generated full report.
28+
- rank: float - LM-derived relevance ranking of the report based on member entity salience
29+
- rank_explanation - LM-derived explanation of the rank.
30+
- findings: dict - LM-derived list of the top 5-10 insights from the community. Contains `summary` and `explanation` values.
31+
- full_content_json - Full JSON output as returned by the LM. Most fields are extracted into columns, but this JSON is sent for query summarization so we leave it to allow for prompt tuning to add fields/content by end users.
32+
- period - Date of ingest, used for incremental update merges.
33+
- size - Size of the community (entity count), used for incremental update merges.
34+
35+
## create_final_covariates
36+
(Optional) If claim extraction is turned on, this is a list of the extracted covariates. Note that claims are typically oriented around identifying malicious behavior such as fraud, so they are not useful for all datasets.
37+
- covariate_type: str - This is always "claim" with our default covariates.
38+
- type: str - Nature of the claim type.
39+
- description: str - LM-generated description of the behavior.
40+
- subject_id: str - Name of the source entity (that is performing the claimed behavior).
41+
- object_id: str - Name of the target entity (that the claimed behavior is performed on).
42+
- status: str [TRUE, FALSE, SUSPECTED] - LM-derived assessment of the correctness of the claim.
43+
- start_date: str (ISO8601) - LM-derived start of the claimed activity.
44+
- end_date: str (ISO8601) - LM-derived end of the claimed activity.
45+
- source_text: str - Short string of text containing the claimed behavior.
46+
- text_unit_id: str - ID of the text unit the claim text was extracted from.
47+
48+
## create_final_documents
49+
List of document content after import.
50+
- title: str - Filename, unless otherwise configured during CSV import.
51+
- text: str - Full text of the document.
52+
- text_unit_ids: str[] - List of text units (chunks) that were parsed from the document.
53+
- attributes: dict (optional) - If specified during CSV import, this is a dict of attributes for the document.
54+
55+
# create_final_entities
56+
List of all entities found in the data by the LM.
57+
- title: str - Name of the entity.
58+
- type: str - Type of the entity. By default this will be "organization", "person", "geo", or "event" unless configured differently or auto-tuning is used.
59+
- description: str - Textual description of the entity. Entities may be found in many text units, so this is an LM-derived summary of all descriptions.
60+
- text_unit_ids: str[] - List of the text units containing the entity.
61+
62+
# create_final_nodes
63+
This is graph-related information for the entities. It contains only information relevant to the graph such as community. There is an entry for each entity at every community level it is found within, so you may see "duplicate" entities.
64+
65+
Note that the ID fields match those in create_final_entities and can be used for joining if additional information about a node is required.
66+
- title: str - Name of the referenced entity. Duplicated from create_final_entities for convenient cross-referencing.
67+
- community: int - Leiden community the node is found within. Entities are not always assigned a community (they may not be close enough to any), so they may have a ID of -1.
68+
- level: int - Level of the community the entity is in.
69+
- degree: int - Node degree (connectedness) in the graph.
70+
- x: float - X position of the node for visual layouts. If graph embeddings and UMAP are not turned on, this will be 0.
71+
- y: float - Y position of the node for visual layouts. If graph embeddings and UMAP are not turned on, this will be 0.
72+
73+
## create_final_relationships
74+
List of all entity-to-entity relationships found in the data by the LM. This is also the _edge list_ for the graph.
75+
- source: str - Name of the source entity.
76+
- target: str - Name of the target entity.
77+
- description: str - LM-derived description of the relationship. Also see note for entity descriptions.
78+
- weight: float - Weight of the edge in the graph. This is summed from an LM-derived "strength" measure for each relationship instance.
79+
- combined_degree: int - Sum of source and target node degrees.
80+
- text_unit_ids: str[] - List of text units the relationship was found within.
81+
82+
## create_final_text_units
83+
List of all text chunks parsed from the input documents.
84+
- text: str - Raw full text of the chunk.
85+
- n_tokens: int - Number of tokens in the chunk. This should normally match the `chunk_size` config parameter, except for the last chunk which is often shorter.
86+
- document_ids: str[] - List of document IDs the chunk came from. This is normally only 1 due to our default groupby, but for very short text documents (e.g., microblogs) it can be configured so text units span multiple documents.
87+
- entity_ids: str[] - List of entities found in the text unit.
88+
- relationships_ids: str[] - List of relationships found in the text unit.
89+
- covariate_ids: str[] - Optional list of covariates found in the text unit.

examples_notebooks/community_contrib/yfiles-jupyter-graphs/graph-visualization.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -299,7 +299,7 @@
299299
"entities = read_indexer_entities(entity_df, entity_embedding_df, COMMUNITY_LEVEL)\n",
300300
"\n",
301301
"description_embedding_store = LanceDBVectorStore(\n",
302-
" collection_name=\"entity.description\",\n",
302+
" collection_name=\"default-entity-description\",\n",
303303
")\n",
304304
"description_embedding_store.connect(db_uri=LANCEDB_URI)\n",
305305
"entity_description_embeddings = store_entity_semantic_embeddings(\n",

graphrag/api/query.py

Lines changed: 20 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,10 @@
2525
from pydantic import validate_call
2626

2727
from graphrag.config import GraphRagConfig
28+
from graphrag.index.config.embeddings import (
29+
community_full_content_embedding,
30+
entity_description_embedding,
31+
)
2832
from graphrag.logging import PrintProgressReporter
2933
from graphrag.query.factories import (
3034
get_drift_search_engine,
@@ -42,6 +46,7 @@
4246
)
4347
from graphrag.query.structured_search.base import SearchResult # noqa: TCH001
4448
from graphrag.utils.cli import redact
49+
from graphrag.utils.embeddings import create_collection_name
4550
from graphrag.vector_stores import VectorStoreFactory, VectorStoreType
4651
from graphrag.vector_stores.base import BaseVectorStore
4752

@@ -228,7 +233,7 @@ async def local_search(
228233

229234
description_embedding_store = _get_embedding_store(
230235
config_args=vector_store_args, # type: ignore
231-
container_suffix="entity-description",
236+
embedding_name=entity_description_embedding,
232237
)
233238

234239
_entities = read_indexer_entities(nodes, entities, community_level)
@@ -302,7 +307,7 @@ async def local_search_streaming(
302307

303308
description_embedding_store = _get_embedding_store(
304309
config_args=vector_store_args, # type: ignore
305-
container_suffix="entity-description",
310+
embedding_name=entity_description_embedding,
306311
)
307312

308313
_entities = read_indexer_entities(nodes, entities, community_level)
@@ -385,12 +390,12 @@ async def drift_search(
385390

386391
description_embedding_store = _get_embedding_store(
387392
config_args=vector_store_args, # type: ignore
388-
container_suffix="entity-description",
393+
embedding_name=entity_description_embedding,
389394
)
390395

391396
full_content_embedding_store = _get_embedding_store(
392397
config_args=vector_store_args, # type: ignore
393-
container_suffix="community-full_content",
398+
embedding_name=community_full_content_embedding,
394399
)
395400

396401
_entities = read_indexer_entities(nodes, entities, community_level)
@@ -450,7 +455,10 @@ def _patch_vector_store(
450455
}
451456
description_embedding_store = LanceDBVectorStore(
452457
db_uri=config.embeddings.vector_store["db_uri"],
453-
collection_name="default-entity-description",
458+
collection_name=create_collection_name(
459+
config.embeddings.vector_store["container_name"],
460+
entity_description_embedding,
461+
),
454462
overwrite=config.embeddings.vector_store["overwrite"],
455463
)
456464
description_embedding_store.connect(
@@ -469,11 +477,7 @@ def _patch_vector_store(
469477
from graphrag.vector_stores.lancedb import LanceDBVectorStore
470478

471479
community_reports = with_reports
472-
collection_name = (
473-
config.embeddings.vector_store.get("container_name", "default")
474-
if config.embeddings.vector_store
475-
else "default"
476-
)
480+
container_name = config.embeddings.vector_store["container_name"]
477481
# Store report embeddings
478482
_reports = read_indexer_reports(
479483
community_reports,
@@ -485,7 +489,9 @@ def _patch_vector_store(
485489

486490
full_content_embedding_store = LanceDBVectorStore(
487491
db_uri=config.embeddings.vector_store["db_uri"],
488-
collection_name=f"{collection_name}-community-full_content",
492+
collection_name=create_collection_name(
493+
container_name, community_full_content_embedding
494+
),
489495
overwrite=config.embeddings.vector_store["overwrite"],
490496
)
491497
full_content_embedding_store.connect(
@@ -501,12 +507,12 @@ def _patch_vector_store(
501507

502508
def _get_embedding_store(
503509
config_args: dict,
504-
container_suffix: str,
510+
embedding_name: str,
505511
) -> BaseVectorStore:
506512
"""Get the embedding description store."""
507513
vector_store_type = config_args["type"]
508-
collection_name = (
509-
f"{config_args.get('container_name', 'default')}-{container_suffix}"
514+
collection_name = create_collection_name(
515+
config_args.get("container_name", "default"), embedding_name
510516
)
511517
embedding_store = VectorStoreFactory.get_vector_store(
512518
vector_store_type=vector_store_type,

graphrag/index/config/__init__.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,9 @@
1616
community_full_content_embedding,
1717
community_summary_embedding,
1818
community_title_embedding,
19-
document_raw_content_embedding,
19+
document_text_embedding,
2020
entity_description_embedding,
21-
entity_name_embedding,
21+
entity_title_embedding,
2222
relationship_description_embedding,
2323
required_embeddings,
2424
text_unit_text_embedding,
@@ -82,9 +82,9 @@
8282
"community_full_content_embedding",
8383
"community_summary_embedding",
8484
"community_title_embedding",
85-
"document_raw_content_embedding",
85+
"document_text_embedding",
8686
"entity_description_embedding",
87-
"entity_name_embedding",
87+
"entity_title_embedding",
8888
"relationship_description_embedding",
8989
"required_embeddings",
9090
"text_unit_text_embedding",

0 commit comments

Comments
 (0)