Skip to content

Commit 97704ab

Browse files
authored
Remove text unit grouping (#2052)
* Remove text unit group_by_columns * Semver * Fix default token split test * Fix models in config test samples * Fix token length in context sort test * Fix document sort
1 parent 978e798 commit 97704ab

32 files changed

+60
-93
lines changed
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
{
2+
"type": "major",
3+
"description": "Remove text unit group-by ability."
4+
}

docs/config/yaml.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,6 @@ These settings configure how we parse documents into text chunks. This is necess
9999

100100
- `size` **int** - The max chunk size in tokens.
101101
- `overlap` **int** - The chunk overlap in tokens.
102-
- `group_by_columns` **list[str]** - Group documents by these fields before chunking.
103102
- `strategy` **str**[tokens|sentences] - How to chunk the text.
104103
- `encoding_model` **str** - The text encoding model to use for splitting on token boundaries.
105104
- `prepend_metadata` **bool** - Determines if metadata values should be added at the beginning of each chunk. Default=`False`.

docs/index/default_dataflow.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -59,9 +59,7 @@ flowchart TB
5959

6060
The first phase of the default-configuration workflow is to transform input documents into _TextUnits_. A _TextUnit_ is a chunk of text that is used for our graph extraction techniques. They are also used as source-references by extracted knowledge items in order to empower breadcrumbs and provenance by concepts back to their original source text.
6161

62-
The chunk size (counted in tokens), is user-configurable. By default this is set to 300 tokens, although we've had positive experience with 1200-token chunks using a single "glean" step. (A "glean" step is a follow-on extraction). Larger chunks result in lower-fidelity output and less meaningful reference texts; however, using larger chunks can result in much faster processing time.
63-
64-
The group-by configuration is also user-configurable. By default, we align our chunks to document boundaries, meaning that there is a strict 1-to-many relationship between Documents and TextUnits. In rare cases, this can be turned into a many-to-many relationship. This is useful when the documents are very short and we need several of them to compose a meaningful analysis unit (e.g. Tweets or a chat log)
62+
The chunk size (counted in tokens), is user-configurable. By default this is set to 1200 tokens. Larger chunks result in lower-fidelity output and less meaningful reference texts; however, using larger chunks can result in much faster processing time.
6563

6664
```mermaid
6765
---

docs/index/outputs.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ List of all text chunks parsed from the input documents.
102102
| ----------------- | ----- | ----------- |
103103
| text | str | Raw full text of the chunk. |
104104
| n_tokens | int | Number of tokens in the chunk. This should normally match the `chunk_size` config parameter, except for the last chunk which is often shorter. |
105-
| document_ids | str[] | List of document IDs the chunk came from. This is normally only 1 due to our default groupby, but for very short text documents (e.g., microblogs) it can be configured so text units span multiple documents. |
105+
| document_id | str | ID of the document the chunk came from. |
106106
| entity_ids | str[] | List of entities found in the text unit. |
107107
| relationships_ids | str[] | List of relationships found in the text unit. |
108108
| covariate_ids | str[] | Optional list of covariates found in the text unit. |

graphrag/config/defaults.py

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -27,15 +27,15 @@
2727
DEFAULT_OUTPUT_BASE_DIR = "output"
2828
DEFAULT_CHAT_MODEL_ID = "default_chat_model"
2929
DEFAULT_CHAT_MODEL_TYPE = ModelType.OpenAIChat
30-
DEFAULT_CHAT_MODEL = "gpt-4-turbo-preview"
30+
DEFAULT_CHAT_MODEL = "gpt-4o"
3131
DEFAULT_CHAT_MODEL_AUTH_TYPE = AuthType.APIKey
3232
DEFAULT_EMBEDDING_MODEL_ID = "default_embedding_model"
3333
DEFAULT_EMBEDDING_MODEL_TYPE = ModelType.OpenAIEmbedding
34-
DEFAULT_EMBEDDING_MODEL = "text-embedding-3-small"
34+
DEFAULT_EMBEDDING_MODEL = "text-embedding-ada-002"
3535
DEFAULT_EMBEDDING_MODEL_AUTH_TYPE = AuthType.APIKey
3636
DEFAULT_VECTOR_STORE_ID = "default_vector_store"
3737

38-
ENCODING_MODEL = "cl100k_base"
38+
ENCODING_MODEL = "o200k_base"
3939
COGNITIVE_SERVICES_AUDIENCE = "https://cognitiveservices.azure.com/.default"
4040

4141

@@ -68,9 +68,8 @@ class ChunksDefaults:
6868

6969
size: int = 1200
7070
overlap: int = 100
71-
group_by_columns: list[str] = field(default_factory=lambda: ["id"])
7271
strategy: ClassVar[ChunkStrategyType] = ChunkStrategyType.tokens
73-
encoding_model: str = "cl100k_base"
72+
encoding_model: str = ENCODING_MODEL
7473
prepend_metadata: bool = False
7574
chunk_size_includes_metadata: bool = False
7675

graphrag/config/init_content.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,6 @@
6767
chunks:
6868
size: {graphrag_config_defaults.chunks.size}
6969
overlap: {graphrag_config_defaults.chunks.overlap}
70-
group_by_columns: [{",".join(graphrag_config_defaults.chunks.group_by_columns)}]
7170
7271
### Output/storage settings ###
7372
## If blob storage is specified in the following four sections,

graphrag/config/models/chunking_config.py

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,10 +20,6 @@ class ChunkingConfig(BaseModel):
2020
description="The chunk overlap to use.",
2121
default=graphrag_config_defaults.chunks.overlap,
2222
)
23-
group_by_columns: list[str] = Field(
24-
description="The chunk by columns to use.",
25-
default=graphrag_config_defaults.chunks.group_by_columns,
26-
)
2723
strategy: ChunkStrategyType = Field(
2824
description="The chunking strategy to use.",
2925
default=graphrag_config_defaults.chunks.strategy,

graphrag/data_model/schemas.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@
5252
RELATIONSHIP_IDS = "relationship_ids"
5353
TEXT_UNIT_IDS = "text_unit_ids"
5454
COVARIATE_IDS = "covariate_ids"
55-
DOCUMENT_IDS = "document_ids"
55+
DOCUMENT_ID = "document_id"
5656

5757
PERIOD = "period"
5858
SIZE = "size"
@@ -142,7 +142,7 @@
142142
SHORT_ID,
143143
TEXT,
144144
N_TOKENS,
145-
DOCUMENT_IDS,
145+
DOCUMENT_ID,
146146
ENTITY_IDS,
147147
RELATIONSHIP_IDS,
148148
COVARIATE_IDS,

graphrag/data_model/text_unit.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,8 @@ class TextUnit(Identified):
2828
n_tokens: int | None = None
2929
"""The number of tokens in the text (optional)."""
3030

31-
document_ids: list[str] | None = None
32-
"""List of document IDs in which the text unit appears (optional)."""
31+
document_id: str | None = None
32+
"""ID of the document in which the text unit appears (optional)."""
3333

3434
attributes: dict[str, Any] | None = None
3535
"""A dictionary of additional attributes associated with the text unit (optional)."""
@@ -45,7 +45,7 @@ def from_dict(
4545
relationships_key: str = "relationship_ids",
4646
covariates_key: str = "covariate_ids",
4747
n_tokens_key: str = "n_tokens",
48-
document_ids_key: str = "document_ids",
48+
document_id_key: str = "document_id",
4949
attributes_key: str = "attributes",
5050
) -> "TextUnit":
5151
"""Create a new text unit from the dict data."""
@@ -57,6 +57,6 @@ def from_dict(
5757
relationship_ids=d.get(relationships_key),
5858
covariate_ids=d.get(covariates_key),
5959
n_tokens=d.get(n_tokens_key),
60-
document_ids=d.get(document_ids_key),
60+
document_id=d.get(document_id_key),
6161
attributes=d.get(attributes_key),
6262
)

graphrag/index/workflows/create_base_text_units.py

Lines changed: 16 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,6 @@ async def run_workflow(
3535
output = create_base_text_units(
3636
documents,
3737
context.callbacks,
38-
chunks.group_by_columns,
3938
chunks.size,
4039
chunks.overlap,
4140
chunks.encoding_model,
@@ -53,7 +52,6 @@ async def run_workflow(
5352
def create_base_text_units(
5453
documents: pd.DataFrame,
5554
callbacks: WorkflowCallbacks,
56-
group_by_columns: list[str],
5755
size: int,
5856
overlap: int,
5957
encoding_model: str,
@@ -62,26 +60,9 @@ def create_base_text_units(
6260
chunk_size_includes_metadata: bool = False,
6361
) -> pd.DataFrame:
6462
"""All the steps to transform base text_units."""
65-
sort = documents.sort_values(by=["id"], ascending=[True])
63+
documents.sort_values(by=["id"], ascending=[True], inplace=True)
6664

67-
sort["text_with_ids"] = list(
68-
zip(*[sort[col] for col in ["id", "text"]], strict=True)
69-
)
70-
71-
agg_dict = {"text_with_ids": list}
72-
if "metadata" in documents:
73-
agg_dict["metadata"] = "first" # type: ignore
74-
75-
aggregated = (
76-
(
77-
sort.groupby(group_by_columns, sort=False)
78-
if len(group_by_columns) > 0
79-
else sort.groupby(lambda _x: True)
80-
)
81-
.agg(agg_dict)
82-
.reset_index()
83-
)
84-
aggregated.rename(columns={"text_with_ids": "texts"}, inplace=True)
65+
encode, _ = get_encoding_fn(encoding_model)
8566

8667
def chunker(row: pd.Series) -> Any:
8768
line_delimiter = ".\n"
@@ -99,15 +80,14 @@ def chunker(row: pd.Series) -> Any:
9980
)
10081

10182
if chunk_size_includes_metadata:
102-
encode, _ = get_encoding_fn(encoding_model)
10383
metadata_tokens = len(encode(metadata_str))
10484
if metadata_tokens >= size:
10585
message = "Metadata tokens exceeds the maximum tokens per chunk. Please increase the tokens per chunk."
10686
raise ValueError(message)
10787

10888
chunked = chunk_text(
10989
pd.DataFrame([row]).reset_index(drop=True),
110-
column="texts",
90+
column="text",
11191
size=size - metadata_tokens,
11292
overlap=overlap,
11393
encoding_model=encoding_model,
@@ -128,7 +108,7 @@ def chunker(row: pd.Series) -> Any:
128108
return row
129109

130110
# Track progress of row-wise apply operation
131-
total_rows = len(aggregated)
111+
total_rows = len(documents)
132112
logger.info("Starting chunking process for %d documents", total_rows)
133113

134114
def chunker_with_logging(row: pd.Series, row_index: int) -> Any:
@@ -137,27 +117,26 @@ def chunker_with_logging(row: pd.Series, row_index: int) -> Any:
137117
logger.info("chunker progress: %d/%d", row_index + 1, total_rows)
138118
return result
139119

140-
aggregated = aggregated.apply(
120+
text_units = documents.apply(
141121
lambda row: chunker_with_logging(row, row.name), axis=1
142122
)
143123

144-
aggregated = cast("pd.DataFrame", aggregated[[*group_by_columns, "chunks"]])
145-
aggregated = aggregated.explode("chunks")
146-
aggregated.rename(
124+
text_units = cast("pd.DataFrame", text_units[["id", "chunks"]])
125+
text_units = text_units.explode("chunks")
126+
text_units.rename(
147127
columns={
148-
"chunks": "chunk",
128+
"id": "document_id",
129+
"chunks": "text",
149130
},
150131
inplace=True,
151132
)
152-
aggregated["id"] = aggregated.apply(
153-
lambda row: gen_sha512_hash(row, ["chunk"]), axis=1
154-
)
155-
aggregated[["document_ids", "chunk", "n_tokens"]] = pd.DataFrame(
156-
aggregated["chunk"].tolist(), index=aggregated.index
133+
134+
text_units["id"] = text_units.apply(
135+
lambda row: gen_sha512_hash(row, ["text"]), axis=1
157136
)
158-
# rename for downstream consumption
159-
aggregated.rename(columns={"chunk": "text"}, inplace=True)
137+
# get a final token measurement
138+
text_units["n_tokens"] = text_units["text"].apply(lambda x: len(encode(x)))
160139

161140
return cast(
162-
"pd.DataFrame", aggregated[aggregated["text"].notna()].reset_index(drop=True)
141+
"pd.DataFrame", text_units[text_units["text"].notna()].reset_index(drop=True)
163142
)

0 commit comments

Comments
 (0)