Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
99aea52
Update input factory to match other factories
natoverse Jan 6, 2026
efaaa1f
Move input config alongside input readers
natoverse Jan 6, 2026
2b89384
Move file pattern logic into InputReader
natoverse Jan 6, 2026
c73263d
Set encoding default
natoverse Jan 6, 2026
b265612
Clean up optional column configs
natoverse Jan 6, 2026
f066080
Combine structured data extraction
natoverse Jan 6, 2026
2b83d66
Remove pandas from input loading
natoverse Jan 6, 2026
a03df1b
Throw if empty documents
natoverse Jan 6, 2026
8b45208
Add json lines (jsonl) input support
natoverse Jan 6, 2026
6ac0b58
Store raw data
natoverse Jan 6, 2026
8e3c717
Merge branch 'v3/main' into input-factory
natoverse Jan 6, 2026
fb9a924
Fix merge imports
natoverse Jan 6, 2026
e2395e9
Move metadata handling entirely to chunking
natoverse Jan 7, 2026
36b7be7
Nicer automatic title
natoverse Jan 7, 2026
9d161bd
Typo
natoverse Jan 7, 2026
164c5e1
Add get_property utility for nested dictionary access with dot notation
natoverse Jan 8, 2026
868fde1
Update structured_file_reader to use get_property utility
natoverse Jan 8, 2026
e8e316f
Extract input module into new graphrag-input monorepo package
natoverse Jan 8, 2026
39125b2
Rename ChunkResult to TextChunk and add transformer support
natoverse Jan 8, 2026
2f6d075
Back-compat comment
natoverse Jan 8, 2026
a671aa4
Align input config type name with other factory configs
natoverse Jan 8, 2026
6d5076a
Add MarkItDown support
natoverse Jan 9, 2026
6fbf26c
Remove pattern default from MarkItDown reader
natoverse Jan 9, 2026
e19501d
Remove plugins flag (implicit disabled)
natoverse Jan 9, 2026
6fba8d0
Format
natoverse Jan 9, 2026
c974970
Update verb tests
natoverse Jan 9, 2026
7ce1030
Separate storage from input config
natoverse Jan 10, 2026
e170124
Add empty objects for NaN raw_data
natoverse Jan 10, 2026
ade3a6f
Fix smoke tests
natoverse Jan 12, 2026
89a5223
Fix BOM in csv smoke
natoverse Jan 12, 2026
ad76163
Format
natoverse Jan 12, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/config/yaml.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,9 +87,9 @@ Our pipeline can ingest .csv, .txt, or .json data from an input location. See th
- `container_name` **str** - (blob/cosmosdb only) The Azure Storage container name.
- `storage_account_blob_url` **str** - (blob only) The storage account blob URL to use.
- `cosmosdb_account_blob_url` **str** - (cosmosdb only) The CosmosDB account blob URL to use.
- `file_type` **text|csv|json** - The type of input data to load. Default is `text`
- `type` **text|csv|json** - The type of input data to load. Default is `text`
- `encoding` **str** - The encoding of the input file. Default is `utf-8`
- `file_pattern` **str** - A regex to match input files. Default is `.*\.csv$`, `.*\.txt$`, or `.*\.json$` depending on the specified `file_type`, but you can customize it if needed.
- `file_pattern` **str** - A regex to match input files. Default is `.*\.csv$`, `.*\.txt$`, or `.*\.json$` depending on the specified `type`, but you can customize it if needed.
- `text_column` **str** - (CSV/JSON only) The text column name. If unset we expect a column named `text`.
- `title_column` **str** - (CSV/JSON only) The title column name, filename will be used if unset.
- `metadata` **list[str]** - (CSV/JSON only) The additional document attributes fields to keep.
Expand Down
5 changes: 2 additions & 3 deletions docs/examples_notebooks/api_overview.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,10 @@
"from pathlib import Path\n",
"from pprint import pprint\n",
"\n",
"import graphrag.api as api\n",
"import pandas as pd\n",
"from graphrag.config.load_config import load_config\n",
"from graphrag.index.typing.pipeline_run_result import PipelineRunResult\n",
"\n",
"import graphrag.api as api"
"from graphrag.index.typing.pipeline_run_result import PipelineRunResult"
]
},
{
Expand Down
5 changes: 2 additions & 3 deletions docs/examples_notebooks/input_documents.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -30,11 +30,10 @@
"from pathlib import Path\n",
"from pprint import pprint\n",
"\n",
"import graphrag.api as api\n",
"import pandas as pd\n",
"from graphrag.config.load_config import load_config\n",
"from graphrag.index.typing.pipeline_run_result import PipelineRunResult\n",
"\n",
"import graphrag.api as api"
"from graphrag.index.typing.pipeline_run_result import PipelineRunResult"
]
},
{
Expand Down
4 changes: 2 additions & 2 deletions docs/index/inputs.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ settings.yaml

```yaml
input:
file_type: text
type: text
metadata: [title]

chunks:
Expand Down Expand Up @@ -194,7 +194,7 @@ settings.yaml

```yaml
input:
file_type: json
type: json
title_column: headline
text_column: content

Expand Down
2 changes: 1 addition & 1 deletion packages/graphrag-cache/graphrag_cache/cache_factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ def register_cache(
cache_initializer: Callable[..., Cache],
scope: ServiceScope = "transient",
) -> None:
"""Register a custom storage implementation.
"""Register a custom cache implementation.

Args
----
Expand Down
4 changes: 2 additions & 2 deletions packages/graphrag-cache/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,12 @@ authors = [
license = "MIT"
readme = "README.md"
license-files = ["LICENSE"]
requires-python = ">=3.10,<3.13"
requires-python = ">=3.11,<3.14"
classifiers = [
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
]
dependencies = [
"graphrag-common==2.7.0",
Expand Down
19 changes: 0 additions & 19 deletions packages/graphrag-chunking/graphrag_chunking/add_metadata.py

This file was deleted.

17 changes: 0 additions & 17 deletions packages/graphrag-chunking/graphrag_chunking/chunk_result.py

This file was deleted.

7 changes: 5 additions & 2 deletions packages/graphrag-chunking/graphrag_chunking/chunker.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,10 @@
"""A module containing the 'Chunker' class."""

from abc import ABC, abstractmethod
from collections.abc import Callable
from typing import Any

from graphrag_chunking.chunk_result import ChunkResult
from graphrag_chunking.text_chunk import TextChunk


class Chunker(ABC):
Expand All @@ -17,5 +18,7 @@ def __init__(self, **kwargs: Any) -> None:
"""Create a chunker instance."""

@abstractmethod
def chunk(self, text: str) -> list[ChunkResult]:
def chunk(
self, text: str, transform: Callable[[str], str] | None = None
) -> list[TextChunk]:
"""Chunk method definition."""
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ class ChunkingConfig(BaseModel):
description="The chunk overlap to use.",
default=100,
)
prepend_metadata: bool = Field(
description="Prepend metadata into each chunk.",
default=False,
prepend_metadata: list[str] | None = Field(
description="Metadata fields from the source document to prepend on each chunk.",
default=None,
)
Original file line number Diff line number Diff line change
Expand Up @@ -5,26 +5,28 @@

from collections.abc import Callable

from graphrag_chunking.chunk_result import ChunkResult
from graphrag_chunking.text_chunk import TextChunk


def create_chunk_results(
chunks: list[str],
transform: Callable[[str], str] | None = None,
encode: Callable[[str], list[int]] | None = None,
) -> list[ChunkResult]:
"""Create chunk results from a list of text chunks. The index assignments are 0-based and assume chunks we not stripped relative to the source text."""
) -> list[TextChunk]:
"""Create chunk results from a list of text chunks. The index assignments are 0-based and assume chunks were not stripped relative to the source text."""
results = []
start_char = 0
for index, chunk in enumerate(chunks):
end_char = start_char + len(chunk) - 1 # 0-based indices
chunk = ChunkResult(
text=chunk,
result = TextChunk(
original=chunk,
text=transform(chunk) if transform else chunk,
index=index,
start_char=start_char,
end_char=end_char,
)
if encode:
chunk.token_count = len(encode(chunk.text))
results.append(chunk)
result.token_count = len(encode(result.text))
results.append(result)
start_char = end_char + 1
return results
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@
import nltk

from graphrag_chunking.bootstrap_nltk import bootstrap
from graphrag_chunking.chunk_result import ChunkResult
from graphrag_chunking.chunker import Chunker
from graphrag_chunking.create_chunk_results import create_chunk_results
from graphrag_chunking.text_chunk import TextChunk


class SentenceChunker(Chunker):
Expand All @@ -24,10 +24,14 @@ def __init__(
self._encode = encode
bootstrap()

def chunk(self, text) -> list[ChunkResult]:
def chunk(
self, text: str, transform: Callable[[str], str] | None = None
) -> list[TextChunk]:
"""Chunk the text into sentence-based chunks."""
sentences = nltk.sent_tokenize(text.strip())
results = create_chunk_results(sentences, encode=self._encode)
results = create_chunk_results(
sentences, transform=transform, encode=self._encode
)
# nltk sentence tokenizer may trim whitespace, so we need to adjust start/end chars
for index, result in enumerate(results):
txt = result.text
Expand Down
29 changes: 29 additions & 0 deletions packages/graphrag-chunking/graphrag_chunking/text_chunk.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License

"""The TextChunk dataclass."""

from dataclasses import dataclass


@dataclass
class TextChunk:
"""Result of chunking a document."""

original: str
"""Raw original text chunk before any transformation."""

text: str
"""The final text content of this chunk."""

index: int
"""Zero-based index of this chunk within the source document."""

start_char: int
"""Character index where the raw chunk text begins in the source document."""

end_char: int
"""Character index where the raw chunk text ends in the source document."""

token_count: int | None = None
"""Number of tokens in the final chunk text, if computed."""
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@
from collections.abc import Callable
from typing import Any

from graphrag_chunking.chunk_result import ChunkResult
from graphrag_chunking.chunker import Chunker
from graphrag_chunking.create_chunk_results import create_chunk_results
from graphrag_chunking.text_chunk import TextChunk


class TokenChunker(Chunker):
Expand All @@ -28,7 +28,9 @@ def __init__(
self._encode = encode
self._decode = decode

def chunk(self, text: str) -> list[ChunkResult]:
def chunk(
self, text: str, transform: Callable[[str], str] | None = None
) -> list[TextChunk]:
"""Chunk the text into token-based chunks."""
chunks = split_text_on_tokens(
text,
Expand All @@ -37,7 +39,7 @@ def chunk(self, text: str) -> list[ChunkResult]:
encode=self._encode,
decode=self._decode,
)
return create_chunk_results(chunks, encode=self._encode)
return create_chunk_results(chunks, transform=transform, encode=self._encode)


def split_text_on_tokens(
Expand Down
25 changes: 25 additions & 0 deletions packages/graphrag-chunking/graphrag_chunking/transformers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License

"""A collection of useful built-in transformers you can use for chunking."""

from collections.abc import Callable
from typing import Any


def add_metadata(
metadata: dict[str, Any],
delimiter: str = ": ",
line_delimiter: str = "\n",
append: bool = False,
) -> Callable[[str], str]:
"""Add metadata to the given text, prepending by default. This utility writes the dict as rows of key/value pairs."""

def transformer(text: str) -> str:
metadata_str = (
line_delimiter.join(f"{k}{delimiter}{v}" for k, v in metadata.items())
+ line_delimiter
)
return text + metadata_str if append else metadata_str + text

return transformer
72 changes: 72 additions & 0 deletions packages/graphrag-input/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# GraphRAG Inputs

This package provides input document loading utilities for GraphRAG, supporting multiple file formats including CSV, JSON, JSON Lines, and plain text.

## Supported File Types

The following four standard file formats are supported out of the box:

- **CSV** - Tabular data with configurable column mappings
- **JSON** - JSON files with configurable property paths
- **JSON Lines** - Line-delimited JSON records
- **Text** - Plain text files

### Markitdown Support

Additionally, we support the `InputType.MarkItDown` format, which uses the [MarkItDown](https://github.com/microsoft/markitdown) library to import any supported file type. The MarkItDown converter can handle a wide variety of file formats including Office documents, PDFs, HTML, and more.

**Note:** Additional optional dependencies may need to be installed depending on the file type you're processing. The choice of converter is determined by MarkItDowns's processing logic, which primarily uses the file extension to select the appropriate converter. Please refer to the [MarkItDown repository](https://github.com/microsoft/markitdown) for installation instructions and detailed information about supported formats.

## Examples

Basic usage with the factory:
```python
from graphrag_input import create_input_reader, InputConfig, InputType
from graphrag_storage import StorageConfig, create_storage

config = InputConfig(
type=InputType.Csv,
text_column="content",
title_column="title",
)
storage = create_storage(StorageConfig(base_dir="./input"))
reader = create_input_reader(config, storage)
documents = await reader.read_files()
```

Import a pdf with MarkItDown:

```bash
pip install 'markitdown[pdf]' # required dependency for pdf processing
```

```python
from graphrag_input import create_input_reader, InputConfig, InputType
from graphrag_storage import StorageConfig, create_storage

config = InputConfig(
type=InputType.MarkitDown,
file_pattern=".*\\.pdf$"
)
storage = create_storage(StorageConfig(base_dir="./input"))
reader = create_input_reader(config, storage)
documents = await reader.read_files()
```

YAML config example for above:
```yaml
input:
type: markitdown
file_pattern: ".*\\.pdf$$"
input_storage:
type: file
base_dir: "input"
```

Note that when specifying column names for data extraction, we can handle nested objects (e.g., in JSON) with dot notation:
```python
from graphrag_input import get_property

data = {"user": {"profile": {"name": "Alice"}}}
name = get_property(data, "user.profile.name") # Returns "Alice"
```
20 changes: 20 additions & 0 deletions packages/graphrag-input/graphrag_input/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License

"""GraphRAG input document loading package."""

from graphrag_input.get_property import get_property
from graphrag_input.input_config import InputConfig
from graphrag_input.input_reader import InputReader
from graphrag_input.input_reader_factory import create_input_reader
from graphrag_input.input_type import InputType
from graphrag_input.text_document import TextDocument

__all__ = [
"InputConfig",
"InputReader",
"InputType",
"TextDocument",
"create_input_reader",
"get_property",
]
Loading