Skip to content

Commit b001422

Browse files
Update index API + a notebook that provides a general API overview (#1454)
* update index api to accept callbacks * fix hardcoded folder name that was creating an empty folder * add API notebook * add semversioner file * filename change --------- Co-authored-by: Alonso Guevara <[email protected]>
1 parent 10f84c9 commit b001422

File tree

6 files changed

+237
-21
lines changed

6 files changed

+237
-21
lines changed
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
{
2+
"type": "patch",
3+
"description": "update API and add a demonstration notebook"
4+
}
Lines changed: 209 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,209 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "code",
5+
"execution_count": null,
6+
"metadata": {},
7+
"outputs": [],
8+
"source": [
9+
"# Copyright (c) 2024 Microsoft Corporation.\n",
10+
"# Licensed under the MIT License."
11+
]
12+
},
13+
{
14+
"cell_type": "markdown",
15+
"metadata": {},
16+
"source": [
17+
"## API Overview\n",
18+
"\n",
19+
"This notebook provides a demonstration of how to interact with graphrag as a library using the API as opposed to the CLI. Note that graphrag's CLI actually connects to the library through this API for all operations. "
20+
]
21+
},
22+
{
23+
"cell_type": "code",
24+
"execution_count": null,
25+
"metadata": {},
26+
"outputs": [],
27+
"source": [
28+
"import graphrag.api as api\n",
29+
"from graphrag.index.typing import PipelineRunResult"
30+
]
31+
},
32+
{
33+
"cell_type": "markdown",
34+
"metadata": {},
35+
"source": [
36+
"## Prerequisite\n",
37+
"As a prerequisite to all API operations, a `GraphRagConfig` object is required. It is the primary means to control the behavior of graphrag and can be instantiated from a `settings.yaml` configuration file.\n",
38+
"\n",
39+
"Please refer to the [CLI docs](https://microsoft.github.io/graphrag/cli/#init) for more detailed information on how to generate the `settings.yaml` file.\n",
40+
"\n",
41+
"#### Load `settings.yaml` configuration"
42+
]
43+
},
44+
{
45+
"cell_type": "code",
46+
"execution_count": null,
47+
"metadata": {},
48+
"outputs": [],
49+
"source": [
50+
"import yaml\n",
51+
"\n",
52+
"settings = yaml.safe_load(open(\"<project_directory>/settings.yaml\")) # noqa: PTH123, SIM115"
53+
]
54+
},
55+
{
56+
"cell_type": "markdown",
57+
"metadata": {},
58+
"source": [
59+
"At this point, you can modify the imported settings to align with your application's requirements. For example, if building a UI application, the application might need to change the input and/or storage destinations dynamically in order to enable users to build and query different indexes."
60+
]
61+
},
62+
{
63+
"cell_type": "markdown",
64+
"metadata": {},
65+
"source": [
66+
"### Generate a `GraphRagConfig` object"
67+
]
68+
},
69+
{
70+
"cell_type": "code",
71+
"execution_count": null,
72+
"metadata": {},
73+
"outputs": [],
74+
"source": [
75+
"from graphrag.config.create_graphrag_config import create_graphrag_config\n",
76+
"\n",
77+
"graphrag_config = create_graphrag_config(\n",
78+
" values=settings, root_dir=\"<project_directory>\"\n",
79+
")"
80+
]
81+
},
82+
{
83+
"cell_type": "markdown",
84+
"metadata": {},
85+
"source": [
86+
"## Indexing API\n",
87+
"\n",
88+
"*Indexing* is the process of ingesting raw text data and constructing a knowledge graph. GraphRAG currently supports plaintext (`.txt`) and `.csv` file formats."
89+
]
90+
},
91+
{
92+
"cell_type": "markdown",
93+
"metadata": {},
94+
"source": [
95+
"## Build an index"
96+
]
97+
},
98+
{
99+
"cell_type": "code",
100+
"execution_count": null,
101+
"metadata": {},
102+
"outputs": [],
103+
"source": [
104+
"index_result: list[PipelineRunResult] = await api.build_index(config=graphrag_config)\n",
105+
"\n",
106+
"# index_result is a list of workflows that make up the indexing pipeline that was run\n",
107+
"for workflow_result in index_result:\n",
108+
" status = f\"error\\n{workflow_result.errors}\" if workflow_result.errors else \"success\"\n",
109+
" print(f\"Workflow Name: {workflow_result.workflow}\\tStatus: {status}\")"
110+
]
111+
},
112+
{
113+
"cell_type": "markdown",
114+
"metadata": {},
115+
"source": [
116+
"## Query an index\n",
117+
"\n",
118+
"To query an index, several index files must first be read into memory and passed to the query API. "
119+
]
120+
},
121+
{
122+
"cell_type": "code",
123+
"execution_count": null,
124+
"metadata": {},
125+
"outputs": [],
126+
"source": [
127+
"import pandas as pd\n",
128+
"\n",
129+
"final_nodes = pd.read_parquet(\"<project_directory>/output/create_final_nodes.parquet\")\n",
130+
"final_entities = pd.read_parquet(\n",
131+
" \"<project_directory>/output/create_final_entities.parquet\"\n",
132+
")\n",
133+
"final_communities = pd.read_parquet(\n",
134+
" \"<project_directory>/output/create_final_communities.parquet\"\n",
135+
")\n",
136+
"final_community_reports = pd.read_parquet(\n",
137+
" \"<project_directory>/output/create_final_community_reports.parquet\"\n",
138+
")\n",
139+
"\n",
140+
"response, context = await api.global_search(\n",
141+
" config=graphrag_config,\n",
142+
" nodes=final_nodes,\n",
143+
" entities=final_entities,\n",
144+
" communities=final_communities,\n",
145+
" community_reports=final_community_reports,\n",
146+
" community_level=2,\n",
147+
" dynamic_community_selection=False,\n",
148+
" response_type=\"Multiple Paragraphs\",\n",
149+
" query=\"Who is Scrooge and what are his main relationships?\",\n",
150+
")"
151+
]
152+
},
153+
{
154+
"cell_type": "markdown",
155+
"metadata": {},
156+
"source": [
157+
"The response object is the official reponse from graphrag while the context object holds various metadata regarding the querying process used to obtain the final response."
158+
]
159+
},
160+
{
161+
"cell_type": "code",
162+
"execution_count": null,
163+
"metadata": {},
164+
"outputs": [],
165+
"source": [
166+
"print(response)"
167+
]
168+
},
169+
{
170+
"cell_type": "markdown",
171+
"metadata": {},
172+
"source": [
173+
"Digging into the context a bit more provides users with extremely granular information such as what sources of data (down to the level of text chunks) were ultimately retrieved and used as part of the context sent to the LLM model)."
174+
]
175+
},
176+
{
177+
"cell_type": "code",
178+
"execution_count": null,
179+
"metadata": {},
180+
"outputs": [],
181+
"source": [
182+
"from pprint import pprint\n",
183+
"\n",
184+
"pprint(context) # noqa: T203"
185+
]
186+
}
187+
],
188+
"metadata": {
189+
"kernelspec": {
190+
"display_name": "graphrag-venv",
191+
"language": "python",
192+
"name": "python3"
193+
},
194+
"language_info": {
195+
"codemirror_mode": {
196+
"name": "ipython",
197+
"version": 3
198+
},
199+
"file_extension": ".py",
200+
"mimetype": "text/x-python",
201+
"name": "python",
202+
"nbconvert_exporter": "python",
203+
"pygments_lexer": "ipython3",
204+
"version": "3.10.15"
205+
}
206+
},
207+
"nbformat": 4,
208+
"nbformat_minor": 2
209+
}

graphrag/api/index.py

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,10 @@
1010

1111
from pathlib import Path
1212

13+
from datashaper import WorkflowCallbacks
14+
1315
from graphrag.cache.noop_pipeline_cache import NoopPipelineCache
16+
from graphrag.callbacks.factory import create_pipeline_reporter
1417
from graphrag.config.enums import CacheType
1518
from graphrag.config.models.graph_rag_config import GraphRagConfig
1619
from graphrag.index.create_pipeline_config import create_pipeline_config
@@ -25,6 +28,7 @@ async def build_index(
2528
run_id: str = "",
2629
is_resume_run: bool = False,
2730
memory_profile: bool = False,
31+
callbacks: list[WorkflowCallbacks] | None = None,
2832
progress_reporter: ProgressReporter | None = None,
2933
) -> list[PipelineRunResult]:
3034
"""Run the pipeline with the given configuration.
@@ -37,10 +41,10 @@ async def build_index(
3741
The run id. Creates a output directory with this name.
3842
is_resume_run : bool default=False
3943
Whether to resume a previous index run.
40-
is_update_run : bool default=False
41-
Whether to update a previous index run.
4244
memory_profile : bool
4345
Whether to enable memory profiling.
46+
callbacks : list[WorkflowCallbacks] | None default=None
47+
A list of callbacks to register.
4448
progress_reporter : ProgressReporter | None default=None
4549
The progress reporter.
4650
@@ -61,12 +65,17 @@ async def build_index(
6165
pipeline_cache = (
6266
NoopPipelineCache() if config.cache.type == CacheType.none is None else None
6367
)
68+
# TODO: remove the type ignore once the new config engine has been refactored
69+
callbacks = (
70+
[create_pipeline_reporter(config.reporting, None)] if config.reporting else None # type: ignore
71+
) # type: ignore
6472
outputs: list[PipelineRunResult] = []
6573
async for output in run_pipeline_with_config(
6674
pipeline_config,
6775
run_id=run_id,
6876
memory_profile=memory_profile,
6977
cache=pipeline_cache,
78+
callbacks=callbacks,
7079
progress_reporter=progress_reporter,
7180
is_resume_run=is_resume_run,
7281
is_update_run=is_update_run,

graphrag/index/run/run.py

Lines changed: 8 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,6 @@
1717
from graphrag.cache.factory import create_cache
1818
from graphrag.cache.pipeline_cache import PipelineCache
1919
from graphrag.callbacks.console_workflow_callbacks import ConsoleWorkflowCallbacks
20-
from graphrag.callbacks.factory import create_pipeline_reporter
2120
from graphrag.index.config.cache import PipelineMemoryCacheConfig
2221
from graphrag.index.config.pipeline import (
2322
PipelineConfig,
@@ -67,7 +66,7 @@ async def run_pipeline_with_config(
6766
storage: PipelineStorage | None = None,
6867
update_index_storage: PipelineStorage | None = None,
6968
cache: PipelineCache | None = None,
70-
callbacks: WorkflowCallbacks | None = None,
69+
callbacks: list[WorkflowCallbacks] | None = None,
7170
progress_reporter: ProgressReporter | None = None,
7271
input_post_process_steps: list[PipelineWorkflowStep] | None = None,
7372
additional_verbs: VerbDefinitions | None = None,
@@ -107,18 +106,14 @@ async def run_pipeline_with_config(
107106
storage = storage = create_storage(config.storage) # type: ignore
108107

109108
if is_update_run:
109+
# TODO: remove the default choice (PipelineFileStorageConfig) once the new config system enforces a correct update-index-storage config when used.
110110
update_index_storage = update_index_storage or create_storage(
111111
config.update_index_storage
112112
or PipelineFileStorageConfig(base_dir=str(Path(root_dir) / "output"))
113113
)
114114

115115
# TODO: remove the default choice (PipelineMemoryCacheConfig) when the new config system guarantees the existence of a cache config
116116
cache = cache or create_cache(config.cache or PipelineMemoryCacheConfig(), root_dir)
117-
callbacks = (
118-
create_pipeline_reporter(config.reporting, root_dir)
119-
if config.reporting
120-
else None
121-
)
122117
# TODO: remove the type ignore when the new config system guarantees the existence of an input config
123118
dataset = (
124119
dataset
@@ -195,7 +190,7 @@ async def run_pipeline(
195190
dataset: pd.DataFrame,
196191
storage: PipelineStorage | None = None,
197192
cache: PipelineCache | None = None,
198-
callbacks: WorkflowCallbacks | None = None,
193+
callbacks: list[WorkflowCallbacks] | None = None,
199194
progress_reporter: ProgressReporter | None = None,
200195
input_post_process_steps: list[PipelineWorkflowStep] | None = None,
201196
additional_verbs: VerbDefinitions | None = None,
@@ -226,13 +221,12 @@ async def run_pipeline(
226221
start_time = time.time()
227222

228223
progress_reporter = progress_reporter or NullProgressReporter()
229-
callbacks = callbacks or ConsoleWorkflowCallbacks()
230-
callbacks = _create_callback_chain(callbacks, progress_reporter)
231-
224+
callbacks = callbacks or [ConsoleWorkflowCallbacks()]
225+
callback_chain = _create_callback_chain(callbacks, progress_reporter)
232226
context = create_run_context(storage=storage, cache=cache, stats=None)
233227
exporter = ParquetExporter(
234228
context.storage,
235-
lambda e, s, d: cast(WorkflowCallbacks, callbacks).on_error(
229+
lambda e, s, d: cast(WorkflowCallbacks, callback_chain).on_error(
236230
"Error exporting table", e, s, d
237231
),
238232
)
@@ -246,7 +240,7 @@ async def run_pipeline(
246240
workflows_to_run = loaded_workflows.workflows
247241
workflow_dependencies = loaded_workflows.dependencies
248242
dataset = await _run_post_process_steps(
249-
input_post_process_steps, dataset, context, callbacks
243+
input_post_process_steps, dataset, context, callback_chain
250244
)
251245

252246
# ensure the incoming data is valid
@@ -266,7 +260,7 @@ async def run_pipeline(
266260
result = await _process_workflow(
267261
workflow_to_run.workflow,
268262
context,
269-
callbacks,
263+
callback_chain,
270264
exporter,
271265
workflow_dependencies,
272266
dataset,

graphrag/index/run/workflow.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -68,12 +68,12 @@ async def _export_workflow_output(
6868

6969

7070
def _create_callback_chain(
71-
callbacks: WorkflowCallbacks | None, progress: ProgressReporter | None
71+
callbacks: list[WorkflowCallbacks] | None, progress: ProgressReporter | None
7272
) -> WorkflowCallbacks:
73-
"""Create a callbacks manager."""
73+
"""Create a callback manager that encompasses multiple callbacks."""
7474
manager = WorkflowCallbacksManager()
75-
if callbacks is not None:
76-
manager.register(callbacks)
75+
for callback in callbacks or []:
76+
manager.register(callback)
7777
if progress is not None:
7878
manager.register(ProgressWorkflowCallbacks(progress))
7979
return manager

graphrag/storage/memory_pipeline_storage.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ class MemoryPipelineStorage(FilePipelineStorage):
1818

1919
def __init__(self):
2020
"""Init method definition."""
21-
super().__init__(root_dir=".output")
21+
super().__init__()
2222
self._storage = {}
2323

2424
async def get(

0 commit comments

Comments
 (0)