Skip to content

Commit f02f5c6

Browse files
authored
Merge pull request #1 from managedcode/codex/install-dotnet9-and-review-graphrag-project
Add community and covariate records for pipeline parity
2 parents 9a76e2b + 7288f4e commit f02f5c6

File tree

86 files changed

+2583
-617
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

86 files changed

+2583
-617
lines changed

.editorconfig

Lines changed: 464 additions & 0 deletions
Large diffs are not rendered by default.

AGENTS.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,12 @@
11
# Repository Guidelines
22

3-
# Rules to follow
3+
## Rules to follow
44
- Always run `dotnet build GraphRag.slnx` (or the relevant project) before executing any `dotnet test` command.
55
- Default to the latest available versions (e.g., Apache AGE `latest`) when selecting dependencies, per user request ("тобі треба latest").
66
- Do not create or rely on fake database stores (e.g., `FakePostgresGraphStore`); all tests must use real connectors/backing services.
7+
- Keep default prompts in static C# classes; do not rely on prompt files under `prompts/` for built-in templates.
8+
- Register language models through Microsoft.Extensions.AI keyed services; avoid bespoke `LanguageModelConfig` providers.
9+
- Always run `dotnet format GraphRag.slnx` before finishing work.
710

811
# Conversations
912
any resulting updates to agents.md should go under the section "## Rules to follow"

Directory.Build.props

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,8 @@
2525
<RepositoryUrl>https://github.com/managedcode/graphrag</RepositoryUrl>
2626
<PackageProjectUrl>https://github.com/managedcode/graphrag</PackageProjectUrl>
2727
<Product>Managed Code GraphRag</Product>
28-
<Version>0.0.2</Version>
29-
<PackageVersion>0.0.2</PackageVersion>
28+
<Version>0.0.3</Version>
29+
<PackageVersion>0.0.3</PackageVersion>
3030

3131
</PropertyGroup>
3232
<PropertyGroup Condition="'$(GITHUB_ACTIONS)' == 'true'">
@@ -42,7 +42,7 @@
4242
<IncludeAssets>runtime; build; native; contentfiles; analyzers; buildtransitive</IncludeAssets>
4343
</PackageReference>
4444
</ItemGroup>
45-
45+
4646
<ItemGroup>
4747
<PackageReference Update="Microsoft.SourceLink.GitHub" Version="8.0.0" />
4848
</ItemGroup>

Directory.Packages.props

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
<PackageVersion Include="coverlet.collector" Version="6.0.4" />
44
<PackageVersion Include="Microsoft.Azure.Cosmos" Version="3.54.0" />
55
<PackageVersion Include="Microsoft.Extensions.Configuration" Version="9.0.10" />
6+
<PackageVersion Include="Microsoft.Extensions.Caching.Memory" Version="9.0.10" />
67
<PackageVersion Include="Microsoft.Extensions.DependencyInjection" Version="9.0.10" />
78
<PackageVersion Include="Microsoft.Extensions.Logging" Version="9.0.10" />
89
<PackageVersion Include="Microsoft.Extensions.Logging.Abstractions" Version="9.0.10" />
@@ -20,4 +21,4 @@
2021
<PackageVersion Include="xunit" Version="2.9.3" />
2122
<PackageVersion Include="xunit.runner.visualstudio" Version="3.1.5" />
2223
</ItemGroup>
23-
</Project>
24+
</Project>

README.md

Lines changed: 65 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -86,10 +86,72 @@ graphrag/
8686

8787
## Integration Testing Strategy
8888

89-
- **No fakes.** We removed the legacy fake Postgres store. Every graph operation in tests uses real services orchestrated by Testcontainers.
90-
- **Security coverage.** `Integration/PostgresGraphStoreIntegrationTests.cs` includes payloads that mimic SQL/Cypher injection attempts to ensure values remain literals and labels/types are strictly validated.
91-
- **Cross-backend validation.** `Integration/GraphStoreIntegrationTests.cs` exercises Postgres, Neo4j, and Cosmos (when available) through the shared `IGraphStore` abstraction.
89+
- **No fakes.** We removed the legacy fake Postgres store. Every graph operation in tests uses real services orchestrated by Testcontainers.
90+
- **Security coverage.** `Integration/PostgresGraphStoreIntegrationTests.cs` includes payloads that mimic SQL/Cypher injection attempts to ensure values remain literals and labels/types are strictly validated.
91+
- **Cross-backend validation.** `Integration/GraphStoreIntegrationTests.cs` exercises Postgres, Neo4j, and Cosmos (when available) through the shared `IGraphStore` abstraction.
9292
- **Workflow smoke tests.** Pipelines (e.g., `IndexingPipelineRunnerTests`) and finalization steps run end-to-end with the fixture-provisioned infrastructure.
93+
- **Prompt precedence.** `Integration/CommunitySummariesIntegrationTests.cs` proves manual prompt overrides win over auto-tuned assets while still falling back to auto templates when manual text is absent.
94+
- **Callback and stats instrumentation.** `Runtime/PipelineExecutorTests.cs` now asserts that pipeline callbacks fire and runtime statistics are captured even when workflows fail early, so custom telemetry remains reliable.
95+
96+
---
97+
98+
## Pipeline Cache
99+
100+
Pipelines exchange state through the `IPipelineCache` abstraction. Every workflow step receives the same cache instance via `PipelineRunContext`, so it can reuse expensive results (LLM calls, chunk expansions, graph lookups) that were produced earlier in the run instead of recomputing them. The cache also keeps optional debug payloads per entry so you can persist trace metadata alongside the main value.
101+
102+
To use the built-in in-memory cache, register it alongside the standard ASP.NET Core services:
103+
104+
```csharp
105+
using GraphRag.Cache;
106+
107+
builder.Services.AddMemoryCache();
108+
builder.Services.AddSingleton<IPipelineCache, MemoryPipelineCache>();
109+
```
110+
111+
Prefer a different backend? Implement `IPipelineCache` yourself and register it through DI—the pipeline will pick up your custom cache automatically.
112+
113+
- **Per-scope isolation.** `MemoryPipelineCache.CreateChild("stage")` scopes keys by prefix (`parent:stage:key`). Calling `ClearAsync` on the parent removes every nested key, so multi-step workflows do not leak data between stages.
114+
- **Debug traces.** The cache stores optional debug payloads per entry; `DeleteAsync` and `ClearAsync` always clear these traces, preventing the diagnostic dictionary from growing unbounded.
115+
- **Lifecycle guidance.** Create the root cache once per pipeline run (the default context factory does this for you) and spawn children inside individual workflows when you need an isolated namespace.
116+
117+
---
118+
119+
## Language Model Registration
120+
121+
GraphRAG delegates language-model configuration to [Microsoft.Extensions.AI](https://learn.microsoft.com/dotnet/ai/overview). Register keyed clients for every `ModelId` you reference in configuration—pick any string key that matches your config:
122+
123+
```csharp
124+
using Azure;
125+
using Azure.AI.OpenAI;
126+
using GraphRag.Config;
127+
using Microsoft.Extensions.AI;
128+
129+
var openAi = new OpenAIClient(new Uri(endpoint), new AzureKeyCredential(key));
130+
const string chatModelId = "chat_model";
131+
const string embeddingModelId = "embedding_model";
132+
133+
builder.Services.AddKeyedSingleton<IChatClient>(
134+
chatModelId,
135+
_ => openAi.GetChatClient(chatDeployment));
136+
137+
builder.Services.AddKeyedSingleton<IEmbeddingGenerator<string, Embedding>>(
138+
embeddingModelId,
139+
_ => openAi.GetEmbeddingClient(embeddingDeployment));
140+
```
141+
142+
Rate limits, retries, and other policies should be configured when you create these clients (for example by wrapping them with `Polly` handlers). `GraphRagConfig.Models` simply tracks the set of model keys that have been registered so overrides can validate references.
143+
144+
---
145+
146+
## Indexing, Querying, and Prompt Tuning Alignment
147+
148+
The .NET port mirrors the [GraphRAG indexing architecture](https://microsoft.github.io/graphrag/index/overview/) and its query workflows so downstream applications retain parity with the Python reference implementation.
149+
150+
- **Indexing overview.** Workflows such as `extract_graph`, `create_communities`, and `community_summaries` map 1:1 to the [default data flow](https://microsoft.github.io/graphrag/index/default_dataflow/) and persist the same tables (`text_units`, `entities`, `relationships`, `communities`, `community_reports`, `covariates`). The new prompt template loader honours manual or auto-tuned prompts before falling back to the stock templates in `prompts/`.
151+
- **Query capabilities.** The query pipeline retains global search, local search, drift search, and question generation semantics described in the [GraphRAG query overview](https://microsoft.github.io/graphrag/query/overview/). Each orchestrator continues to assemble context from the indexed tables so you can reference [global](https://microsoft.github.io/graphrag/query/global_search/) or [local](https://microsoft.github.io/graphrag/query/local_search/) narratives interchangeably.
152+
- **Prompt tuning.** GraphRAG’s [manual](https://microsoft.github.io/graphrag/prompt_tuning/manual_prompt_tuning/) and [auto](https://microsoft.github.io/graphrag/prompt_tuning/auto_prompt_tuning/) strategies are surfaced through `GraphRagConfig.PromptTuning`. Store custom templates under `prompts/` or point `PromptTuning.Manual.Directory`/`PromptTuning.Auto.Directory` at your tuning outputs. You can also skip files entirely by assigning inline text (multi-line or prefixed with `inline:`) to workflow prompt properties. Stage keys and placeholders are documented in `docs/indexing-and-query.md`.
153+
154+
See [`docs/indexing-and-query.md`](docs/indexing-and-query.md) for a deeper mapping between the .NET workflows and the research publications underpinning GraphRAG.
93155

94156
---
95157

docs/indexing-and-query.md

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# Indexing, Querying, and Prompt Tuning in GraphRAG for .NET
2+
3+
GraphRAG for .NET keeps feature parity with the Python reference project described in the [Microsoft Research blog](https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/) and the [GraphRAG paper](https://arxiv.org/pdf/2404.16130). This document explains how the .NET workflows map to the concepts documented on [microsoft.github.io/graphrag](https://microsoft.github.io/graphrag/), highlights the supported query modes, and shows how to customise prompts via manual or auto tuning outputs.
4+
5+
## Indexing Architecture
6+
7+
- **Workflow parity.** Each indexing stage matches the Python pipeline and the [default data flow](https://microsoft.github.io/graphrag/index/default_dataflow/):
8+
- `load_input_documents``create_base_text_units``summarize_descriptions`
9+
- `extract_graph` persists `entities` and `relationships`
10+
- `create_communities` produces `communities`
11+
- `community_summaries` writes `community_reports`
12+
- `extract_covariates` stores `covariates`
13+
- **Storage schema.** Tables share the column layout described under [index outputs](https://microsoft.github.io/graphrag/index/outputs/). The new strongly-typed records (`CommunityRecord`, `CovariateRecord`, etc.) mirror the JSON representation used by the Python implementation.
14+
- **Cluster configuration.** `GraphRagConfig.ClusterGraph` exposes the same knobs as the Python `cluster_graph` settings, enabling largest-component filtering and deterministic seeding.
15+
16+
## Language Model Registration
17+
18+
Workflows resolve language models from the DI container via [Microsoft.Extensions.AI](https://learn.microsoft.com/dotnet/ai/overview). Register keyed services for every `ModelId` you plan to reference:
19+
20+
```csharp
21+
using Azure;
22+
using Azure.AI.OpenAI;
23+
using GraphRag.Config;
24+
using Microsoft.Extensions.AI;
25+
26+
var openAi = new OpenAIClient(new Uri(endpoint), new AzureKeyCredential(key));
27+
const string chatModelId = "chat_model";
28+
const string embeddingModelId = "embedding_model";
29+
30+
services.AddKeyedSingleton<IChatClient>(chatModelId, _ => openAi.GetChatClient(chatDeployment));
31+
services.AddKeyedSingleton<IEmbeddingGenerator<string, Embedding>>(embeddingModelId, _ => openAi.GetEmbeddingClient(embeddingDeployment));
32+
```
33+
34+
Configure retries, rate limits, and logging when you construct the concrete clients. `GraphRagConfig.Models` simply records the set of registered keys so configuration overrides can validate references.
35+
36+
## Pipeline Cache
37+
38+
`IPipelineCache` is intentionally infrastructure-neutral. To mirror ASP.NET Core's in-memory behaviour, register the built-in cache services alongside the provided adapter:
39+
40+
```csharp
41+
services.AddMemoryCache();
42+
services.AddSingleton<IPipelineCache, MemoryPipelineCache>();
43+
```
44+
45+
Need Redis or something else? Implement `IPipelineCache` yourself and register it through DI; the pipeline will automatically consume your custom cache.
46+
47+
## Query Capabilities
48+
49+
The query layer ports the orchestrators documented in the [GraphRAG query overview](https://microsoft.github.io/graphrag/query/overview/):
50+
51+
- **Global search** ([docs](https://microsoft.github.io/graphrag/query/global_search/)) traverses community summaries and graph context to craft answers spanning the corpus.
52+
- **Local search** ([docs](https://microsoft.github.io/graphrag/query/local_search/)) anchors on a document neighbourhood when you need focused context.
53+
- **Drift search** ([docs](https://microsoft.github.io/graphrag/query/drift_search/)) monitors narrative changes across time slices.
54+
- **Question generation** ([docs](https://microsoft.github.io/graphrag/query/question_generation/)) produces follow-up questions to extend an investigation.
55+
56+
Every orchestrator consumes the same indexed tables as the Python project, so the .NET stack interoperates with BYOG scenarios described in the [index architecture guide](https://microsoft.github.io/graphrag/index/architecture/).
57+
58+
## Prompt Tuning
59+
60+
Manual and auto prompt tuning are both available without code changes:
61+
62+
1. **Manual overrides** follow the rules from [manual prompt tuning](https://microsoft.github.io/graphrag/prompt_tuning/manual_prompt_tuning/).
63+
- Place custom templates under a directory referenced by `GraphRagConfig.PromptTuning.Manual.Directory` and set `Enabled = true`.
64+
- Filenames follow the stage key pattern `section/workflow/kind.txt` (see table below).
65+
2. **Auto tuning** integrates the outputs documented in [auto prompt tuning](https://microsoft.github.io/graphrag/prompt_tuning/auto_prompt_tuning/).
66+
- Point `GraphRagConfig.PromptTuning.Auto.Directory` at the folder containing the generated prompts and set `Enabled = true`.
67+
- The runtime prefers explicit paths from workflow configs, then manual overrides, then auto-tuned files, and finally the built-in defaults in `prompts/`.
68+
3. **Inline overrides** can be injected directly from code: set `ExtractGraphConfig.SystemPrompt`, `ExtractGraphConfig.Prompt`, or the equivalent properties to either a multi-line string or a value prefixed with `inline:`. Inline values bypass template file lookups and are used as-is.
69+
70+
### Stage Keys and Placeholders
71+
72+
| Workflow | Stage key | Purpose | Supported placeholders |
73+
|----------|-----------|---------|------------------------|
74+
| `extract_graph` (system) | `index/extract_graph/system.txt` | System prompt that instructs the extractor. | _N/A_ |
75+
| `extract_graph` (user) | `index/extract_graph/user.txt` | User prompt template for individual text units. | `{{max_entities}}`, `{{text}}` |
76+
| `community_summaries` (system) | `index/community_reports/system.txt` | System guidance for cluster summarisation. | _N/A_ |
77+
| `community_summaries` (user) | `index/community_reports/user.txt` | User prompt template for entity lists. | `{{max_length}}`, `{{entities}}` |
78+
79+
Placeholders are replaced at runtime with values drawn from workflow configuration:
80+
81+
- `{{max_entities}}``ExtractGraphConfig.EntityTypes.Count + 5` (minimum 1)
82+
- `{{text}}` → the original text unit content
83+
- `{{max_length}}``CommunityReportsConfig.MaxLength`
84+
- `{{entities}}` → bullet list of entity titles and descriptions
85+
86+
If a template is omitted, the runtime falls back to the built-in prompts defined in `GraphRagPromptLibrary`.
87+
88+
## Integration Tests
89+
90+
`tests/ManagedCode.GraphRag.Tests/Integration/CommunitySummariesIntegrationTests.cs` exercises the new prompt loader end-to-end using the file-backed pipeline storage. Combined with the existing Aspire-powered suites, the tests demonstrate how indexing, community detection, and summarisation behave with tuned prompts while remaining faithful to the [GraphRAG BYOG guidance](https://microsoft.github.io/graphrag/index/byog/).
91+
92+
## Further Reading
93+
94+
- [GraphRAG prompt tuning overview](https://microsoft.github.io/graphrag/prompt_tuning/overview/)
95+
- [GraphRAG index methods](https://microsoft.github.io/graphrag/index/methods/)
96+
- [GraphRAG query overview](https://microsoft.github.io/graphrag/query/overview/)
97+
- [GraphRAG default dataflow](https://microsoft.github.io/graphrag/index/default_dataflow/)
98+
99+
These resources underpin the .NET implementation and provide broader context for customising or extending the library.

prompts/community_graph.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
You are an investigative analyst. Produce concise, neutral summaries that describe the shared theme binding the supplied entities.
2+
Highlight how they relate, why the cluster matters, and any notable signals the reader should know. Do not invent facts.

prompts/community_text.txt

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
Summarise the key theme that connects the following entities in no more than {{max_length}} characters. Focus on what unites them and why the group matters. Avoid bullet lists.
2+
3+
Entities:
4+
{{entities}}
5+
6+
Provide a single paragraph answer.
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
You are a precise information extraction engine. Analyse the supplied text and return structured JSON describing:
2+
- distinct entities (people, organisations, locations, products, events, concepts, technologies, dates, other)
3+
- relationships between those entities
4+
5+
Rules:
6+
- Only use information explicitly stated or implied in the text.
7+
- Prefer short, human-readable titles.
8+
- Use snake_case relationship types (e.g., "works_with", "located_in").
9+
- Always return valid JSON adhering to the response schema.
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
Extract up to {{max_entities}} of the most important entities and their relationships from the following text.
2+
3+
Text (between <BEGIN_TEXT> and <END_TEXT> markers):
4+
<BEGIN_TEXT>
5+
{{text}}
6+
<END_TEXT>
7+
8+
Respond with JSON matching this schema:
9+
{
10+
"entities": [
11+
{
12+
"title": "string",
13+
"type": "person | organization | location | product | event | concept | technology | date | other",
14+
"description": "short description",
15+
"confidence": 0.0 - 1.0
16+
}
17+
],
18+
"relationships": [
19+
{
20+
"source": "entity title",
21+
"target": "entity title",
22+
"type": "relationship_type",
23+
"description": "short description",
24+
"weight": 0.0 - 1.0,
25+
"bidirectional": true | false
26+
}
27+
]
28+
}

0 commit comments

Comments
 (0)