Streaming create communities by dayesouza · Pull Request #2237 · microsoft/graphrag

dayesouza · 2026-02-19T14:55:57Z

Streaming Performance Improvement (with .txt):

run time: -15.01%
peak memory: -16.87%
memory delta: -58.26%

This pull request introduces a patch for the create_communities workflow, focusing on improving streaming, memory efficiency, and code clarity. The changes optimize how entities and relationships are processed and written, refactor the clustering logic for better performance, and update documentation and supporting files accordingly.

Workflow and Data Processing Improvements:

Refactored the create_communities workflow to stream entity and community data, reducing memory usage by avoiding loading all entities at once and writing community rows incrementally instead of as a full DataFrame. The function now returns a sample of rows instead of the full output for easier inspection.
Changed the create_communities function to accept a title_to_entity_id mapping instead of a full DataFrame, and updated its docstring for clarity. Entity and relationship aggregation logic was rewritten to be more efficient and to handle intra-community edges at each hierarchy level separately, improving performance and maintainability.

Clustering Logic and DataFrame Handling:

Updated cluster_graph.py to use defaultdict(list) for cluster aggregation, simplifying the logic for grouping node IDs by community.
Improved DataFrame operations in _compute_leiden_communities by normalizing edge directions and using in-place deduplication, and by constructing the edge list using efficient vectorized operations.
Added missing import for defaultdict to support the above refactor.

Documentation and Release Process:

Removed the RELEASE.md file, which contained detailed instructions for the release process, as it is no longer needed or is being replaced elsewhere.

Metadata:

Added a semversioner patch file to document the streaming improvements in the create_communities workflow.

dayesouza added 5 commits February 13, 2026 19:44

add manual release instructions

6dc5482

Merge remote-tracking branch 'origin/main' into create-communities

6f606c0

create streaming

cabe541

fix deleted file

76a72f6

addd file

82c5111

dayesouza requested a review from a team as a code owner February 19, 2026 14:55

dayesouza added 3 commits February 19, 2026 12:01

fix check

360c9c5

add consistency

17fbdc7

fix logic

b18ddbc

natoverse approved these changes Feb 19, 2026

View reviewed changes

dayesouza merged commit ac7ce32 into main Feb 19, 2026
18 checks passed

dayesouza deleted the create-communities branch February 19, 2026 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming create communities#2237

Streaming create communities#2237
dayesouza merged 8 commits intomainfrom
create-communities

dayesouza commented Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dayesouza commented Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants