Skip to content

Streaming create communities#2237

Merged
dayesouza merged 8 commits intomainfrom
create-communities
Feb 19, 2026
Merged

Streaming create communities#2237
dayesouza merged 8 commits intomainfrom
create-communities

Conversation

@dayesouza
Copy link
Contributor

Streaming Performance Improvement (with .txt):

  • run time: -15.01%
  • peak memory: -16.87%
  • memory delta: -58.26%

This pull request introduces a patch for the create_communities workflow, focusing on improving streaming, memory efficiency, and code clarity. The changes optimize how entities and relationships are processed and written, refactor the clustering logic for better performance, and update documentation and supporting files accordingly.

Workflow and Data Processing Improvements:

  • Refactored the create_communities workflow to stream entity and community data, reducing memory usage by avoiding loading all entities at once and writing community rows incrementally instead of as a full DataFrame. The function now returns a sample of rows instead of the full output for easier inspection.
  • Changed the create_communities function to accept a title_to_entity_id mapping instead of a full DataFrame, and updated its docstring for clarity. Entity and relationship aggregation logic was rewritten to be more efficient and to handle intra-community edges at each hierarchy level separately, improving performance and maintainability.

Clustering Logic and DataFrame Handling:

  • Updated cluster_graph.py to use defaultdict(list) for cluster aggregation, simplifying the logic for grouping node IDs by community.
  • Improved DataFrame operations in _compute_leiden_communities by normalizing edge directions and using in-place deduplication, and by constructing the edge list using efficient vectorized operations.
  • Added missing import for defaultdict to support the above refactor.

Documentation and Release Process:

  • Removed the RELEASE.md file, which contained detailed instructions for the release process, as it is no longer needed or is being replaced elsewhere.

Metadata:

  • Added a semversioner patch file to document the streaming improvements in the create_communities workflow.

@dayesouza dayesouza requested a review from a team as a code owner February 19, 2026 14:55
@dayesouza dayesouza merged commit ac7ce32 into main Feb 19, 2026
18 checks passed
@dayesouza dayesouza deleted the create-communities branch February 19, 2026 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants