Conversation
natoverse
approved these changes
Feb 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Streaming Performance Improvement (with .txt):
This pull request introduces a patch for the
create_communitiesworkflow, focusing on improving streaming, memory efficiency, and code clarity. The changes optimize how entities and relationships are processed and written, refactor the clustering logic for better performance, and update documentation and supporting files accordingly.Workflow and Data Processing Improvements:
create_communitiesworkflow to stream entity and community data, reducing memory usage by avoiding loading all entities at once and writing community rows incrementally instead of as a full DataFrame. The function now returns a sample of rows instead of the full output for easier inspection.create_communitiesfunction to accept atitle_to_entity_idmapping instead of a full DataFrame, and updated its docstring for clarity. Entity and relationship aggregation logic was rewritten to be more efficient and to handle intra-community edges at each hierarchy level separately, improving performance and maintainability.Clustering Logic and DataFrame Handling:
cluster_graph.pyto usedefaultdict(list)for cluster aggregation, simplifying the logic for grouping node IDs by community._compute_leiden_communitiesby normalizing edge directions and using in-place deduplication, and by constructing the edge list using efficient vectorized operations.defaultdictto support the above refactor.Documentation and Release Process:
RELEASE.mdfile, which contained detailed instructions for the release process, as it is no longer needed or is being replaced elsewhere.Metadata:
create_communitiesworkflow.