Skip to content

Conversation

@BrianLusina
Copy link
Owner

@BrianLusina BrianLusina commented Nov 16, 2025

Describe your change:

Algorithm to find similar string groups

  • Add an algorithm?
  • Fix a bug or typo in an existing algorithm?
  • Documentation change?

Checklist:

  • I have read CONTRIBUTING.md.
  • This pull request is all my own work -- I have not plagiarized.
  • I know that pull requests will not be merged if they fail the automated tests.
  • This PR only changes one algorithm file. To ease review, please open separate PRs for separate algorithms.
  • All new Python files are placed inside an existing directory.
  • All filenames are in all lowercase characters with no spaces or dashes.
  • All functions and variable names follow Python naming conventions.
  • All function parameters and return values are annotated with Python type hints.
  • All functions have doctests that pass the automated testing.
  • All new algorithms have a URL in its comments that points to Wikipedia or other similar explanation.
  • If this pull request resolves one or more open issues then the commit message contains Fixes: #{$ISSUE_NO}.

Summary by CodeRabbit

  • New Features

    • Added Similar String Groups feature (two implementations).
    • Added Union-Find / Disjoint Set data structures and exposed them in the datastructures package.
  • Documentation

    • Added README for Similar String Groups and updated repository navigation.
  • Tests

    • Added unit tests covering multiple scenarios to validate the group-counting implementations.

@BrianLusina BrianLusina self-assigned this Nov 16, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 16, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

Adds a Union-Find implementation and re-exports, introduces a new "Similar String Groups" module with two DSU-based implementations, tests and README, and updates package navigation in DIRECTORY.md.

Changes

Cohort / File(s) Summary
Package exports
datastructures/__init__.py, datastructures/sets/__init__.py
Re-export DisjointSetUnion and UnionFind by importing from the new union_find subpackage and updating __all__.
Union-Find implementation
datastructures/sets/union_find/__init__.py
New module adding DisjointSetUnion (union by rank, path compression, get_count) and UnionFind (path-compressed find, union). Constructors validate size > 0.
Similar String Groups module
pystrings/similar_string_groups/__init__.py, pystrings/similar_string_groups/README.md
New implementations num_similar_groups and num_similar_groups_2 with helper similarity predicates and README describing the DSU approach, complexity, and examples.
Tests
pystrings/similar_string_groups/test_similar_string_groups.py
New unittest suite exercising both implementations with multiple test cases.
Navigation
DIRECTORY.md
Added "Similar String Groups" entry under Pystrings/Reverse Vowels navigation.

Sequence Diagram(s)

sequenceDiagram
    participant Caller as num_similar_groups()
    participant DSU as DisjointSetUnion
    participant Check as are_similar()

    Caller->>DSU: __init__(n)
    Note right of DSU: parent[], rank[], count initialized

    loop for each pair (i, j)
        Caller->>Check: are_similar(strs[i], strs[j])
        alt similar
            Check-->>Caller: True
            Caller->>DSU: union(i, j)
            Note right of DSU: merge roots, update count
        else not similar
            Check-->>Caller: False
        end
    end

    Caller->>DSU: get_count()
    DSU-->>Caller: group_count
    Caller-->>Caller: return group_count
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

  • Review DisjointSetUnion path compression and union-by-rank correctness and edge cases (size validation, indexing).
  • Confirm get_count() updates properly on unions.
  • Validate similarity helpers handle length mismatches, early exits when diffs > 2, and correct mirrored-pair check.
  • Ensure tests import exported names (package re-exports) and exercise both implementations.

Possibly related PRs

Suggested labels

enhancement, Documentation

Poem

🐰 I hop through roots and ranks tonight,
I swap two letters, make groups tight,
With unions done and tests in play,
Similar strings find their way—
A tiny hop, a tidy byte.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 26.32% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(strings): similar string groups' clearly and concisely summarizes the main change: adding a similar string groups algorithm to the strings module.
Description check ✅ Passed The pull request description follows the required template with all major sections completed, including a clear description of the change and comprehensive checklist coverage.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc02f1e and 69e2998.

📒 Files selected for processing (2)
  • datastructures/sets/union_find/__init__.py (1 hunks)
  • pystrings/similar_string_groups/__init__.py (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
pystrings/similar_string_groups/__init__.py (1)

67-77: Implementation works but uses different counting approach.

This implementation correctly groups similar strings. However, there's a subtle difference in how it counts groups compared to num_similar_groups:

  • This version counts unique roots via set comprehension (line 76)
  • num_similar_groups uses DisjointSetUnion.get_count()

Both approaches are correct, but the set-based counting approach is less efficient as it performs O(n) find operations after the union operations complete.

Consider documenting why two implementations exist or consolidating them:

def num_similar_groups_2(strs: List[str]) -> int:
    """
    Alternative implementation that demonstrates counting groups 
    via unique roots instead of maintaining a count during unions.
    """
    n = len(strs)
    uf = DisjointSetUnion(n)
    
    for i in range(n):
        for j in range(i + 1, n):
            if are_similar(strs[i], strs[j]):
                uf.union(i, j)
    
    # Alternative: return uf.get_count() for better performance
    roots = {uf.find(i) for i in range(n)}
    return len(roots)
datastructures/sets/union_find/__init__.py (1)

25-26: Remove debug comments.

Lines 25-26 contain inline comments (# 0, # 1) that appear to be debug artifacts from development. These should be removed for cleaner code.

-        root_i = self.find(i) # 0
-        root_j = self.find(j) # 1
+        root_i = self.find(i)
+        root_j = self.find(j)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a90f0fc and aa47a80.

⛔ Files ignored due to path filters (10)
  • pystrings/similar_string_groups/images/similar_string_groups_example_1.png is excluded by !**/*.png
  • pystrings/similar_string_groups/images/similar_string_groups_example_2.png is excluded by !**/*.png
  • pystrings/similar_string_groups/images/similar_string_groups_example_3.png is excluded by !**/*.png
  • pystrings/similar_string_groups/images/similar_string_groups_solution_1.png is excluded by !**/*.png
  • pystrings/similar_string_groups/images/similar_string_groups_solution_2.png is excluded by !**/*.png
  • pystrings/similar_string_groups/images/similar_string_groups_solution_3.png is excluded by !**/*.png
  • pystrings/similar_string_groups/images/similar_string_groups_solution_4.png is excluded by !**/*.png
  • pystrings/similar_string_groups/images/similar_string_groups_solution_5.png is excluded by !**/*.png
  • pystrings/similar_string_groups/images/similar_string_groups_solution_6.png is excluded by !**/*.png
  • pystrings/similar_string_groups/images/similar_string_groups_solution_7.png is excluded by !**/*.png
📒 Files selected for processing (7)
  • DIRECTORY.md (1 hunks)
  • datastructures/__init__.py (1 hunks)
  • datastructures/sets/__init__.py (1 hunks)
  • datastructures/sets/union_find/__init__.py (1 hunks)
  • pystrings/similar_string_groups/README.md (1 hunks)
  • pystrings/similar_string_groups/__init__.py (1 hunks)
  • pystrings/similar_string_groups/test_similar_string_groups.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
pystrings/similar_string_groups/test_similar_string_groups.py (1)
pystrings/similar_string_groups/__init__.py (2)
  • num_similar_groups (5-52)
  • num_similar_groups_2 (67-77)
datastructures/sets/__init__.py (1)
datastructures/sets/union_find/__init__.py (2)
  • DisjointSetUnion (1-45)
  • UnionFind (48-62)
pystrings/similar_string_groups/__init__.py (1)
datastructures/sets/union_find/__init__.py (7)
  • DisjointSetUnion (1-45)
  • UnionFind (48-62)
  • union (20-41)
  • union (58-62)
  • get_count (43-45)
  • find (12-18)
  • find (52-56)
datastructures/__init__.py (1)
datastructures/sets/union_find/__init__.py (2)
  • DisjointSetUnion (1-45)
  • UnionFind (48-62)
🪛 LanguageTool
pystrings/similar_string_groups/README.md

[style] ~3-~3: ‘exactly the same’ might be wordy. Consider a shorter alternative.
Context: ...e considered similar if they are either exactly the same or can be made identical by swapping at...

(EN_WORDINESS_PREMIUM_EXACTLY_THE_SAME)


[style] ~38-~38: ‘exactly the same’ might be wordy. Consider a shorter alternative.
Context: ...o strings are similar — that is, either exactly the same or differ at exactly two positions (me...

(EN_WORDINESS_PREMIUM_EXACTLY_THE_SAME)

🪛 markdownlint-cli2 (0.18.1)
DIRECTORY.md

732-732: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


733-733: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

pystrings/similar_string_groups/README.md

54-54: Inconsistent indentation for list items at the same level
Expected: 3; Actual: 4

(MD005, list-indent)


56-56: Reference links and images should use a label that is defined
Missing link or image reference definition: "::-1"

(MD052, reference-links-images)

🔇 Additional comments (6)
DIRECTORY.md (1)

732-733: LGTM! Documentation entry properly placed.

The Similar String Groups entry is correctly positioned under Pystrings and follows the existing indentation pattern used throughout the file.

datastructures/__init__.py (1)

1-6: LGTM! Clean package organization.

The re-export pattern properly exposes the DSU classes at the package level, making them conveniently accessible via from datastructures import DisjointSetUnion, UnionFind.

pystrings/similar_string_groups/test_similar_string_groups.py (1)

1-60: LGTM! Comprehensive test coverage.

The test suite effectively validates both implementations with identical test cases covering:

  • Multiple distinct groups (test_1: 3 groups)
  • All strings in a single group (test_2: 1 group)
  • Partial grouping (test_3, test_4: 2 groups)
  • Longer strings for performance validation

The parallel test structure ensures both num_similar_groups and num_similar_groups_2 maintain consistent behavior.

datastructures/sets/__init__.py (1)

1-3: LGTM! Proper module organization.

Clean re-export pattern that maintains the package hierarchy while making the Union-Find classes accessible at the sets module level.

pystrings/similar_string_groups/README.md (1)

1-113: Excellent documentation!

The README provides comprehensive coverage of the problem with:

  • Clear problem statement and constraints
  • Visual examples for understanding
  • Detailed solution walkthrough
  • Accurate time complexity analysis: O(n²·m) dominated by string comparisons
  • Space complexity analysis: O(n) for Union-Find and tracking structures

The explanation of the Union-Find approach is well-articulated and helps readers understand both the algorithm and data structure choice.

pystrings/similar_string_groups/__init__.py (1)

55-65: LGTM! Correct similarity check.

The are_similar function correctly implements the similarity logic:

  • Returns True for identical strings (0 differences)
  • Returns True for strings differing at exactly 2 positions where the character pairs are symmetric
  • Early exits when more than 2 differences detected

The symmetry check diff[0] == diff[1][::-1] properly validates that swapping would make the strings equal.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
datastructures/sets/union_find/__init__.py (1)

48-74: Consider consolidating or documenting the purpose of having two Union-Find implementations.

Both UnionFind and DisjointSetUnion serve the same purpose but with different optimizations:

  • DisjointSetUnion: Uses union by rank + path compression, tracks set count → O(α(n)) amortized per operation
  • UnionFind: Uses only path compression → can degrade to O(log n) per operation

This creates potential confusion about which implementation to use. Consider one of the following:

  1. Consolidate: Remove UnionFind and use only DisjointSetUnion (recommended for best performance)
  2. Document differences: Add clear docstring guidance on when to use each (e.g., "Use UnionFind for simpler cases where minimal memory overhead is preferred; use DisjointSetUnion for optimal performance")
  3. Enhance UnionFind: Add union by rank and get_count() to match DisjointSetUnion's API and performance

Based on learnings from the past review comment, option 1 (consolidation) was suggested. If both are needed, consider adding this to the UnionFind docstring:

class UnionFind:
    """
    A minimal Union-Find data structure with path compression.
    
    Note: This is a simpler alternative to DisjointSetUnion without union by rank
    optimization. For performance-critical applications, prefer DisjointSetUnion.
    """
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aa47a80 and a1cdbce.

📒 Files selected for processing (1)
  • datastructures/sets/union_find/__init__.py (1 hunks)

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
pystrings/similar_string_groups/__init__.py (1)

17-41: Code duplication between is_similar and are_similar.

The nested is_similar function duplicates the logic of the module-level are_similar function (lines 56-66). This was flagged in previous reviews and remains unaddressed.

Consider removing the nested is_similar and using are_similar directly:

-    # All strings have the same length, per constraints
-    word_len = len(strs[0])
-
     # Initialize Union-Find with n elements, one for each string.
     # The initial count is n (each string is its own group).
     uf = DisjointSetUnion(strs_len)
 
-    def is_similar(s1: str, s2: str) -> bool:
-        """
-        Checks if two strings are similar.
-        Similar means identical (0 diffs) or 1 swap (2 diffs).
-        """
-        diff_count = 0
-        positions_that_differ = []
-        for k in range(word_len):
-            if s1[k] != s2[k]:
-                positions_that_differ.append(k)
-                diff_count += 1
-
-            # Optimization: If more than 2 differences,
-            # they can't be similar.
-            if diff_count > 2:
-                return False
-
-        if diff_count == 2:
-            i = positions_that_differ[0]
-            j = positions_that_differ[1]
-            return s1[i] == s2[j] and s1[j] == s2[i]
-
-        # At this point, diff_count is either 0 or 1
-        # Only 0 differences (identical strings) are similar
-        return diff_count == 0
-
     # Iterate over all unique pairs of strings
     for i in range(strs_len):
         for j in range(i + 1, strs_len):
             # If the strings are similar, merge their groups.
             # The union() method handles decrementing the count
             # only if they were in different groups.
-            if is_similar(strs[i], strs[j]):
+            if are_similar(strs[i], strs[j]):
                 uf.union(i, j)
🧹 Nitpick comments (3)
pystrings/similar_string_groups/__init__.py (3)

2-2: Remove unused import.

UnionFind is imported but never used in this file.

Apply this diff to remove the unused import:

-from datastructures import DisjointSetUnion, UnionFind
+from datastructures import DisjointSetUnion

56-56: Add type hints for consistency.

The are_similar function is missing type hints for parameters and return type, which is inconsistent with the rest of the codebase.

Apply this diff to add type hints:

-def are_similar(s1, s2):
+def are_similar(s1: str, s2: str) -> bool:

77-78: Consider using get_count() for consistency.

num_similar_groups_2 manually counts unique roots by creating a set, while num_similar_groups uses the more efficient uf.get_count() method. Both approaches are correct, but using get_count() is more efficient (O(1) vs O(n)) and consistent with the DSU API.

Apply this diff to use get_count():

-    roots = {uf.find(i) for i in range(n)}
-    return len(roots)
+    return uf.get_count()
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a1cdbce and dc02f1e.

📒 Files selected for processing (1)
  • pystrings/similar_string_groups/__init__.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
pystrings/similar_string_groups/__init__.py (1)
datastructures/sets/union_find/__init__.py (7)
  • DisjointSetUnion (1-45)
  • UnionFind (48-74)
  • union (20-41)
  • union (64-74)
  • get_count (43-45)
  • find (12-18)
  • find (57-62)
🔇 Additional comments (1)
pystrings/similar_string_groups/__init__.py (1)

39-41: Bug fix verified.

The logic issue from the previous review has been correctly addressed. The function now properly rejects strings with exactly 1 difference, returning True only for identical strings when diff_count != 2.

@BrianLusina BrianLusina merged commit e43536e into main Nov 16, 2025
4 of 7 checks passed
@BrianLusina BrianLusina deleted the feat/similar-string-groups branch November 16, 2025 05:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants