feat(strings): similar string groups #104

BrianLusina · 2025-11-16T04:10:39Z

Describe your change:

Algorithm to find similar string groups

Add an algorithm?
Fix a bug or typo in an existing algorithm?
Documentation change?

Checklist:

Summary by CodeRabbit

New Features
- Added Similar String Groups feature (two implementations).
- Added Union-Find / Disjoint Set data structures and exposed them in the datastructures package.
Documentation
- Added README for Similar String Groups and updated repository navigation.
Tests
- Added unit tests covering multiple scenarios to validate the group-counting implementations.

…t union

coderabbitai · 2025-11-16T04:10:57Z

Caution

Review failed

The pull request is closed.

Walkthrough

Adds a Union-Find implementation and re-exports, introduces a new "Similar String Groups" module with two DSU-based implementations, tests and README, and updates package navigation in DIRECTORY.md.

Changes

Cohort / File(s)	Summary
Package exports `datastructures/__init__.py`, `datastructures/sets/__init__.py`	Re-export `DisjointSetUnion` and `UnionFind` by importing from the new union_find subpackage and updating `__all__`.
Union-Find implementation `datastructures/sets/union_find/__init__.py`	New module adding `DisjointSetUnion` (union by rank, path compression, `get_count`) and `UnionFind` (path-compressed find, union). Constructors validate size > 0.
Similar String Groups module `pystrings/similar_string_groups/__init__.py`, `pystrings/similar_string_groups/README.md`	New implementations `num_similar_groups` and `num_similar_groups_2` with helper similarity predicates and README describing the DSU approach, complexity, and examples.
Tests `pystrings/similar_string_groups/test_similar_string_groups.py`	New unittest suite exercising both implementations with multiple test cases.
Navigation `DIRECTORY.md`	Added "Similar String Groups" entry under Pystrings/Reverse Vowels navigation.

Sequence Diagram(s)

sequenceDiagram
    participant Caller as num_similar_groups()
    participant DSU as DisjointSetUnion
    participant Check as are_similar()

    Caller->>DSU: __init__(n)
    Note right of DSU: parent[], rank[], count initialized

    loop for each pair (i, j)
        Caller->>Check: are_similar(strs[i], strs[j])
        alt similar
            Check-->>Caller: True
            Caller->>DSU: union(i, j)
            Note right of DSU: merge roots, update count
        else not similar
            Check-->>Caller: False
        end
    end

    Caller->>DSU: get_count()
    DSU-->>Caller: group_count
    Caller-->>Caller: return group_count

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Review DisjointSetUnion path compression and union-by-rank correctness and edge cases (size validation, indexing).
Confirm get_count() updates properly on unions.
Validate similarity helpers handle length mismatches, early exits when diffs > 2, and correct mirrored-pair check.
Ensure tests import exported names (package re-exports) and exercise both implementations.

Possibly related PRs

feat(strings): similar string groups #104 — Adds the same Similar String Groups implementation and the Union-Find exports; likely a duplicate or closely related change.

Suggested labels

enhancement, Documentation

Poem

🐰 I hop through roots and ranks tonight,
I swap two letters, make groups tight,
With unions done and tests in play,
Similar strings find their way—
A tiny hop, a tidy byte.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 26.32% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat(strings): similar string groups' clearly and concisely summarizes the main change: adding a similar string groups algorithm to the strings module.
Description check	✅ Passed	The pull request description follows the required template with all major sections completed, including a clear description of the change and comprehensive checklist coverage.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc02f1e and 69e2998.

📒 Files selected for processing (2)

datastructures/sets/union_find/__init__.py (1 hunks)
pystrings/similar_string_groups/__init__.py (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

pystrings/similar_string_groups/__init__.py (1)
67-77: Implementation works but uses different counting approach.

This implementation correctly groups similar strings. However, there's a subtle difference in how it counts groups compared to num_similar_groups:

This version counts unique roots via set comprehension (line 76)

num_similar_groups uses DisjointSetUnion.get_count()

Both approaches are correct, but the set-based counting approach is less efficient as it performs O(n) find operations after the union operations complete.

Consider documenting why two implementations exist or consolidating them:
def num_similar_groups_2(strs: List[str]) -> int:
    """
    Alternative implementation that demonstrates counting groups 
    via unique roots instead of maintaining a count during unions.
    """
    n = len(strs)
    uf = DisjointSetUnion(n)
    
    for i in range(n):
        for j in range(i + 1, n):
            if are_similar(strs[i], strs[j]):
                uf.union(i, j)
    
    # Alternative: return uf.get_count() for better performance
    roots = {uf.find(i) for i in range(n)}
    return len(roots)
datastructures/sets/union_find/__init__.py (1)
25-26: Remove debug comments.

Lines 25-26 contain inline comments (# 0, # 1) that appear to be debug artifacts from development. These should be removed for cleaner code.
-        root_i = self.find(i) # 0
-        root_j = self.find(j) # 1
+        root_i = self.find(i)
+        root_j = self.find(j)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a90f0fc and aa47a80.

⛔ Files ignored due to path filters (10)

pystrings/similar_string_groups/images/similar_string_groups_example_1.png is excluded by !**/*.png
pystrings/similar_string_groups/images/similar_string_groups_example_2.png is excluded by !**/*.png
pystrings/similar_string_groups/images/similar_string_groups_example_3.png is excluded by !**/*.png
pystrings/similar_string_groups/images/similar_string_groups_solution_1.png is excluded by !**/*.png
pystrings/similar_string_groups/images/similar_string_groups_solution_2.png is excluded by !**/*.png
pystrings/similar_string_groups/images/similar_string_groups_solution_3.png is excluded by !**/*.png
pystrings/similar_string_groups/images/similar_string_groups_solution_4.png is excluded by !**/*.png
pystrings/similar_string_groups/images/similar_string_groups_solution_5.png is excluded by !**/*.png
pystrings/similar_string_groups/images/similar_string_groups_solution_6.png is excluded by !**/*.png
pystrings/similar_string_groups/images/similar_string_groups_solution_7.png is excluded by !**/*.png

📒 Files selected for processing (7)

DIRECTORY.md (1 hunks)
datastructures/__init__.py (1 hunks)
datastructures/sets/__init__.py (1 hunks)
datastructures/sets/union_find/__init__.py (1 hunks)
pystrings/similar_string_groups/README.md (1 hunks)
pystrings/similar_string_groups/__init__.py (1 hunks)
pystrings/similar_string_groups/test_similar_string_groups.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (4)

pystrings/similar_string_groups/test_similar_string_groups.py (1)

pystrings/similar_string_groups/__init__.py (2)

num_similar_groups (5-52)

num_similar_groups_2 (67-77)

datastructures/sets/__init__.py (1)

datastructures/sets/union_find/__init__.py (2)

DisjointSetUnion (1-45)

UnionFind (48-62)

pystrings/similar_string_groups/__init__.py (1)

datastructures/sets/union_find/__init__.py (7)

DisjointSetUnion (1-45)

UnionFind (48-62)

union (20-41)

union (58-62)

get_count (43-45)

find (12-18)

find (52-56)

datastructures/__init__.py (1)

datastructures/sets/union_find/__init__.py (2)

DisjointSetUnion (1-45)

UnionFind (48-62)

🪛 LanguageTool

pystrings/similar_string_groups/README.md

[style] ~3-~3: ‘exactly the same’ might be wordy. Consider a shorter alternative.
Context: ...e considered similar if they are either exactly the same or can be made identical by swapping at...

(EN_WORDINESS_PREMIUM_EXACTLY_THE_SAME)

[style] ~38-~38: ‘exactly the same’ might be wordy. Consider a shorter alternative.
Context: ...o strings are similar — that is, either exactly the same or differ at exactly two positions (me...

(EN_WORDINESS_PREMIUM_EXACTLY_THE_SAME)

🪛 markdownlint-cli2 (0.18.1)

DIRECTORY.md

732-732: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)

733-733: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

pystrings/similar_string_groups/README.md

54-54: Inconsistent indentation for list items at the same level
Expected: 3; Actual: 4

(MD005, list-indent)

56-56: Reference links and images should use a label that is defined
Missing link or image reference definition: "::-1"

(MD052, reference-links-images)

🔇 Additional comments (6)

DIRECTORY.md (1)

732-733: LGTM! Documentation entry properly placed.

The Similar String Groups entry is correctly positioned under Pystrings and follows the existing indentation pattern used throughout the file.

datastructures/__init__.py (1)

1-6: LGTM! Clean package organization.

The re-export pattern properly exposes the DSU classes at the package level, making them conveniently accessible via from datastructures import DisjointSetUnion, UnionFind.

pystrings/similar_string_groups/test_similar_string_groups.py (1)

1-60: LGTM! Comprehensive test coverage.

The test suite effectively validates both implementations with identical test cases covering:

Multiple distinct groups (test_1: 3 groups)

All strings in a single group (test_2: 1 group)

Partial grouping (test_3, test_4: 2 groups)

Longer strings for performance validation

The parallel test structure ensures both num_similar_groups and num_similar_groups_2 maintain consistent behavior.

datastructures/sets/__init__.py (1)

1-3: LGTM! Proper module organization.

Clean re-export pattern that maintains the package hierarchy while making the Union-Find classes accessible at the sets module level.

pystrings/similar_string_groups/README.md (1)

1-113: Excellent documentation!

The README provides comprehensive coverage of the problem with:

Clear problem statement and constraints

Visual examples for understanding

Detailed solution walkthrough

Accurate time complexity analysis: O(n²·m) dominated by string comparisons

Space complexity analysis: O(n) for Union-Find and tracking structures

The explanation of the Union-Find approach is well-articulated and helps readers understand both the algorithm and data structure choice.

pystrings/similar_string_groups/__init__.py (1)

55-65: LGTM! Correct similarity check.

The are_similar function correctly implements the similarity logic:

Returns True for identical strings (0 differences)

Returns True for strings differing at exactly 2 positions where the character pairs are symmetric

Early exits when more than 2 differences detected

The symmetry check diff[0] == diff[1][::-1] properly validates that swapping would make the strings equal.

datastructures/sets/union_find/__init__.py

pystrings/similar_string_groups/__init__.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

datastructures/sets/union_find/__init__.py (1)
48-74: Consider consolidating or documenting the purpose of having two Union-Find implementations.

Both UnionFind and DisjointSetUnion serve the same purpose but with different optimizations:

DisjointSetUnion: Uses union by rank + path compression, tracks set count → O(α(n)) amortized per operation

UnionFind: Uses only path compression → can degrade to O(log n) per operation

This creates potential confusion about which implementation to use. Consider one of the following:

Consolidate: Remove UnionFind and use only DisjointSetUnion (recommended for best performance)

Document differences: Add clear docstring guidance on when to use each (e.g., "Use UnionFind for simpler cases where minimal memory overhead is preferred; use DisjointSetUnion for optimal performance")

Enhance UnionFind: Add union by rank and get_count() to match DisjointSetUnion's API and performance

Based on learnings from the past review comment, option 1 (consolidation) was suggested. If both are needed, consider adding this to the UnionFind docstring:
class UnionFind:
    """
    A minimal Union-Find data structure with path compression.
    
    Note: This is a simpler alternative to DisjointSetUnion without union by rank
    optimization. For performance-critical applications, prefer DisjointSetUnion.
    """

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aa47a80 and a1cdbce.

📒 Files selected for processing (1)

datastructures/sets/union_find/__init__.py (1 hunks)

datastructures/sets/union_find/__init__.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

pystrings/similar_string_groups/__init__.py (1)

17-41: Code duplication between is_similar and are_similar.

The nested is_similar function duplicates the logic of the module-level are_similar function (lines 56-66). This was flagged in previous reviews and remains unaddressed.

Consider removing the nested is_similar and using are_similar directly:

-    # All strings have the same length, per constraints
-    word_len = len(strs[0])
-
     # Initialize Union-Find with n elements, one for each string.
     # The initial count is n (each string is its own group).
     uf = DisjointSetUnion(strs_len)
 
-    def is_similar(s1: str, s2: str) -> bool:
-        """
-        Checks if two strings are similar.
-        Similar means identical (0 diffs) or 1 swap (2 diffs).
-        """
-        diff_count = 0
-        positions_that_differ = []
-        for k in range(word_len):
-            if s1[k] != s2[k]:
-                positions_that_differ.append(k)
-                diff_count += 1
-
-            # Optimization: If more than 2 differences,
-            # they can't be similar.
-            if diff_count > 2:
-                return False
-
-        if diff_count == 2:
-            i = positions_that_differ[0]
-            j = positions_that_differ[1]
-            return s1[i] == s2[j] and s1[j] == s2[i]
-
-        # At this point, diff_count is either 0 or 1
-        # Only 0 differences (identical strings) are similar
-        return diff_count == 0
-
     # Iterate over all unique pairs of strings
     for i in range(strs_len):
         for j in range(i + 1, strs_len):
             # If the strings are similar, merge their groups.
             # The union() method handles decrementing the count
             # only if they were in different groups.
-            if is_similar(strs[i], strs[j]):
+            if are_similar(strs[i], strs[j]):
                 uf.union(i, j)

🧹 Nitpick comments (3)

pystrings/similar_string_groups/__init__.py (3)
2-2: Remove unused import.

UnionFind is imported but never used in this file.

Apply this diff to remove the unused import:
-from datastructures import DisjointSetUnion, UnionFind
+from datastructures import DisjointSetUnion
56-56: Add type hints for consistency.

The are_similar function is missing type hints for parameters and return type, which is inconsistent with the rest of the codebase.

Apply this diff to add type hints:
-def are_similar(s1, s2):
+def are_similar(s1: str, s2: str) -> bool:
77-78: Consider using get_count() for consistency.

num_similar_groups_2 manually counts unique roots by creating a set, while num_similar_groups uses the more efficient uf.get_count() method. Both approaches are correct, but using get_count() is more efficient (O(1) vs O(n)) and consistent with the DSU API.

Apply this diff to use get_count():
-    roots = {uf.find(i) for i in range(n)}
-    return len(roots)
+    return uf.get_count()

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a1cdbce and dc02f1e.

📒 Files selected for processing (1)

pystrings/similar_string_groups/__init__.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

pystrings/similar_string_groups/__init__.py (1)

datastructures/sets/union_find/__init__.py (7)

DisjointSetUnion (1-45)

UnionFind (48-74)

union (20-41)

union (64-74)

get_count (43-45)

find (12-18)

find (57-62)

🔇 Additional comments (1)

pystrings/similar_string_groups/__init__.py (1)

39-41: Bug fix verified.

The logic issue from the previous review has been correctly addressed. The function now properly rejects strings with exactly 1 difference, returning True only for identical strings when diff_count != 2.

pystrings/similar_string_groups/__init__.py

BrianLusina and others added 2 commits November 15, 2025 18:33

feat(strings, datastructures): similar string groups with disjoint se…

bc4c62e

…t union

updating DIRECTORY.md

aa47a80

BrianLusina self-assigned this Nov 16, 2025

BrianLusina added Algorithm Algorithm Problem Datastructures Datastructures Strings Disjoint Sets labels Nov 16, 2025

coderabbitai bot reviewed Nov 16, 2025

View reviewed changes

datastructures/sets/union_find/__init__.py Outdated Show resolved Hide resolved

pystrings/similar_string_groups/__init__.py Outdated Show resolved Hide resolved

refactor(datastructures, union-find): add type hints and doc comments

a1cdbce

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai bot reviewed Nov 16, 2025

View reviewed changes

datastructures/sets/union_find/__init__.py Outdated Show resolved Hide resolved

refactor(strings, similar string groups): is similar string function

dc02f1e

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai bot reviewed Nov 16, 2025

View reviewed changes

pystrings/similar_string_groups/__init__.py Outdated Show resolved Hide resolved

refactor(strings, similar-string-groups): add length check

69e2998

BrianLusina merged commit e43536e into main Nov 16, 2025
4 of 7 checks passed

BrianLusina deleted the feat/similar-string-groups branch November 16, 2025 05:06

coderabbitai bot mentioned this pull request Dec 11, 2025

feat(strings, heaps): longest happy string #116

Merged

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(strings): similar string groups #104

feat(strings): similar string groups #104

Uh oh!

BrianLusina commented Nov 16, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 16, 2025 •

edited

Loading

Review failed

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(strings): similar string groups #104

feat(strings): similar string groups #104

Uh oh!

Conversation

BrianLusina commented Nov 16, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your change:

Checklist:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BrianLusina commented Nov 16, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 16, 2025 •

edited

Loading