Skip to content

Bug fix: #79 and delete all existing duplicate links#82

Open
PawsFunctions wants to merge 2 commits intortuszik:mainfrom
PawsFunctions:main
Open

Bug fix: #79 and delete all existing duplicate links#82
PawsFunctions wants to merge 2 commits intortuszik:mainfrom
PawsFunctions:main

Conversation

@PawsFunctions
Copy link
Contributor

Bug fix: #79 - Failed to get all the existing links when entire response is filled with duplicates.

Edit: updated get_existing_links to use new search api and changed cursor logic to use nextCursor from the response.
ADD: delete_links function to delete all duplicate links (OPT_DELETE_DUPLICATE).
ADD: DEBUG environment variable to replicate -d.

…e response is filled with duplicate

Edit: updated get_existing_links to use new search api and changed cursor
      logic to use nextCursor from response.
ADD: delete_links function to delete all duplicate links (OPT_DELETE_DUPLICATE).
ADD: DEBUG enviroment variable to replicate -d.
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • Consider separating the duplicate deletion logic from get_existing_links into a dedicated function so that get_existing_links focuses purely on pagination/iteration and side effects like deletions are handled explicitly by the caller.
  • Logging the full duplicate_link_ids list can become very large and may expose internal IDs; it would be safer to log only the count and perhaps a small sample instead of the entire list.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Consider separating the duplicate deletion logic from `get_existing_links` into a dedicated function so that `get_existing_links` focuses purely on pagination/iteration and side effects like deletions are handled explicitly by the caller.
- Logging the full `duplicate_link_ids` list can become very large and may expose internal IDs; it would be safer to log only the count and perhaps a small sample instead of the entire list.

## Individual Comments

### Comment 1
<location> `starwarden/linkwarden_api.py:43-56` </location>
<code_context>
-            if not links:
+            total_links_processed += len(links)
+            
+            for link in links:
+                link_url = link["url"]
+                link_id = link["id"]
+                
+                if link_url in seen_urls:
+                    # Found a duplicate
+                    logger.debug(f"Found duplicate link: {link_url} (ID: {link_id})")
+                    duplicate_link_ids.append(link_id)
+                else:
+                    seen_urls.add(link_url)
+                
+                yield link_url
+
+            if next_cursor is None:
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Consider not yielding URLs that are detected as duplicates if they are going to be deleted.

`get_existing_links` currently yields `link_url` for every record, including those already in `seen_urls` whose `link_id` is queued in `duplicate_link_ids` for deletion. A caller expecting a stream of unique, current links will still receive these soon-to-be-deleted duplicates. If the goal (especially with `delete_duplicate=True`) is to present a deduplicated view, you could avoid yielding when `link_url in seen_urls` and only record the `link_id` for deletion. If some callers depend on the existing "yield everything" behavior, consider making that distinction explicit via a parameter or clearer naming.

```suggestion
            total_links_processed += len(links)

            for link in links:
                link_url = link["url"]
                link_id = link["id"]

                if link_url in seen_urls:
                    # Found a duplicate
                    logger.debug(f"Found duplicate link: {link_url} (ID: {link_id})")
                    duplicate_link_ids.append(link_id)
                    # Do not yield duplicates since they are queued for deletion
                    continue
                else:
                    seen_urls.add(link_url)
                    # Only yield URLs that are not marked as duplicates
                    yield link_url
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
@rtuszik
Copy link
Owner

rtuszik commented Jan 31, 2026

I can't merge a PR that has failing tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Failed to get all the existing links when have too many duplicate links

2 participants