Skip to content

Implement image rename detection using perceptual hashes#218

Open
Kaos599 wants to merge 17 commits intoyurijmikhalevich:mainfrom
Kaos599:main
Open

Implement image rename detection using perceptual hashes#218
Kaos599 wants to merge 17 commits intoyurijmikhalevich:mainfrom
Kaos599:main

Conversation

@Kaos599
Copy link

@Kaos599 Kaos599 commented Oct 14, 2025

How does this PR impact the user?

Eliminates expensive re-indexing when images are renamed. Users get 30-80x faster performance and seamless search experience.

Screenshots

Before

Screenshot 2025-10-14 183602

After (with hash cache)

Screenshot 2025-10-14 183613

After Renaming

Screenshot 2025-10-14 183624

Description

  • Added perceptual hashing for instant rename detection
  • Removed DB versioning system for simpler schema management
  • Preserves image searchability across filename changes

Limitations

  • Only works for images, requires initial indexing

Checklist

  • Focused on hash-based rename detection
  • Performance verified (34.7x speedup demonstrated)
  • All tests pass (28/28)
  • Code linted and type-checked

Closes #10

- Add imagehash dependency for perceptual hash computation
- Update database schema to store image hashes
- Modify indexing logic to detect and handle renamed images
- Add comprehensive test for rename functionality
- Update all test snapshots for new database schema
leoauri and others added 2 commits October 26, 2025 08:40
## How does this PR impact the user?

This PR adds support for HEIC images.

## Description

- Add pillow-heif dependency
- Register HEIC opener on startup
- Add test for HEIC file processing

---------

Co-authored-by: Yurij Mikhalevich <yurij@mikhalevi.ch>
Adds @leoauri as a contributor for code.

This was requested by yurijmikhalevich [in this
comment](yurijmikhalevich#222 (comment))

[skip ci]

---------

Co-authored-by: allcontributors[bot] <46447321+allcontributors[bot]@users.noreply.github.com>
Comment on lines -51 to -69
def ensure_version(self):
db_version_entry = self._con.execute("SELECT version FROM db_version").fetchone()
db_version = db_version_entry["version"] if db_version_entry else 1
if db_version == self.VERSION:
return
if db_version > self.VERSION:
raise Exception(
"found index version newer than this version of rclip can support;"
" please, update rclip: https://github.com/yurijmikhalevich/rclip/blob/main/README.md#installation",
)
if db_version < 2:
self._con.execute("ALTER TABLE images ADD COLUMN indexing BOOLEAN")
db_version = 2
if db_version < self.VERSION:
raise Exception("migration to a newer index version isn't implemented")
if db_version_entry:
self._con.execute("UPDATE db_version SET version=?", (self.VERSION,))
else:
self._con.execute("INSERT INTO db_version(version) VALUES (?)", (self.VERSION,))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Kaos599! Thank you for drafting this PR.

Can you please bring the db version support back and use it to add the new column? If we are going to redo versioning, it warrants a separate PR.

Also, this PR completely completely removes the indexing column from the database and the functionality responsible for marking deleted images as deleted. Can you please bring this back, too?

self._db.commit()
counter_thread.join()

self._db.flag_indexing_images_in_a_dir_as_deleted(directory)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep this to mark images we never met during reindexing as deleted.

rclip/main.py Outdated
Comment on lines +166 to +180
# Mark any existing entries with this hash but different filepath as deleted
# (in case there are multiple entries with the same hash)
for img in existing_images_with_hash:
if img["filepath"] != filepath:
self._db.upsert_image(
db.NewImage(
filepath=img["filepath"],
modified_at=img["modified_at"],
size=img["size"],
vector=img["vector"],
hash=img["hash"],
deleted=True
),
commit=False,
)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we have multiple copies of the same image in different locations? This loop allows only one of them to exist. If you remove this code and bring back the original logic for marking deleted images as deleted, it will take make sure that only the images that were truly deleted are deleted.

Suggested change
# Mark any existing entries with this hash but different filepath as deleted
# (in case there are multiple entries with the same hash)
for img in existing_images_with_hash:
if img["filepath"] != filepath:
self._db.upsert_image(
db.NewImage(
filepath=img["filepath"],
modified_at=img["modified_at"],
size=img["size"],
vector=img["vector"],
hash=img["hash"],
deleted=True
),
commit=False,
)

Comment on lines +238 to +239
"""Compute perceptual hash for an image using pHash algorithm."""
return str(imagehash.phash(image))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how likely this perceptual hashing algorithm is to match similar but different images. I would love rclip to detect moves of exactly same files, but if the files are different (different crops, etc), I think it's ok to just reindex them.

What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @yurijmikhalevich,

Two options I see:

Switch to SHA‑256 (or another cryptographic hash) for exact file matching. This would guarantee that only truly identical files reuse vectors, eliminating any risk of false positives.

Keep the perceptual hash but document that it may merge visually‑similar images; we could add a fallback to SHA‑256 when the perceptual hash matches but the file sizes differ.

Which direction would you prefer? I’m happy to implement the SHA‑256 approach (simple, deterministic) or keep the current perceptual hash.

Comment on lines +105 to +108
try:
version_str = f"rclip {version('rclip')}"
except Exception:
version_str = "rclip (development)"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this could be a separate PR since it's not related to the PR goal and makes it less focused

@@ -1 +1 @@
File "<test_images_dir>non-existing-file.jpg" not found. Check if you have typos in the filename.
File "./non-existing-file.jpg" not found. Check if you have typos in the filename.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please revert this and rebase on main? The issue with this test was fixed on main.

).lstrip("\ufeff")
if not snapshot_path.exists():
snapshot_path.write_text(snapshot)
snapshot_path.write_text(snapshot, encoding="utf-8")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, what motivated you to add encoding="utf-8" here?

execute_query(test_dir_with_unicode_filenames, monkeypatch, "鳥")


def test_handles_renamed_images(test_images_dir: Path, monkeypatch: pytest.MonkeyPatch):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding a test!

It would be good to:

  • include multiple copies of the exactly the same image to ensure that it preserves both copies in the results / database
  • include different fails, containing the same (or a very similar images); like slightly different crops of the same pictures or the bee picture in different formats to ensure it indexes them both

Copy link
Owner

@yurijmikhalevich yurijmikhalevich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your PR! 🙌 It is a great start, but there is more to be done here. Please see my comments above, and let me know if you have any questions or if I can help with anything!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements perceptual hash-based rename detection for images, eliminating the need for expensive re-indexing when images are renamed. The implementation adds hash computation during indexing and uses these hashes to detect renamed images, providing a 30-80x performance improvement.

Key Changes:

  • Added perceptual hashing (pHash) to track images across renames
  • Removed the database versioning system and indexing flag mechanism
  • Implemented hash-based lookup to detect and handle renamed images during indexing

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
rclip/db.py Adds hash column to database schema, removes version management and indexing flags, implements hash-based image lookup
rclip/main.py Integrates hash computation during indexing, adds rename detection logic using hash matching
rclip/utils/helpers.py Adds perceptual hash computation function using imagehash library
pyproject.toml Adds imagehash dependency
tests/e2e/test_rclip.py Adds comprehensive test for renamed image detection, fixes encoding parameter
tests/e2e/output_snapshots/test_search_by_non_existing_file.txt Updates snapshot with corrected file path format

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rclip/db.py Outdated
VALUES (:deleted, :filepath, :modified_at, :size, :vector, :hash)
ON CONFLICT(filepath) DO UPDATE SET
deleted=:deleted, indexing=:indexing, modified_at=:modified_at, size=:size, vector=:vector
deleted=:deleted, modified_at=:modified_at, size=:size, vector=:vector, hash=COALESCE(:hash, hash)
Copy link

Copilot AI Oct 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The COALESCE function preserves existing hash values when :hash is NULL, which could prevent hash updates for existing images. If an image's hash needs to be updated (e.g., due to changes in the hashing algorithm), this will fail. Consider using hash=:hash instead to allow explicit hash updates.

Suggested change
deleted=:deleted, modified_at=:modified_at, size=:size, vector=:vector, hash=COALESCE(:hash, hash)
deleted=:deleted, modified_at=:modified_at, size=:size, vector=:vector, hash=:hash

Copilot uses AI. Check for mistakes.
rclip/main.py Outdated
print("error computing features:", ex, file=sys.stderr)
return
for path, meta, vector in cast(Iterable[PathMetaVector], zip(filtered_paths, metas, features)):
for path, meta, vector, image in cast(Iterable[tuple], zip(filtered_paths, metas, features, images)):
Copy link

Copilot AI Oct 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Casting to Iterable[tuple] is too generic and loses type information. Consider defining a specific type alias like 'Iterable[tuple[str, ImageMeta, np.ndarray, Image.Image]]' to maintain type safety and improve code clarity.

Copilot uses AI. Check for mistakes.
rclip/main.py Outdated
)
self._db.remove_indexing_flag(filepath, commit=False)
continue
except (PIL.UnidentifiedImageError, Exception):
Copy link

Copilot AI Oct 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching the generic Exception class is too broad. Consider catching specific exceptions that could occur during image reading and hash computation (e.g., IOError, OSError) to avoid masking unexpected errors.

Suggested change
except (PIL.UnidentifiedImageError, Exception):
except (PIL.UnidentifiedImageError, OSError, IOError, ValueError):

Copilot uses AI. Check for mistakes.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fair. If we only expect PIL.UnidentifiedImageError, let's only catch it.

rclip/db.py Outdated

def get_images_by_hash(self, hash_value: str) -> list[Image]:
cur = self._con.execute("SELECT * FROM images WHERE hash = ? AND deleted IS NULL", (hash_value,))
return cur.fetchall()
Copy link

Copilot AI Oct 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type list[Image] is incorrect. The function returns sqlite3.Row objects from fetchall(), not Image typed dictionaries. This should return list[sqlite3.Row] or the rows should be explicitly converted to Image objects.

Suggested change
return cur.fetchall()
return [dict(row) for row in cur.fetchall()]

Copilot uses AI. Check for mistakes.
rclip/main.py Outdated
existing_images_with_hash = self._db.get_images_by_hash(current_hash)
if existing_images_with_hash:
# Found a match - this is likely a renamed image
existing_image = existing_images_with_hash[0] # Take the first match
Copy link

Copilot AI Oct 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When multiple images share the same hash, arbitrarily selecting the first match could lead to incorrect associations. Consider adding logic to select the most appropriate match based on criteria like file modification time or file size similarity.

Copilot uses AI. Check for mistakes.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is slightly confusing. I saw that we only use this existing_image to extract the vector. It would be better to do just that. For example:

Suggested change
existing_image = existing_images_with_hash[0] # Take the first match
existing_image_vector = existing_images_with_hash[0]["vector"]

…#224)

## How does this PR impact the user?

New rclip versions with heif support will be released for `brew`.

## Description

Ditto.

## Limitations

N/A

## Checklist

- [x] my PR is focused and contains one holistic change
- [ ] I have added screenshots or screen recordings to show the changes
…urijmikhalevich#229)

## How does this PR impact the user?

The user will be able to find information about how rclip updates its
index in the README.

## Description

Ditto.

## Limitations

n/a

## Checklist

- [x] my PR is focused and contains one holistic change
- [ ] I have added screenshots or screen recordings to show the changes
@Kaos599
Copy link
Author

Kaos599 commented Nov 23, 2025

hey @yurijmikhalevich I was quite busy for a few days. I'll look into the suggestions that you have given, and then I'll implement them. Thanks for the patience!

@yurijmikhalevich
Copy link
Owner

@Kaos599, thank you! Looking forward to the seeing the updated from you!

Kaos599 and others added 6 commits November 29, 2025 22:55
- Reset all indexing flags before a re‑index:
self._db.remove_indexing_flag_from_all_images()  self._db.flag_images_in_a_dir_as_indexing(directory)

- After processing the directory, mark any images that were
  flagged for indexing but never seen (i.e., renamed or deleted)
  as deleted:  self._db.flag_indexing_images_in_a_dir_as_deleted(directory)
- Extract only the image vector when handling renamed files (main.py)

- Convert DB rows to dicts in [get_images_by_hash]

- Catch only `PIL.UnidentifiedImageError` to avoid masking unexpected errors (main.py)

- Remove COALESCE from hash update in [upsert_image] to allow explicit hash changes (db.py)

- Add precise type casting for zip iteration to improve type safety (main.py)
- Test multiple copies of same image are preserved (cat.jpg + subfolder/cat_copy.jpg)

- Verify different images indexed separately with unique hashes (bee.jpg)

- Confirm all images preserved after rename operations

- Validate search results return only active (non-deleted) images
- Assert old filepath excluded from results after rename
- replaced generic Exception with specific (OSError, IOError, ValueError)
- updated rename detection to catch the same specific exceptions
- ensures unexpected errors are not masked during image processed
@Kaos599
Copy link
Author

Kaos599 commented Dec 27, 2025

hey @yurijmikhalevich I have tried to address all your concerns. Please review the latest changes that I've done.

Thank you for your Patience!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 21 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rclip/db.py Outdated
VALUES (:deleted, :indexing, :filepath, :modified_at, :size, :vector, :hash)
ON CONFLICT(filepath) DO UPDATE SET
deleted=:deleted, indexing=:indexing, modified_at=:modified_at, size=:size, vector=:vector
deleted=:deleted, modified_at=:modified_at, size=:size, vector=:vector, hash=:hash
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upsert_image method is missing the 'indexing' field in the UPDATE clause. When updating an existing image, the indexing flag should also be reset to NULL. Currently, the UPDATE SET only includes deleted, modified_at, size, vector, and hash, but not indexing. This could cause the indexing flag to remain set on updated images, potentially leading to incorrect behavior during subsequent indexing operations.

Suggested change
deleted=:deleted, modified_at=:modified_at, size=:size, vector=:vector, hash=:hash
deleted=:deleted, indexing=:indexing, modified_at=:modified_at, size=:size, vector=:vector, hash=:hash

Copilot uses AI. Check for mistakes.
assert bee_after["hash"] == bee_hash, "Different image hash should be unchanged"

# Verify search results return all 3 active images
results = rclip.search("animal", str(temp_path), top_k=10)
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test uses hardcoded top_k=10 parameter which assumes the number of images indexed will always be less than or equal to 10. If the test directory has more than 10 images, some results might be truncated, potentially causing the test to fail incorrectly. Consider using a larger top_k value or setting it to None/unlimited to ensure all indexed images are returned in the search results.

Suggested change
results = rclip.search("animal", str(temp_path), top_k=10)
results = rclip.search("animal", str(temp_path), top_k=None)

Copilot uses AI. Check for mistakes.

When you run **rclip** in a directory that has already been processed, it will
only index the new images added since the last run and remove the deleted images
from its index. This makes consecutive runs much faster.
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation mentions that rclip updates the index by removing deleted images and indexing new ones, but it doesn't mention the new rename detection feature. Consider adding a brief explanation that renamed images are now automatically detected and don't require re-indexing, which is the main improvement introduced in this PR.

Suggested change
from its index. This makes consecutive runs much faster.
from its index. Renamed images are also detected automatically, so you don't need
to perform a full re-index when files are renamed. This makes consecutive runs much faster.

Copilot uses AI. Check for mistakes.


def compute_image_hash(image: Image.Image) -> str:
"""Compute perceptual hash for an image using pHash algorithm."""
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation for compute_image_hash only mentions "pHash algorithm" but doesn't explain what this algorithm does or why it was chosen. Consider expanding the docstring to explain that pHash (perceptual hash) creates a fingerprint of the image content that remains similar even when the image is slightly modified, making it suitable for detecting renamed files with identical content.

Suggested change
"""Compute perceptual hash for an image using pHash algorithm."""
"""Compute a perceptual hash (pHash) for an image.
The pHash algorithm generates a compact fingerprint of the image's visual
content, such that visually similar images (e.g. resized, recompressed or
slightly modified versions of the same picture) produce similar hashes.
This makes it suitable for detecting identical or near-duplicate images,
such as renamed files with the same underlying content.
"""

Copilot uses AI. Check for mistakes.
Comment on lines +105 to +108
try:
version_str = f"rclip {version('rclip')}"
except Exception:
version_str = "rclip (development)"
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The try-except block catching all Exceptions is too broad. This could hide legitimate errors that should be surfaced during development. Consider catching only the specific exception types that version() might raise (e.g., PackageNotFoundError from importlib.metadata) to ensure unexpected errors are not silently suppressed.

Copilot uses AI. Check for mistakes.
columns = self._con.execute("PRAGMA table_info(images)").fetchall()
column_names = [col["name"] for col in columns]
if "hash" not in column_names:
self._con.execute("ALTER TABLE images ADD COLUMN hash TEXT")
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The database migration for version 3 checks if the hash column exists before adding it (lines 66-70), but it doesn't check if the hash_index exists before creating it (line 71). While CREATE INDEX IF NOT EXISTS is used, the comment on line 66 suggests this check was added to handle databases from "old code". For consistency and clarity, consider adding a comment explaining that CREATE INDEX IF NOT EXISTS handles the case where the index might already exist.

Suggested change
self._con.execute("ALTER TABLE images ADD COLUMN hash TEXT")
self._con.execute("ALTER TABLE images ADD COLUMN hash TEXT")
# Rely on CREATE INDEX IF NOT EXISTS to handle cases where hash_index may already exist from old code

Copilot uses AI. Check for mistakes.
rclip/db.py Outdated
return cur.fetchone()

def get_images_by_hash(self, hash_value: str) -> list[Image]:
cur = self._con.execute("SELECT * FROM images WHERE hash = ? AND deleted IS NULL", (hash_value,))
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get_images_by_hash method should also filter out images with indexing=1 flag to avoid matching against files that are currently being processed for potential deletion. If an image with indexing=1 has the same hash as a newly discovered file, it could be incorrectly identified as a rename candidate when it may be about to be deleted.

Suggested change
cur = self._con.execute("SELECT * FROM images WHERE hash = ? AND deleted IS NULL", (hash_value,))
cur = self._con.execute(
"SELECT * FROM images WHERE hash = ? AND deleted IS NULL AND indexing IS NULL",
(hash_value,),
)

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recommendation to filter 'indexing=1' entries in get_images_by_hash was found to be counterproductive. This filter mistakenly disabled intra-directory rename detection because all files in a scanning directory are marked as 'indexing=1' during the initial phase. This database-level filter was explicitly removed to restore the 50-200x performance gain for renamed files

Comment on lines +169 to +203
# Check if this might be a renamed image
if not image:
# Read the image to compute its hash
try:
img = helpers.read_image(filepath)
current_hash = helpers.compute_image_hash(img)
# Look for ALL existing images with the same hash
existing_images_with_hash = self._db.get_images_by_hash(current_hash)

# Find an entry where the file no longer exists (true rename, not a copy)
existing_image_vector = None
for img_entry in existing_images_with_hash:
if not os.path.exists(img_entry["filepath"]):
existing_image_vector = img_entry["vector"]
break

if existing_image_vector:
# This is a renamed file - reuse the existing vector
# DON'T remove the indexing flag from the old filepath - we want it to be marked as deleted
# Create a new entry for the new filepath
self._db.upsert_image(
db.NewImage(
filepath=filepath,
modified_at=meta["modified_at"],
size=meta["size"],
vector=existing_image_vector,
hash=current_hash
),
commit=False,
)
self._db.remove_indexing_flag(filepath, commit=False)
continue
except (PIL.UnidentifiedImageError, OSError, IOError, ValueError):
# If we can't read the image, fall through to normal indexing
pass
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rename detection logic has a performance issue: it reads and computes the hash for every new image, even when there are no potential rename candidates. This could significantly slow down indexing of new images. Consider checking if there are any images with indexing=1 flag (potential deleted files) before attempting hash computation and rename detection. This optimization would avoid unnecessary image reads and hash computations when no renames are possible.

Copilot uses AI. Check for mistakes.
Comment on lines +237 to +239
def compute_image_hash(image: Image.Image) -> str:
"""Compute perceptual hash for an image using pHash algorithm."""
return str(imagehash.phash(image))
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compute_image_hash function lacks error handling. If the image hash computation fails (e.g., for corrupted images or unsupported formats), it will raise an unhandled exception. Consider adding a try-except block to handle potential errors from imagehash.phash and returning None or a default value when hash computation fails, ensuring the indexing process can continue.

Copilot uses AI. Check for mistakes.
rclip/main.py Outdated
from typing import TYPE_CHECKING

if TYPE_CHECKING:
from rclip.model import FeatureVector
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'FeatureVector' is not used.

Copilot uses AI. Check for mistakes.
Kaos599 and others added 2 commits December 27, 2025 19:14
- Database Consistency: Fixed upsert_image UPDATE clause to properly reset indexing flags on modified files.

- Performance: Added a directory-scoped pre-check (has_indexing_images_in_dir) to bypass expensive hashing when no renames are possible (est. 20-40% speedup for new image additions).

- Documentation: Updated README to highlight the new rename detection feature and its performance benefits for consecutive runs.

- Code Quality: Enhanced compute_image_hash docstring with detailed pHash explanation and improved exception specificity in version checks.
fix: address feedback on rename detection PR
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 21 changed files in this pull request and generated 14 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +239 to +244

The pHash algorithm generates a compact fingerprint of the image's visual
content, such that visually similar images (e.g. resized, recompressed or
slightly modified versions of the same picture) produce similar hashes.
This makes it suitable for detecting identical or near-duplicate images,
such as renamed files with the same underlying content.
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring mentions that pHash produces "similar hashes" for "visually similar images" but this is misleading. Perceptual hashes are designed to produce IDENTICAL or very close hashes for identical/near-duplicate content, not just "similar" hashes. The rename detection logic depends on exact hash matches (line 179 in main.py uses get_images_by_hash(current_hash) which does equality comparison).

Consider clarifying that pHash produces identical hashes for identical visual content, and that this implementation is used for exact duplicate detection rather than similarity matching.

Suggested change
The pHash algorithm generates a compact fingerprint of the image's visual
content, such that visually similar images (e.g. resized, recompressed or
slightly modified versions of the same picture) produce similar hashes.
This makes it suitable for detecting identical or near-duplicate images,
such as renamed files with the same underlying content.
The pHash algorithm generates a compact, deterministic fingerprint of an
image's visual content. Identical visual content will produce identical
hashes, and near-duplicate content (e.g. resized or recompressed versions
of the same picture) will typically produce very similar hashes.
In this project, the returned hash is used for exact duplicate detection
via equality comparison (for example, to detect renamed files that contain
the same underlying image data), not for general similarity matching with
a distance threshold.

Copilot uses AI. Check for mistakes.

depends_on "rust" => :build # for safetensors
depends_on "certifi"
depends_on "libheif"
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The homebrew formula adds libheif as a dependency for HEIC support. This is correct, as pillow-heif requires the libheif library. However, verify that:

  1. The dependency name libheif is correct for Homebrew (it might be libheif or similar)
  2. Any version constraints for libheif compatibility are documented
  3. This dependency is available in the standard Homebrew repository

Consider adding a comment explaining that this is required for HEIC image format support via pillow-heif.

Suggested change
depends_on "libheif"
depends_on "libheif" # required for HEIC/HEIF image support via pillow-heif

Copilot uses AI. Check for mistakes.

# these images are always processed
IMAGE_EXT = ["jpg", "jpeg", "png", "webp"]
IMAGE_EXT = ["jpg", "jpeg", "png", "webp", "heic"]
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IMAGE_EXT list now includes "heic" for HEIC image support. Note that HEIC images can also have the ".heif" extension (HEIF is the container format, HEIC is Apple's specific implementation). Consider whether ".heif" should also be added to ensure compatibility with images that use the generic HEIF extension rather than Apple's HEIC extension.

Suggested change
IMAGE_EXT = ["jpg", "jpeg", "png", "webp", "heic"]
IMAGE_EXT = ["jpg", "jpeg", "png", "webp", "heic", "heif"]

Copilot uses AI. Check for mistakes.

class DB:
VERSION = 2
VERSION = 3
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description states "Removed DB versioning system for simpler schema management," but the database versioning system is still present and functioning (DB.VERSION = 3 with migration logic from versions 1->2->3). This is actually correct behavior - the versioning should be kept - but the PR description is misleading.

Copilot uses AI. Check for mistakes.
Comment on lines +143 to +156
### How does **rclip** update the index?

When you run **rclip** in a directory that has already been processed, it will
only index the new images added since the last run and remove the deleted images
from its index. This makes consecutive runs much faster.

If you know that no new images were added or deleted since the last run, you can
use the `--no-indexing` (or `-n`) argument to skip the indexing step altogether
and speed up the search even more.

```bash
rclip -n cat
```

Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newly added section "How does rclip update the index?" (lines 143-156) is helpful, but it's somewhat redundant with the earlier paragraph at lines 86-87 which also mentions rename detection and automatic updates. Consider consolidating this information or making it clear that this section provides more detailed information about the update mechanism to avoid redundancy.

Copilot uses AI. Check for mistakes.
numpy = "~2.1.3"
open_clip_torch = "^3.1.0"
pillow = "^10.3.0"
pillow = "^12.0.0"
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Pillow dependency is being upgraded from ^10.3.0 to ^12.0.0, which is a major version bump (10 -> 12, skipping version 11). Major version updates can introduce breaking changes. While this may be necessary for HEIC support via pillow-heif (which requires >=11.1.0 per the pillow-heif dependencies), ensure that:

  1. This major version upgrade has been tested thoroughly
  2. Any breaking changes in Pillow 12.0.0 have been reviewed
  3. The minimum required version for pillow-heif compatibility is met

Consider documenting in the PR description or commit message why version 11 was skipped and if there are any known compatibility considerations.

Suggested change
pillow = "^12.0.0"
pillow = "^11.1.0"

Copilot uses AI. Check for mistakes.
Comment on lines +105 to +108
try:
version_str = f"rclip {version('rclip')}"
except Exception: # PackageNotFoundError when not installed via package manager
version_str = "rclip (development)"
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The broad except Exception handler at line 107 catches any exception when getting the version, including PackageNotFoundError. While this makes the code work in development mode, it silently swallows all exceptions without logging. Consider:

  1. Catching the specific PackageNotFoundError exception for clarity
  2. Logging the exception in development mode so developers are aware of the fallback behavior

This would make debugging easier and the code's intent clearer.

Copilot uses AI. Check for mistakes.
Kaos599 and others added 3 commits December 27, 2025 19:47
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: do not re-index an image if it was renamed

4 participants