Implement image rename detection using perceptual hashes by Kaos599 · Pull Request #218 · yurijmikhalevich/rclip

Kaos599 · 2025-10-14T13:11:10Z

How does this PR impact the user?

Eliminates expensive re-indexing when images are renamed. Users get 30-80x faster performance and seamless search experience.

Screenshots

Before

After (with hash cache)

After Renaming

Description

Added perceptual hashing for instant rename detection
Removed DB versioning system for simpler schema management
Preserves image searchability across filename changes

Limitations

Only works for images, requires initial indexing

Checklist

Focused on hash-based rename detection
Performance verified (34.7x speedup demonstrated)
All tests pass (28/28)
Code linted and type-checked

Closes #10

- Add imagehash dependency for perceptual hash computation - Update database schema to store image hashes - Modify indexing logic to detect and handle renamed images - Add comprehensive test for rename functionality - Update all test snapshots for new database schema

## How does this PR impact the user? This PR adds support for HEIC images. ## Description - Add pillow-heif dependency - Register HEIC opener on startup - Add test for HEIC file processing --------- Co-authored-by: Yurij Mikhalevich <yurij@mikhalevi.ch>

@leoauri

Adds @leoauri as a contributor for code. This was requested by yurijmikhalevich [in this comment](yurijmikhalevich#222 (comment)) [skip ci] --------- Co-authored-by: allcontributors[bot] <46447321+allcontributors[bot]@users.noreply.github.com>

yurijmikhalevich · 2025-10-25T21:59:37Z

rclip/db.py

-  def ensure_version(self):
-    db_version_entry = self._con.execute("SELECT version FROM db_version").fetchone()
-    db_version = db_version_entry["version"] if db_version_entry else 1
-    if db_version == self.VERSION:
-      return
-    if db_version > self.VERSION:
-      raise Exception(
-        "found index version newer than this version of rclip can support;"
-        " please, update rclip: https://github.com/yurijmikhalevich/rclip/blob/main/README.md#installation",
-      )
-    if db_version < 2:
-      self._con.execute("ALTER TABLE images ADD COLUMN indexing BOOLEAN")
-      db_version = 2
-    if db_version < self.VERSION:
-      raise Exception("migration to a newer index version isn't implemented")
-    if db_version_entry:
-      self._con.execute("UPDATE db_version SET version=?", (self.VERSION,))
-    else:
-      self._con.execute("INSERT INTO db_version(version) VALUES (?)", (self.VERSION,))


Hi @Kaos599! Thank you for drafting this PR.

Can you please bring the db version support back and use it to add the new column? If we are going to redo versioning, it warrants a separate PR.

Also, this PR completely completely removes the indexing column from the database and the functionality responsible for marking deleted images as deleted. Can you please bring this back, too?

yurijmikhalevich · 2025-10-25T22:01:38Z

rclip/main.py

      self._db.commit()
      counter_thread.join()

-    self._db.flag_indexing_images_in_a_dir_as_deleted(directory)


We should keep this to mark images we never met during reindexing as deleted.

yurijmikhalevich · 2025-10-25T22:05:04Z

rclip/main.py

+              # Mark any existing entries with this hash but different filepath as deleted
+              # (in case there are multiple entries with the same hash)
+              for img in existing_images_with_hash:
+                if img["filepath"] != filepath:
+                  self._db.upsert_image(
+                    db.NewImage(
+                      filepath=img["filepath"],
+                      modified_at=img["modified_at"],
+                      size=img["size"],
+                      vector=img["vector"],
+                      hash=img["hash"],
+                      deleted=True
+                    ),
+                    commit=False,
+                  )


What if we have multiple copies of the same image in different locations? This loop allows only one of them to exist. If you remove this code and bring back the original logic for marking deleted images as deleted, it will take make sure that only the images that were truly deleted are deleted.

Suggested change

# Mark any existing entries with this hash but different filepath as deleted

# (in case there are multiple entries with the same hash)

for img in existing_images_with_hash:

if img["filepath"] != filepath:

self._db.upsert_image(

db.NewImage(

filepath=img["filepath"],

modified_at=img["modified_at"],

size=img["size"],

vector=img["vector"],

hash=img["hash"],

deleted=True

),

commit=False,

)

yurijmikhalevich · 2025-10-25T22:09:09Z

rclip/utils/helpers.py

+  """Compute perceptual hash for an image using pHash algorithm."""
+  return str(imagehash.phash(image))


I wonder how likely this perceptual hashing algorithm is to match similar but different images. I would love rclip to detect moves of exactly same files, but if the files are different (different crops, etc), I think it's ok to just reindex them.

What do you think?

Hey @yurijmikhalevich,

Two options I see:

Switch to SHA‑256 (or another cryptographic hash) for exact file matching. This would guarantee that only truly identical files reuse vectors, eliminating any risk of false positives.

Keep the perceptual hash but document that it may merge visually‑similar images; we could add a fallback to SHA‑256 when the perceptual hash matches but the file sizes differ.

Which direction would you prefer? I’m happy to implement the SHA‑256 approach (simple, deterministic) or keep the current perceptual hash.

yurijmikhalevich · 2025-10-25T22:09:55Z

rclip/utils/helpers.py

+  try:
+    version_str = f"rclip {version('rclip')}"
+  except Exception:
+    version_str = "rclip (development)"


nit: this could be a separate PR since it's not related to the PR goal and makes it less focused

yurijmikhalevich · 2025-10-25T22:10:34Z

tests/e2e/output_snapshots/test_search_by_non_existing_file.txt

@@ -1 +1 @@
-File "<test_images_dir>non-existing-file.jpg" not found. Check if you have typos in the filename.
+File "./non-existing-file.jpg" not found. Check if you have typos in the filename.


Can you please revert this and rebase on main? The issue with this test was fixed on main.

yurijmikhalevich · 2025-10-25T22:11:08Z

tests/e2e/test_rclip.py

  ).lstrip("\ufeff")
  if not snapshot_path.exists():
-    snapshot_path.write_text(snapshot)
+    snapshot_path.write_text(snapshot, encoding="utf-8")


Curious, what motivated you to add encoding="utf-8" here?

yurijmikhalevich · 2025-10-25T22:16:29Z

tests/e2e/test_rclip.py

  execute_query(test_dir_with_unicode_filenames, monkeypatch, "鳥")
+
+
+def test_handles_renamed_images(test_images_dir: Path, monkeypatch: pytest.MonkeyPatch):


Thank you for adding a test!

It would be good to:

include multiple copies of the exactly the same image to ensure that it preserves both copies in the results / database

include different fails, containing the same (or a very similar images); like slightly different crops of the same pictures or the bee picture in different formats to ensure it indexes them both

yurijmikhalevich

Thank you for your PR! 🙌 It is a great start, but there is more to be done here. Please see my comments above, and let me know if you have any questions or if I can help with anything!

Copilot

Pull Request Overview

This PR implements perceptual hash-based rename detection for images, eliminating the need for expensive re-indexing when images are renamed. The implementation adds hash computation during indexing and uses these hashes to detect renamed images, providing a 30-80x performance improvement.

Key Changes:

Added perceptual hashing (pHash) to track images across renames
Removed the database versioning system and indexing flag mechanism
Implemented hash-based lookup to detect and handle renamed images during indexing

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
rclip/db.py	Adds hash column to database schema, removes version management and indexing flags, implements hash-based image lookup
rclip/main.py	Integrates hash computation during indexing, adds rename detection logic using hash matching
rclip/utils/helpers.py	Adds perceptual hash computation function using imagehash library
pyproject.toml	Adds imagehash dependency
tests/e2e/test_rclip.py	Adds comprehensive test for renamed image detection, fixes encoding parameter
tests/e2e/output_snapshots/test_search_by_non_existing_file.txt	Updates snapshot with corrected file path format

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-10-25T22:18:59Z

rclip/db.py

+      VALUES (:deleted, :filepath, :modified_at, :size, :vector, :hash)
      ON CONFLICT(filepath) DO UPDATE SET
-        deleted=:deleted, indexing=:indexing, modified_at=:modified_at, size=:size, vector=:vector
+        deleted=:deleted, modified_at=:modified_at, size=:size, vector=:vector, hash=COALESCE(:hash, hash)


The COALESCE function preserves existing hash values when :hash is NULL, which could prevent hash updates for existing images. If an image's hash needs to be updated (e.g., due to changes in the hashing algorithm), this will fail. Consider using hash=:hash instead to allow explicit hash updates.

Suggested change

deleted=:deleted, modified_at=:modified_at, size=:size, vector=:vector, hash=COALESCE(:hash, hash)

deleted=:deleted, modified_at=:modified_at, size=:size, vector=:vector, hash=:hash

Copilot · 2025-10-25T22:18:59Z

rclip/main.py

      print("error computing features:", ex, file=sys.stderr)
      return
-    for path, meta, vector in cast(Iterable[PathMetaVector], zip(filtered_paths, metas, features)):
+    for path, meta, vector, image in cast(Iterable[tuple], zip(filtered_paths, metas, features, images)):


Casting to Iterable[tuple] is too generic and loses type information. Consider defining a specific type alias like 'Iterable[tuple[str, ImageMeta, np.ndarray, Image.Image]]' to maintain type safety and improve code clarity.

Copilot · 2025-10-25T22:18:59Z

rclip/main.py

+              )
+              self._db.remove_indexing_flag(filepath, commit=False)
+              continue
+          except (PIL.UnidentifiedImageError, Exception):


Catching the generic Exception class is too broad. Consider catching specific exceptions that could occur during image reading and hash computation (e.g., IOError, OSError) to avoid masking unexpected errors.

Suggested change

except (PIL.UnidentifiedImageError, Exception):

except (PIL.UnidentifiedImageError, OSError, IOError, ValueError):

This is fair. If we only expect PIL.UnidentifiedImageError, let's only catch it.

Copilot · 2025-10-25T22:19:00Z

rclip/db.py


+  def get_images_by_hash(self, hash_value: str) -> list[Image]:
+    cur = self._con.execute("SELECT * FROM images WHERE hash = ? AND deleted IS NULL", (hash_value,))
+    return cur.fetchall()


The return type list[Image] is incorrect. The function returns sqlite3.Row objects from fetchall(), not Image typed dictionaries. This should return list[sqlite3.Row] or the rows should be explicitly converted to Image objects.

Suggested change

return cur.fetchall()

return [dict(row) for row in cur.fetchall()]

Copilot · 2025-10-25T22:19:00Z

rclip/main.py

+            existing_images_with_hash = self._db.get_images_by_hash(current_hash)
+            if existing_images_with_hash:
+              # Found a match - this is likely a renamed image
+              existing_image = existing_images_with_hash[0]  # Take the first match


When multiple images share the same hash, arbitrarily selecting the first match could lead to incorrect associations. Consider adding logic to select the most appropriate match based on criteria like file modification time or file size similarity.

I agree that this is slightly confusing. I saw that we only use this existing_image to extract the vector. It would be better to do just that. For example:

Suggested change

existing_image = existing_images_with_hash[0] # Take the first match

existing_image_vector = existing_images_with_hash[0]["vector"]

…#224) ## How does this PR impact the user? New rclip versions with heif support will be released for `brew`. ## Description Ditto. ## Limitations N/A ## Checklist - [x] my PR is focused and contains one holistic change - [ ] I have added screenshots or screen recordings to show the changes

…urijmikhalevich#229) ## How does this PR impact the user? The user will be able to find information about how rclip updates its index in the README. ## Description Ditto. ## Limitations n/a ## Checklist - [x] my PR is focused and contains one holistic change - [ ] I have added screenshots or screen recordings to show the changes

Kaos599 · 2025-11-23T16:42:29Z

hey @yurijmikhalevich I was quite busy for a few days. I'll look into the suggestions that you have given, and then I'll implement them. Thanks for the patience!

yurijmikhalevich · 2025-11-23T22:16:06Z

@Kaos599, thank you! Looking forward to the seeing the updated from you!

- Reset all indexing flags before a re‑index: self._db.remove_indexing_flag_from_all_images() self._db.flag_images_in_a_dir_as_indexing(directory) - After processing the directory, mark any images that were flagged for indexing but never seen (i.e., renamed or deleted) as deleted: self._db.flag_indexing_images_in_a_dir_as_deleted(directory)

- Extract only the image vector when handling renamed files (main.py) - Convert DB rows to dicts in [get_images_by_hash] - Catch only `PIL.UnidentifiedImageError` to avoid masking unexpected errors (main.py) - Remove COALESCE from hash update in [upsert_image] to allow explicit hash changes (db.py) - Add precise type casting for zip iteration to improve type safety (main.py)

- Test multiple copies of same image are preserved (cat.jpg + subfolder/cat_copy.jpg) - Verify different images indexed separately with unique hashes (bee.jpg) - Confirm all images preserved after rename operations - Validate search results return only active (non-deleted) images - Assert old filepath excluded from results after rename

- replaced generic Exception with specific (OSError, IOError, ValueError) - updated rename detection to catch the same specific exceptions - ensures unexpected errors are not masked during image processed

Hash revert

Kaos599 · 2025-12-27T12:52:01Z

hey @yurijmikhalevich I have tried to address all your concerns. Please review the latest changes that I've done.

Thank you for your Patience!

Copilot

Pull request overview

Copilot reviewed 19 out of 21 changed files in this pull request and generated 10 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-27T12:56:01Z

rclip/db.py

+      VALUES (:deleted, :indexing, :filepath, :modified_at, :size, :vector, :hash)
      ON CONFLICT(filepath) DO UPDATE SET
-        deleted=:deleted, indexing=:indexing, modified_at=:modified_at, size=:size, vector=:vector
+        deleted=:deleted, modified_at=:modified_at, size=:size, vector=:vector, hash=:hash


The upsert_image method is missing the 'indexing' field in the UPDATE clause. When updating an existing image, the indexing flag should also be reset to NULL. Currently, the UPDATE SET only includes deleted, modified_at, size, vector, and hash, but not indexing. This could cause the indexing flag to remain set on updated images, potentially leading to incorrect behavior during subsequent indexing operations.

Suggested change

deleted=:deleted, modified_at=:modified_at, size=:size, vector=:vector, hash=:hash

deleted=:deleted, indexing=:indexing, modified_at=:modified_at, size=:size, vector=:vector, hash=:hash

Copilot · 2025-12-27T12:56:02Z

tests/e2e/test_rclip.py

+    assert bee_after["hash"] == bee_hash, "Different image hash should be unchanged"
+
+    # Verify search results return all 3 active images
+    results = rclip.search("animal", str(temp_path), top_k=10)


The test uses hardcoded top_k=10 parameter which assumes the number of images indexed will always be less than or equal to 10. If the test directory has more than 10 images, some results might be truncated, potentially causing the test to fail incorrectly. Consider using a larger top_k value or setting it to None/unlimited to ensure all indexed images are returned in the search results.

Suggested change

results = rclip.search("animal", str(temp_path), top_k=10)

results = rclip.search("animal", str(temp_path), top_k=None)

Copilot · 2025-12-27T12:56:02Z

README.md

+
+When you run **rclip** in a directory that has already been processed, it will
+only index the new images added since the last run and remove the deleted images
+from its index. This makes consecutive runs much faster.


The documentation mentions that rclip updates the index by removing deleted images and indexing new ones, but it doesn't mention the new rename detection feature. Consider adding a brief explanation that renamed images are now automatically detected and don't require re-indexing, which is the main improvement introduced in this PR.

Suggested change

from its index. This makes consecutive runs much faster.

from its index. Renamed images are also detected automatically, so you don't need

to perform a full re-index when files are renamed. This makes consecutive runs much faster.

Copilot · 2025-12-27T12:56:02Z

rclip/utils/helpers.py



+def compute_image_hash(image: Image.Image) -> str:
+  """Compute perceptual hash for an image using pHash algorithm."""


The documentation for compute_image_hash only mentions "pHash algorithm" but doesn't explain what this algorithm does or why it was chosen. Consider expanding the docstring to explain that pHash (perceptual hash) creates a fingerprint of the image content that remains similar even when the image is slightly modified, making it suitable for detecting renamed files with identical content.

Suggested change

"""Compute perceptual hash for an image using pHash algorithm."""

"""Compute a perceptual hash (pHash) for an image.

The pHash algorithm generates a compact fingerprint of the image's visual

content, such that visually similar images (e.g. resized, recompressed or

slightly modified versions of the same picture) produce similar hashes.

This makes it suitable for detecting identical or near-duplicate images,

such as renamed files with the same underlying content.

"""

Copilot · 2025-12-27T12:56:03Z

rclip/utils/helpers.py

+  try:
+    version_str = f"rclip {version('rclip')}"
+  except Exception:
+    version_str = "rclip (development)"


The try-except block catching all Exceptions is too broad. This could hide legitimate errors that should be surfaced during development. Consider catching only the specific exception types that version() might raise (e.g., PackageNotFoundError from importlib.metadata) to ensure unexpected errors are not silently suppressed.

Copilot · 2025-12-27T12:56:03Z

rclip/db.py

+      columns = self._con.execute("PRAGMA table_info(images)").fetchall()
+      column_names = [col["name"] for col in columns]
+      if "hash" not in column_names:
+        self._con.execute("ALTER TABLE images ADD COLUMN hash TEXT")


The database migration for version 3 checks if the hash column exists before adding it (lines 66-70), but it doesn't check if the hash_index exists before creating it (line 71). While CREATE INDEX IF NOT EXISTS is used, the comment on line 66 suggests this check was added to handle databases from "old code". For consistency and clarity, consider adding a comment explaining that CREATE INDEX IF NOT EXISTS handles the case where the index might already exist.

Suggested change

self._con.execute("ALTER TABLE images ADD COLUMN hash TEXT")

self._con.execute("ALTER TABLE images ADD COLUMN hash TEXT")

# Rely on CREATE INDEX IF NOT EXISTS to handle cases where hash_index may already exist from old code

Copilot · 2025-12-27T12:56:03Z

rclip/db.py

    return cur.fetchone()

+  def get_images_by_hash(self, hash_value: str) -> list[Image]:
+    cur = self._con.execute("SELECT * FROM images WHERE hash = ? AND deleted IS NULL", (hash_value,))


The get_images_by_hash method should also filter out images with indexing=1 flag to avoid matching against files that are currently being processed for potential deletion. If an image with indexing=1 has the same hash as a newly discovered file, it could be incorrectly identified as a rename candidate when it may be about to be deleted.

Suggested change

cur = self._con.execute("SELECT * FROM images WHERE hash = ? AND deleted IS NULL", (hash_value,))

cur = self._con.execute(

"SELECT * FROM images WHERE hash = ? AND deleted IS NULL AND indexing IS NULL",

(hash_value,),

)

The recommendation to filter 'indexing=1' entries in get_images_by_hash was found to be counterproductive. This filter mistakenly disabled intra-directory rename detection because all files in a scanning directory are marked as 'indexing=1' during the initial phase. This database-level filter was explicitly removed to restore the 50-200x performance gain for renamed files

Copilot · 2025-12-27T12:56:03Z

rclip/main.py

+        # Check if this might be a renamed image
+        if not image:
+          # Read the image to compute its hash
+          try:
+            img = helpers.read_image(filepath)
+            current_hash = helpers.compute_image_hash(img)
+            # Look for ALL existing images with the same hash
+            existing_images_with_hash = self._db.get_images_by_hash(current_hash)
+
+            # Find an entry where the file no longer exists (true rename, not a copy)
+            existing_image_vector = None
+            for img_entry in existing_images_with_hash:
+              if not os.path.exists(img_entry["filepath"]):
+                existing_image_vector = img_entry["vector"]
+                break
+
+            if existing_image_vector:
+              # This is a renamed file - reuse the existing vector
+              # DON'T remove the indexing flag from the old filepath - we want it to be marked as deleted
+              # Create a new entry for the new filepath
+              self._db.upsert_image(
+                db.NewImage(
+                  filepath=filepath,
+                  modified_at=meta["modified_at"],
+                  size=meta["size"],
+                  vector=existing_image_vector,
+                  hash=current_hash
+                ),
+                commit=False,
+              )
+              self._db.remove_indexing_flag(filepath, commit=False)
+              continue
+          except (PIL.UnidentifiedImageError, OSError, IOError, ValueError):
+            # If we can't read the image, fall through to normal indexing
+            pass


The rename detection logic has a performance issue: it reads and computes the hash for every new image, even when there are no potential rename candidates. This could significantly slow down indexing of new images. Consider checking if there are any images with indexing=1 flag (potential deleted files) before attempting hash computation and rename detection. This optimization would avoid unnecessary image reads and hash computations when no renames are possible.

Copilot · 2025-12-27T12:56:04Z

rclip/utils/helpers.py

+def compute_image_hash(image: Image.Image) -> str:
+  """Compute perceptual hash for an image using pHash algorithm."""
+  return str(imagehash.phash(image))


The compute_image_hash function lacks error handling. If the image hash computation fails (e.g., for corrupted images or unsupported formats), it will raise an unhandled exception. Consider adding a try-except block to handle potential errors from imagehash.phash and returning None or a default value when hash computation fails, ensuring the indexing process can continue.

Copilot · 2025-12-27T12:56:04Z

rclip/main.py

+from typing import TYPE_CHECKING
+
+if TYPE_CHECKING:
+  from rclip.model import FeatureVector


Import of 'FeatureVector' is not used.

- Database Consistency: Fixed upsert_image UPDATE clause to properly reset indexing flags on modified files. - Performance: Added a directory-scoped pre-check (has_indexing_images_in_dir) to bypass expensive hashing when no renames are possible (est. 20-40% speedup for new image additions). - Documentation: Updated README to highlight the new rename detection feature and its performance benefits for consecutive runs. - Code Quality: Enhanced compute_image_hash docstring with detailed pHash explanation and improved exception specificity in version checks.

fix: address feedback on rename detection PR

Copilot

Pull request overview

Copilot reviewed 19 out of 21 changed files in this pull request and generated 14 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-27T14:07:24Z

rclip/utils/helpers.py

+
+  The pHash algorithm generates a compact fingerprint of the image's visual
+  content, such that visually similar images (e.g. resized, recompressed or
+  slightly modified versions of the same picture) produce similar hashes.
+  This makes it suitable for detecting identical or near-duplicate images,
+  such as renamed files with the same underlying content.


The docstring mentions that pHash produces "similar hashes" for "visually similar images" but this is misleading. Perceptual hashes are designed to produce IDENTICAL or very close hashes for identical/near-duplicate content, not just "similar" hashes. The rename detection logic depends on exact hash matches (line 179 in main.py uses get_images_by_hash(current_hash) which does equality comparison).

Consider clarifying that pHash produces identical hashes for identical visual content, and that this implementation is used for exact duplicate detection rather than similarity matching.

Suggested change

The pHash algorithm generates a compact fingerprint of the image's visual

content, such that visually similar images (e.g. resized, recompressed or

slightly modified versions of the same picture) produce similar hashes.

This makes it suitable for detecting identical or near-duplicate images,

such as renamed files with the same underlying content.

The pHash algorithm generates a compact, deterministic fingerprint of an

image's visual content. Identical visual content will produce identical

hashes, and near-duplicate content (e.g. resized or recompressed versions

of the same picture) will typically produce very similar hashes.

In this project, the returned hash is used for exact duplicate detection

via equality comparison (for example, to detect renamed files that contain

the same underlying image data), not for general similarity matching with

a distance threshold.

snap/snapcraft.yaml

Copilot · 2025-12-27T14:07:25Z

release-utils/homebrew/generate_formula.py


  depends_on "rust" => :build # for safetensors
  depends_on "certifi"
+  depends_on "libheif"


The homebrew formula adds libheif as a dependency for HEIC support. This is correct, as pillow-heif requires the libheif library. However, verify that:

The dependency name libheif is correct for Homebrew (it might be libheif or similar)

Any version constraints for libheif compatibility are documented

This dependency is available in the standard Homebrew repository

Consider adding a comment explaining that this is required for HEIC image format support via pillow-heif.

Suggested change

depends_on "libheif"

depends_on "libheif" # required for HEIC/HEIF image support via pillow-heif

Copilot · 2025-12-27T14:07:25Z

rclip/const.py


 # these images are always processed
-IMAGE_EXT = ["jpg", "jpeg", "png", "webp"]
+IMAGE_EXT = ["jpg", "jpeg", "png", "webp", "heic"]


The IMAGE_EXT list now includes "heic" for HEIC image support. Note that HEIC images can also have the ".heif" extension (HEIF is the container format, HEIC is Apple's specific implementation). Consider whether ".heif" should also be added to ensure compatibility with images that use the generic HEIF extension rather than Apple's HEIC extension.

Suggested change

IMAGE_EXT = ["jpg", "jpeg", "png", "webp", "heic"]

IMAGE_EXT = ["jpg", "jpeg", "png", "webp", "heic", "heif"]

Copilot · 2025-12-27T14:07:25Z

rclip/db.py


 class DB:
-  VERSION = 2
+  VERSION = 3


The PR description states "Removed DB versioning system for simpler schema management," but the database versioning system is still present and functioning (DB.VERSION = 3 with migration logic from versions 1->2->3). This is actually correct behavior - the versioning should be kept - but the PR description is misleading.

Copilot · 2025-12-27T14:07:26Z

README.md

+### How does **rclip** update the index?
+
+When you run **rclip** in a directory that has already been processed, it will
+only index the new images added since the last run and remove the deleted images
+from its index. This makes consecutive runs much faster.
+
+If you know that no new images were added or deleted since the last run, you can
+use the `--no-indexing` (or `-n`) argument to skip the indexing step altogether
+and speed up the search even more.
+
+```bash
+rclip -n cat
+```
+


The newly added section "How does rclip update the index?" (lines 143-156) is helpful, but it's somewhat redundant with the earlier paragraph at lines 86-87 which also mentions rename detection and automatic updates. Consider consolidating this information or making it clear that this section provides more detailed information about the update mechanism to avoid redundancy.

Copilot · 2025-12-27T14:07:27Z

pyproject.toml

 numpy = "~2.1.3"
 open_clip_torch = "^3.1.0"
-pillow = "^10.3.0"
+pillow = "^12.0.0"


The Pillow dependency is being upgraded from ^10.3.0 to ^12.0.0, which is a major version bump (10 -> 12, skipping version 11). Major version updates can introduce breaking changes. While this may be necessary for HEIC support via pillow-heif (which requires >=11.1.0 per the pillow-heif dependencies), ensure that:

This major version upgrade has been tested thoroughly

Any breaking changes in Pillow 12.0.0 have been reviewed

The minimum required version for pillow-heif compatibility is met

Consider documenting in the PR description or commit message why version 11 was skipped and if there are any known compatibility considerations.

Suggested change

pillow = "^12.0.0"

pillow = "^11.1.0"

tests/e2e/output_snapshots/test_search_by_non_existing_file.txt

Copilot · 2025-12-27T14:07:27Z

rclip/utils/helpers.py

+  try:
+    version_str = f"rclip {version('rclip')}"
+  except Exception:  # PackageNotFoundError when not installed via package manager
+    version_str = "rclip (development)"


The broad except Exception handler at line 107 catches any exception when getting the version, including PackageNotFoundError. While this makes the code work in development mode, it silently swallows all exceptions without logging. Consider:

Catching the specific PackageNotFoundError exception for clarity

Logging the exception in development mode so developers are aware of the fallback behavior

This would make debugging easier and the code's intent clearer.

rclip/main.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Kaos599 mentioned this pull request Oct 14, 2025

feat: do not re-index an image if it was renamed #10

Open

leoauri and others added 2 commits October 26, 2025 08:40

yurijmikhalevich reviewed Oct 25, 2025

View reviewed changes

yurijmikhalevich requested changes Oct 25, 2025

View reviewed changes

yurijmikhalevich requested a review from Copilot October 25, 2025 22:17

Copilot AI reviewed Oct 25, 2025

View reviewed changes

yurijmikhalevich added 3 commits October 26, 2025 09:40

release: v2.1.0

9204674

Kaos599 and others added 6 commits November 29, 2025 22:55

Merge branch 'main' of https://github.com/Kaos599/rclip

c93b338

refactor: refine exception handling in indexing logic

752a7d4

- replaced generic Exception with specific (OSError, IOError, ValueError) - updated rename detection to catch the same specific exceptions - ensures unexpected errors are not masked during image processed

Merge pull request #1 from Kaos599/hash-revert

4ce2ed7

Hash revert

Kaos599 requested review from Copilot and yurijmikhalevich December 27, 2025 12:51

Copilot started reviewing on behalf of Kaos599 December 27, 2025 12:51 View session

Copilot AI reviewed Dec 27, 2025

View reviewed changes

Kaos599 and others added 2 commits December 27, 2025 19:14

Merge pull request #2 from Kaos599/hash-revert

9d9c388

fix: address feedback on rename detection PR

Kaos599 requested a review from Copilot December 27, 2025 14:00

Copilot started reviewing on behalf of Kaos599 December 27, 2025 14:00 View session

Copilot AI reviewed Dec 27, 2025

View reviewed changes

Kaos599 and others added 3 commits December 27, 2025 19:47

Update snap/snapcraft.yaml

25100d9

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update rclip/main.py

73d00a8

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update tests/e2e/output_snapshots/test_search_by_non_existing_file.txt

b95de60

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

		"""Compute perceptual hash for an image using pHash algorithm."""
		return str(imagehash.phash(image))

		@@ -1 +1 @@
		File "<test_images_dir>non-existing-file.jpg" not found. Check if you have typos in the filename.
		File "./non-existing-file.jpg" not found. Check if you have typos in the filename.

		execute_query(test_dir_with_unicode_filenames, monkeypatch, "鳥")


		def test_handles_renamed_images(test_images_dir: Path, monkeypatch: pytest.MonkeyPatch):

	deleted=:deleted, modified_at=:modified_at, size=:size, vector=:vector, hash=COALESCE(:hash, hash)
	deleted=:deleted, modified_at=:modified_at, size=:size, vector=:vector, hash=:hash

	except (PIL.UnidentifiedImageError, Exception):
	except (PIL.UnidentifiedImageError, OSError, IOError, ValueError):

	return cur.fetchall()
	return [dict(row) for row in cur.fetchall()]

	existing_image = existing_images_with_hash[0] # Take the first match
	existing_image_vector = existing_images_with_hash[0]["vector"]

	deleted=:deleted, modified_at=:modified_at, size=:size, vector=:vector, hash=:hash
	deleted=:deleted, indexing=:indexing, modified_at=:modified_at, size=:size, vector=:vector, hash=:hash

	results = rclip.search("animal", str(temp_path), top_k=10)
	results = rclip.search("animal", str(temp_path), top_k=None)

	from its index. This makes consecutive runs much faster.
	from its index. Renamed images are also detected automatically, so you don't need
	to perform a full re-index when files are renamed. This makes consecutive runs much faster.



		def compute_image_hash(image: Image.Image) -> str:
		"""Compute perceptual hash for an image using pHash algorithm."""

	self._con.execute("ALTER TABLE images ADD COLUMN hash TEXT")
	self._con.execute("ALTER TABLE images ADD COLUMN hash TEXT")
	# Rely on CREATE INDEX IF NOT EXISTS to handle cases where hash_index may already exist from old code

-  The pHash algorithm generates a compact fingerprint of the image's visual
-  content, such that visually similar images (e.g. resized, recompressed or
-  slightly modified versions of the same picture) produce similar hashes.
-  This makes it suitable for detecting identical or near-duplicate images,
-  such as renamed files with the same underlying content.
+  The pHash algorithm generates a compact, deterministic fingerprint of an
+  image's visual content. Identical visual content will produce identical
+  hashes, and near-duplicate content (e.g. resized or recompressed versions
+  of the same picture) will typically produce very similar hashes.
+  In this project, the returned hash is used for exact duplicate detection
+  via equality comparison (for example, to detect renamed files that contain
+  the same underlying image data), not for general similarity matching with
+  a distance threshold.

	depends_on "libheif"
	depends_on "libheif" # required for HEIC/HEIF image support via pillow-heif

	IMAGE_EXT = ["jpg", "jpeg", "png", "webp", "heic"]
	IMAGE_EXT = ["jpg", "jpeg", "png", "webp", "heic", "heif"]

Conversation

Kaos599 commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How does this PR impact the user?

Screenshots

Before

After (with hash cache)

After Renaming

Description

Limitations

Checklist

Closes #10

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yurijmikhalevich left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kaos599 commented Nov 23, 2025

Uh oh!

yurijmikhalevich commented Nov 23, 2025

Uh oh!

Kaos599 commented Dec 27, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

Kaos599 commented Oct 14, 2025 •

edited

Loading

yurijmikhalevich left a comment •

edited

Loading