Implement image rename detection using perceptual hashes#218
Implement image rename detection using perceptual hashes#218Kaos599 wants to merge 17 commits intoyurijmikhalevich:mainfrom
Conversation
- Add imagehash dependency for perceptual hash computation - Update database schema to store image hashes - Modify indexing logic to detect and handle renamed images - Add comprehensive test for rename functionality - Update all test snapshots for new database schema
## How does this PR impact the user? This PR adds support for HEIC images. ## Description - Add pillow-heif dependency - Register HEIC opener on startup - Add test for HEIC file processing --------- Co-authored-by: Yurij Mikhalevich <yurij@mikhalevi.ch>
Adds @leoauri as a contributor for code. This was requested by yurijmikhalevich [in this comment](yurijmikhalevich#222 (comment)) [skip ci] --------- Co-authored-by: allcontributors[bot] <46447321+allcontributors[bot]@users.noreply.github.com>
| def ensure_version(self): | ||
| db_version_entry = self._con.execute("SELECT version FROM db_version").fetchone() | ||
| db_version = db_version_entry["version"] if db_version_entry else 1 | ||
| if db_version == self.VERSION: | ||
| return | ||
| if db_version > self.VERSION: | ||
| raise Exception( | ||
| "found index version newer than this version of rclip can support;" | ||
| " please, update rclip: https://github.com/yurijmikhalevich/rclip/blob/main/README.md#installation", | ||
| ) | ||
| if db_version < 2: | ||
| self._con.execute("ALTER TABLE images ADD COLUMN indexing BOOLEAN") | ||
| db_version = 2 | ||
| if db_version < self.VERSION: | ||
| raise Exception("migration to a newer index version isn't implemented") | ||
| if db_version_entry: | ||
| self._con.execute("UPDATE db_version SET version=?", (self.VERSION,)) | ||
| else: | ||
| self._con.execute("INSERT INTO db_version(version) VALUES (?)", (self.VERSION,)) |
There was a problem hiding this comment.
Hi @Kaos599! Thank you for drafting this PR.
Can you please bring the db version support back and use it to add the new column? If we are going to redo versioning, it warrants a separate PR.
Also, this PR completely completely removes the indexing column from the database and the functionality responsible for marking deleted images as deleted. Can you please bring this back, too?
| self._db.commit() | ||
| counter_thread.join() | ||
|
|
||
| self._db.flag_indexing_images_in_a_dir_as_deleted(directory) |
There was a problem hiding this comment.
We should keep this to mark images we never met during reindexing as deleted.
rclip/main.py
Outdated
| # Mark any existing entries with this hash but different filepath as deleted | ||
| # (in case there are multiple entries with the same hash) | ||
| for img in existing_images_with_hash: | ||
| if img["filepath"] != filepath: | ||
| self._db.upsert_image( | ||
| db.NewImage( | ||
| filepath=img["filepath"], | ||
| modified_at=img["modified_at"], | ||
| size=img["size"], | ||
| vector=img["vector"], | ||
| hash=img["hash"], | ||
| deleted=True | ||
| ), | ||
| commit=False, | ||
| ) |
There was a problem hiding this comment.
What if we have multiple copies of the same image in different locations? This loop allows only one of them to exist. If you remove this code and bring back the original logic for marking deleted images as deleted, it will take make sure that only the images that were truly deleted are deleted.
| # Mark any existing entries with this hash but different filepath as deleted | |
| # (in case there are multiple entries with the same hash) | |
| for img in existing_images_with_hash: | |
| if img["filepath"] != filepath: | |
| self._db.upsert_image( | |
| db.NewImage( | |
| filepath=img["filepath"], | |
| modified_at=img["modified_at"], | |
| size=img["size"], | |
| vector=img["vector"], | |
| hash=img["hash"], | |
| deleted=True | |
| ), | |
| commit=False, | |
| ) |
rclip/utils/helpers.py
Outdated
| """Compute perceptual hash for an image using pHash algorithm.""" | ||
| return str(imagehash.phash(image)) |
There was a problem hiding this comment.
I wonder how likely this perceptual hashing algorithm is to match similar but different images. I would love rclip to detect moves of exactly same files, but if the files are different (different crops, etc), I think it's ok to just reindex them.
What do you think?
There was a problem hiding this comment.
Hey @yurijmikhalevich,
Two options I see:
Switch to SHA‑256 (or another cryptographic hash) for exact file matching. This would guarantee that only truly identical files reuse vectors, eliminating any risk of false positives.
Keep the perceptual hash but document that it may merge visually‑similar images; we could add a fallback to SHA‑256 when the perceptual hash matches but the file sizes differ.
Which direction would you prefer? I’m happy to implement the SHA‑256 approach (simple, deterministic) or keep the current perceptual hash.
| try: | ||
| version_str = f"rclip {version('rclip')}" | ||
| except Exception: | ||
| version_str = "rclip (development)" |
There was a problem hiding this comment.
nit: this could be a separate PR since it's not related to the PR goal and makes it less focused
| @@ -1 +1 @@ | |||
| File "<test_images_dir>non-existing-file.jpg" not found. Check if you have typos in the filename. | |||
| File "./non-existing-file.jpg" not found. Check if you have typos in the filename. | |||
There was a problem hiding this comment.
Can you please revert this and rebase on main? The issue with this test was fixed on main.
| ).lstrip("\ufeff") | ||
| if not snapshot_path.exists(): | ||
| snapshot_path.write_text(snapshot) | ||
| snapshot_path.write_text(snapshot, encoding="utf-8") |
There was a problem hiding this comment.
Curious, what motivated you to add encoding="utf-8" here?
| execute_query(test_dir_with_unicode_filenames, monkeypatch, "鳥") | ||
|
|
||
|
|
||
| def test_handles_renamed_images(test_images_dir: Path, monkeypatch: pytest.MonkeyPatch): |
There was a problem hiding this comment.
Thank you for adding a test!
It would be good to:
- include multiple copies of the exactly the same image to ensure that it preserves both copies in the results / database
- include different fails, containing the same (or a very similar images); like slightly different crops of the same pictures or the
beepicture in different formats to ensure it indexes them both
There was a problem hiding this comment.
Pull Request Overview
This PR implements perceptual hash-based rename detection for images, eliminating the need for expensive re-indexing when images are renamed. The implementation adds hash computation during indexing and uses these hashes to detect renamed images, providing a 30-80x performance improvement.
Key Changes:
- Added perceptual hashing (pHash) to track images across renames
- Removed the database versioning system and indexing flag mechanism
- Implemented hash-based lookup to detect and handle renamed images during indexing
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| rclip/db.py | Adds hash column to database schema, removes version management and indexing flags, implements hash-based image lookup |
| rclip/main.py | Integrates hash computation during indexing, adds rename detection logic using hash matching |
| rclip/utils/helpers.py | Adds perceptual hash computation function using imagehash library |
| pyproject.toml | Adds imagehash dependency |
| tests/e2e/test_rclip.py | Adds comprehensive test for renamed image detection, fixes encoding parameter |
| tests/e2e/output_snapshots/test_search_by_non_existing_file.txt | Updates snapshot with corrected file path format |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
rclip/db.py
Outdated
| VALUES (:deleted, :filepath, :modified_at, :size, :vector, :hash) | ||
| ON CONFLICT(filepath) DO UPDATE SET | ||
| deleted=:deleted, indexing=:indexing, modified_at=:modified_at, size=:size, vector=:vector | ||
| deleted=:deleted, modified_at=:modified_at, size=:size, vector=:vector, hash=COALESCE(:hash, hash) |
There was a problem hiding this comment.
The COALESCE function preserves existing hash values when :hash is NULL, which could prevent hash updates for existing images. If an image's hash needs to be updated (e.g., due to changes in the hashing algorithm), this will fail. Consider using hash=:hash instead to allow explicit hash updates.
| deleted=:deleted, modified_at=:modified_at, size=:size, vector=:vector, hash=COALESCE(:hash, hash) | |
| deleted=:deleted, modified_at=:modified_at, size=:size, vector=:vector, hash=:hash |
rclip/main.py
Outdated
| print("error computing features:", ex, file=sys.stderr) | ||
| return | ||
| for path, meta, vector in cast(Iterable[PathMetaVector], zip(filtered_paths, metas, features)): | ||
| for path, meta, vector, image in cast(Iterable[tuple], zip(filtered_paths, metas, features, images)): |
There was a problem hiding this comment.
Casting to Iterable[tuple] is too generic and loses type information. Consider defining a specific type alias like 'Iterable[tuple[str, ImageMeta, np.ndarray, Image.Image]]' to maintain type safety and improve code clarity.
rclip/main.py
Outdated
| ) | ||
| self._db.remove_indexing_flag(filepath, commit=False) | ||
| continue | ||
| except (PIL.UnidentifiedImageError, Exception): |
There was a problem hiding this comment.
Catching the generic Exception class is too broad. Consider catching specific exceptions that could occur during image reading and hash computation (e.g., IOError, OSError) to avoid masking unexpected errors.
| except (PIL.UnidentifiedImageError, Exception): | |
| except (PIL.UnidentifiedImageError, OSError, IOError, ValueError): |
There was a problem hiding this comment.
This is fair. If we only expect PIL.UnidentifiedImageError, let's only catch it.
rclip/db.py
Outdated
|
|
||
| def get_images_by_hash(self, hash_value: str) -> list[Image]: | ||
| cur = self._con.execute("SELECT * FROM images WHERE hash = ? AND deleted IS NULL", (hash_value,)) | ||
| return cur.fetchall() |
There was a problem hiding this comment.
The return type list[Image] is incorrect. The function returns sqlite3.Row objects from fetchall(), not Image typed dictionaries. This should return list[sqlite3.Row] or the rows should be explicitly converted to Image objects.
| return cur.fetchall() | |
| return [dict(row) for row in cur.fetchall()] |
rclip/main.py
Outdated
| existing_images_with_hash = self._db.get_images_by_hash(current_hash) | ||
| if existing_images_with_hash: | ||
| # Found a match - this is likely a renamed image | ||
| existing_image = existing_images_with_hash[0] # Take the first match |
There was a problem hiding this comment.
When multiple images share the same hash, arbitrarily selecting the first match could lead to incorrect associations. Consider adding logic to select the most appropriate match based on criteria like file modification time or file size similarity.
There was a problem hiding this comment.
I agree that this is slightly confusing. I saw that we only use this existing_image to extract the vector. It would be better to do just that. For example:
| existing_image = existing_images_with_hash[0] # Take the first match | |
| existing_image_vector = existing_images_with_hash[0]["vector"] |
…#224) ## How does this PR impact the user? New rclip versions with heif support will be released for `brew`. ## Description Ditto. ## Limitations N/A ## Checklist - [x] my PR is focused and contains one holistic change - [ ] I have added screenshots or screen recordings to show the changes
…urijmikhalevich#229) ## How does this PR impact the user? The user will be able to find information about how rclip updates its index in the README. ## Description Ditto. ## Limitations n/a ## Checklist - [x] my PR is focused and contains one holistic change - [ ] I have added screenshots or screen recordings to show the changes
|
hey @yurijmikhalevich I was quite busy for a few days. I'll look into the suggestions that you have given, and then I'll implement them. Thanks for the patience! |
|
@Kaos599, thank you! Looking forward to the seeing the updated from you! |
- Reset all indexing flags before a re‑index: self._db.remove_indexing_flag_from_all_images() self._db.flag_images_in_a_dir_as_indexing(directory) - After processing the directory, mark any images that were flagged for indexing but never seen (i.e., renamed or deleted) as deleted: self._db.flag_indexing_images_in_a_dir_as_deleted(directory)
- Extract only the image vector when handling renamed files (main.py) - Convert DB rows to dicts in [get_images_by_hash] - Catch only `PIL.UnidentifiedImageError` to avoid masking unexpected errors (main.py) - Remove COALESCE from hash update in [upsert_image] to allow explicit hash changes (db.py) - Add precise type casting for zip iteration to improve type safety (main.py)
- Test multiple copies of same image are preserved (cat.jpg + subfolder/cat_copy.jpg) - Verify different images indexed separately with unique hashes (bee.jpg) - Confirm all images preserved after rename operations - Validate search results return only active (non-deleted) images - Assert old filepath excluded from results after rename
- replaced generic Exception with specific (OSError, IOError, ValueError) - updated rename detection to catch the same specific exceptions - ensures unexpected errors are not masked during image processed
Hash revert
|
hey @yurijmikhalevich I have tried to address all your concerns. Please review the latest changes that I've done. Thank you for your Patience! |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 19 out of 21 changed files in this pull request and generated 10 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
rclip/db.py
Outdated
| VALUES (:deleted, :indexing, :filepath, :modified_at, :size, :vector, :hash) | ||
| ON CONFLICT(filepath) DO UPDATE SET | ||
| deleted=:deleted, indexing=:indexing, modified_at=:modified_at, size=:size, vector=:vector | ||
| deleted=:deleted, modified_at=:modified_at, size=:size, vector=:vector, hash=:hash |
There was a problem hiding this comment.
The upsert_image method is missing the 'indexing' field in the UPDATE clause. When updating an existing image, the indexing flag should also be reset to NULL. Currently, the UPDATE SET only includes deleted, modified_at, size, vector, and hash, but not indexing. This could cause the indexing flag to remain set on updated images, potentially leading to incorrect behavior during subsequent indexing operations.
| deleted=:deleted, modified_at=:modified_at, size=:size, vector=:vector, hash=:hash | |
| deleted=:deleted, indexing=:indexing, modified_at=:modified_at, size=:size, vector=:vector, hash=:hash |
| assert bee_after["hash"] == bee_hash, "Different image hash should be unchanged" | ||
|
|
||
| # Verify search results return all 3 active images | ||
| results = rclip.search("animal", str(temp_path), top_k=10) |
There was a problem hiding this comment.
The test uses hardcoded top_k=10 parameter which assumes the number of images indexed will always be less than or equal to 10. If the test directory has more than 10 images, some results might be truncated, potentially causing the test to fail incorrectly. Consider using a larger top_k value or setting it to None/unlimited to ensure all indexed images are returned in the search results.
| results = rclip.search("animal", str(temp_path), top_k=10) | |
| results = rclip.search("animal", str(temp_path), top_k=None) |
|
|
||
| When you run **rclip** in a directory that has already been processed, it will | ||
| only index the new images added since the last run and remove the deleted images | ||
| from its index. This makes consecutive runs much faster. |
There was a problem hiding this comment.
The documentation mentions that rclip updates the index by removing deleted images and indexing new ones, but it doesn't mention the new rename detection feature. Consider adding a brief explanation that renamed images are now automatically detected and don't require re-indexing, which is the main improvement introduced in this PR.
| from its index. This makes consecutive runs much faster. | |
| from its index. Renamed images are also detected automatically, so you don't need | |
| to perform a full re-index when files are renamed. This makes consecutive runs much faster. |
rclip/utils/helpers.py
Outdated
|
|
||
|
|
||
| def compute_image_hash(image: Image.Image) -> str: | ||
| """Compute perceptual hash for an image using pHash algorithm.""" |
There was a problem hiding this comment.
The documentation for compute_image_hash only mentions "pHash algorithm" but doesn't explain what this algorithm does or why it was chosen. Consider expanding the docstring to explain that pHash (perceptual hash) creates a fingerprint of the image content that remains similar even when the image is slightly modified, making it suitable for detecting renamed files with identical content.
| """Compute perceptual hash for an image using pHash algorithm.""" | |
| """Compute a perceptual hash (pHash) for an image. | |
| The pHash algorithm generates a compact fingerprint of the image's visual | |
| content, such that visually similar images (e.g. resized, recompressed or | |
| slightly modified versions of the same picture) produce similar hashes. | |
| This makes it suitable for detecting identical or near-duplicate images, | |
| such as renamed files with the same underlying content. | |
| """ |
| try: | ||
| version_str = f"rclip {version('rclip')}" | ||
| except Exception: | ||
| version_str = "rclip (development)" |
There was a problem hiding this comment.
The try-except block catching all Exceptions is too broad. This could hide legitimate errors that should be surfaced during development. Consider catching only the specific exception types that version() might raise (e.g., PackageNotFoundError from importlib.metadata) to ensure unexpected errors are not silently suppressed.
| columns = self._con.execute("PRAGMA table_info(images)").fetchall() | ||
| column_names = [col["name"] for col in columns] | ||
| if "hash" not in column_names: | ||
| self._con.execute("ALTER TABLE images ADD COLUMN hash TEXT") |
There was a problem hiding this comment.
The database migration for version 3 checks if the hash column exists before adding it (lines 66-70), but it doesn't check if the hash_index exists before creating it (line 71). While CREATE INDEX IF NOT EXISTS is used, the comment on line 66 suggests this check was added to handle databases from "old code". For consistency and clarity, consider adding a comment explaining that CREATE INDEX IF NOT EXISTS handles the case where the index might already exist.
| self._con.execute("ALTER TABLE images ADD COLUMN hash TEXT") | |
| self._con.execute("ALTER TABLE images ADD COLUMN hash TEXT") | |
| # Rely on CREATE INDEX IF NOT EXISTS to handle cases where hash_index may already exist from old code |
rclip/db.py
Outdated
| return cur.fetchone() | ||
|
|
||
| def get_images_by_hash(self, hash_value: str) -> list[Image]: | ||
| cur = self._con.execute("SELECT * FROM images WHERE hash = ? AND deleted IS NULL", (hash_value,)) |
There was a problem hiding this comment.
The get_images_by_hash method should also filter out images with indexing=1 flag to avoid matching against files that are currently being processed for potential deletion. If an image with indexing=1 has the same hash as a newly discovered file, it could be incorrectly identified as a rename candidate when it may be about to be deleted.
| cur = self._con.execute("SELECT * FROM images WHERE hash = ? AND deleted IS NULL", (hash_value,)) | |
| cur = self._con.execute( | |
| "SELECT * FROM images WHERE hash = ? AND deleted IS NULL AND indexing IS NULL", | |
| (hash_value,), | |
| ) |
There was a problem hiding this comment.
The recommendation to filter 'indexing=1' entries in get_images_by_hash was found to be counterproductive. This filter mistakenly disabled intra-directory rename detection because all files in a scanning directory are marked as 'indexing=1' during the initial phase. This database-level filter was explicitly removed to restore the 50-200x performance gain for renamed files
| # Check if this might be a renamed image | ||
| if not image: | ||
| # Read the image to compute its hash | ||
| try: | ||
| img = helpers.read_image(filepath) | ||
| current_hash = helpers.compute_image_hash(img) | ||
| # Look for ALL existing images with the same hash | ||
| existing_images_with_hash = self._db.get_images_by_hash(current_hash) | ||
|
|
||
| # Find an entry where the file no longer exists (true rename, not a copy) | ||
| existing_image_vector = None | ||
| for img_entry in existing_images_with_hash: | ||
| if not os.path.exists(img_entry["filepath"]): | ||
| existing_image_vector = img_entry["vector"] | ||
| break | ||
|
|
||
| if existing_image_vector: | ||
| # This is a renamed file - reuse the existing vector | ||
| # DON'T remove the indexing flag from the old filepath - we want it to be marked as deleted | ||
| # Create a new entry for the new filepath | ||
| self._db.upsert_image( | ||
| db.NewImage( | ||
| filepath=filepath, | ||
| modified_at=meta["modified_at"], | ||
| size=meta["size"], | ||
| vector=existing_image_vector, | ||
| hash=current_hash | ||
| ), | ||
| commit=False, | ||
| ) | ||
| self._db.remove_indexing_flag(filepath, commit=False) | ||
| continue | ||
| except (PIL.UnidentifiedImageError, OSError, IOError, ValueError): | ||
| # If we can't read the image, fall through to normal indexing | ||
| pass |
There was a problem hiding this comment.
The rename detection logic has a performance issue: it reads and computes the hash for every new image, even when there are no potential rename candidates. This could significantly slow down indexing of new images. Consider checking if there are any images with indexing=1 flag (potential deleted files) before attempting hash computation and rename detection. This optimization would avoid unnecessary image reads and hash computations when no renames are possible.
| def compute_image_hash(image: Image.Image) -> str: | ||
| """Compute perceptual hash for an image using pHash algorithm.""" | ||
| return str(imagehash.phash(image)) |
There was a problem hiding this comment.
The compute_image_hash function lacks error handling. If the image hash computation fails (e.g., for corrupted images or unsupported formats), it will raise an unhandled exception. Consider adding a try-except block to handle potential errors from imagehash.phash and returning None or a default value when hash computation fails, ensuring the indexing process can continue.
rclip/main.py
Outdated
| from typing import TYPE_CHECKING | ||
|
|
||
| if TYPE_CHECKING: | ||
| from rclip.model import FeatureVector |
There was a problem hiding this comment.
Import of 'FeatureVector' is not used.
- Database Consistency: Fixed upsert_image UPDATE clause to properly reset indexing flags on modified files. - Performance: Added a directory-scoped pre-check (has_indexing_images_in_dir) to bypass expensive hashing when no renames are possible (est. 20-40% speedup for new image additions). - Documentation: Updated README to highlight the new rename detection feature and its performance benefits for consecutive runs. - Code Quality: Enhanced compute_image_hash docstring with detailed pHash explanation and improved exception specificity in version checks.
fix: address feedback on rename detection PR
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 19 out of 21 changed files in this pull request and generated 14 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| The pHash algorithm generates a compact fingerprint of the image's visual | ||
| content, such that visually similar images (e.g. resized, recompressed or | ||
| slightly modified versions of the same picture) produce similar hashes. | ||
| This makes it suitable for detecting identical or near-duplicate images, | ||
| such as renamed files with the same underlying content. |
There was a problem hiding this comment.
The docstring mentions that pHash produces "similar hashes" for "visually similar images" but this is misleading. Perceptual hashes are designed to produce IDENTICAL or very close hashes for identical/near-duplicate content, not just "similar" hashes. The rename detection logic depends on exact hash matches (line 179 in main.py uses get_images_by_hash(current_hash) which does equality comparison).
Consider clarifying that pHash produces identical hashes for identical visual content, and that this implementation is used for exact duplicate detection rather than similarity matching.
| The pHash algorithm generates a compact fingerprint of the image's visual | |
| content, such that visually similar images (e.g. resized, recompressed or | |
| slightly modified versions of the same picture) produce similar hashes. | |
| This makes it suitable for detecting identical or near-duplicate images, | |
| such as renamed files with the same underlying content. | |
| The pHash algorithm generates a compact, deterministic fingerprint of an | |
| image's visual content. Identical visual content will produce identical | |
| hashes, and near-duplicate content (e.g. resized or recompressed versions | |
| of the same picture) will typically produce very similar hashes. | |
| In this project, the returned hash is used for exact duplicate detection | |
| via equality comparison (for example, to detect renamed files that contain | |
| the same underlying image data), not for general similarity matching with | |
| a distance threshold. |
|
|
||
| depends_on "rust" => :build # for safetensors | ||
| depends_on "certifi" | ||
| depends_on "libheif" |
There was a problem hiding this comment.
The homebrew formula adds libheif as a dependency for HEIC support. This is correct, as pillow-heif requires the libheif library. However, verify that:
- The dependency name
libheifis correct for Homebrew (it might belibheifor similar) - Any version constraints for libheif compatibility are documented
- This dependency is available in the standard Homebrew repository
Consider adding a comment explaining that this is required for HEIC image format support via pillow-heif.
| depends_on "libheif" | |
| depends_on "libheif" # required for HEIC/HEIF image support via pillow-heif |
|
|
||
| # these images are always processed | ||
| IMAGE_EXT = ["jpg", "jpeg", "png", "webp"] | ||
| IMAGE_EXT = ["jpg", "jpeg", "png", "webp", "heic"] |
There was a problem hiding this comment.
The IMAGE_EXT list now includes "heic" for HEIC image support. Note that HEIC images can also have the ".heif" extension (HEIF is the container format, HEIC is Apple's specific implementation). Consider whether ".heif" should also be added to ensure compatibility with images that use the generic HEIF extension rather than Apple's HEIC extension.
| IMAGE_EXT = ["jpg", "jpeg", "png", "webp", "heic"] | |
| IMAGE_EXT = ["jpg", "jpeg", "png", "webp", "heic", "heif"] |
|
|
||
| class DB: | ||
| VERSION = 2 | ||
| VERSION = 3 |
There was a problem hiding this comment.
The PR description states "Removed DB versioning system for simpler schema management," but the database versioning system is still present and functioning (DB.VERSION = 3 with migration logic from versions 1->2->3). This is actually correct behavior - the versioning should be kept - but the PR description is misleading.
| ### How does **rclip** update the index? | ||
|
|
||
| When you run **rclip** in a directory that has already been processed, it will | ||
| only index the new images added since the last run and remove the deleted images | ||
| from its index. This makes consecutive runs much faster. | ||
|
|
||
| If you know that no new images were added or deleted since the last run, you can | ||
| use the `--no-indexing` (or `-n`) argument to skip the indexing step altogether | ||
| and speed up the search even more. | ||
|
|
||
| ```bash | ||
| rclip -n cat | ||
| ``` | ||
|
|
There was a problem hiding this comment.
The newly added section "How does rclip update the index?" (lines 143-156) is helpful, but it's somewhat redundant with the earlier paragraph at lines 86-87 which also mentions rename detection and automatic updates. Consider consolidating this information or making it clear that this section provides more detailed information about the update mechanism to avoid redundancy.
| numpy = "~2.1.3" | ||
| open_clip_torch = "^3.1.0" | ||
| pillow = "^10.3.0" | ||
| pillow = "^12.0.0" |
There was a problem hiding this comment.
The Pillow dependency is being upgraded from ^10.3.0 to ^12.0.0, which is a major version bump (10 -> 12, skipping version 11). Major version updates can introduce breaking changes. While this may be necessary for HEIC support via pillow-heif (which requires >=11.1.0 per the pillow-heif dependencies), ensure that:
- This major version upgrade has been tested thoroughly
- Any breaking changes in Pillow 12.0.0 have been reviewed
- The minimum required version for pillow-heif compatibility is met
Consider documenting in the PR description or commit message why version 11 was skipped and if there are any known compatibility considerations.
| pillow = "^12.0.0" | |
| pillow = "^11.1.0" |
tests/e2e/output_snapshots/test_search_by_non_existing_file.txt
Outdated
Show resolved
Hide resolved
| try: | ||
| version_str = f"rclip {version('rclip')}" | ||
| except Exception: # PackageNotFoundError when not installed via package manager | ||
| version_str = "rclip (development)" |
There was a problem hiding this comment.
The broad except Exception handler at line 107 catches any exception when getting the version, including PackageNotFoundError. While this makes the code work in development mode, it silently swallows all exceptions without logging. Consider:
- Catching the specific
PackageNotFoundErrorexception for clarity - Logging the exception in development mode so developers are aware of the fallback behavior
This would make debugging easier and the code's intent clearer.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
How does this PR impact the user?
Eliminates expensive re-indexing when images are renamed. Users get 30-80x faster performance and seamless search experience.
Screenshots
Before
After (with hash cache)
After Renaming
Description
Limitations
Checklist
Closes #10