-
Notifications
You must be signed in to change notification settings - Fork 71
Implement image rename detection using perceptual hashes #218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
0331654
78d7dc5
b1dd8d2
9204674
9c3cc17
e26b030
c93b338
a7eaa7c
43fc0a2
991b5a7
752a7d4
4ce2ed7
0d08b79
9d9c388
25100d9
73d00a8
b95de60
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,6 +1,6 @@ | ||
| # rclip - AI-Powered Command-Line Photo Search Tool | ||
| <!-- ALL-CONTRIBUTORS-BADGE:START - Do not remove or modify this section --> | ||
| [](#contributors-) | ||
| [](#contributors-) | ||
| <!-- ALL-CONTRIBUTORS-BADGE:END --> | ||
|
|
||
| [[Blog]](https://mikhalevi.ch/rclip-an-ai-powered-command-line-photo-search-tool/) [[Demo on YouTube]](https://www.youtube.com/watch?v=tAJHXOkHidw) [[Paper]](https://www.thinkmind.org/index.php?view=article&articleid=content_2023_1_20_60011) | ||
|
|
@@ -83,6 +83,8 @@ cd photos && rclip "search query" | |
|
|
||
| When you run **rclip** for the first time in a particular directory, it will extract features from the photos, which takes time. How long it will take depends on your CPU and the number of pictures you will search through. It took about a day to process 73 thousand photos on my NAS, which runs an old-ish Intel Celeron J3455, 7 minutes to index 50 thousand images on my MacBook with an M1 Max CPU, and three hours to process 1.28 million images on the same MacBook. | ||
|
|
||
| When you run **rclip** in a directory that has already been processed, it will only index the new images added since the last run and remove the deleted images from its index. Renamed images are also detected automatically using perceptual hashing, so you don't need to perform a full re-index when files are renamed. This makes consecutive runs much faster. | ||
|
|
||
| For a detailed demonstration, watch the video: https://www.youtube.com/watch?v=tAJHXOkHidw. | ||
|
|
||
| ### Similar image search | ||
|
|
@@ -138,6 +140,20 @@ rclip -p kitty | |
| ``` | ||
| </details> | ||
|
|
||
| ### How does **rclip** update the index? | ||
|
|
||
| When you run **rclip** in a directory that has already been processed, it will | ||
| only index the new images added since the last run and remove the deleted images | ||
| from its index. This makes consecutive runs much faster. | ||
|
|
||
| If you know that no new images were added or deleted since the last run, you can | ||
| use the `--no-indexing` (or `-n`) argument to skip the indexing step altogether | ||
| and speed up the search even more. | ||
|
|
||
| ```bash | ||
| rclip -n cat | ||
| ``` | ||
|
|
||
|
Comment on lines
+143
to
+156
|
||
| ## Get help | ||
|
|
||
| https://github.com/yurijmikhalevich/rclip/discussions/new/choose | ||
|
|
@@ -180,6 +196,7 @@ Thanks go to these wonderful people and organizations ([emoji key](https://allco | |
| <td align="center" valign="top" width="14.28%"><a href="http://abidkhan484.github.io"><img src="https://avatars.githubusercontent.com/u/15053047?v=4?s=100" width="100px;" alt="AbId KhAn"/><br /><sub><b>AbId KhAn</b></sub></a><br /><a href="https://github.com/yurijmikhalevich/rclip/commits?author=abidkhan484" title="Code">💻</a></td> | ||
| <td align="center" valign="top" width="14.28%"><a href="https://cl4r1ty.dev"><img src="https://avatars.githubusercontent.com/u/136800640?v=4?s=100" width="100px;" alt="Ben"/><br /><sub><b>Ben</b></sub></a><br /><a href="https://github.com/yurijmikhalevich/rclip/commits?author=Cl4r1ty-1" title="Code">💻</a></td> | ||
| <td align="center" valign="top" width="14.28%"><a href="https://techtracer.pages.dev"><img src="https://avatars.githubusercontent.com/u/48885301?v=4?s=100" width="100px;" alt="Tanmay Chaudhari"/><br /><sub><b>Tanmay Chaudhari</b></sub></a><br /><a href="https://github.com/yurijmikhalevich/rclip/commits?author=tanmayc07" title="Code">💻</a></td> | ||
| <td align="center" valign="top" width="14.28%"><a href="http://leoauri.com"><img src="https://avatars.githubusercontent.com/u/10868855?v=4?s=100" width="100px;" alt="Leo Auri"/><br /><sub><b>Leo Auri</b></sub></a><br /><a href="https://github.com/yurijmikhalevich/rclip/commits?author=leoauri" title="Code">💻</a></td> | ||
| </tr> | ||
| </tbody> | ||
| </table> | ||
|
|
||
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -1,6 +1,6 @@ | ||||||
| [tool.poetry] | ||||||
| name = "rclip" | ||||||
| version = "2.0.11" | ||||||
| version = "2.1.0" | ||||||
| description = "AI-Powered Command-Line Photo Search Tool" | ||||||
| authors = ["Yurij Mikhalevich <yurij@mikhalevi.ch>"] | ||||||
| license = "MIT" | ||||||
|
|
@@ -21,7 +21,8 @@ classifiers = [ | |||||
| python = ">=3.10 <3.13" | ||||||
| numpy = "~2.1.3" | ||||||
| open_clip_torch = "^3.1.0" | ||||||
| pillow = "^10.3.0" | ||||||
| pillow = "^12.0.0" | ||||||
|
||||||
| pillow = "^12.0.0" | |
| pillow = "^11.1.0" |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -6,6 +6,6 @@ | |||||
| IS_WINDOWS = sys.platform == "win32" or sys.platform == "cygwin" | ||||||
|
|
||||||
| # these images are always processed | ||||||
| IMAGE_EXT = ["jpg", "jpeg", "png", "webp"] | ||||||
| IMAGE_EXT = ["jpg", "jpeg", "png", "webp", "heic"] | ||||||
|
||||||
| IMAGE_EXT = ["jpg", "jpeg", "png", "webp", "heic"] | |
| IMAGE_EXT = ["jpg", "jpeg", "png", "webp", "heic", "heif"] |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -13,14 +13,15 @@ class NewImage(ImageOmittable): | |||||||||||||||||||||||||||||||
| modified_at: float | ||||||||||||||||||||||||||||||||
| size: int | ||||||||||||||||||||||||||||||||
| vector: bytes | ||||||||||||||||||||||||||||||||
| hash: Optional[str] = None | ||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
| class Image(NewImage): | ||||||||||||||||||||||||||||||||
| id: int | ||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
| class DB: | ||||||||||||||||||||||||||||||||
| VERSION = 2 | ||||||||||||||||||||||||||||||||
| VERSION = 3 | ||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
| def __init__(self, filename: Union[str, pathlib.Path]): | ||||||||||||||||||||||||||||||||
| self._con = sqlite3.connect(filename) | ||||||||||||||||||||||||||||||||
|
|
@@ -61,6 +62,15 @@ def ensure_version(self): | |||||||||||||||||||||||||||||||
| if db_version < 2: | ||||||||||||||||||||||||||||||||
| self._con.execute("ALTER TABLE images ADD COLUMN indexing BOOLEAN") | ||||||||||||||||||||||||||||||||
| db_version = 2 | ||||||||||||||||||||||||||||||||
| if db_version < 3: | ||||||||||||||||||||||||||||||||
| # Check if hash column already exists (it might from old code) | ||||||||||||||||||||||||||||||||
| columns = self._con.execute("PRAGMA table_info(images)").fetchall() | ||||||||||||||||||||||||||||||||
| column_names = [col["name"] for col in columns] | ||||||||||||||||||||||||||||||||
| if "hash" not in column_names: | ||||||||||||||||||||||||||||||||
| self._con.execute("ALTER TABLE images ADD COLUMN hash TEXT") | ||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||
| self._con.execute("ALTER TABLE images ADD COLUMN hash TEXT") | |
| self._con.execute("ALTER TABLE images ADD COLUMN hash TEXT") | |
| # Rely on CREATE INDEX IF NOT EXISTS to handle cases where hash_index may already exist from old code |
Copilot
AI
Dec 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The migration code at lines 66-72 checks if the hash column exists before adding it, which is good defensive programming. However, the comment says "it might from old code" suggesting there may have been previous iterations that added this column.
If users have databases from experimental/development versions with the hash column but no index, they would get the column check passing but still need the index. The current code handles this correctly with CREATE INDEX IF NOT EXISTS, but consider adding a comment explaining this handles both fresh migrations and partial migrations from development versions.
| # Check if hash column already exists (it might from old code) | |
| columns = self._con.execute("PRAGMA table_info(images)").fetchall() | |
| column_names = [col["name"] for col in columns] | |
| if "hash" not in column_names: | |
| self._con.execute("ALTER TABLE images ADD COLUMN hash TEXT") | |
| # CREATE INDEX IF NOT EXISTS handles cases where hash_index may already exist from old code | |
| # Check if hash column already exists (it might from old/experimental code) | |
| # so we don't attempt to add it twice on databases from earlier dev builds. | |
| columns = self._con.execute("PRAGMA table_info(images)").fetchall() | |
| column_names = [col["name"] for col in columns] | |
| if "hash" not in column_names: | |
| self._con.execute("ALTER TABLE images ADD COLUMN hash TEXT") | |
| # Always ensure the index exists: on fresh migrations this creates it, | |
| # and on partially migrated/experimental databases it is a no-op if the | |
| # index was already created by older code. |
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
|
|
@@ -4,11 +4,13 @@ | |||||||
| import sys | ||||||||
| import threading | ||||||||
| from typing import Iterable, List, NamedTuple, Optional, Tuple, TypedDict, cast | ||||||||
| from typing import TYPE_CHECKING | ||||||||
|
|
||||||||
| import numpy as np | ||||||||
| from tqdm import tqdm | ||||||||
| import PIL | ||||||||
| from PIL import Image, ImageFile | ||||||||
| from pillow_heif import register_heif_opener | ||||||||
|
|
||||||||
| from rclip import db, fs, model | ||||||||
| from rclip.const import IMAGE_EXT, IMAGE_RAW_EXT | ||||||||
|
|
@@ -18,6 +20,7 @@ | |||||||
|
|
||||||||
|
|
||||||||
| ImageFile.LOAD_TRUNCATED_IMAGES = True | ||||||||
| register_heif_opener() | ||||||||
|
|
||||||||
|
|
||||||||
| class ImageMeta(TypedDict): | ||||||||
|
|
@@ -77,17 +80,21 @@ def _index_files(self, filepaths: List[str], metas: List[ImageMeta]): | |||||||
| filtered_paths.append(path) | ||||||||
| except PIL.UnidentifiedImageError: | ||||||||
| pass | ||||||||
| except Exception as ex: | ||||||||
| except (OSError, ValueError) as ex: | ||||||||
|
||||||||
| except (OSError, ValueError) as ex: | |
| except Exception as ex: |
Copilot
AI
Dec 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rename detection logic at lines 179-186 has a subtle bug. The get_images_by_hash query returns ALL non-deleted images with the matching hash across the entire database, not just from the directory currently being indexed.
The code then checks if each file exists on disk (line 184) to determine if it's a "renamed" file. However, this can lead to incorrect behavior when indexing nested directories or when files with the same hash exist in different directory trees:
Example problematic scenario:
- Previously indexed
/photos/old/image.jpgwith hash X - File was deleted manually (not yet marked as deleted in DB)
- Now indexing
/photos/new/directory which containsnew_image.jpgwith same hash X - The code finds the deleted
/photos/old/image.jpgentry, sees the file doesn't exist, and incorrectly reuses its vector
The fix should ensure that only images from the current directory being indexed (those with indexing=1) are considered as candidates for rename detection. Add a filter to check img_entry.get("indexing") == 1 before considering it as a renamed file.
| if not os.path.exists(img_entry["filepath"]): | |
| # Only consider images that are part of the current indexing pass (indexing == 1) | |
| if img_entry.get("indexing") == 1 and not os.path.exists(img_entry["filepath"]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep this to mark images we never met during reindexing as deleted.
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -10,6 +10,7 @@ | |||||||||||||||||||||||||||||||||
| import requests | ||||||||||||||||||||||||||||||||||
| import sys | ||||||||||||||||||||||||||||||||||
| from importlib.metadata import version | ||||||||||||||||||||||||||||||||||
| import imagehash | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| from rclip.const import IMAGE_RAW_EXT, IS_LINUX, IS_MACOS, IS_WINDOWS | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
|
|
@@ -101,7 +102,10 @@ def init_arg_parser() -> argparse.ArgumentParser: | |||||||||||||||||||||||||||||||||
| "get help:\n" | ||||||||||||||||||||||||||||||||||
| " https://github.com/yurijmikhalevich/rclip/discussions/new/choose\n\n", | ||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||
| version_str = f"rclip {version('rclip')}" | ||||||||||||||||||||||||||||||||||
| try: | ||||||||||||||||||||||||||||||||||
| version_str = f"rclip {version('rclip')}" | ||||||||||||||||||||||||||||||||||
| except Exception: # PackageNotFoundError when not installed via package manager | ||||||||||||||||||||||||||||||||||
| version_str = "rclip (development)" | ||||||||||||||||||||||||||||||||||
|
Comment on lines
+105
to
+108
|
||||||||||||||||||||||||||||||||||
| parser.add_argument("--version", "-v", action="version", version=version_str, help=f'prints "{version_str}"') | ||||||||||||||||||||||||||||||||||
| parser.add_argument("query", help="a text query or a path/URL to an image file") | ||||||||||||||||||||||||||||||||||
| parser.add_argument( | ||||||||||||||||||||||||||||||||||
|
|
@@ -230,6 +234,18 @@ def read_raw_image_file(path: str): | |||||||||||||||||||||||||||||||||
| return Image.fromarray(np.array(rgb)) | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| def compute_image_hash(image: Image.Image) -> str: | ||||||||||||||||||||||||||||||||||
| """Compute a perceptual hash (pHash) for an image. | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| The pHash algorithm generates a compact fingerprint of the image's visual | ||||||||||||||||||||||||||||||||||
| content, such that visually similar images (e.g. resized, recompressed or | ||||||||||||||||||||||||||||||||||
| slightly modified versions of the same picture) produce similar hashes. | ||||||||||||||||||||||||||||||||||
| This makes it suitable for detecting identical or near-duplicate images, | ||||||||||||||||||||||||||||||||||
| such as renamed files with the same underlying content. | ||||||||||||||||||||||||||||||||||
|
Comment on lines
+239
to
+244
|
||||||||||||||||||||||||||||||||||
| The pHash algorithm generates a compact fingerprint of the image's visual | |
| content, such that visually similar images (e.g. resized, recompressed or | |
| slightly modified versions of the same picture) produce similar hashes. | |
| This makes it suitable for detecting identical or near-duplicate images, | |
| such as renamed files with the same underlying content. | |
| The pHash algorithm generates a compact, deterministic fingerprint of an | |
| image's visual content. Identical visual content will produce identical | |
| hashes, and near-duplicate content (e.g. resized or recompressed versions | |
| of the same picture) will typically produce very similar hashes. | |
| In this project, the returned hash is used for exact duplicate detection | |
| via equality comparison (for example, to detect renamed files that contain | |
| the same underlying image data), not for general similarity matching with | |
| a distance threshold. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation mentions that rclip updates the index by removing deleted images and indexing new ones, but it doesn't mention the new rename detection feature. Consider adding a brief explanation that renamed images are now automatically detected and don't require re-indexing, which is the main improvement introduced in this PR.