Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .all-contributorsrc
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,15 @@
"contributions": [
"code"
]
},
{
"login": "leoauri",
"name": "Leo Auri",
"avatar_url": "https://avatars.githubusercontent.com/u/10868855?v=4",
"profile": "http://leoauri.com",
"contributions": [
"code"
]
}
],
"contributorsPerLine": 7,
Expand Down
19 changes: 18 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# rclip - AI-Powered Command-Line Photo Search Tool
<!-- ALL-CONTRIBUTORS-BADGE:START - Do not remove or modify this section -->
[![All Contributors](https://img.shields.io/badge/all_contributors-5-orange.svg?style=flat-square)](#contributors-)
[![All Contributors](https://img.shields.io/badge/all_contributors-6-orange.svg?style=flat-square)](#contributors-)
<!-- ALL-CONTRIBUTORS-BADGE:END -->

[[Blog]](https://mikhalevi.ch/rclip-an-ai-powered-command-line-photo-search-tool/) [[Demo on YouTube]](https://www.youtube.com/watch?v=tAJHXOkHidw) [[Paper]](https://www.thinkmind.org/index.php?view=article&articleid=content_2023_1_20_60011)
Expand Down Expand Up @@ -83,6 +83,8 @@ cd photos && rclip "search query"

When you run **rclip** for the first time in a particular directory, it will extract features from the photos, which takes time. How long it will take depends on your CPU and the number of pictures you will search through. It took about a day to process 73 thousand photos on my NAS, which runs an old-ish Intel Celeron J3455, 7 minutes to index 50 thousand images on my MacBook with an M1 Max CPU, and three hours to process 1.28 million images on the same MacBook.

When you run **rclip** in a directory that has already been processed, it will only index the new images added since the last run and remove the deleted images from its index. Renamed images are also detected automatically using perceptual hashing, so you don't need to perform a full re-index when files are renamed. This makes consecutive runs much faster.

For a detailed demonstration, watch the video: https://www.youtube.com/watch?v=tAJHXOkHidw.

### Similar image search
Expand Down Expand Up @@ -138,6 +140,20 @@ rclip -p kitty
```
</details>

### How does **rclip** update the index?

When you run **rclip** in a directory that has already been processed, it will
only index the new images added since the last run and remove the deleted images
from its index. This makes consecutive runs much faster.
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation mentions that rclip updates the index by removing deleted images and indexing new ones, but it doesn't mention the new rename detection feature. Consider adding a brief explanation that renamed images are now automatically detected and don't require re-indexing, which is the main improvement introduced in this PR.

Suggested change
from its index. This makes consecutive runs much faster.
from its index. Renamed images are also detected automatically, so you don't need
to perform a full re-index when files are renamed. This makes consecutive runs much faster.

Copilot uses AI. Check for mistakes.

If you know that no new images were added or deleted since the last run, you can
use the `--no-indexing` (or `-n`) argument to skip the indexing step altogether
and speed up the search even more.

```bash
rclip -n cat
```

Comment on lines +143 to +156
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newly added section "How does rclip update the index?" (lines 143-156) is helpful, but it's somewhat redundant with the earlier paragraph at lines 86-87 which also mentions rename detection and automatic updates. Consider consolidating this information or making it clear that this section provides more detailed information about the update mechanism to avoid redundancy.

Copilot uses AI. Check for mistakes.
## Get help

https://github.com/yurijmikhalevich/rclip/discussions/new/choose
Expand Down Expand Up @@ -180,6 +196,7 @@ Thanks go to these wonderful people and organizations ([emoji key](https://allco
<td align="center" valign="top" width="14.28%"><a href="http://abidkhan484.github.io"><img src="https://avatars.githubusercontent.com/u/15053047?v=4?s=100" width="100px;" alt="AbId KhAn"/><br /><sub><b>AbId KhAn</b></sub></a><br /><a href="https://github.com/yurijmikhalevich/rclip/commits?author=abidkhan484" title="Code">💻</a></td>
<td align="center" valign="top" width="14.28%"><a href="https://cl4r1ty.dev"><img src="https://avatars.githubusercontent.com/u/136800640?v=4?s=100" width="100px;" alt="Ben"/><br /><sub><b>Ben</b></sub></a><br /><a href="https://github.com/yurijmikhalevich/rclip/commits?author=Cl4r1ty-1" title="Code">💻</a></td>
<td align="center" valign="top" width="14.28%"><a href="https://techtracer.pages.dev"><img src="https://avatars.githubusercontent.com/u/48885301?v=4?s=100" width="100px;" alt="Tanmay Chaudhari"/><br /><sub><b>Tanmay Chaudhari</b></sub></a><br /><a href="https://github.com/yurijmikhalevich/rclip/commits?author=tanmayc07" title="Code">💻</a></td>
<td align="center" valign="top" width="14.28%"><a href="http://leoauri.com"><img src="https://avatars.githubusercontent.com/u/10868855?v=4?s=100" width="100px;" alt="Leo Auri"/><br /><sub><b>Leo Auri</b></sub></a><br /><a href="https://github.com/yurijmikhalevich/rclip/commits?author=leoauri" title="Code">💻</a></td>
</tr>
</tbody>
</table>
Expand Down
257 changes: 170 additions & 87 deletions poetry.lock

Large diffs are not rendered by default.

6 changes: 4 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "rclip"
version = "2.0.11"
version = "2.1.0"
description = "AI-Powered Command-Line Photo Search Tool"
authors = ["Yurij Mikhalevich <yurij@mikhalevi.ch>"]
license = "MIT"
Expand All @@ -21,7 +21,8 @@ classifiers = [
python = ">=3.10 <3.13"
numpy = "~2.1.3"
open_clip_torch = "^3.1.0"
pillow = "^10.3.0"
pillow = "^12.0.0"
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Pillow dependency is being upgraded from ^10.3.0 to ^12.0.0, which is a major version bump (10 -> 12, skipping version 11). Major version updates can introduce breaking changes. While this may be necessary for HEIC support via pillow-heif (which requires >=11.1.0 per the pillow-heif dependencies), ensure that:

  1. This major version upgrade has been tested thoroughly
  2. Any breaking changes in Pillow 12.0.0 have been reviewed
  3. The minimum required version for pillow-heif compatibility is met

Consider documenting in the PR description or commit message why version 11 was skipped and if there are any known compatibility considerations.

Suggested change
pillow = "^12.0.0"
pillow = "^11.1.0"

Copilot uses AI. Check for mistakes.
pillow-heif = "^1.1.1"
requests = "~=2.32"
torch = [
{ version = "==2.5.1", source = "pypi", markers = "sys_platform != 'linux' or platform_machine == 'aarch64'" },
Expand All @@ -33,6 +34,7 @@ torchvision = [
]
tqdm = "^4.65.0"
rawpy = "^0.24.0"
imagehash = "^4.3.1"

[tool.poetry.group.dev.dependencies]
pyright = {extras = ["nodejs"], version = "^1.1.394"}
Expand Down
2 changes: 1 addition & 1 deletion rclip/const.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,6 @@
IS_WINDOWS = sys.platform == "win32" or sys.platform == "cygwin"

# these images are always processed
IMAGE_EXT = ["jpg", "jpeg", "png", "webp"]
IMAGE_EXT = ["jpg", "jpeg", "png", "webp", "heic"]
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IMAGE_EXT list now includes "heic" for HEIC image support. Note that HEIC images can also have the ".heif" extension (HEIF is the container format, HEIC is Apple's specific implementation). Consider whether ".heif" should also be added to ensure compatibility with images that use the generic HEIF extension rather than Apple's HEIC extension.

Suggested change
IMAGE_EXT = ["jpg", "jpeg", "png", "webp", "heic"]
IMAGE_EXT = ["jpg", "jpeg", "png", "webp", "heic", "heif"]

Copilot uses AI. Check for mistakes.
# RAW images are processed only if there is no processed image alongside it
IMAGE_RAW_EXT = ["arw", "cr2"]
49 changes: 44 additions & 5 deletions rclip/db.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,15 @@ class NewImage(ImageOmittable):
modified_at: float
size: int
vector: bytes
hash: Optional[str] = None


class Image(NewImage):
id: int


class DB:
VERSION = 2
VERSION = 3
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description states "Removed DB versioning system for simpler schema management," but the database versioning system is still present and functioning (DB.VERSION = 3 with migration logic from versions 1->2->3). This is actually correct behavior - the versioning should be kept - but the PR description is misleading.

Copilot uses AI. Check for mistakes.

def __init__(self, filename: Union[str, pathlib.Path]):
self._con = sqlite3.connect(filename)
Expand Down Expand Up @@ -61,6 +62,15 @@ def ensure_version(self):
if db_version < 2:
self._con.execute("ALTER TABLE images ADD COLUMN indexing BOOLEAN")
db_version = 2
if db_version < 3:
# Check if hash column already exists (it might from old code)
columns = self._con.execute("PRAGMA table_info(images)").fetchall()
column_names = [col["name"] for col in columns]
if "hash" not in column_names:
self._con.execute("ALTER TABLE images ADD COLUMN hash TEXT")
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The database migration for version 3 checks if the hash column exists before adding it (lines 66-70), but it doesn't check if the hash_index exists before creating it (line 71). While CREATE INDEX IF NOT EXISTS is used, the comment on line 66 suggests this check was added to handle databases from "old code". For consistency and clarity, consider adding a comment explaining that CREATE INDEX IF NOT EXISTS handles the case where the index might already exist.

Suggested change
self._con.execute("ALTER TABLE images ADD COLUMN hash TEXT")
self._con.execute("ALTER TABLE images ADD COLUMN hash TEXT")
# Rely on CREATE INDEX IF NOT EXISTS to handle cases where hash_index may already exist from old code

Copilot uses AI. Check for mistakes.
# CREATE INDEX IF NOT EXISTS handles cases where hash_index may already exist from old code
Comment on lines +66 to +71
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The migration code at lines 66-72 checks if the hash column exists before adding it, which is good defensive programming. However, the comment says "it might from old code" suggesting there may have been previous iterations that added this column.

If users have databases from experimental/development versions with the hash column but no index, they would get the column check passing but still need the index. The current code handles this correctly with CREATE INDEX IF NOT EXISTS, but consider adding a comment explaining this handles both fresh migrations and partial migrations from development versions.

Suggested change
# Check if hash column already exists (it might from old code)
columns = self._con.execute("PRAGMA table_info(images)").fetchall()
column_names = [col["name"] for col in columns]
if "hash" not in column_names:
self._con.execute("ALTER TABLE images ADD COLUMN hash TEXT")
# CREATE INDEX IF NOT EXISTS handles cases where hash_index may already exist from old code
# Check if hash column already exists (it might from old/experimental code)
# so we don't attempt to add it twice on databases from earlier dev builds.
columns = self._con.execute("PRAGMA table_info(images)").fetchall()
column_names = [col["name"] for col in columns]
if "hash" not in column_names:
self._con.execute("ALTER TABLE images ADD COLUMN hash TEXT")
# Always ensure the index exists: on fresh migrations this creates it,
# and on partially migrated/experimental databases it is a no-op if the
# index was already created by older code.

Copilot uses AI. Check for mistakes.
self._con.execute("CREATE INDEX IF NOT EXISTS hash_index ON images(hash) WHERE deleted IS NULL")
db_version = 3
if db_version < self.VERSION:
raise Exception("migration to a newer index version isn't implemented")
if db_version_entry:
Expand All @@ -69,27 +79,35 @@ def ensure_version(self):
self._con.execute("INSERT INTO db_version(version) VALUES (?)", (self.VERSION,))
self._con.commit()


def commit(self):
self._con.commit()

def upsert_image(self, image: NewImage, commit: bool = True):
self._con.execute(
"""
INSERT INTO images(deleted, indexing, filepath, modified_at, size, vector)
VALUES (:deleted, :indexing, :filepath, :modified_at, :size, :vector)
INSERT INTO images(deleted, indexing, filepath, modified_at, size, vector, hash)
VALUES (:deleted, :indexing, :filepath, :modified_at, :size, :vector, :hash)
ON CONFLICT(filepath) DO UPDATE SET
deleted=:deleted, indexing=:indexing, modified_at=:modified_at, size=:size, vector=:vector
deleted=:deleted, indexing=:indexing, modified_at=:modified_at, size=:size, vector=:vector, hash=:hash
""",
{"deleted": None, "indexing": None, **image},
)
if commit:
self._con.commit()


def remove_indexing_flag_from_all_images(self, commit: bool = True):
self._con.execute("UPDATE images SET indexing = NULL")
if commit:
self._con.commit()

def remove_indexing_flag_from_dir(self, path: str, commit: bool = True):
"""Remove indexing flag only from images within a specific directory."""
self._con.execute("UPDATE images SET indexing = NULL WHERE filepath LIKE ?", (path + f"{os.path.sep}%",))
if commit:
self._con.commit()

def flag_images_in_a_dir_as_indexing(self, path: str, commit: bool = True):
self._con.execute("UPDATE images SET indexing = 1 WHERE filepath LIKE ?", (path + f"{os.path.sep}%",))
if commit:
Expand All @@ -108,10 +126,31 @@ def remove_indexing_flag(self, filepath: str, commit: bool = True):
self._con.commit()

def get_image(self, **kwargs: Any) -> Optional[Image]:
query = " AND ".join(f"{key}=:{key}" for key in kwargs)
query_parts = [f"{key}=:{key}" for key in kwargs]
query_parts.append("deleted IS NULL")
query = " AND ".join(query_parts)
cur = self._con.execute(f"SELECT * FROM images WHERE {query} LIMIT 1", kwargs)
return cur.fetchone()

def get_images_by_hash(self, hash_value: str) -> list[Image]:
cur = self._con.execute(
"SELECT * FROM images WHERE hash = ? AND deleted IS NULL",
(hash_value,),
)
return [dict(row) for row in cur.fetchall()]

def has_indexing_images_in_dir(self, path: str) -> bool:
"""Check if there are any images with indexing=1 flag in this directory.

Used to optimize rename detection: only compute hashes when there are
potential deletions to match against.
"""
cur = self._con.execute(
"SELECT 1 FROM images WHERE filepath LIKE ? AND indexing = 1 LIMIT 1",
(path + f"{os.path.sep}%",)
)
return cur.fetchone() is not None

def get_image_vectors_by_dir_path(self, path: str) -> sqlite3.Cursor:
return self._con.execute(
"SELECT filepath, vector FROM images WHERE filepath LIKE ? AND deleted IS NULL", (path + f"{os.path.sep}%",)
Expand Down
64 changes: 58 additions & 6 deletions rclip/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,13 @@
import sys
import threading
from typing import Iterable, List, NamedTuple, Optional, Tuple, TypedDict, cast
from typing import TYPE_CHECKING

import numpy as np
from tqdm import tqdm
import PIL
from PIL import Image, ImageFile
from pillow_heif import register_heif_opener

from rclip import db, fs, model
from rclip.const import IMAGE_EXT, IMAGE_RAW_EXT
Expand All @@ -18,6 +20,7 @@


ImageFile.LOAD_TRUNCATED_IMAGES = True
register_heif_opener()


class ImageMeta(TypedDict):
Expand Down Expand Up @@ -77,17 +80,21 @@ def _index_files(self, filepaths: List[str], metas: List[ImageMeta]):
filtered_paths.append(path)
except PIL.UnidentifiedImageError:
pass
except Exception as ex:
except (OSError, ValueError) as ex:
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception handler catches (OSError, ValueError) but this is more restrictive than the original Exception handler. While this is generally better practice, it could miss some edge cases. For example, RuntimeError or other exceptions from image processing libraries might not be caught. Consider if this change is intentional and whether it adequately covers all error cases that can occur during image reading.

Suggested change
except (OSError, ValueError) as ex:
except Exception as ex:

Copilot uses AI. Check for mistakes.
print(f"error loading image {path}:", ex, file=sys.stderr)

try:
features = self._model.compute_image_features(images)
except Exception as ex:
print("error computing features:", ex, file=sys.stderr)
return
for path, meta, vector in cast(Iterable[PathMetaVector], zip(filtered_paths, metas, features)):
for path, meta, vector, image in cast(
Iterable[Tuple[str, ImageMeta, 'FeatureVector', Image.Image]],
zip(filtered_paths, metas, features, images)
):
hash_value = helpers.compute_image_hash(image)
self._db.upsert_image(
db.NewImage(filepath=path, modified_at=meta["modified_at"], size=meta["size"], vector=vector.tobytes()),
db.NewImage(filepath=path, modified_at=meta["modified_at"], size=meta["size"], vector=vector.tobytes(), hash=hash_value),
commit=False,
)

Expand All @@ -110,8 +117,9 @@ def ensure_index(self, directory: str):
file=sys.stderr,
)

self._db.remove_indexing_flag_from_all_images(commit=False)
self._db.flag_images_in_a_dir_as_indexing(directory, commit=True)
# Initialize indexing workflow: reset flags for this directory, then mark it for reindexing
self._db.remove_indexing_flag_from_dir(directory)
self._db.flag_images_in_a_dir_as_indexing(directory)

with tqdm(total=None, unit="images") as pbar:

Expand Down Expand Up @@ -151,9 +159,49 @@ def update_total_images(count: int):

image = self._db.get_image(filepath=filepath)
if image and is_image_meta_equal(image, meta):
# Image hasn't changed, remove indexing flag to mark it as still present
self._db.remove_indexing_flag(filepath, commit=False)
continue

# Check if this might be a renamed image
# Only attempt rename detection if there are potential deletions to match against
has_potential_deletions = self._db.has_indexing_images_in_dir(directory)

if not image and has_potential_deletions:
# Read the image to compute its hash
try:
img = helpers.read_image(filepath)
current_hash = helpers.compute_image_hash(img)
# Look for ALL existing images with the same hash
existing_images_with_hash = self._db.get_images_by_hash(current_hash)

# Find an entry where the file no longer exists (true rename, not a copy)
existing_image_vector = None
for img_entry in existing_images_with_hash:
if not os.path.exists(img_entry["filepath"]):
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rename detection logic at lines 179-186 has a subtle bug. The get_images_by_hash query returns ALL non-deleted images with the matching hash across the entire database, not just from the directory currently being indexed.

The code then checks if each file exists on disk (line 184) to determine if it's a "renamed" file. However, this can lead to incorrect behavior when indexing nested directories or when files with the same hash exist in different directory trees:

Example problematic scenario:

  1. Previously indexed /photos/old/image.jpg with hash X
  2. File was deleted manually (not yet marked as deleted in DB)
  3. Now indexing /photos/new/ directory which contains new_image.jpg with same hash X
  4. The code finds the deleted /photos/old/image.jpg entry, sees the file doesn't exist, and incorrectly reuses its vector

The fix should ensure that only images from the current directory being indexed (those with indexing=1) are considered as candidates for rename detection. Add a filter to check img_entry.get("indexing") == 1 before considering it as a renamed file.

Suggested change
if not os.path.exists(img_entry["filepath"]):
# Only consider images that are part of the current indexing pass (indexing == 1)
if img_entry.get("indexing") == 1 and not os.path.exists(img_entry["filepath"]):

Copilot uses AI. Check for mistakes.
existing_image_vector = img_entry["vector"]
break

if existing_image_vector:
# This is a renamed file - reuse the existing vector
# DON'T remove the indexing flag from the old filepath - we want it to be marked as deleted
# Create a new entry for the new filepath
self._db.upsert_image(
db.NewImage(
filepath=filepath,
modified_at=meta["modified_at"],
size=meta["size"],
vector=existing_image_vector,
hash=current_hash
),
commit=False,
)
self._db.remove_indexing_flag(filepath, commit=False)
continue
except (PIL.UnidentifiedImageError, OSError, ValueError):
# If we can't read the image, fall through to normal indexing
pass

batch.append(filepath)
metas.append(meta)

Expand All @@ -165,10 +213,13 @@ def update_total_images(count: int):
if len(batch) != 0:
self._index_files(batch, metas)

# Finalize indexing workflow: mark any remaining indexing=1 entries as deleted
# These are files that no longer exist (e.g., old paths of renamed files)
self._db.flag_indexing_images_in_a_dir_as_deleted(directory)

self._db.commit()
counter_thread.join()

self._db.flag_indexing_images_in_a_dir_as_deleted(directory)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep this to mark images we never met during reindexing as deleted.

print("", file=sys.stderr)

def search(
Expand Down Expand Up @@ -245,6 +296,7 @@ def init_rclip(

return rclip, model_instance, database


def print_results(result: List[RClip.SearchResult], args: helpers.argparse.Namespace):
# if we are not outputting to console on windows, ensure unicode encoding is correct
if not sys.stdout.isatty() and os.name == "nt":
Expand Down
18 changes: 17 additions & 1 deletion rclip/utils/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
import requests
import sys
from importlib.metadata import version
import imagehash

from rclip.const import IMAGE_RAW_EXT, IS_LINUX, IS_MACOS, IS_WINDOWS

Expand Down Expand Up @@ -101,7 +102,10 @@ def init_arg_parser() -> argparse.ArgumentParser:
"get help:\n"
" https://github.com/yurijmikhalevich/rclip/discussions/new/choose\n\n",
)
version_str = f"rclip {version('rclip')}"
try:
version_str = f"rclip {version('rclip')}"
except Exception: # PackageNotFoundError when not installed via package manager
version_str = "rclip (development)"
Comment on lines +105 to +108
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The broad except Exception handler at line 107 catches any exception when getting the version, including PackageNotFoundError. While this makes the code work in development mode, it silently swallows all exceptions without logging. Consider:

  1. Catching the specific PackageNotFoundError exception for clarity
  2. Logging the exception in development mode so developers are aware of the fallback behavior

This would make debugging easier and the code's intent clearer.

Copilot uses AI. Check for mistakes.
parser.add_argument("--version", "-v", action="version", version=version_str, help=f'prints "{version_str}"')
parser.add_argument("query", help="a text query or a path/URL to an image file")
parser.add_argument(
Expand Down Expand Up @@ -230,6 +234,18 @@ def read_raw_image_file(path: str):
return Image.fromarray(np.array(rgb))


def compute_image_hash(image: Image.Image) -> str:
"""Compute a perceptual hash (pHash) for an image.

The pHash algorithm generates a compact fingerprint of the image's visual
content, such that visually similar images (e.g. resized, recompressed or
slightly modified versions of the same picture) produce similar hashes.
This makes it suitable for detecting identical or near-duplicate images,
such as renamed files with the same underlying content.
Comment on lines +239 to +244
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring mentions that pHash produces "similar hashes" for "visually similar images" but this is misleading. Perceptual hashes are designed to produce IDENTICAL or very close hashes for identical/near-duplicate content, not just "similar" hashes. The rename detection logic depends on exact hash matches (line 179 in main.py uses get_images_by_hash(current_hash) which does equality comparison).

Consider clarifying that pHash produces identical hashes for identical visual content, and that this implementation is used for exact duplicate detection rather than similarity matching.

Suggested change
The pHash algorithm generates a compact fingerprint of the image's visual
content, such that visually similar images (e.g. resized, recompressed or
slightly modified versions of the same picture) produce similar hashes.
This makes it suitable for detecting identical or near-duplicate images,
such as renamed files with the same underlying content.
The pHash algorithm generates a compact, deterministic fingerprint of an
image's visual content. Identical visual content will produce identical
hashes, and near-duplicate content (e.g. resized or recompressed versions
of the same picture) will typically produce very similar hashes.
In this project, the returned hash is used for exact duplicate detection
via equality comparison (for example, to detect renamed files that contain
the same underlying image data), not for general similarity matching with
a distance threshold.

Copilot uses AI. Check for mistakes.
"""
return str(imagehash.phash(image))


def read_image(query: str) -> Image.Image:
path = str.removeprefix(query, "file://")
try:
Expand Down
Loading