Skip to content

feat(assets): asynchronous background asset scanning#12336

Closed
luke-mino-altherr wants to merge 67 commits intomasterfrom
luke-mino-altherr/asynchronous-scanning
Closed

feat(assets): asynchronous background asset scanning#12336
luke-mino-altherr wants to merge 67 commits intomasterfrom
luke-mino-altherr/asynchronous-scanning

Conversation

@luke-mino-altherr
Copy link
Contributor

Summary

This PR refactors the assets package to support asynchronous background scanning, improving startup performance by not blocking on asset discovery.

Key Changes

Architecture Refactoring

  • Eliminated manager layer - routes now call services directly
  • Extracted scanner logic into dedicated service modules
  • Separated business logic from database queries
  • Split queries.py into modular query modules

New Features

  • Background asset seeder for non-blocking startup
  • Non-destructive asset pruning with is_missing flag instead of deletion
  • Decoupled orphan pruning from asset seeding

Code Quality

  • Added TypedDict types throughout scanner and bulk_ingest
  • Converted to explicit dataclasses instead of ORM objects in service layer
  • Improved function naming (verb-based convention)
  • Table-driven parametrized tests
  • Fixed path traversal validation (returns 400 instead of 500)
  • Graceful handling of missing blake3 module

Cleanup

  • Removed unused imports and re-exports
  • Sorted imports throughout assets package
  • Consolidated service imports
  • Removed self-documenting comments

@socket-security
Copy link

socket-security bot commented Feb 12, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedblake3@​1.0.810010098100100

View full report

@luke-mino-altherr luke-mino-altherr force-pushed the luke-mino-altherr/asynchronous-scanning branch from beae633 to 5ff6bf7 Compare February 12, 2026 01:41
@nick-morhun
Copy link

Hi! I believe a few smaller PRs will have a better chance of being merged.

@coderabbitai
Copy link

coderabbitai bot commented Feb 23, 2026

📝 Walkthrough

Walkthrough

This PR replaces the previous AssetInfo/AssetCacheState schema with a unified AssetReference model and adds an Alembic migration to create the new tables. It restructures database query code into a modular queries package, introduces a services layer for ingest/management/tagging/hashing/metadata, implements a threaded AssetSeeder for background scanning, refactors API routes and multipart upload handling, removes several legacy modules, and adds extensive unit tests for queries, services, seeding, and API behaviors.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main feature: asynchronous background asset scanning, which is the primary focus of this comprehensive refactoring PR.
Description check ✅ Passed The description is well-structured and directly related to the changeset, outlining architecture refactoring, new features (background seeder, non-destructive pruning), and code quality improvements that align with the actual changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 46

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
app/assets/api/routes.py (1)

246-253: ⚠️ Potential issue | 🟠 Major

Synchronous file I/O inside an async generator blocks the event loop.

open(abs_path, "rb") and f.read(chunk_size) are blocking syscalls executed directly on the event loop thread. For large files this stalls all other request handling. Use asyncio.to_thread / loop.run_in_executor, or aiohttp's built-in FileResponse which already handles this correctly.

Suggested fix using aiohttp's FileResponse
-    async def stream_file_chunks():
-        chunk_size = 64 * 1024
-        with open(abs_path, "rb") as f:
-            while True:
-                chunk = f.read(chunk_size)
-                if not chunk:
-                    break
-                yield chunk
-
-    return web.Response(
-        body=stream_file_chunks(),
-        content_type=content_type,
-        headers={
-            "Content-Disposition": cd,
-            "Content-Length": str(file_size),
-        },
-    )
+    resp = web.FileResponse(
+        abs_path,
+        headers={
+            "Content-Disposition": cd,
+            "Content-Type": content_type,
+        },
+    )
+    return resp

If custom streaming is needed (e.g., to preserve the explicit Content-Length header and Content-Type), offload reads to a thread executor:

async def stream_file_chunks():
    chunk_size = 64 * 1024
    loop = asyncio.get_running_loop()
    f = await loop.run_in_executor(None, open, abs_path, "rb")
    try:
        while True:
            chunk = await loop.run_in_executor(None, f.read, chunk_size)
            if not chunk:
                break
            yield chunk
    finally:
        await loop.run_in_executor(None, f.close)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/api/routes.py` around lines 246 - 253, The async generator
stream_file_chunks uses synchronous file I/O (open and f.read) which blocks the
event loop; replace those blocking calls by offloading them to a thread executor
(e.g., asyncio.to_thread or asyncio.get_running_loop().run_in_executor) or
switch to aiohttp's FileResponse; specifically, open the file via
to_thread/run_in_executor, perform each f.read via to_thread/run_in_executor,
and ensure the file is closed in a finally block (or use FileResponse which
handles streaming and headers for you) so streaming no longer blocks the event
loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@alembic_db/versions/0002_merge_to_asset_references.py`:
- Around line 173-177: The downgrade() in migration
0002_merge_to_asset_references currently raises a generic NotImplementedError
with no pre-flight warning; update it to provide an explicit, actionable failure
message and signal to operators (e.g., raise RuntimeError or SystemExit with a
clear, prominent message) that this migration is destructive and cannot be
auto-rolled-back, and include a short pointer to the runbook/backup procedure
and an instruction to prevent accidental alembic downgrade attempts (or, if
Alembic hooks are used in the project, install a before_downgrade hook that
logs/prints the same prominent warning); locate and change the downgrade()
function in this migration file to implement this clearer, operator-facing
behavior.
- Around line 97-106: The migration is missing an index for updated_at which
causes ORDER BY updated_at DESC to full-scan; in the Alembic upgrade() add
op.create_index("ix_asset_references_updated_at", "asset_references",
["updated_at"]) (and add the corresponding op.drop_index in downgrade()), and
mirror the change in the SQLAlchemy model definition (e.g., in models.py on the
AssetReference/table definition add an Index("ix_asset_references_updated_at",
AssetReference.updated_at) or include updated_at in the model's __table_args__)
so the ORM and migration stay consistent.

In `@app/assets/api/routes.py`:
- Around line 46-58: get_query_dict currently iterates over request.query.keys()
which can yield duplicate keys from aiohttp's MultiDict, wasting work; change
the dict comprehension to iterate over a de-duplicated set of keys (e.g., use
set(request.query.keys())) so each key is processed once while keeping the
existing getall/get logic in get_query_dict.
- Around line 580-603: delete_asset_tags currently only catches ValueError from
remove_tags so PermissionError bubbles to the generic handler and returns 500;
update the exception handling in the delete_asset_tags route to catch
PermissionError (same as add_asset_tags and update_asset_route) and map it to a
404 using _build_error_response with the same payload shape as the ValueError
branch. Locate the try/except around the remove_tags call in delete_asset_tags
and either add PermissionError to the existing except (e.g., except (ValueError,
PermissionError) as e) or add a separate except PermissionError block that
returns _build_error_response(404, "ASSET_NOT_FOUND", str(e), {"id":
reference_id}); keep the logging and other exception handling for the generic
Exception branch unchanged.
- Around line 628-662: The handler currently calls the blocking
asset_seeder.wait() directly and also fails to re-disable the seeder on the
non-wait path; change the wait call to run off the event loop (e.g., await
asyncio.to_thread(asset_seeder.wait) or await loop.run_in_executor(None,
asset_seeder.wait)) and ensure that whenever was_disabled was True you call
asset_seeder.disable() before every return path (both after the awaited wait
branch and the non-wait branch where you currently return {"status":"started"}),
keeping the existing logic around asset_seeder.enable(),
asset_seeder.start(roots=valid_roots) and asset_seeder.get_status() intact.
- Around line 695-712: The handler mark_missing_assets currently calls
asset_seeder.mark_missing_outside_prefixes() before checking
asset_seeder.get_status().state.value, allowing a race; modify
mark_missing_assets to first inspect asset_seeder.get_status().state.value (or a
helper like is_running) and if the state indicates a running scan return the 409
{"status":"scan_running","marked":0} immediately, only calling
asset_seeder.mark_missing_outside_prefixes() when the seeder is idle; also
remove the combined `marked == 0` check — treat the running-scan check
independently so a zero-mark result is not conflated with an active scan.

In `@app/assets/api/schemas_in.py`:
- Around line 42-55: Consolidate the duplicate exception classes by moving the
single definitions of HashMismatchError and DependencyMissingError into one
central module (create a shared exceptions module), delete the duplicate class
definitions from both schemas_in.py and ingest.py, and update both modules to
import HashMismatchError and DependencyMissingError from that shared module so
all code catches the same exception types.

In `@app/assets/api/upload.py`:
- Around line 93-109: The current multipart handler assumes the `hash` field is
seen before `file` and only drains without writing when `provided_hash and
provided_hash_exists is True`, causing unnecessary disk writes if `file` arrives
first; change the logic so the `file` branch writes incoming chunks to a temp
file (use a secure tempfile path tracked as e.g. temp_path) and updates
`file_written` while still supporting the existing drain-only branch behavior,
then after the multipart loop check `provided_hash_exists` and if true delete
the temp file and skip further storage, otherwise move/commit the temp file to
permanent storage; ensure exceptions around `field.read_chunk` still raise
`UploadError` and clean up temp files on error.
- Around line 111-128: The try/except currently calls
delete_temp_file_if_exists(tmp_path) but leaves the parent unique_dir behind,
causing orphaned empty directories; update the cleanup so that after removing
tmp_path you also remove its parent directory (unique_dir) if it's empty: locate
the variables uploads_root, unique_dir, tmp_path and the call to
delete_temp_file_if_exists in the except block and modify cleanup to remove the
file and then attempt to remove unique_dir (e.g., os.rmdir(unique_dir) guarded
by os.path.isdir(unique_dir) and catching OSError if not empty) before raising
the UploadError so the temporary upload directory is not left behind.

In `@app/assets/database/models.py`:
- Around line 145-162: The asset_references model defines many indexes in
__table_args__ but is missing an index on updated_at; add
Index("ix_asset_references_updated_at", "updated_at") to __table_args__ so
queries ordering by updated_at can use an index. Also add a corresponding
database migration change by calling
op.create_index("ix_asset_references_updated_at", "asset_references",
["updated_at"]) in the 0002_merge_to_asset_references.py migration (and the
reverse op.drop_index in downgrade), so the schema and migration stay
consistent.

In `@app/assets/database/queries/asset_reference.py`:
- Around line 926-942: The call to calculate_rows_per_statement uses a hardcoded
14 which can drift; in bulk_insert_references_ignore_conflicts compute the
columns-per-row dynamically (e.g., cols_per_row = len(enriched_rows[0]) after
enriching rows) and pass that to calculate_rows_per_statement instead of 14 so
the batching reflects the actual number of bind parameters for the ins execute
(references: function bulk_insert_references_ignore_conflicts, enriched_rows,
calculate_rows_per_statement, iter_chunks, AssetReference).
- Around line 815-827: The function delete_orphaned_seed_asset currently deletes
AssetReference rows for any asset_id before checking the asset's hash; update it
to first load the Asset via session.get(Asset, asset_id), verify that asset
exists and that asset.hash is None, and only then delete both the AssetReference
rows (sa.delete(AssetReference).where(AssetReference.asset_id == asset_id)) and
the Asset (session.delete(asset)); if the asset is missing or asset.hash is not
None return False without deleting references. Ensure you reference the existing
symbols: delete_orphaned_seed_asset, Asset, AssetReference, session.get, and
session.delete.
- Around line 626-667: Both functions mark_references_missing_outside_prefixes
and restore_references_by_paths currently return result.rowcount which can be
None; change each to return a definite integer by converting/casting rowcount to
int and defaulting to 0 when None (e.g., use an int-cast with a fallback of 0)
instead of returning result.rowcount directly so the functions always satisfy
the -> int annotation.
- Around line 290-363: The count query duplicates the filtering logic in
list_references_page (see variables base and count_stmt) which risks divergence;
refactor by constructing a single filtered selectable (e.g., filtered_base or
filtered_subq) that applies build_visible_owner_clause, name_contains escaping,
_apply_tag_filters and _apply_metadata_filter once, then derive the paginated
query from filtered_base.order_by(...).limit(...).offset(...) and derive the
total with select(sa.func.count()).select_from(filtered_base.subquery()/alias())
so both queries share the exact same filters; keep sorting/pagination only on
the paginated query and reuse the same filtered_base when loading tags
(id_list).
- Around line 945-972: The query in get_references_by_paths_and_asset_ids uses
two independent IN clauses and causes cross-product matches; replace the IN/IN
with a disjunction of correlated conditions so each pair is matched exactly
(i.e. build an OR of (AssetReference.file_path == path AND
AssetReference.asset_id == asset_id) for each entry in the chunk), adjust
chunking to account for roughly two bind params per pair (e.g. use
MAX_BIND_PARAMS // 2 or similar), execute the select with the combined or_
condition targeting AssetReference.file_path and AssetReference.asset_id, and
continue to collect winners from result.scalars().all(); use iter_chunks and
MAX_BIND_PARAMS symbols to locate where to change logic.

In `@app/assets/database/queries/asset.py`:
- Around line 123-137: The function reassign_asset_references currently always
calls session.flush() even when no AssetReference was found; modify the control
flow to only call session.flush() when ref is not None (e.g., move the
session.flush() inside the existing if ref: block or return early when ref is
None) so you avoid an unnecessary DB round-trip; reference the ref variable,
session.flush(), and the reassign_asset_references function when applying the
change.
- Around line 93-103: get_existing_asset_ids currently passes the full asset_ids
list into a single IN(...) clause which can exceed SQLite's bind-parameter
limit; change it to iterate the asset_ids in chunks (reuse the existing
iter_chunks helper pattern used by get_reference_ids_by_ids) and accumulate
results across chunked queries against select(Asset.id). For each chunk call
session.execute(select(Asset.id).where(Asset.id.in_(chunk))).fetchall() and add
the returned ids to a set to return, while keeping the early-empty-list check
and function signature intact.

In `@app/assets/database/queries/common.py`:
- Around line 17-20: iter_chunks currently assumes seq supports len() and
slicing which breaks for generators — tighten the contract by typing seq as a
sliceable Sequence (e.g. add from typing import Sequence, TypeVar, Iterator and
use def iter_chunks(seq: Sequence[T], n: int) -> Iterator[Sequence[T]] with a
TypeVar T) so callers know the function requires a sequence; update the function
signature and imports accordingly and run type-checks to ensure compatibility
with existing callers.

In `@app/assets/database/queries/tags.py`:
- Around line 291-325: The total_q subquery ignores the owner filter when
include_zero is False; update the inner select used in total_q (the
select(AssetReferenceTag.tag_name).group_by(...) expression) to include the same
owner filter applied elsewhere (i.e., add .where(AssetReferenceTag.owner_user_id
== <owner_id> or the equivalent owner predicate used in the main query) so
Tag.name.in_(...) only considers tags owned by that user), ensuring the owner
predicate matches the one used for the main query (refer to total_q,
include_zero, AssetReferenceTag, and Tag in tags.py).

In `@app/assets/helpers.py`:
- Around line 6-23: The function select_best_live_path accepts an
unparameterized Sequence and accesses duck-typed attributes via getattr; define
a lightweight Protocol (e.g., CacheStateProtocol) that declares the expected
attributes (file_path: Optional[str] and needs_verify: bool or Optional[bool])
and import Protocol and Optional from typing, then change the function signature
to states: Sequence[CacheStateProtocol] so type checkers can verify callers and
the attribute use inside select_best_live_path is explicit (refer to the
select_best_live_path function and introduce CacheStateProtocol to match its
file_path and needs_verify usage).

In `@app/assets/scanner.py`:
- Around line 298-306: The fast-scan loop using os.stat(abs_p,
follow_symlinks=False) is using lstat semantics that mismatch the verification
path and cause perpetual fast_ok/“missing” mismatches; update the stat calls in
the fast-scan loop (the for p in paths block where stat_p is assigned) to use
follow_symlinks=True (and make the identical change for the second occurrence
around lines 372-380) so the symlink-following policy matches
verification/build_*_specs behavior.

In `@app/assets/seeder.py`:
- Around line 250-251: Remove the trailing whitespace at the end of the
docstring line containing "Resume a paused scan." in app/assets/seeder.py (the
docstring lines shown in the diff); edit the docstring so there are no space
characters after the final text or newline, save the file, and re-run ruff to
confirm the trailing-whitespace warnings are resolved.

In `@app/assets/services/bulk_ingest.py`:
- Around line 186-193: The current code filters out reference_rows whose
r["asset_id"] weren't returned from get_existing_asset_ids, which drops
references for assets that already exist (duplicate hash); instead resolve asset
IDs by asset hash for all reference_rows and map each row to the correct
existing asset id when present. Change the logic around get_existing_asset_ids:
query the assets table by asset hash (use or add a helper like
get_asset_ids_by_hashes(session, hashes) or extend get_existing_asset_ids to
accept hashes), build a hash->asset_id map, then for each reference row replace
or set r["asset_id"] to the mapped existing id when available (keeping newly
inserted ids otherwise) and do not simply drop rows based on inserted_asset_ids;
use the symbols reference_rows, r["asset_id"], and the existing DB lookup helper
(get_existing_asset_ids or a new get_asset_ids_by_hashes) to locate and
implement the fix.

In `@app/assets/services/file_utils.py`:
- Around line 4-8: The getattr fallback in get_mtime_ns causes unnecessary
float-to-int conversion and precision loss; instead, change get_mtime_ns to
directly return stat_result.st_mtime_ns (remove the int(stat_result.st_mtime *
1_000_000_000) fallback) so the function always uses the native nanosecond
timestamp and avoids spurious file-change mismatches in verify_file_unchanged.

In `@app/assets/services/hashing.py`:
- Around line 35-54: _hash_file_obj currently only restores the file position
when orig_pos != 0, leaving the stream at EOF for callers that started at 0;
change the finally block to always seek back to orig_pos (i.e., call
file_obj.seek(orig_pos) unconditionally) so the original position is restored
after reading, while keeping the existing pre-read seek(0) logic that runs only
when orig_pos != 0.

In `@app/assets/services/ingest.py`:
- Around line 242-326: The function upload_from_temp_path doesn't guard against
tags being None; normalize tags at the top of the function (e.g. in
upload_from_temp_path) by converting None to an empty list (tags = list(tags or
[])) before calling resolve_destination_from_tags, passing to
_register_existing_asset, and to _ingest_file_from_path so
resolve_destination_from_tags(tags) and any list(tags) usages won't raise
TypeError; update references to use this normalized variable throughout the
function (keep function name upload_from_temp_path,
resolve_destination_from_tags, _register_existing_asset, and
_ingest_file_from_path for locating the change).

In `@app/assets/services/path_utils.py`:
- Around line 53-60: The path check in validate_path_within_base uses
os.path.abspath which can be bypassed via symlinks; replace abspath with
os.path.realpath for both candidate and base before calling os.path.commonpath
(i.e., use realpath for cand_abs and base_abs), preserve the same ValueError
behavior/messages (raising "destination escapes base directory" when commonpath
differs and "invalid destination path" on other exceptions), and keep the
function signature validate_path_within_base(candidate: str, base: str) -> None
unchanged.
- Around line 25-50: The function resolve_destination_from_tags currently
assumes tags is non-empty and treats any unknown root as "output"; first
validate that tags is a non-empty list and raise ValueError("empty tags") if
not. Next, after computing root = tags[0], check that root is one of the allowed
roots ("models", "input", "output") and raise ValueError(f"unknown root tag
'{root}'") for any other value instead of falling through; keep the existing
models branch but retain its existing checks for a missing category tag and
unknown category, and ensure raw_subdirs handling and path normalization remain
unchanged.

In `@app/assets/services/schemas.py`:
- Around line 76-79: The service-layer NamedTuple TagUsage uses tag_type while
the API-layer schemas_out.TagUsage uses type; make them consistent by renaming
the service field to type (change TagUsage fields to name: str, type: str,
count: int) or, if renaming is undesirable, add a clear mapping comment and
conversion helper where service TagUsage is converted to schemas_out.TagUsage
(referencing TagUsage in this file and schemas_out.TagUsage.type) so callers and
route mappers know to translate tag_type <-> type explicitly.
- Around line 11-14: AssetData.hash is declared as str but Asset.hash is typed
str | None, causing extract_asset_data to construct AssetData with None for
hash; update the AssetData class to declare hash: str | None (and adjust any
related type hints) or, alternatively, have extract_asset_data
validate/normalize asset.hash (e.g., raise or replace None with an empty string)
before constructing AssetData; modify the symbol AssetData (in
app/assets/services/schemas.py) and/or the extract_asset_data function to ensure
types align and downstream code won’t receive a None where a str is expected.
- Around line 31-43: AssetDetailResult and AssetSummaryData share identical
fields and RegisterAssetResult and UploadResult only differ by the boolean flag
name (created vs created_new); refactor by introducing a shared frozen dataclass
base (e.g., AssetResultBase or AssetSummaryBase) with fields ref, asset, tags
and have AssetDetailResult/AssetSummaryData inherit from it, and likewise unify
RegisterAssetResult and UploadResult to inherit the same base and expose a
single boolean (choose one canonical name such as created) while preserving
external API compatibility (add a small adapter/alias or property if consumers
expect the alternate name), or if the separation is intentional, add a brief
comment above AssetDetailResult, AssetSummaryData, RegisterAssetResult and
UploadResult documenting why they are distinct to avoid future confusion.
- Around line 35-36: The dataclasses in app/assets/services/schemas.py declare
mutable list fields (e.g., tags and ListAssetsResult.items) while using
frozen=True, which breaks immutability; change the type annotations from
list[str] / list[AssetSummaryData] to tuple[str, ...] / tuple[AssetSummaryData,
...] (and adjust any default_factory usages to use tuple() or None with
post-init conversion) for all affected fields (tags, ListAssetsResult.items and
the other list-typed fields called out), and update any callers/constructors to
pass tuples (or convert lists to tuples) so the dataclass truly remains
immutable.
- Around line 89-92: ListAssetsResult should own pagination logic: add an
offset: int field to the dataclass and expose has_more as a computed boolean
(e.g., a `@property` or computed field) using (offset + len(items)) < total;
update the service function that constructs ListAssetsResult (the service method
returning ListAssetsResult) to pass the offset and remove the duplicate has_more
computation in the route layer (where it’s currently computed) so callers can
read result.has_more directly.

In `@main.py`:
- Around line 262-266: The code calls asset_seeder.pause() then e.execute(...)
and asset_seeder.resume(), but if e.execute raises an exception resume() is
never called; wrap the execute call in a try/finally so asset_seeder.resume() is
guaranteed to run (i.e., call asset_seeder.pause(), then try: e.execute(item[2],
prompt_id, extra_data, item[4]) finally: asset_seeder.resume()), keeping the
same parameters and preserve any exception propagation after resume.
- Around line 263-265: Remove trailing whitespace on the blank lines around the
call to e.execute(item[2], prompt_id, extra_data, item[4]) — the lint failure is
caused by whitespace-only characters on the blank lines near that statement;
open the block containing the e.execute call and delete any trailing spaces on
the empty lines (or enable your editor's trim-trailing-whitespace/trim on save)
so the file has truly empty lines rather than whitespace.

In `@requirements.txt`:
- Line 27: Pin the blake3 dependency to a safe minimum version to avoid silently
pulling in breaking or insecure releases; update the requirements.txt entry for
"blake3" to include a version constraint such as "blake3>=1.0.8" (or the chosen
minimum) so installs will require at least 1.0.8 which contains the PyBuffer
race-fix.

In `@server.py`:
- Around line 692-695: Wrap the call to asset_seeder.start inside
get_object_info in a try/except so any exception from the background seeding
does not bubble and fail the request; i.e., around the
asset_seeder.start(roots=("models","input","output")) call (and optionally the
surrounding with folder_paths.cache_helper block) catch Exception, log the error
via the existing logger and continue so the route returns normally even if
seeding fails.

In `@tests-unit/assets_test/queries/conftest.py`:
- Around line 8-14: The SQLite in-memory session fixture (session) currently
doesn't enable foreign key enforcement, so update the fixture to turn on SQLite
PRAGMA foreign_keys=ON on engine connect; import and use sqlalchemy.event.listen
targeting the engine's "connect" event (apply PRAGMA using the DBAPI connection
in the listener) before calling Base.metadata.create_all(engine) so tests
enforce FK constraints for the AssetReference schema.

In `@tests-unit/assets_test/queries/test_asset_info.py`:
- Around line 428-501: Add a brief inline comment in the
TestBulkInsertReferencesIgnoreConflicts tests (near the rows definitions)
explaining that file_path is intentionally omitted so it defaults to NULL and,
because the unique/conflict key is file_path and SQLite treats NULLs as
distinct, all inserts will succeed; reference the
bulk_insert_references_ignore_conflicts helper and the AssetReference
model/file_path uniqueness to make the intent explicit.
- Around line 206-207: Remove the redundant local "import time" that is
duplicated at module level; keep the existing top-level import and delete the
inline "import time" immediately before "time.sleep(0.01)" so the test uses the
module-level time and avoids a duplicate import statement.

In `@tests-unit/assets_test/queries/test_asset.py`:
- Around line 24-31: In test_exists_by_hash, the outer guard "if setup_hash is
not None or setup_hash is None" is always true and should be removed; instead
keep a single meaningful condition—replace the two-level check with a single
conditional such as "if setup_hash is not None or query_hash == '':" before
creating and committing the Asset (referencing test_exists_by_hash, setup_hash,
query_hash, and Asset) so the setup only creates an asset when intended.

In `@tests-unit/assets_test/services/conftest.py`:
- Around line 50-54: The temp_dir pytest fixture duplicates built-in tmp_path;
remove the custom temp_dir fixture from conftest.py (the function named
temp_dir) and update any tests that reference temp_dir to use pytest's tmp_path
instead (e.g., change temp_dir -> tmp_path in tests such as test_file_utils.py
if present), ensuring imports/fixtures are not left unused.

In `@tests-unit/assets_test/services/test_enrich.py`:
- Around line 161-195: The test mocks create_session via mock_cs but
enrich_asset returns ENRICHMENT_STUB before any DB session is created, so the
mock is unused; update the test_missing_file_returns_stub_level to either remove
the mock_cs setup or keep it but add a clear assertion or comment to document
intent (e.g., call mock_cs.assert_not_called() after enrich_asset) so readers
know the DB pathway wasn't exercised—refer to mock_cs, enrich_asset, and the
test_missing_file_returns_stub_level function when making the change.
- Around line 61-69: Add a reusable pytest fixture (e.g., patch_scanner_session)
that defines the contextmanager _create_session which yields Session(db_engine)
and patches app.assets.scanner.create_session with side_effect=_create_session;
place this fixture in conftest.py or at the top of the test module and have each
test that currently uses the repeated with patch(...) block accept
patch_scanner_session as a parameter and remove the duplicated patch code,
keeping references to create_session, _create_session, Session, and db_engine
consistent.

In `@tests-unit/assets_test/test_crud.py`:
- Around line 5-8: Replace the top-level import of the local helpers module with
an explicit relative import to make the intent clear and robust: in test_crud.py
change the line importing helpers to use a relative form (e.g., import the
module as from . import helpers or import the functions as from .helpers import
get_asset_filename, trigger_sync_seed_assets) so that the local helpers module
and the functions get_asset_filename and trigger_sync_seed_assets are
unambiguously loaded.

In `@tests-unit/assets_test/test_file_utils.py`:
- Around line 1-55: Add unit tests for get_mtime_ns, get_size_and_mtime_ns, and
verify_file_unchanged to cover the missing change-detection edge cases: assert
get_mtime_ns and get_size_and_mtime_ns return correct nanosecond mtime and size
for real files (use tmp_path to create files), add verify_file_unchanged tests
for mtime_db=None (should return False), size_db=None (should skip size check
and only compare mtime), explicit mtime mismatch and size mismatch cases (return
False), and a precision-sensitive test where mtime differs only at the
nanosecond level to verify the nanosecond equality path; reference the functions
get_mtime_ns, get_size_and_mtime_ns, and verify_file_unchanged in your new tests
and use tmp_path and monkeypatch/os.stat stubs as needed to control mtime/size
values.

---

Outside diff comments:
In `@app/assets/api/routes.py`:
- Around line 246-253: The async generator stream_file_chunks uses synchronous
file I/O (open and f.read) which blocks the event loop; replace those blocking
calls by offloading them to a thread executor (e.g., asyncio.to_thread or
asyncio.get_running_loop().run_in_executor) or switch to aiohttp's FileResponse;
specifically, open the file via to_thread/run_in_executor, perform each f.read
via to_thread/run_in_executor, and ensure the file is closed in a finally block
(or use FileResponse which handles streaming and headers for you) so streaming
no longer blocks the event loop.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e5ae670 and 185100c.

📒 Files selected for processing (58)
  • alembic_db/versions/0002_merge_to_asset_references.py
  • app/assets/api/routes.py
  • app/assets/api/schemas_in.py
  • app/assets/api/schemas_out.py
  • app/assets/api/upload.py
  • app/assets/database/bulk_ops.py
  • app/assets/database/models.py
  • app/assets/database/queries.py
  • app/assets/database/queries/__init__.py
  • app/assets/database/queries/asset.py
  • app/assets/database/queries/asset_reference.py
  • app/assets/database/queries/common.py
  • app/assets/database/queries/tags.py
  • app/assets/database/tags.py
  • app/assets/hashing.py
  • app/assets/helpers.py
  • app/assets/manager.py
  • app/assets/scanner.py
  • app/assets/seeder.py
  • app/assets/services/__init__.py
  • app/assets/services/asset_management.py
  • app/assets/services/bulk_ingest.py
  • app/assets/services/file_utils.py
  • app/assets/services/hashing.py
  • app/assets/services/ingest.py
  • app/assets/services/metadata_extract.py
  • app/assets/services/path_utils.py
  • app/assets/services/schemas.py
  • app/assets/services/tagging.py
  • app/database/db.py
  • main.py
  • requirements.txt
  • server.py
  • tests-unit/assets_test/conftest.py
  • tests-unit/assets_test/helpers.py
  • tests-unit/assets_test/queries/conftest.py
  • tests-unit/assets_test/queries/test_asset.py
  • tests-unit/assets_test/queries/test_asset_info.py
  • tests-unit/assets_test/queries/test_cache_state.py
  • tests-unit/assets_test/queries/test_metadata.py
  • tests-unit/assets_test/queries/test_tags.py
  • tests-unit/assets_test/services/__init__.py
  • tests-unit/assets_test/services/conftest.py
  • tests-unit/assets_test/services/test_asset_management.py
  • tests-unit/assets_test/services/test_bulk_ingest.py
  • tests-unit/assets_test/services/test_enrich.py
  • tests-unit/assets_test/services/test_ingest.py
  • tests-unit/assets_test/services/test_tagging.py
  • tests-unit/assets_test/test_assets_missing_sync.py
  • tests-unit/assets_test/test_crud.py
  • tests-unit/assets_test/test_downloads.py
  • tests-unit/assets_test/test_file_utils.py
  • tests-unit/assets_test/test_list_filter.py
  • tests-unit/assets_test/test_prune_orphaned_assets.py
  • tests-unit/assets_test/test_tags_api.py
  • tests-unit/assets_test/test_uploads.py
  • tests-unit/requirements.txt
  • tests-unit/seeder_test/test_seeder.py
💤 Files with no reviewable changes (7)
  • tests-unit/requirements.txt
  • app/assets/hashing.py
  • tests-unit/assets_test/conftest.py
  • app/assets/database/tags.py
  • app/assets/database/bulk_ops.py
  • app/assets/manager.py
  • app/assets/database/queries.py

Comment on lines +97 to +106
op.create_index(
"ix_asset_references_enrichment_level", "asset_references", ["enrichment_level"]
)
op.create_index("ix_asset_references_created_at", "asset_references", ["created_at"])
op.create_index(
"ix_asset_references_last_access_time", "asset_references", ["last_access_time"]
)
op.create_index(
"ix_asset_references_owner_name", "asset_references", ["owner_id", "name"]
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing index on updated_at — sorting by updated_at will full-scan at scale.

created_at and last_access_time get explicit indexes, but updated_at is omitted. The integration test test_list_assets_sort_by_updated_at_desc confirms the API supports and exercises ORDER BY updated_at DESC. Without an index, every such sort on a large asset_references table requires a full table scan + sort.

🔧 Proposed fix (add to upgrade(), and mirror in models.py)
     op.create_index(
         "ix_asset_references_owner_name", "asset_references", ["owner_id", "name"]
     )
+    op.create_index(
+        "ix_asset_references_updated_at", "asset_references", ["updated_at"]
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@alembic_db/versions/0002_merge_to_asset_references.py` around lines 97 - 106,
The migration is missing an index for updated_at which causes ORDER BY
updated_at DESC to full-scan; in the Alembic upgrade() add
op.create_index("ix_asset_references_updated_at", "asset_references",
["updated_at"]) (and add the corresponding op.drop_index in downgrade()), and
mirror the change in the SQLAlchemy model definition (e.g., in models.py on the
AssetReference/table definition add an Index("ix_asset_references_updated_at",
AssetReference.updated_at) or include updated_at in the model's __table_args__)
so the ORM and migration stay consistent.

Comment on lines +42 to +55
class HashMismatchError(Exception):
"""Uploaded file hash does not match provided hash."""

def __init__(self, message: str):
super().__init__(message)
self.message = message


class DependencyMissingError(Exception):
"""A required dependency is not installed."""

def __init__(self, message: str):
super().__init__(message)
self.message = message
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify duplicate exception definitions
echo "=== HashMismatchError definitions ==="
rg -n 'class HashMismatchError' --type=py

echo ""
echo "=== DependencyMissingError definitions ==="
rg -n 'class DependencyMissingError' --type=py

echo ""
echo "=== Import sites ==="
rg -n 'import.*HashMismatchError' --type=py
rg -n 'import.*DependencyMissingError' --type=py

Repository: Comfy-Org/ComfyUI

Length of output: 450


Consolidate duplicate HashMismatchError and DependencyMissingError definitions.

These exception classes are defined in both app/assets/services/ingest.py (lines 232, 236) and app/assets/api/schemas_in.py (lines 42, 50). This creates two independent classes with the same name, breaking exception handling across module boundaries—code catching one won't catch the other, causing silently unhandled errors if the wrong module's exception is raised.

Define these exceptions in a single location and import them from there.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/api/schemas_in.py` around lines 42 - 55, Consolidate the duplicate
exception classes by moving the single definitions of HashMismatchError and
DependencyMissingError into one central module (create a shared exceptions
module), delete the duplicate class definitions from both schemas_in.py and
ingest.py, and update both modules to import HashMismatchError and
DependencyMissingError from that shared module so all code catches the same
exception types.

Comment on lines +111 to +128
uploads_root = os.path.join(folder_paths.get_temp_directory(), "uploads")
unique_dir = os.path.join(uploads_root, uuid.uuid4().hex)
os.makedirs(unique_dir, exist_ok=True)
tmp_path = os.path.join(unique_dir, ".upload.part")

try:
with open(tmp_path, "wb") as f:
while True:
chunk = await field.read_chunk(8 * 1024 * 1024)
if not chunk:
break
f.write(chunk)
file_written += len(chunk)
except Exception:
delete_temp_file_if_exists(tmp_path)
raise UploadError(
500, "UPLOAD_IO_ERROR", "Failed to receive and store uploaded file."
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Orphaned temp directory left behind on cleanup.

delete_temp_file_if_exists removes the file but not the parent unique_dir created here. Over time, empty directories accumulate under temp/uploads/.

♻️ Proposed fix for delete_temp_file_if_exists
 def delete_temp_file_if_exists(tmp_path: str | None) -> None:
     """Safely remove a temp file if it exists."""
     if tmp_path:
         try:
             if os.path.exists(tmp_path):
                 os.remove(tmp_path)
+            parent = os.path.dirname(tmp_path)
+            if parent and os.path.isdir(parent):
+                try:
+                    os.rmdir(parent)  # only succeeds if empty
+                except OSError:
+                    pass
         except OSError as e:
             logging.debug("Failed to delete temp file %s: %s", tmp_path, e)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/api/upload.py` around lines 111 - 128, The try/except currently
calls delete_temp_file_if_exists(tmp_path) but leaves the parent unique_dir
behind, causing orphaned empty directories; update the cleanup so that after
removing tmp_path you also remove its parent directory (unique_dir) if it's
empty: locate the variables uploads_root, unique_dir, tmp_path and the call to
delete_temp_file_if_exists in the except block and modify cleanup to remove the
file and then attempt to remove unique_dir (e.g., os.rmdir(unique_dir) guarded
by os.path.isdir(unique_dir) and catching OSError if not empty) before raising
the UploadError so the temporary upload directory is not left behind.

Comment on lines +50 to +54
@pytest.fixture
def temp_dir():
"""Temporary directory for file operations."""
with tempfile.TemporaryDirectory() as tmpdir:
yield Path(tmpdir)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

temp_dir duplicates pytest's built-in tmp_path fixture.

tmp_path already returns a Path to a unique per-test temporary directory. test_file_utils.py uses tmp_path directly, so this fixture appears unused by the existing test suite.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests-unit/assets_test/services/conftest.py` around lines 50 - 54, The
temp_dir pytest fixture duplicates built-in tmp_path; remove the custom temp_dir
fixture from conftest.py (the function named temp_dir) and update any tests that
reference temp_dir to use pytest's tmp_path instead (e.g., change temp_dir ->
tmp_path in tests such as test_file_utils.py if present), ensuring
imports/fixtures are not left unused.

Comment on lines +61 to +69
with patch("app.assets.scanner.create_session") as mock_cs:
from contextlib import contextmanager

@contextmanager
def _create_session():
with Session(db_engine) as sess:
yield sess

mock_cs.side_effect = _create_session
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Extract the repeated create_session patch into a shared fixture.

The identical context-manager mock for app.assets.scanner.create_session is duplicated in every test method (5 times). A @pytest.fixture (or conftest.py helper) would remove the boilerplate and make the tests easier to maintain.

♻️ Example fixture
# In conftest.py or at top of this file
`@pytest.fixture`
def patch_scanner_session(db_engine):
    """Patch scanner's create_session to use the test engine."""
    from contextlib import contextmanager
    from unittest.mock import patch

    `@contextmanager`
    def _create_session():
        with Session(db_engine) as sess:
            yield sess

    with patch("app.assets.scanner.create_session", side_effect=_create_session):
        yield

Then each test simply requests the fixture:

def test_extracts_mime_type_and_updates_asset(
    self, patch_scanner_session, temp_dir: Path, session: Session
):
    # ... test body without the `with patch(...)` block
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests-unit/assets_test/services/test_enrich.py` around lines 61 - 69, Add a
reusable pytest fixture (e.g., patch_scanner_session) that defines the
contextmanager _create_session which yields Session(db_engine) and patches
app.assets.scanner.create_session with side_effect=_create_session; place this
fixture in conftest.py or at the top of the test module and have each test that
currently uses the repeated with patch(...) block accept patch_scanner_session
as a parameter and remove the duplicated patch code, keeping references to
create_session, _create_session, Session, and db_engine consistent.

Comment on lines +161 to +195
def test_missing_file_returns_stub_level(
self, db_engine, temp_dir: Path, session: Session
):
"""Verify missing files don't cause errors and return STUB level."""
file_path = temp_dir / "nonexistent.bin"

asset, ref = _create_stub_asset(
session, str(file_path), "asset-4", "ref-4"
)
session.commit()

with patch("app.assets.scanner.create_session") as mock_cs:
from contextlib import contextmanager

@contextmanager
def _create_session():
with Session(db_engine) as sess:
yield sess

mock_cs.side_effect = _create_session

new_level = enrich_asset(
file_path=str(file_path),
reference_id=ref.id,
asset_id=asset.id,
extract_metadata=True,
compute_hash=True,
)

assert new_level == ENRICHMENT_STUB

session.expire_all()
updated_asset = session.get(Asset, "asset-4")
assert updated_asset.mime_type is None
assert updated_asset.hash is None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Note: create_session mock is unused for the missing-file case.

enrich_asset returns ENRICHMENT_STUB immediately after the failed os.stat, before any DB call. The mock setup is harmless but unnecessary here. Consider adding a brief comment or assertion (e.g., mock_cs.assert_not_called()) to document the intent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests-unit/assets_test/services/test_enrich.py` around lines 161 - 195, The
test mocks create_session via mock_cs but enrich_asset returns ENRICHMENT_STUB
before any DB session is created, so the mock is unused; update the
test_missing_file_returns_stub_level to either remove the mock_cs setup or keep
it but add a clear assertion or comment to document intent (e.g., call
mock_cs.assert_not_called() after enrich_asset) so readers know the DB pathway
wasn't exercised—refer to mock_cs, enrich_asset, and the
test_missing_file_returns_stub_level function when making the change.

Comment on lines 5 to 8
import pytest
import requests
from conftest import get_asset_filename, trigger_sync_seed_assets
from helpers import get_asset_filename, trigger_sync_seed_assets

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Locate all helpers.py files in the repository
echo "=== Locating helpers.py files ==="
fd -a "helpers.py"

echo -e "\n=== Searching for function definitions ==="
# Search for the functions being imported
rg "def (get_asset_filename|trigger_sync_seed_assets)" -A 2

echo -e "\n=== Directory structure around the test file ==="
# Check the directory structure
ls -la tests-unit/assets_test/ 2>/dev/null || echo "Directory not found at expected location"

echo -e "\n=== Contents of test_crud.py for import context ==="
# Show the imports and surrounding context
head -15 tests-unit/assets_test/test_crud.py

Repository: Comfy-Org/ComfyUI

Length of output: 2408


Clarify the relative import to improve maintainability.

The from helpers import ... on line 7 correctly resolves to the local tests-unit/assets_test/helpers.py since both imported functions are defined there and Python prioritizes the local module. However, using an explicit relative import (from . import helpers or from .helpers import get_asset_filename, trigger_sync_seed_assets) would make the intent clearer and is more robust to future refactoring.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests-unit/assets_test/test_crud.py` around lines 5 - 8, Replace the
top-level import of the local helpers module with an explicit relative import to
make the intent clear and robust: in test_crud.py change the line importing
helpers to use a relative form (e.g., import the module as from . import helpers
or import the functions as from .helpers import get_asset_filename,
trigger_sync_seed_assets) so that the local helpers module and the functions
get_asset_filename and trigger_sync_seed_assets are unambiguously loaded.

Comment on lines +1 to +55
from app.assets.services.file_utils import is_visible, list_files_recursively


class TestIsVisible:
def test_visible_file(self):
assert is_visible("file.txt") is True

def test_hidden_file(self):
assert is_visible(".hidden") is False

def test_hidden_directory(self):
assert is_visible(".git") is False

def test_visible_directory(self):
assert is_visible("src") is True

def test_dotdot_is_hidden(self):
assert is_visible("..") is False

def test_dot_is_hidden(self):
assert is_visible(".") is False


class TestListFilesRecursively:
def test_skips_hidden_files(self, tmp_path):
(tmp_path / "visible.txt").write_text("a")
(tmp_path / ".hidden").write_text("b")

result = list_files_recursively(str(tmp_path))

assert len(result) == 1
assert result[0].endswith("visible.txt")

def test_skips_hidden_directories(self, tmp_path):
hidden_dir = tmp_path / ".hidden_dir"
hidden_dir.mkdir()
(hidden_dir / "file.txt").write_text("a")

visible_dir = tmp_path / "visible_dir"
visible_dir.mkdir()
(visible_dir / "file.txt").write_text("b")

result = list_files_recursively(str(tmp_path))

assert len(result) == 1
assert "visible_dir" in result[0]
assert ".hidden_dir" not in result[0]

def test_empty_directory(self, tmp_path):
result = list_files_recursively(str(tmp_path))
assert result == []

def test_nonexistent_directory(self, tmp_path):
result = list_files_recursively(str(tmp_path / "nonexistent"))
assert result == []
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

get_mtime_ns, get_size_and_mtime_ns, and verify_file_unchanged have no test coverage.

These three functions drive the core change-detection logic (whether an asset is re-enriched or skipped). Key verify_file_unchanged edge cases that should be covered: mtime_db=NoneFalse, size_db=None → skip size check, mtime mismatch, size mismatch, and the precision-sensitive path where nanosecond mtime equality is evaluated.

Do you want me to draft the missing test cases, or should I open a follow-up issue to track this?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests-unit/assets_test/test_file_utils.py` around lines 1 - 55, Add unit
tests for get_mtime_ns, get_size_and_mtime_ns, and verify_file_unchanged to
cover the missing change-detection edge cases: assert get_mtime_ns and
get_size_and_mtime_ns return correct nanosecond mtime and size for real files
(use tmp_path to create files), add verify_file_unchanged tests for
mtime_db=None (should return False), size_db=None (should skip size check and
only compare mtime), explicit mtime mismatch and size mismatch cases (return
False), and a precision-sensitive test where mtime differs only at the
nanosecond level to verify the nanosecond equality path; reference the functions
get_mtime_ns, get_size_and_mtime_ns, and verify_file_unchanged in your new tests
and use tmp_path and monkeypatch/os.stat stubs as needed to control mtime/size
values.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review continued from previous batch...

Comment on lines +46 to +58
def get_query_dict(request: web.Request) -> dict[str, Any]:
"""Gets a dictionary of query parameters from the request.

request.query is a MultiMapping[str], needs to be converted to a dict
to be validated by Pydantic.
"""
query_dict = {
key: request.query.getall(key)
if len(request.query.getall(key)) > 1
else request.query.get(key)
for key in request.query.keys()
}
return query_dict
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

get_query_dict may emit duplicate keys from the multidict.

request.query.keys() on aiohttp's MultiDict can yield the same key multiple times. Using it directly in a dict comprehension means earlier entries are overwritten by later ones, but the final value is always computed via getall/get, so the result itself is correct. However, iterating redundantly is wasteful. Consider using set(request.query.keys()) or just request.query iteration semantics.

Minor: deduplicate keys
     query_dict = {
         key: request.query.getall(key)
         if len(request.query.getall(key)) > 1
         else request.query.get(key)
-        for key in request.query.keys()
+        for key in set(request.query.keys())
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/api/routes.py` around lines 46 - 58, get_query_dict currently
iterates over request.query.keys() which can yield duplicate keys from aiohttp's
MultiDict, wasting work; change the dict comprehension to iterate over a
de-duplicated set of keys (e.g., use set(request.query.keys())) so each key is
processed once while keeping the existing getall/get logic in get_query_dict.

Comment on lines 580 to +603
try:
result = manager.remove_tags_from_asset(
asset_info_id=asset_info_id,
result = remove_tags(
reference_id=reference_id,
tags=data.tags,
owner_id=USER_MANAGER.get_request_user_id(request),
)
payload = schemas_out.TagsRemove(
removed=result.removed,
not_present=result.not_present,
total_tags=result.total_tags,
)
except ValueError as ve:
return _error_response(404, "ASSET_NOT_FOUND", str(ve), {"id": asset_info_id})
return _build_error_response(
404, "ASSET_NOT_FOUND", str(ve), {"id": reference_id}
)
except Exception:
logging.exception(
"remove_tags_from_asset failed for asset_info_id=%s, owner_id=%s",
asset_info_id,
"remove_tags_from_asset failed for reference_id=%s, owner_id=%s",
reference_id,
USER_MANAGER.get_request_user_id(request),
)
return _error_response(500, "INTERNAL", "Unexpected server error.")
return _build_error_response(500, "INTERNAL", "Unexpected server error.")

return web.json_response(result.model_dump(mode="json"), status=200)
return web.json_response(payload.model_dump(mode="json"), status=200)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

delete_asset_tags doesn't catch PermissionError, unlike sibling handlers.

add_asset_tags (Line 547) and update_asset_route (Line 436) both catch (ValueError, PermissionError), but delete_asset_tags only catches ValueError (Line 591). If remove_tags raises PermissionError for a non-owner, it falls through to the generic Exception handler and returns a 500 instead of a 404.

Proposed fix
-    except ValueError as ve:
+    except (ValueError, PermissionError) as ve:
         return _build_error_response(
             404, "ASSET_NOT_FOUND", str(ve), {"id": reference_id}
         )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
result = manager.remove_tags_from_asset(
asset_info_id=asset_info_id,
result = remove_tags(
reference_id=reference_id,
tags=data.tags,
owner_id=USER_MANAGER.get_request_user_id(request),
)
payload = schemas_out.TagsRemove(
removed=result.removed,
not_present=result.not_present,
total_tags=result.total_tags,
)
except ValueError as ve:
return _error_response(404, "ASSET_NOT_FOUND", str(ve), {"id": asset_info_id})
return _build_error_response(
404, "ASSET_NOT_FOUND", str(ve), {"id": reference_id}
)
except Exception:
logging.exception(
"remove_tags_from_asset failed for asset_info_id=%s, owner_id=%s",
asset_info_id,
"remove_tags_from_asset failed for reference_id=%s, owner_id=%s",
reference_id,
USER_MANAGER.get_request_user_id(request),
)
return _error_response(500, "INTERNAL", "Unexpected server error.")
return _build_error_response(500, "INTERNAL", "Unexpected server error.")
return web.json_response(result.model_dump(mode="json"), status=200)
return web.json_response(payload.model_dump(mode="json"), status=200)
try:
result = remove_tags(
reference_id=reference_id,
tags=data.tags,
owner_id=USER_MANAGER.get_request_user_id(request),
)
payload = schemas_out.TagsRemove(
removed=result.removed,
not_present=result.not_present,
total_tags=result.total_tags,
)
except (ValueError, PermissionError) as ve:
return _build_error_response(
404, "ASSET_NOT_FOUND", str(ve), {"id": reference_id}
)
except Exception:
logging.exception(
"remove_tags_from_asset failed for reference_id=%s, owner_id=%s",
reference_id,
USER_MANAGER.get_request_user_id(request),
)
return _build_error_response(500, "INTERNAL", "Unexpected server error.")
return web.json_response(payload.model_dump(mode="json"), status=200)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/api/routes.py` around lines 580 - 603, delete_asset_tags currently
only catches ValueError from remove_tags so PermissionError bubbles to the
generic handler and returns 500; update the exception handling in the
delete_asset_tags route to catch PermissionError (same as add_asset_tags and
update_asset_route) and map it to a 404 using _build_error_response with the
same payload shape as the ValueError branch. Locate the try/except around the
remove_tags call in delete_asset_tags and either add PermissionError to the
existing except (e.g., except (ValueError, PermissionError) as e) or add a
separate except PermissionError block that returns _build_error_response(404,
"ASSET_NOT_FOUND", str(e), {"id": reference_id}); keep the logging and other
exception handling for the generic Exception branch unchanged.

Comment on lines +628 to +662
wait_param = request.query.get("wait", "").lower()
should_wait = wait_param in ("true", "1", "yes")

# Temporarily enable seeder for explicit API calls (--disable-assets-autoscan
# only prevents the automatic startup scan, not manual triggers)
was_disabled = asset_seeder.is_disabled()
if was_disabled:
asset_seeder.enable()

started = asset_seeder.start(roots=valid_roots)
if not started:
if was_disabled:
asset_seeder.disable()
return web.json_response({"status": "already_running"}, status=409)

if should_wait:
asset_seeder.wait()
if was_disabled:
asset_seeder.disable()
status = asset_seeder.get_status()
return web.json_response(
{
"status": "completed",
"progress": {
"scanned": status.progress.scanned if status.progress else 0,
"total": status.progress.total if status.progress else 0,
"created": status.progress.created if status.progress else 0,
"skipped": status.progress.skipped if status.progress else 0,
},
"errors": status.errors,
},
status=200,
)

return web.json_response({"status": "started"}, status=202)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Two issues: blocking wait() freezes the event loop, and seeder is never re-disabled on the async (non-wait) path.

  1. Event-loop blocking (critical): asset_seeder.wait() (Line 644) is a synchronous blocking call (likely Thread.join()) invoked directly inside an async def handler. This freezes the entire aiohttp event loop until the scan completes. Wrap it in await asyncio.to_thread(asset_seeder.wait) or await loop.run_in_executor(None, asset_seeder.wait).

  2. Seeder stays enabled after async start (bug): When was_disabled=True and should_wait=False, the handler returns at Line 662 without calling asset_seeder.disable(). The seeder remains permanently enabled after a single manual trigger, contradicting the --disable-assets-autoscan flag.

Proposed fix
+    import asyncio
+
     if should_wait:
-        asset_seeder.wait()
+        await asyncio.to_thread(asset_seeder.wait)
         if was_disabled:
             asset_seeder.disable()
         status = asset_seeder.get_status()
         return web.json_response(
             {
                 "status": "completed",
                 "progress": {
                     "scanned": status.progress.scanned if status.progress else 0,
                     "total": status.progress.total if status.progress else 0,
                     "created": status.progress.created if status.progress else 0,
                     "skipped": status.progress.skipped if status.progress else 0,
                 },
                 "errors": status.errors,
             },
             status=200,
         )

+    # For async path: schedule re-disable after scan completes
+    if was_disabled:
+        def _restore_disabled():
+            asset_seeder.wait()
+            asset_seeder.disable()
+        threading.Thread(target=_restore_disabled, daemon=True).start()
+
     return web.json_response({"status": "started"}, status=202)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/api/routes.py` around lines 628 - 662, The handler currently calls
the blocking asset_seeder.wait() directly and also fails to re-disable the
seeder on the non-wait path; change the wait call to run off the event loop
(e.g., await asyncio.to_thread(asset_seeder.wait) or await
loop.run_in_executor(None, asset_seeder.wait)) and ensure that whenever
was_disabled was True you call asset_seeder.disable() before every return path
(both after the awaited wait branch and the non-wait branch where you currently
return {"status":"started"}), keeping the existing logic around
asset_seeder.enable(), asset_seeder.start(roots=valid_roots) and
asset_seeder.get_status() intact.

Comment on lines +695 to +712
@ROUTES.post("/api/assets/prune")
async def mark_missing_assets(request: web.Request) -> web.Response:
"""Mark assets as missing when outside all known root prefixes.

This is a non-destructive soft-delete operation. Assets and metadata
are preserved, but references are flagged as missing. They can be
restored if the file reappears in a future scan.

Returns:
200 OK with count of marked assets
409 Conflict if a scan is currently running
"""
marked = asset_seeder.mark_missing_outside_prefixes()
if marked == 0 and asset_seeder.get_status().state.value != "IDLE":
return web.json_response(
{"status": "scan_running", "marked": 0},
status=409,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Check-after-act: "scan running" guard fires only after the prune has already executed.

mark_missing_outside_prefixes() runs first (Line 707), then the handler checks whether a scan is running (Line 708). If a scan is in progress, the prune may race with the seeder. The guard should be checked before calling mark_missing_outside_prefixes().

Additionally, marked == 0 with an active scan is ambiguous — it could also mean there was genuinely nothing to mark. The condition conflates two different states.

Proposed fix: check state before pruning
 async def mark_missing_assets(request: web.Request) -> web.Response:
+    status = asset_seeder.get_status()
+    if status.state.value != "IDLE":
+        return web.json_response(
+            {"status": "scan_running", "marked": 0},
+            status=409,
+        )
     marked = asset_seeder.mark_missing_outside_prefixes()
-    if marked == 0 and asset_seeder.get_status().state.value != "IDLE":
-        return web.json_response(
-            {"status": "scan_running", "marked": 0},
-            status=409,
-        )
     return web.json_response({"status": "completed", "marked": marked}, status=200)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/api/routes.py` around lines 695 - 712, The handler
mark_missing_assets currently calls asset_seeder.mark_missing_outside_prefixes()
before checking asset_seeder.get_status().state.value, allowing a race; modify
mark_missing_assets to first inspect asset_seeder.get_status().state.value (or a
helper like is_running) and if the state indicates a running scan return the 409
{"status":"scan_running","marked":0} immediately, only calling
asset_seeder.mark_missing_outside_prefixes() when the seeder is idle; also
remove the combined `marked == 0` check — treat the running-scan check
independently so a zero-mark result is not conflated with an active scan.

Comment on lines +290 to +363
def list_references_page(
session: Session,
owner_id: str = "",
limit: int = 100,
offset: int = 0,
name_contains: str | None = None,
include_tags: Sequence[str] | None = None,
exclude_tags: Sequence[str] | None = None,
metadata_filter: dict | None = None,
sort: str | None = None,
order: str | None = None,
) -> tuple[list[AssetReference], dict[str, list[str]], int]:
"""List references with pagination, filtering, and sorting.

Returns (references, tag_map, total_count).
"""
base = (
select(AssetReference)
.join(Asset, Asset.id == AssetReference.asset_id)
.where(build_visible_owner_clause(owner_id))
.options(noload(AssetReference.tags))
)

if name_contains:
escaped, esc = escape_sql_like_string(name_contains)
base = base.where(AssetReference.name.ilike(f"%{escaped}%", escape=esc))

base = _apply_tag_filters(base, include_tags, exclude_tags)
base = _apply_metadata_filter(base, metadata_filter)

sort = (sort or "created_at").lower()
order = (order or "desc").lower()
sort_map = {
"name": AssetReference.name,
"created_at": AssetReference.created_at,
"updated_at": AssetReference.updated_at,
"last_access_time": AssetReference.last_access_time,
"size": Asset.size_bytes,
}
sort_col = sort_map.get(sort, AssetReference.created_at)
sort_exp = sort_col.desc() if order == "desc" else sort_col.asc()

base = base.order_by(sort_exp).limit(limit).offset(offset)

count_stmt = (
select(sa.func.count())
.select_from(AssetReference)
.join(Asset, Asset.id == AssetReference.asset_id)
.where(build_visible_owner_clause(owner_id))
)
if name_contains:
escaped, esc = escape_sql_like_string(name_contains)
count_stmt = count_stmt.where(
AssetReference.name.ilike(f"%{escaped}%", escape=esc)
)
count_stmt = _apply_tag_filters(count_stmt, include_tags, exclude_tags)
count_stmt = _apply_metadata_filter(count_stmt, metadata_filter)

total = int(session.execute(count_stmt).scalar_one() or 0)
refs = session.execute(base).unique().scalars().all()

id_list: list[str] = [r.id for r in refs]
tag_map: dict[str, list[str]] = defaultdict(list)
if id_list:
rows = session.execute(
select(AssetReferenceTag.asset_reference_id, Tag.name)
.join(Tag, Tag.name == AssetReferenceTag.tag_name)
.where(AssetReferenceTag.asset_reference_id.in_(id_list))
.order_by(AssetReferenceTag.added_at)
)
for ref_id, tag_name in rows.all():
tag_map[ref_id].append(tag_name)

return list(refs), tag_map, total
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Count query duplicates all filter logic from the data query — divergence risk.

The count_stmt (Lines 334–346) manually re-applies name_contains, _apply_tag_filters, and _apply_metadata_filter independently from the base query. If a new filter is added to base but forgotten in count_stmt, the count will silently diverge from the actual results.

Consider building a shared "filtered base" CTE or subquery, then deriving both the paginated result and the count from it.

♻️ Sketch: derive count from the same filtered base
+    # Build shared filter base
+    filtered = (
+        select(AssetReference.id)
+        .join(Asset, Asset.id == AssetReference.asset_id)
+        .where(build_visible_owner_clause(owner_id))
+    )
+    if name_contains:
+        escaped, esc = escape_sql_like_string(name_contains)
+        filtered = filtered.where(AssetReference.name.ilike(f"%{escaped}%", escape=esc))
+    filtered = _apply_tag_filters(filtered, include_tags, exclude_tags)
+    filtered = _apply_metadata_filter(filtered, metadata_filter)
+    filtered_sub = filtered.subquery()
+
+    # Count
+    total = int(
+        session.execute(select(sa.func.count()).select_from(filtered_sub)).scalar_one() or 0
+    )
+
+    # Paginated data query
+    ...
+    base = (
+        select(AssetReference)
+        .join(Asset, Asset.id == AssetReference.asset_id)
+        .where(AssetReference.id.in_(select(filtered_sub.c.id)))
+        .options(noload(AssetReference.tags))
+        .order_by(sort_exp)
+        .limit(limit)
+        .offset(offset)
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/database/queries/asset_reference.py` around lines 290 - 363, The
count query duplicates the filtering logic in list_references_page (see
variables base and count_stmt) which risks divergence; refactor by constructing
a single filtered selectable (e.g., filtered_base or filtered_subq) that applies
build_visible_owner_clause, name_contains escaping, _apply_tag_filters and
_apply_metadata_filter once, then derive the paginated query from
filtered_base.order_by(...).limit(...).offset(...) and derive the total with
select(sa.func.count()).select_from(filtered_base.subquery()/alias()) so both
queries share the exact same filters; keep sorting/pagination only on the
paginated query and reuse the same filtered_base when loading tags (id_list).

Comment on lines +945 to +972
def get_references_by_paths_and_asset_ids(
session: Session,
path_to_asset: dict[str, str],
) -> set[str]:
"""Query references to find paths where our asset_id won the insert.

Args:
path_to_asset: Mapping of file_path -> asset_id we tried to insert

Returns:
Set of file_paths where our asset_id is present
"""
if not path_to_asset:
return set()

paths = list(path_to_asset.keys())
winners: set[str] = set()

for chunk in iter_chunks(paths, MAX_BIND_PARAMS):
result = session.execute(
select(AssetReference.file_path).where(
AssetReference.file_path.in_(chunk),
AssetReference.asset_id.in_([path_to_asset[p] for p in chunk]),
)
)
winners.update(result.scalars().all())

return winners
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

IN / IN cross-product doesn't correctly correlate path↔asset_id pairs.

The query uses two independent IN clauses (Lines 966–968):

file_path IN ('/a', '/b')  AND  asset_id IN ('1', '2')

This matches any combination, e.g. (path='/a', asset_id='2') even though path_to_asset maps /a1. A reference that belongs to a different asset could be incorrectly reported as "winning" the insert.

Proposed fix: use correlated OR conditions per pair
 def get_references_by_paths_and_asset_ids(
     session: Session,
     path_to_asset: dict[str, str],
 ) -> set[str]:
     if not path_to_asset:
         return set()

     paths = list(path_to_asset.keys())
     winners: set[str] = set()

     for chunk in iter_chunks(paths, MAX_BIND_PARAMS // 2):
-        result = session.execute(
-            select(AssetReference.file_path).where(
-                AssetReference.file_path.in_(chunk),
-                AssetReference.asset_id.in_([path_to_asset[p] for p in chunk]),
-            )
-        )
+        pair_conds = [
+            sa.and_(
+                AssetReference.file_path == p,
+                AssetReference.asset_id == path_to_asset[p],
+            )
+            for p in chunk
+        ]
+        result = session.execute(
+            select(AssetReference.file_path).where(sa.or_(*pair_conds))
+        )
         winners.update(result.scalars().all())

     return winners

Note: adjust chunk size for the doubled bind params per pair.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/database/queries/asset_reference.py` around lines 945 - 972, The
query in get_references_by_paths_and_asset_ids uses two independent IN clauses
and causes cross-product matches; replace the IN/IN with a disjunction of
correlated conditions so each pair is matched exactly (i.e. build an OR of
(AssetReference.file_path == path AND AssetReference.asset_id == asset_id) for
each entry in the chunk), adjust chunking to account for roughly two bind params
per pair (e.g. use MAX_BIND_PARAMS // 2 or similar), execute the select with the
combined or_ condition targeting AssetReference.file_path and
AssetReference.asset_id, and continue to collect winners from
result.scalars().all(); use iter_chunks and MAX_BIND_PARAMS symbols to locate
where to change logic.

luke-mino-altherr and others added 20 commits February 24, 2026 11:34
Split the ~1000 line app/assets/database/queries.py into focused modules:

- queries/asset.py - Asset entity queries (asset_exists_by_hash, get_asset_by_hash)
- queries/asset_info.py - AssetInfo queries (~15 functions)
- queries/cache_state.py - AssetCacheState queries (list_cache_states_by_asset_id,
  pick_best_live_path, prune_orphaned_assets, fast_db_consistency_pass)
- queries/tags.py - Tag queries (8 functions including ensure_tags_exist,
  add/remove tag functions, list_tags_with_usage)
- queries/__init__.py - Re-exports all public functions for backward compatibility

Also adds comprehensive unit tests using in-memory SQLite:
- tests-unit/assets_test/queries/conftest.py - Session fixture
- tests-unit/assets_test/queries/test_asset.py - 5 tests
- tests-unit/assets_test/queries/test_asset_info.py - 23 tests
- tests-unit/assets_test/queries/test_cache_state.py - 8 tests
- tests-unit/assets_test/queries/test_metadata.py - 12 tests for _apply_metadata_filter
- tests-unit/assets_test/queries/test_tags.py - 23 tests

All 71 unit tests pass. Existing integration tests unaffected.

Amp-Thread-ID: https://ampcode.com/threads/T-019c24bb-475b-7442-9ff9-8288edea3345
Co-authored-by: Amp <amp@ampcode.com>
Architecture changes:
- API Routes -> manager.py (thin adapter) -> services/ (business logic) -> queries/ (atomic DB ops)
- Services own session lifecycle via create_session()
- Queries accept Session as parameter, do single-table atomic operations

New app/assets/services/ layer:
- __init__.py - exports all service functions
- ingest.py - ingest_file_from_path(), register_existing_asset()
- asset_management.py - get_asset_detail(), update_asset_metadata(), delete_asset_reference(), set_asset_preview()
- tagging.py - apply_tags(), remove_tags(), list_tags()

Removed from queries/asset_info.py:
- ingest_fs_asset (moved to services/ingest.py as ingest_file_from_path)
- update_asset_info_full (moved to services/asset_management.py as update_asset_metadata)
- create_asset_info_for_existing_asset (moved to services/ingest.py as register_existing_asset)

Updated manager.py:
- Now a thin adapter that transforms API schemas to/from service calls
- Delegates all business logic to services layer
- No longer imports sqlalchemy.orm.Session or models directly

Test updates:
- Fixed test_cache_state.py import of pick_best_live_path (moved to helpers.py)
- Added comprehensive service layer tests (41 new tests)
- All 112 query + service tests pass

Amp-Thread-ID: https://ampcode.com/threads/T-019c24e2-7ae4-707f-ad19-c775ed8b82b5
Co-authored-by: Amp <amp@ampcode.com>
…ly argument pattern

- Move imports from inside functions to module top-level in:
  - app/assets/database/queries/asset.py
  - app/assets/database/queries/asset_info.py
  - app/assets/database/queries/cache_state.py
  - app/assets/manager.py
  - app/assets/services/asset_management.py
  - app/assets/services/ingest.py

- Remove keyword-only argument markers (*,) from app/assets/ to match codebase conventions

Amp-Thread-ID: https://ampcode.com/threads/T-019c24eb-bfa2-727f-8212-8bc976048604
Co-authored-by: Amp <amp@ampcode.com>
- Move app/assets/scanner.py to app/assets/services/scanner.py
- Extract pure queries from fast_db_consistency_pass:
  - get_cache_states_for_prefixes()
  - bulk_set_needs_verify()
  - delete_cache_states_by_ids()
  - delete_orphaned_seed_asset()
- Split prune_orphaned_assets into pure queries:
  - delete_cache_states_outside_prefixes()
  - get_orphaned_seed_asset_ids()
  - delete_assets_by_ids()
- Add reconcile_cache_states_for_root() service function
- Add prune_orphaned_assets() service function
- Remove function injection pattern
- Update imports in main.py, server.py, routes.py

Amp-Thread-ID: https://ampcode.com/threads/T-019c24f1-3385-701b-87e0-8b6bc87e841b
Co-authored-by: Amp <amp@ampcode.com>
- Delete bulk_ops.py, moving logic to appropriate layers
- Add bulk insert query functions:
  - queries/asset.bulk_insert_assets
  - queries/cache_state.bulk_insert_cache_states_ignore_conflicts
  - queries/cache_state.get_cache_states_by_paths_and_asset_ids
  - queries/asset_info.bulk_insert_asset_infos_ignore_conflicts
  - queries/asset_info.get_asset_info_ids_by_ids
  - queries/tags.bulk_insert_tags_and_meta
- Move seed_from_paths_batch orchestration to scanner._seed_from_paths_batch

Amp-Thread-ID: https://ampcode.com/threads/T-019c24fd-157d-776a-ad24-4f19cf5d3afe
Co-authored-by: Amp <amp@ampcode.com>
- Move scanner-only functions to scanner.py
- Move query-only functions (is_scalar, project_kv) to asset_info.py
- Move get_query_dict to routes.py
- Create path_utils.py service for path-related functions
- Reduce helpers.py to shared utilities only

Amp-Thread-ID: https://ampcode.com/threads/T-019c2510-33fa-7199-ae4b-bc31102277a7
Co-authored-by: Amp <amp@ampcode.com>
- Add app/assets/api/upload.py with parse_multipart_upload() for HTTP parsing
- Add ParsedUpload dataclass to schemas_in.py
- Add domain exceptions (AssetValidationError, AssetNotFoundError, HashMismatchError)
- Add manager.process_upload() with domain exceptions (no HTTP status codes)
- Routes map domain exceptions to HTTP responses
- Slim down upload_asset route to ~20 lines (was ~150)

Amp-Thread-ID: https://ampcode.com/threads/T-019c2519-abe1-738a-ad2e-29ece17c0e42
Co-authored-by: Amp <amp@ampcode.com>
…t queries

- Fix unused imports in routes.py, asset.py, manager.py, asset_management.py, ingest.py
- Fix whitespace issues in upload.py, asset_info.py, ingest.py
- Fix typo in manager.py (stray character after result["asset"])
- Fix broken import in test_metadata.py (project_kv moved to asset_info.py)
- Add fixture override in queries/conftest.py for unit test isolation

Add 48 new tests covering all previously untested query functions:
- asset.py: upsert_asset, bulk_insert_assets
- cache_state.py: upsert_cache_state, delete_cache_states_outside_prefixes,
  get_orphaned_seed_asset_ids, delete_assets_by_ids, get_cache_states_for_prefixes,
  bulk_set_needs_verify, delete_cache_states_by_ids, delete_orphaned_seed_asset,
  bulk_insert_cache_states_ignore_conflicts, get_cache_states_by_paths_and_asset_ids
- asset_info.py: insert_asset_info, get_or_create_asset_info,
  update_asset_info_timestamps, replace_asset_info_metadata_projection,
  bulk_insert_asset_infos_ignore_conflicts, get_asset_info_ids_by_ids
- tags.py: bulk_insert_tags_and_meta

Total: 119 tests pass (up from 71)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove unused os imports in conftest.py and test_ingest.py
- Remove unused Tag import in test_asset_management.py
- Remove unused ensure_tags_exist import in test_ingest.py
- Fix unused info2 variable in test_asset_management.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Rename functions across app/assets/ to follow verb-based naming:
- is_scalar → check_is_scalar
- project_kv → expand_metadata_to_rows
- _visible_owner_clause → _build_visible_owner_clause
- _chunk_rows → _iter_row_chunks
- _at_least_one → _validate_at_least_one_field
- _tags_norm → _normalize_tags_field
- _ser_dt → _serialize_datetime
- _ser_updated → _serialize_updated_at
- _error_response → _build_error_response
- _validation_error_response → _build_validation_error_response
- file_sender → stream_file_chunks
- seed_assets_endpoint → seed_assets
- utcnow → get_utc_now
- _safe_sort_field → _validate_sort_field
- _safe_filename → _sanitize_filename
- fast_asset_file_check → check_asset_file_fast
- prefixes_for_root → get_prefixes_for_root
- blake3_hash → compute_blake3_hash
- blake3_hash_async → compute_blake3_hash_async
- _is_within → _check_is_within
- _rel → _compute_relative

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Rename functions to use clearer verb-based names:
- pick_best_live_path → select_best_live_path
- escape_like_prefix → escape_sql_like_string
- list_tree → list_files_recursively
- check_asset_file_fast → verify_asset_file_unchanged
- _seed_from_paths_batch → _batch_insert_assets_from_paths
- reconcile_cache_states_for_root → sync_cache_states_with_filesystem
- touch_asset_info_by_id → update_asset_info_access_time
- replace_asset_info_metadata_projection → set_asset_info_metadata
- expand_metadata_to_rows → convert_metadata_to_rows
- _rows_per_stmt → _calculate_rows_per_statement
- ensure_within_base → validate_path_within_base
- _cleanup_temp → _delete_temp_file_if_exists
- validate_hash_format → normalize_and_validate_hash
- get_relative_to_root_category_path_of_asset → get_asset_category_and_relative_path

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extract helper functions to eliminate nested try-except blocks in scanner.py
and remove duplicated type-checking logic in asset_info.py. Simplify nested
conditionals in asset_management.py for clearer control flow.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…pers

Extract focused helper functions to eliminate the try-finally block that
wrapped ~50 lines just for logging. The new helpers (_collect_paths_for_roots,
_build_asset_specs, _insert_asset_specs) make seed_assets a simple linear flow.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Make blake3 an optional import that fails gracefully at import time,
with a clear error message when hashing functions are actually called.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extract helper functions from conftest.py to a dedicated helpers.py module
to fix import resolution issues when pytest processes subdirectories.
Rename test_tags.py to test_tags_api.py to avoid module name collision
with queries/test_tags.py.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…layer

Replace dict/ORM object returns with explicit dataclasses to fix
DetachedInstanceError when accessing ORM attributes after session closes.

- Add app/assets/services/schemas.py with AssetData, AssetInfoData,
  AssetDetailResult, and RegisterAssetResult dataclasses
- Update asset_management.py and ingest.py to return dataclasses
- Update manager.py to use attribute access on dataclasses
- Fix created_new to be False in create_asset_from_hash (content exists)
- Add DependencyMissingError for better blake3 missing error handling
- Update tests to use attribute access instead of dict subscripting

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…service layer

Add update_asset_info_name and update_asset_info_updated_at query functions
and update asset_management.py to use them instead of modifying ORM objects
directly. This ensures the service layer only uses explicit operations from
the queries package.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
luke-mino-altherr and others added 27 commits February 24, 2026 11:34
- Rename shorthand variables to explicit names (sp -> spec, aid -> asset_id, etc.)
- Move imports to top of file
- Add TypedDict definitions for AssetRow, CacheStateRow, AssetInfoRow, TagRow, MetadataRow
- Replace bare dict types with typed alternatives

Amp-Thread-ID: https://ampcode.com/threads/T-019c316d-13f7-77f8-b92b-ea7276c3e09c
Co-authored-by: Amp <amp@ampcode.com>
- Add bulk_update_is_missing() to efficiently update is_missing flag
- Update sync_cache_states_with_filesystem() to mark non-existent files as is_missing=True
- Call restore_cache_states_by_paths() in batch_insert_seed_assets() to restore
  previously-missing states when files reappear during scanning

Amp-Thread-ID: https://ampcode.com/threads/T-019c3177-e591-7666-ac6b-7e05c71c8ebf
Co-authored-by: Amp <amp@ampcode.com>
- Remove duplicate from routes.py
- Import from upload.py instead
- Rename to public API (remove leading underscore)

Amp-Thread-ID: https://ampcode.com/threads/T-019c3549-c245-7628-950c-dd6826185394
Co-authored-by: Amp <amp@ampcode.com>
- Fix race in mark_missing_outside_prefixes: set state to RUNNING inside
  lock before operations, restore to IDLE in finally block to prevent
  concurrent start() calls

- Fix timing consistency: capture perf_counter before _update_progress
  for consistent event timing

Amp-Thread-ID: https://ampcode.com/threads/T-019c354b-e7d7-7309-aa0e-79e5e7dff2b7
Co-authored-by: Amp <amp@ampcode.com>
Since size_bytes is declared as non-nullable (nullable=False, default=0) in
the Asset model, simplify the conditional checks:
- Use 'if item.asset else None' when the asset relationship might be None
- Access size_bytes directly when asset is guaranteed to exist (create endpoints)

Amp-Thread-ID: https://ampcode.com/threads/T-019c354e-cbfb-77d8-acdd-0d066c16006e
Co-authored-by: Amp <amp@ampcode.com>
- Add MAX_SAFETENSORS_HEADER_SIZE constant in metadata_extract.py
- Fix double 'compute' typo: compute_compute_blake3_hash_async → compute_blake3_hash_async

Amp-Thread-ID: https://ampcode.com/threads/T-019c3550-4dbc-7301-a5e8-e6e23aa2d7b1
Co-authored-by: Amp <amp@ampcode.com>
- Make blake3 import lazy in hashing.py (only imported when needed)
- Add compute_hashes parameter to AssetSeeder.start(), build_asset_specs(), and seed_assets()
- Fix missing tag clearing: include is_missing states in sync when update_missing_tags=True
- Clear is_missing flag on cache states when files are restored with matching mtime/size
- Fix validation error serialization in routes.py (use json.loads(ve.json()))

Amp-Thread-ID: https://ampcode.com/threads/T-019c3614-56d4-74a8-a717-19922d6dbbee
Co-authored-by: Amp <amp@ampcode.com>
- Add custom MIME type registrations for model files (.safetensors, .pt, .ckpt, .gguf, .yaml)
- Pass mime_type through SeedAssetSpec to bulk_ingest
- Re-register types before use since server.py mimetypes.init() resets them
- Add tests for bulk ingest mime_type handling

Amp-Thread-ID: https://ampcode.com/threads/T-019c3626-c6ad-7139-a570-62da4e656a1a
Co-authored-by: Amp <amp@ampcode.com>
- Remove debug print statements
- Remove trailing whitespace on blank lines
- Remove unused pytest import

Amp-Thread-ID: https://ampcode.com/threads/T-019c3a8d-3b4f-75b4-8513-1c77914782f7
Co-authored-by: Amp <amp@ampcode.com>
Phase 1 (FAST): Creates stub records with filesystem metadata only
- path, size, mtime - no file content reading
- Populates asset database quickly on startup

Phase 2 (ENRICH): Extracts metadata and computes hashes
- Safetensors header parsing, MIME types
- Optional blake3 hash computation
- Updates existing stub records

Changes:
- Add ScanPhase enum (FAST, ENRICH, FULL)
- Add enrichment_level column to AssetCacheState (0=stub, 1=metadata, 2=hashed)
- Add build_stub_specs() for fast scanning without metadata extraction
- Add get_unenriched_cache_states(), enrich_asset(), enrich_assets_batch()
- Add start_fast(), start_enrich() convenience methods to AssetSeeder
- Update start() to accept phase parameter (defaults to FULL)
- Split _run_scan() into _run_fast_phase() and _run_enrich_phase()
- Add migration 0003_add_enrichment_level.py
- Update tests for new architecture

Amp-Thread-ID: https://ampcode.com/threads/T-019c4eef-1568-778f-aede-38254728f848
Co-authored-by: Amp <amp@ampcode.com>
- Add PAUSED state to state machine
- Add pause() method - blocks scan at next checkpoint
- Add resume() method - unblocks paused scan
- Add stop() method - alias for cancel()
- Add restart() method - cancel + wait + start with same/overridden params
- Add _check_pause_and_cancel() helper for checkpoint locations
- Emit assets.seed.paused and assets.seed.resumed WebSocket events
- Update get_object_info to use async seeder instead of blocking seed_assets
- Scan all roots (models, input, output) on object_info, not just models

Amp-Thread-ID: https://ampcode.com/threads/T-019c4f2b-5801-711c-8d47-bd1525808d77
Co-authored-by: Amp <amp@ampcode.com>
- Add log when scanner start is requested and when skipped due to already running
- Remove noisy 'no mime_type' info log (expected during fast stub phase)

Amp-Thread-ID: https://ampcode.com/threads/T-019c4f56-3fe1-72cb-888a-3ac4ac99b3d7
Co-authored-by: Amp <amp@ampcode.com>
Log pause, resume, cancel, and restart events

Amp-Thread-ID: https://ampcode.com/threads/T-019c4f56-3fe1-72cb-888a-3ac4ac99b3d7
Co-authored-by: Amp <amp@ampcode.com>
…-autoscan flag

- Added disable(), enable(), and is_disabled() methods to AssetSeeder
- start() now checks is_disabled() and returns early if disabled
- Updated main.py to call asset_seeder.disable() when CLI flag is set
- Fixes bypass where /object_info would trigger scans regardless of flag

Amp-Thread-ID: https://ampcode.com/threads/T-019c4f66-6773-72d2-bdfe-b55f5aa76021
Co-authored-by: Amp <amp@ampcode.com>
…ence

This change solves the basename collision bug by using UNIQUE(file_path) on the
unified asset_references table. Key changes:

Database:
- Migration 0005 merges asset_cache_states and asset_infos into asset_references
- AssetReference now contains: cache state fields (file_path, mtime_ns, needs_verify,
  is_missing, enrichment_level) plus info fields (name, owner_id, preview_id, etc.)
- AssetReferenceMeta replaces AssetInfoMeta
- AssetReferenceTag replaces AssetInfoTag
- UNIQUE constraint on file_path prevents duplicate entries for same file

Code:
- New unified query module: asset_reference.py (replaces asset_info.py, cache_state.py)
- Updated scanner, seeder, and services to use AssetReference
- Updated API routes to use reference_id instead of asset_info_id

Tests:
- All 175 unit tests updated and passing
- Integration tests require server environment (not run here)

Amp-Thread-ID: https://ampcode.com/threads/T-019c4fe8-9dcb-75ce-bea8-ea786343a581
Co-authored-by: Amp <amp@ampcode.com>
Consolidate all branch migrations into 0002_merge_to_asset_references.
Drops old tables (asset_info_meta, asset_info_tags, asset_cache_state,
assets_info), truncates assets, and creates the new unified schema
(asset_references, asset_reference_tags, asset_reference_meta).

Amp-Thread-ID: https://ampcode.com/threads/T-019c5005-66e9-719f-94df-ebb4b25bcf14
Co-authored-by: Amp <amp@ampcode.com>
…autoscan disabled

Move inline/lazy imports to top-level now that blake3 is a main dependency and
circular-import concerns are resolved. Remove decorative section-header comments
in asset_reference queries. Allow the seed API endpoint to temporarily re-enable
the seeder when --disable-assets-autoscan is active so manual triggers still work.
Add retry-on-409 logic in test helpers and remove try/finally around prompt
execution in favour of sequential pause/resume.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix _hash_file_obj not restoring file position when orig_pos == 0
- Fix validate_path_within_base swallowing its own ValueError
- Guard against tags=None crash in upload_from_temp_path
- Filter is_missing references from list_references_page results
- Include Decimal in metadata filter isinstance check
- Return 403 instead of 404 for PermissionError in update/add_tags routes
- Re-disable seeder after non-blocking start with --disable-assets-autoscan
- Remove duplicate HashMismatchError/DependencyMissingError from schemas_in
- Move _emit_event outside lock in seeder resume() to prevent deadlock
- Fix total_q ignoring owner_id in list_tags_with_usage
- Capture ORM attributes before commit in resolve_asset_for_download
- Remove dead code (if mime_type is None: pass)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@luke-mino-altherr luke-mino-altherr force-pushed the luke-mino-altherr/asynchronous-scanning branch from 185100c to 1fa4c88 Compare February 24, 2026 19:36
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

♻️ Duplicate comments (26)
app/assets/services/file_utils.py (1)

4-8: Prefer st_mtime_ns directly to avoid float precision loss.

The fallback int(stat_result.st_mtime * 1_000_000_000) can lose precision and cause spurious “changed” detections. If your minimum Python version guarantees st_mtime_ns, return it directly.

♻️ Suggested simplification
 def get_mtime_ns(stat_result: os.stat_result) -> int:
     """Extract mtime in nanoseconds from a stat result."""
-    return getattr(
-        stat_result, "st_mtime_ns", int(stat_result.st_mtime * 1_000_000_000)
-    )
+    return stat_result.st_mtime_ns
#!/bin/bash
# Check declared minimum Python version
fd -g 'pyproject.toml' -g 'setup.cfg' -g 'setup.py' -g 'tox.ini' \
  | xargs rg -n "requires-python|python_requires|python_version"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/services/file_utils.py` around lines 4 - 8, The get_mtime_ns
function should use st_mtime_ns directly to avoid float precision loss; replace
the getattr fallback with a direct return of stat_result.st_mtime_ns (i.e.,
change get_mtime_ns to return stat_result.st_mtime_ns) and remove the
int(stat_result.st_mtime * 1_000_000_000) fallback since the project's minimum
Python version guarantees st_mtime_ns.
tests-unit/assets_test/test_file_utils.py (1)

1-55: Add tests for the new mtime/size helpers.

Current coverage misses get_mtime_ns, get_size_and_mtime_ns, and verify_file_unchanged edge cases (mtime None, size None, mismatch cases). These are new in this PR and drive change-detection behavior.

Do you want me to draft the missing test cases or open a follow-up issue?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests-unit/assets_test/test_file_utils.py` around lines 1 - 55, Add unit
tests covering the new mtime/size helper functions: write tests for
get_mtime_ns, get_size_and_mtime_ns, and verify_file_unchanged to exercise edge
cases (when mtime is None, when size is None, and when returned size/mtime
mismatch expected values). Specifically, create files under tmp_path and assert
get_mtime_ns returns an int (and handles missing files/None), assert
get_size_and_mtime_ns returns a tuple (size, mtime_ns) and handles None
components, and assert verify_file_unchanged returns True for matching
size/mtime and False for each mismatch scenario (size mismatch, mtime mismatch,
and when one value is None). Use the function names get_mtime_ns,
get_size_and_mtime_ns, and verify_file_unchanged to locate the helpers.
tests-unit/assets_test/queries/conftest.py (1)

2-13: ⚠️ Potential issue | 🟡 Minor

Enable SQLite foreign keys in the in-memory fixture.

SQLite disables FK enforcement by default; without PRAGMA foreign_keys=ON, tests can pass while runtime fails. Add the connect listener before create_all.

🛠️ Suggested update
-from sqlalchemy import create_engine
+from sqlalchemy import create_engine, event
@@
     engine = create_engine("sqlite:///:memory:")
+    `@event.listens_for`(engine, "connect")
+    def _set_sqlite_fk(dbapi_connection, _):
+        cursor = dbapi_connection.cursor()
+        cursor.execute("PRAGMA foreign_keys=ON")
+        cursor.close()
     Base.metadata.create_all(engine)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests-unit/assets_test/queries/conftest.py` around lines 2 - 13, The
in-memory SQLite session fixture disables foreign key enforcement by default;
add a connection listener that runs PRAGMA foreign_keys=ON before creating
tables. In the session fixture (where engine = create_engine(...) and
Base.metadata.create_all(engine) are called), register an event listener on the
engine's "connect" event to execute "PRAGMA foreign_keys=ON" for each new DBAPI
connection, and register it before calling Base.metadata.create_all(engine) so
tests enforce FK constraints as in production.
tests-unit/assets_test/queries/test_asset_info.py (1)

206-207: Drop the redundant local import of time.
The module already imports time at the top.

♻️ Suggested refactor
-        import time
         time.sleep(0.01)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests-unit/assets_test/queries/test_asset_info.py` around lines 206 - 207,
Remove the redundant local import of time in the test (the lines "import time"
near the sleep) since time is already imported at module top; keep the existing
time.sleep(0.01) call but delete the duplicate import so there are no duplicate
imports in tests-unit/assets_test/queries/test_asset_info.py.
main.py (1)

261-265: ⚠️ Potential issue | 🟠 Major

Ensure asset_seeder.resume() always runs.
If e.execute(...) raises, the seeder stays paused for the rest of the process.

🐛 Suggested fix
             asset_seeder.pause()
-
-            e.execute(item[2], prompt_id, extra_data, item[4])
-
-            asset_seeder.resume()
+            try:
+                e.execute(item[2], prompt_id, extra_data, item[4])
+            finally:
+                asset_seeder.resume()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@main.py` around lines 261 - 265, The asset_seeder is paused then resumed but
if e.execute(item[2], prompt_id, extra_data, item[4]) throws the seeder remains
paused; wrap the call in a try/finally so asset_seeder.resume() always runs
(call asset_seeder.pause() before the try, put e.execute(...) in the try block,
and asset_seeder.resume() in the finally block) referencing the existing
asset_seeder.pause, e.execute, and asset_seeder.resume symbols.
app/assets/services/schemas.py (1)

10-14: ⚠️ Potential issue | 🟠 Major

AssetData.hash should allow None to match Asset.hash.
Asset.hash is str | None, and extract_asset_data passes it through; keeping AssetData.hash as str can leak a None into a non-optional field.

🐛 Suggested fix
 `@dataclass`(frozen=True)
 class AssetData:
-    hash: str
+    hash: str | None
     size_bytes: int | None
     mime_type: str | None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/services/schemas.py` around lines 10 - 14, AssetData.hash is
currently typed as str but Asset.hash (and values passed through
extract_asset_data) can be None; update the AssetData dataclass definition
(AssetData) to make hash Optional by changing its type to str | None so it
matches Asset.hash and prevents None being stored in a non-optional field;
verify extract_asset_data and any callers still type-check with the new nullable
signature.
tests-unit/assets_test/queries/test_asset.py (1)

24-31: Simplify the always-true guard in setup.
This conditional is redundant and obscures intent.

♻️ Suggested refactor
-        if setup_hash is not None or setup_hash is None:
-            # Create asset with given hash (including None for null hash test)
-            if setup_hash is not None or query_hash == "":
-                asset = Asset(hash=setup_hash, size_bytes=100)
-                session.add(asset)
-                session.commit()
+        # Create asset with given hash (including None for null hash test)
+        if setup_hash is not None or query_hash == "":
+            asset = Asset(hash=setup_hash, size_bytes=100)
+            session.add(asset)
+            session.commit()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests-unit/assets_test/queries/test_asset.py` around lines 24 - 31, The outer
guard in test_exists_by_hash (if setup_hash is not None or setup_hash is None)
is always true and should be removed; keep the inner condition that actually
controls creation (if setup_hash is not None or query_hash == "") so the test
only creates the Asset when intended—update the test_exists_by_hash method to
eliminate the redundant outer if and retain the inner conditional and asset
creation logic (Asset(hash=setup_hash, size_bytes=100), session.add,
session.commit()).
server.py (1)

698-701: ⚠️ Potential issue | 🟠 Major

Guard asset seeder startup so /object_info stays reliable.
A seeding exception here will bubble and fail the request.

🛡️ Suggested fix
-            asset_seeder.start(roots=("models", "input", "output"))
+            try:
+                asset_seeder.start(roots=("models", "input", "output"))
+            except Exception as e:
+                logging.warning("Asset seeder start failed: %s", e)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server.py` around lines 698 - 701, The call to asset_seeder.start in the
get_object_info handler can raise and abort the request; wrap
asset_seeder.start(roots=("models","input","output")) in a try/except that
catches Exception, logs the error (using the existing logger or processLogger)
with context, and allows the request to continue (or return a safe fallback) so
a seeding failure does not bubble up; update the get_object_info function
(identified by the routes.get("/object_info") decorator and asset_seeder.start
usage) to perform this guarded startup before entering the with
folder_paths.cache_helper block.
app/assets/database/queries/asset.py (2)

133-137: session.flush() called unconditionally even when ref is None.

When the reference isn't found, the flush is a no-op but unnecessary. Move it inside the if ref: block.

♻️ Proposed fix
     ref = session.get(AssetReference, reference_id)
     if ref:
         ref.asset_id = to_asset_id
-
-    session.flush()
+        session.flush()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/database/queries/asset.py` around lines 133 - 137, The call to
session.flush() is executed even when no AssetReference is found; update the
logic in the code handling AssetReference retrieval (where
session.get(AssetReference, reference_id) is assigned to ref) so that
session.flush() is moved inside the if ref: block (after setting ref.asset_id =
to_asset_id) and is not called when ref is None; locate the block referencing
AssetReference, reference_id, ref.asset_id and adjust the control flow
accordingly.

93-103: ⚠️ Potential issue | 🟠 Major

Missing chunking — IN clause can exceed SQLite bind-parameter limit.

get_existing_asset_ids passes the entire asset_ids list into a single IN(...) clause. Large lists will hit SQLite's bind-parameter ceiling. The sibling function bulk_insert_assets already uses iter_chunks — the same pattern should be applied here.

🐛 Proposed fix
 def get_existing_asset_ids(
     session: Session,
     asset_ids: list[str],
 ) -> set[str]:
     """Return the subset of asset_ids that exist in the database."""
     if not asset_ids:
         return set()
-    rows = session.execute(
-        select(Asset.id).where(Asset.id.in_(asset_ids))
-    ).fetchall()
-    return {row[0] for row in rows}
+    found: set[str] = set()
+    for chunk in iter_chunks(asset_ids, calculate_rows_per_statement(1)):
+        rows = session.execute(
+            select(Asset.id).where(Asset.id.in_(chunk))
+        ).fetchall()
+        found.update(row[0] for row in rows)
+    return found
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/database/queries/asset.py` around lines 93 - 103,
get_existing_asset_ids currently runs a single SELECT ... WHERE Asset.id IN
(asset_ids) which will break on large lists due to SQLite bind-parameter limits;
change it to iterate the input asset_ids in chunks using the existing
iter_chunks helper (same pattern as bulk_insert_assets), run a
select(Asset.id).where(Asset.id.in_(chunk)) for each chunk against the Session,
collect IDs from each rows batch into a single set, and return that set so large
inputs are handled safely.
app/assets/api/upload.py (2)

163-170: ⚠️ Potential issue | 🟡 Minor

Orphaned temp directory left behind on cleanup.

delete_temp_file_if_exists removes the file but not the parent unique_dir created at line 112. Over time, empty directories accumulate under temp/uploads/.

♻️ Proposed fix
 def delete_temp_file_if_exists(tmp_path: str | None) -> None:
     """Safely remove a temp file if it exists."""
     if tmp_path:
         try:
             if os.path.exists(tmp_path):
                 os.remove(tmp_path)
+            parent = os.path.dirname(tmp_path)
+            if parent and os.path.isdir(parent):
+                try:
+                    os.rmdir(parent)  # only succeeds if empty
+                except OSError:
+                    pass
         except OSError as e:
             logging.debug("Failed to delete temp file %s: %s", tmp_path, e)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/api/upload.py` around lines 163 - 170, delete_temp_file_if_exists
currently only removes the temp file but leaves its parent unique_dir (created
earlier) behind; update delete_temp_file_if_exists to after removing tmp_path
compute the parent directory (e.g., parent = os.path.dirname(tmp_path)) and
attempt to remove it if empty (os.rmdir or Path(parent).rmdir()) while catching
and logging OSError/Exception so non-empty or permission errors are ignored;
reference the function delete_temp_file_if_exists and ensure the cleanup targets
the parent directory created for uploads (the unique_dir under temp/uploads) to
prevent accumulating empty directories.

93-109: ⚠️ Potential issue | 🟡 Minor

Hash-before-file ordering dependency in multipart parsing.

The drain-without-write optimization on line 97 only activates if the hash field arrives before file. If the client sends file first, data is written to disk even when the hash is already known. Multipart field order isn't guaranteed by the HTTP spec.

Consider buffering the decision: accept the file to a temp path, then delete the temp file post-loop if the hash turned out to be known. Alternatively, document the expected field order in the API contract.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/api/upload.py` around lines 93 - 109, The current branch inside
the multipart handling (when fname == "file") only drains without writing if
provided_hash_exists is already true, creating a dependency on field order;
instead, always stream the incoming file to a temp path (use the existing
file_client_name/file_written logic and field.read_chunk) and record the path,
then after the overall multipart loop check provided_hash/provided_hash_exists
and, if the hash indicates the upload is a duplicate, delete the temp file and
adjust any tracking rather than relying on hash arriving before file; keep the
same UploadError handling around read/write operations and ensure cleanup on
exceptions.
app/assets/helpers.py (1)

6-23: select_best_live_path uses an unparameterized Sequence.

The function accesses file_path and needs_verify via getattr, which makes the expected interface invisible to type checkers. A lightweight Protocol would make the contract explicit.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/helpers.py` around lines 6 - 23, The function
select_best_live_path currently types its parameter as an unparameterized
Sequence and relies on getattr for file_path and needs_verify; define a
lightweight Protocol (e.g., class HasFilePath(Protocol): file_path: str | None;
needs_verify: bool) and change the signature of select_best_live_path to accept
Sequence[HasFilePath] so type checkers see the expected attributes, then replace
getattr usage with direct attribute access (s.file_path, s.needs_verify) inside
select_best_live_path.
app/assets/services/path_utils.py (2)

25-50: ⚠️ Potential issue | 🟠 Major

Missing validation for empty or unknown root tags.

tags[0] on line 27 will raise IndexError on an empty list, and any unknown root (e.g., "custom") silently falls through to the output directory on line 43.

🐛 Proposed fix
 def resolve_destination_from_tags(tags: list[str]) -> tuple[str, list[str]]:
     """Validates and maps tags -> (base_dir, subdirs_for_fs)"""
+    if not tags:
+        raise ValueError("at least one tag required")
     root = tags[0]
+    if root not in ("models", "input", "output"):
+        raise ValueError(f"unknown root tag '{root}'")
     if root == "models":
         if len(tags) < 2:
             raise ValueError("at least two tags required for model asset")
@@
-    else:
-        base_dir = os.path.abspath(
-            folder_paths.get_input_directory()
-            if root == "input"
-            else folder_paths.get_output_directory()
-        )
+    elif root == "input":
+        base_dir = os.path.abspath(folder_paths.get_input_directory())
+        raw_subdirs = tags[1:]
+    else:  # root == "output"
+        base_dir = os.path.abspath(folder_paths.get_output_directory())
         raw_subdirs = tags[1:]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/services/path_utils.py` around lines 25 - 50, In
resolve_destination_from_tags, first validate that the tags list is not empty
(guard against tags[0] IndexError) and raise a clear ValueError if it is empty;
next replace the implicit fallback that treats any unknown root as "output" with
an explicit allowed-root check (only allow "models", "input", or "output") and
raise ValueError for unknown roots (e.g., "custom"); keep the existing
models-specific logic (folder_paths.folder_names_and_paths lookup and bases
emptiness check) and preserve the path component validation for raw_subdirs.

53-61: ⚠️ Potential issue | 🟠 Major

Symlink bypass in path containment check.

os.path.abspath doesn't resolve symlinks. A symlinked subdirectory can escape the base path while passing this check. Use os.path.realpath instead.

🐛 Proposed fix
 def validate_path_within_base(candidate: str, base: str) -> None:
-    cand_abs = os.path.abspath(candidate)
-    base_abs = os.path.abspath(base)
+    cand_abs = os.path.realpath(candidate)
+    base_abs = os.path.realpath(base)
     try:
         common = os.path.commonpath([cand_abs, base_abs])
     except Exception:
         raise ValueError("invalid destination path")
     if common != base_abs:
         raise ValueError("destination escapes base directory")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/services/path_utils.py` around lines 53 - 61, The
validate_path_within_base function uses os.path.abspath which does not resolve
symlinks and allows symlinked paths to bypass the containment check; change both
cand_abs and base_abs to use os.path.realpath (resolve symlinks) before
computing os.path.commonpath and keep the existing exception handling and
ValueError messages so the function properly rejects paths that escape the base
via symlinks.
app/assets/api/routes.py (3)

651-652: ⚠️ Potential issue | 🔴 Critical

Blocking asset_seeder.wait() call inside async handler freezes the event loop.

asset_seeder.wait() calls Thread.join() synchronously, blocking the entire aiohttp event loop until the scan completes. This should be offloaded to a thread executor.

Proposed fix
+    import asyncio
+
     if should_wait:
-        asset_seeder.wait()
+        await asyncio.to_thread(asset_seeder.wait)
         if was_disabled:
             asset_seeder.disable()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/api/routes.py` around lines 651 - 652, The handler is calling
asset_seeder.wait(), which blocks via Thread.join() and freezes the aiohttp
event loop; change the call to run the blocking wait in a thread executor (e.g.,
use asyncio.get_running_loop().run_in_executor(...) or asyncio.to_thread(...))
so the join runs off the event loop; update the async handler where
asset_seeder.wait() is invoked to await the executor call instead of calling
wait() directly (referencing asset_seeder.wait and the surrounding async request
handler).

720-725: ⚠️ Potential issue | 🟡 Minor

Check-after-act: mark_missing_outside_prefixes() executes before the state check.

The seeder's own mark_missing_outside_prefixes method does check for IDLE internally (returning 0 if not IDLE), so the prune won't actually race with a scan. However, marked == 0 is ambiguous — it could mean "nothing to mark" or "seeder was busy". Consider checking state before calling the method for a cleaner 409 response.

Proposed fix: check state first
 async def mark_missing_assets(request: web.Request) -> web.Response:
+    status = asset_seeder.get_status()
+    if status.state.value != "IDLE":
+        return web.json_response(
+            {"status": "scan_running", "marked": 0},
+            status=409,
+        )
     marked = asset_seeder.mark_missing_outside_prefixes()
-    if marked == 0 and asset_seeder.get_status().state.value != "IDLE":
-        return web.json_response(
-            {"status": "scan_running", "marked": 0},
-            status=409,
-        )
     return web.json_response({"status": "completed", "marked": marked}, status=200)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/api/routes.py` around lines 720 - 725, The code calls
asset_seeder.mark_missing_outside_prefixes() before checking
asset_seeder.get_status().state.value, making a return value of 0 ambiguous;
update the logic to first check asset_seeder.get_status().state.value != "IDLE"
and immediately return the 409 json_response when a scan is running, and only
then call asset_seeder.mark_missing_outside_prefixes(); after calling
mark_missing_outside_prefixes() return the normal response based on the actual
marked count so a 0 means “nothing to mark” not “seeder busy.”

588-611: ⚠️ Potential issue | 🟠 Major

delete_asset_tags doesn't catch PermissionError, unlike add_asset_tags and update_asset_route.

If remove_tags raises PermissionError for a non-owner, it falls through to the generic Exception handler and returns 500 instead of 403.

Proposed fix
-    except ValueError as ve:
+    except (ValueError, PermissionError) as ve:
         return _build_error_response(
             404, "ASSET_NOT_FOUND", str(ve), {"id": reference_id}
         )

Note: If you want PermissionError to return 403 (like other handlers), you may want separate except clauses:

+    except PermissionError as pe:
+        return _build_error_response(
+            403, "FORBIDDEN", str(pe), {"id": reference_id}
+        )
     except ValueError as ve:
         return _build_error_response(
             404, "ASSET_NOT_FOUND", str(ve), {"id": reference_id}
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/api/routes.py` around lines 588 - 611, delete_asset_tags currently
catches ValueError and a generic Exception but not PermissionError, so
PermissionError raised by remove_tags is treated as a 500; add a dedicated
except PermissionError block in the delete_asset_tags handler to mirror
add_asset_tags/update_asset_route behavior: catch PermissionError (from
remove_tags) and return _build_error_response(403, "FORBIDDEN", str(pe) or a
suitable message, and include {"id": reference_id}) while leaving the existing
ValueError and generic Exception handlers intact; locate the try/except around
remove_tags in delete_asset_tags to implement this.
app/assets/database/models.py (1)

145-162: ⚠️ Potential issue | 🟡 Minor

Missing index on updated_at — API sort queries on this column will require a full table scan.

The _validate_sort_field function in routes.py (Line 92) permits sorting by updated_at, but no index exists for it in __table_args__. Both created_at and last_access_time have indexes, so updated_at should too.

Proposed fix
         Index("ix_asset_references_created_at", "created_at"),
         Index("ix_asset_references_last_access_time", "last_access_time"),
+        Index("ix_asset_references_updated_at", "updated_at"),
         Index("ix_asset_references_owner_name", "owner_id", "name"),

Also add a corresponding op.create_index in the Alembic migration.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/database/models.py` around lines 145 - 162, Add an index on the
updated_at column to avoid full table scans by updating the model's
__table_args__ (add an Index like ix_asset_references_updated_at on
"updated_at") and create the matching Alembic migration that calls
op.create_index for the same index name on the asset_references table; ensure
the index name matches between the SQLAlchemy model (in models.py) and the
Alembic migration so API sort queries using updated_at (per _validate_sort_field
in routes.py) can use the index.
app/assets/services/bulk_ingest.py (1)

186-193: ⚠️ Potential issue | 🟠 Major

Duplicate-hash assets silently lose their references.

When bulk_insert_assets skips an asset due to hash conflict (ON CONFLICT DO NOTHING), the new asset_id is never inserted, so get_existing_asset_ids won't return it. The corresponding reference row is then filtered out at Line 191-193, meaning a new file path pointing to an already-known hash never gets a reference row.

Consider resolving the existing asset ID by hash for skipped specs, and remapping reference_rows to point to the existing asset rather than dropping them.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/services/bulk_ingest.py` around lines 186 - 193,
bulk_insert_assets currently drops reference_rows whose asset_id wasn't returned
by get_existing_asset_ids, losing references for assets skipped due to hash
conflicts; fix by resolving skipped assets by hash and remapping reference_rows
to the existing asset IDs instead of dropping them. Specifically: after calling
get_existing_asset_ids(session, [...]) and before filtering reference_rows,
collect the hashes from the original import specs (the same hash field used by
bulk_insert_assets), query assets by hash to build a map hash ->
existing_asset_id, then for any reference_row whose asset_id is missing from
inserted_asset_ids replace its asset_id with the existing_asset_id from that
map; finally filter reference_rows to only those whose (possibly remapped)
asset_id exists. Use the existing functions bulk_insert_assets and
get_existing_asset_ids and the reference_rows list to locate where to insert the
resolver and remapping logic.
app/assets/scanner.py (1)

298-306: ⚠️ Potential issue | 🟠 Major

follow_symlinks=False in spec builders mismatches follow_symlinks=True in sync — perpetual needs_verify for symlinks.

Both build_asset_specs (Line 304) and build_stub_specs (Line 376) use os.stat(abs_p, follow_symlinks=False) (lstat), while sync_references_with_filesystem (Line 147) uses follow_symlinks=True (stat). For symlinked files, the recorded mtime/size will differ from the verification check, causing verify_file_unchanged to return False every scan cycle.

Proposed fix: align follow_symlinks policy
-            stat_p = os.stat(abs_p, follow_symlinks=False)
+            stat_p = os.stat(abs_p, follow_symlinks=True)

Apply to both build_asset_specs (Line 304) and build_stub_specs (Line 376).

Also applies to: 370-380

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/scanner.py` around lines 298 - 306, The spec builders
build_asset_specs and build_stub_specs currently call os.stat(...,
follow_symlinks=False) (lstat), which records symlink metadata and mismatches
sync_references_with_filesystem/verify_file_unchanged that use os.stat(...,
follow_symlinks=True), causing symlinked files to always appear changed; update
the os.stat calls in build_asset_specs and build_stub_specs to use
follow_symlinks=True so recorded mtime/size match the verification path and
remove perpetual needs_verify for symlinks.
app/assets/database/queries/asset_reference.py (5)

628-669: result.rowcount may be None at lines 652 and 669.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/database/queries/asset_reference.py` around lines 628 - 669, The
SQLAlchemy ExecutionResult.rowcount can be None, so ensure both
mark_references_missing_outside_prefixes and restore_references_by_paths return
an int by normalizing None to 0; after executing the update in
mark_references_missing_outside_prefixes (result) and
restore_references_by_paths (result) convert result.rowcount to an integer
(e.g., use result.rowcount or 0 or int(result.rowcount or 0)) before returning
so the functions always return a concrete int.

947-974: IN/IN cross-product doesn't correctly correlate file_path ↔ asset_id pairs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/database/queries/asset_reference.py` around lines 947 - 974, The
WHERE currently uses two independent IN clauses causing a cross-product; change
it to match pairs by using a tuple IN comparison: build a list of (file_path,
asset_id) tuples for each chunk (using path_to_asset to map paths) and use
sqlalchemy.sql.expression.tuple_(AssetReference.file_path,
AssetReference.asset_id).in_(list_of_tuples) in the select. Update the loop in
get_references_by_paths_and_asset_ids to create chunk_pairs from
iter_chunks(paths, MAX_BIND_PARAMS) and pass those paired tuples to the query so
file_path ↔ asset_id are correlated.

817-829: delete_orphaned_seed_asset doesn't enforce the "hash is None" precondition.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/database/queries/asset_reference.py` around lines 817 - 829, The
function delete_orphaned_seed_asset currently deletes AssetReference rows
unconditionally; change it to first load the Asset (session.get(Asset,
asset_id)), verify that asset is not None and asset.hash is None, and only then
execute the deletion of AssetReference records
(sa.delete(AssetReference).where(AssetReference.asset_id == asset_id)) and
session.delete(asset); if the asset is missing or its hash is not None, do not
delete references and return False. Use the existing symbols Asset,
AssetReference, and delete_orphaned_seed_asset to locate and update the logic so
the "hash is None" precondition is enforced and the return value reflects
whether deletion occurred.

928-944: Hardcoded column count 14 in calculate_rows_per_statement(14) may drift if the model changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/database/queries/asset_reference.py` around lines 928 - 944, The
hardcoded column count (14) in bulk_insert_references_ignore_conflicts when
calling calculate_rows_per_statement can drift; replace the literal with a
dynamic column count computed from the actual insert payload or model (e.g., use
the number of keys in enriched_rows[0] or AssetReference.__table__.columns) and
pass that value to calculate_rows_per_statement so the chunk sizing stays
correct; update the call at the iter_chunks(...) line inside
bulk_insert_references_ignore_conflicts accordingly.

290-365: Count query duplicates all filter logic from the data query — divergence risk.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/database/queries/asset_reference.py` around lines 290 - 365, The
count query duplicates the filter/join logic in list_references_page (risking
divergence); refactor to build a single filtered Select and reuse it for both
data and count. Concretely, inside list_references_page extract the
select(AssetReference).join(...).where(...).options(...) construction (and the
subsequent name_contains, _apply_tag_filters, _apply_metadata_filter steps) into
a single variable (e.g. filtered_stmt) and then derive the paginated query from
it by applying order_by/limit/offset, while deriving the count by converting
filtered_stmt to a count select (e.g.
filtered_stmt.with_only_columns(sa.func.count()).order_by(None) or
select(sa.func.count()).select_from(filtered_stmt.subquery())). Update usages of
base and count_stmt accordingly and remove the duplicated join/where logic; keep
references to functions _apply_tag_filters and _apply_metadata_filter and
symbols AssetReference, AssetReferenceTag, Tag, and Asset to locate code.
🧹 Nitpick comments (9)
app/assets/services/tagging.py (1)

41-64: remove_tags performs a redundant reference lookup.

remove_tags_from_reference (in tags.py line 177-215 from the snippets) already fetches the reference and raises ValueError if not found. The pre-fetch and ownership check here in the service layer means the reference is loaded twice. This is intentional for ownership enforcement (which the query layer doesn't do), so it's fine from a correctness standpoint, but worth noting the double lookup.

Consider passing reference_row=ref_row to remove_tags_from_reference (similar to how apply_tags passes it to add_tags_to_reference) to avoid the redundant session.get() inside the query function. This would require updating the remove_tags_from_reference signature to accept an optional reference_row parameter, matching the pattern already used by add_tags_to_reference.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/services/tagging.py` around lines 41 - 64, The service function
remove_tags currently does a duplicate lookup; to avoid it, change remove_tags
to pass the already-fetched ref_row into remove_tags_from_reference (i.e. call
remove_tags_from_reference(session, reference_id=reference_id, tags=tags,
reference_row=ref_row)) and update the remove_tags_from_reference function
signature to accept an optional reference_row parameter and use that row if
provided instead of calling session.get()/get_reference_by_id; keep the
ownership check in remove_tags as-is and ensure the updated query function still
raises ValueError when no reference_row is available.
app/assets/helpers.py (1)

26-30: ALLOWED_ROOTS constant is a clean addition.

Centralizing the root literals in one place improves consistency. Consider using this constant in resolve_destination_from_tags (in path_utils.py) for the root validation rather than hardcoding the set there.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/helpers.py` around lines 26 - 30, Update
resolve_destination_from_tags in path_utils.py to use the centralized
ALLOWED_ROOTS constant (from app.assets.helpers) instead of the hardcoded set;
add the appropriate import for ALLOWED_ROOTS at the top of path_utils.py and
replace any validation logic that compares tag roots (e.g., the code that
previously used {"models", "input", "output"}) to reference ALLOWED_ROOTS so
root validation stays consistent with the new constant.
tests-unit/seeder_test/test_seeder.py (2)

664-666: Missing blank line between test classes.

PEP 8 expects two blank lines between top-level definitions. There's only one blank line before TestSeederStopRestart.

♻️ Fix
             fresh_seeder.wait(timeout=5.0)
             assert fresh_seeder.get_status().state == State.IDLE
 
+
 class TestSeederStopRestart:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests-unit/seeder_test/test_seeder.py` around lines 664 - 666, The file is
missing the required blank line separation between top-level classes; add one
additional blank line between the end of the previous test class (just after the
assertion checking fresh_seeder.get_status().state == State.IDLE) and the
declaration of the next test class TestSeederStopRestart so there are two blank
lines between top-level definitions per PEP 8.

339-364: Potential race between barrier.set() and with patch(...) scope exit.

After barrier.set() on line 362, the seeder thread unblocks and continues executing. The with patch(...) context on line 348 exits immediately after line 362, restoring the original collect_paths_for_roots. If the seeder thread somehow re-invokes collect_paths_for_roots after the patch is torn down but before the mock_dependencies fixture restores its version, you'd get unexpected behavior.

In practice this is safe here because collect_paths_for_roots is only called once per scan, and mock_dependencies covers the remaining functions. Adding a fresh_seeder.wait(timeout=5.0) before exiting the with block would make this airtight, consistent with how other tests handle it.

♻️ Suggested improvement
             barrier.set()
 
-            assert sum(results) == 1
+            fresh_seeder.wait(timeout=5.0)
+
+        assert sum(results) == 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests-unit/seeder_test/test_seeder.py` around lines 339 - 364, There's a
potential race where the patched collect_paths_for_roots may be restored before
the background seeder finishes after barrier.set(); to fix, keep the patch
active until the seeder thread completes by waiting for the seeder to finish
inside the with block—call fresh_seeder.wait(timeout=5.0) (or equivalent
synchronization on AssetSeeder) before exiting the with patch(...) context so
collect_paths_for_roots remains patched while the seeder thread unwraps.
app/assets/services/path_utils.py (1)

114-117: _compute_relative has a convoluted normalization pattern.

os.path.relpath(
    os.path.join(os.sep, os.path.relpath(child, parent)), os.sep
)

This computes relpath(child, parent), then joins it to /, then computes relpath of that against / — effectively just normalizing out any leading ../ segments. But if child is not under parent, the intermediate relpath will contain .. components, and the outer relpath will silently collapse them, potentially returning a misleading relative path. This function is only called after _check_is_within confirms containment, so in practice it's safe, but a simpler os.path.relpath(child, parent) would suffice and be more readable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/services/path_utils.py` around lines 114 - 117, The
_compute_relative function uses a convoluted normalization (calling
os.path.relpath(child, parent), joining to os.sep, then relpath again) which can
silently collapse leading ".." and is unnecessary because callers run
_check_is_within first; simplify _compute_relative to directly return
os.path.relpath(child, parent) (replace the current implementation in function
_compute_relative to use a single os.path.relpath call and remove the
os.path.join/os.sep double-relpath pattern).
app/assets/api/upload.py (1)

12-30: Hash validation duplicated between normalize_and_validate_hash and UploadAssetSpec._parse_hash / CreateFromHashBody._require_blake3.

Three separate locations implement nearly identical blake3 hash validation logic (this function, plus the two Pydantic validators in schemas_in.py). Consolidating into a single shared helper would reduce the chance of drift.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/api/upload.py` around lines 12 - 30, There is duplicated blake3
hash validation across normalize_and_validate_hash and the Pydantic validators
UploadAssetSpec._parse_hash and CreateFromHashBody._require_blake3 in
schemas_in.py; extract the canonical logic into a single shared helper (e.g.,
normalize_and_validate_blake3_hash) that returns the normalized "blake3:<hex>"
or raises a clear error, then update normalize_and_validate_hash to call that
helper and replace the bodies of UploadAssetSpec._parse_hash and
CreateFromHashBody._require_blake3 to call the same helper (converting its
exception to the validator-appropriate exception type if needed, e.g., raise
ValueError inside Pydantic validators but keep UploadError for API paths), and
add the new helper to the appropriate shared utils module and imports so all
three places reuse the single implementation.
app/assets/services/ingest.py (1)

251-256: Exception chaining lost when re-raising.

The from e clause is missing, which discards the original traceback in DependencyMissingError and RuntimeError raises.

Proposed fix
     try:
         digest = hashing.compute_blake3_hash(temp_path)
     except ImportError as e:
-        raise DependencyMissingError(str(e))
+        raise DependencyMissingError(str(e)) from e
     except Exception as e:
-        raise RuntimeError(f"failed to hash uploaded file: {e}")
+        raise RuntimeError(f"failed to hash uploaded file: {e}") from e
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/services/ingest.py` around lines 251 - 256, The except blocks
around hashing.compute_blake3_hash(temp_path) drop the original tracebacks;
update both re-raises to preserve exception chaining by using "from e" (e.g.,
raise DependencyMissingError(str(e)) from e and raise RuntimeError(f"failed to
hash uploaded file: {e}") from e) so the original exception context is retained
when raising DependencyMissingError and RuntimeError.
app/assets/seeder.py (1)

84-100: _prune_first is not initialized in __init__, requiring a getattr workaround in restart.

_prune_first is only set inside start() (Line 149), but restart() uses getattr(self, "_prune_first", False) at Line 293 to avoid AttributeError. Initialize it in __init__ alongside other instance variables for consistency and to remove the defensive getattr.

Proposed fix
         self._compute_hashes: bool = False
         self._progress_callback: ProgressCallback | None = None
         self._disabled: bool = False
+        self._prune_first: bool = False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/seeder.py` around lines 84 - 100, Initialize the missing
_prune_first attribute in the class __init__ to avoid the getattr workaround
used in restart; add self._prune_first: bool = False alongside other instance
variables in __init__ so start() can overwrite it and restart() can safely
reference it without getattr(self, "_prune_first", False).
app/assets/services/asset_management.py (1)

130-177: File deletion after commit silently swallows all exceptions.

contextlib.suppress(Exception) at Line 173 will suppress any exception during file deletion, including unexpected errors (e.g., permission errors on directories, symlink issues). While defensive, this could hide operational problems. Consider logging at debug or warning level inside the suppression.

Suggestion
         for p in file_paths:
-            with contextlib.suppress(Exception):
-                if p and os.path.isfile(p):
-                    os.remove(p)
+            try:
+                if p and os.path.isfile(p):
+                    os.remove(p)
+            except Exception:
+                logging.debug("Failed to remove orphan file %s", p, exc_info=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/services/asset_management.py` around lines 130 - 177, The file
deletion loop in delete_asset_reference uses contextlib.suppress(Exception)
which hides all errors when removing paths in file_paths; change this to catch
Exception as e (or use contextlib.suppress but still log) and emit a
warning/debug log including the path and exception details so permission/symlink
errors aren't silently swallowed; add or use a module logger (e.g., logger =
logging.getLogger(__name__)) and log inside the exception handler referencing
delete_asset_reference and file_paths (or the loop variable p) so failures are
recorded while still allowing the loop to continue.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 185100c and 1fa4c88.

📒 Files selected for processing (58)
  • alembic_db/versions/0002_merge_to_asset_references.py
  • app/assets/api/routes.py
  • app/assets/api/schemas_in.py
  • app/assets/api/schemas_out.py
  • app/assets/api/upload.py
  • app/assets/database/bulk_ops.py
  • app/assets/database/models.py
  • app/assets/database/queries.py
  • app/assets/database/queries/__init__.py
  • app/assets/database/queries/asset.py
  • app/assets/database/queries/asset_reference.py
  • app/assets/database/queries/common.py
  • app/assets/database/queries/tags.py
  • app/assets/database/tags.py
  • app/assets/hashing.py
  • app/assets/helpers.py
  • app/assets/manager.py
  • app/assets/scanner.py
  • app/assets/seeder.py
  • app/assets/services/__init__.py
  • app/assets/services/asset_management.py
  • app/assets/services/bulk_ingest.py
  • app/assets/services/file_utils.py
  • app/assets/services/hashing.py
  • app/assets/services/ingest.py
  • app/assets/services/metadata_extract.py
  • app/assets/services/path_utils.py
  • app/assets/services/schemas.py
  • app/assets/services/tagging.py
  • app/database/db.py
  • main.py
  • requirements.txt
  • server.py
  • tests-unit/assets_test/conftest.py
  • tests-unit/assets_test/helpers.py
  • tests-unit/assets_test/queries/conftest.py
  • tests-unit/assets_test/queries/test_asset.py
  • tests-unit/assets_test/queries/test_asset_info.py
  • tests-unit/assets_test/queries/test_cache_state.py
  • tests-unit/assets_test/queries/test_metadata.py
  • tests-unit/assets_test/queries/test_tags.py
  • tests-unit/assets_test/services/__init__.py
  • tests-unit/assets_test/services/conftest.py
  • tests-unit/assets_test/services/test_asset_management.py
  • tests-unit/assets_test/services/test_bulk_ingest.py
  • tests-unit/assets_test/services/test_enrich.py
  • tests-unit/assets_test/services/test_ingest.py
  • tests-unit/assets_test/services/test_tagging.py
  • tests-unit/assets_test/test_assets_missing_sync.py
  • tests-unit/assets_test/test_crud.py
  • tests-unit/assets_test/test_downloads.py
  • tests-unit/assets_test/test_file_utils.py
  • tests-unit/assets_test/test_list_filter.py
  • tests-unit/assets_test/test_prune_orphaned_assets.py
  • tests-unit/assets_test/test_tags_api.py
  • tests-unit/assets_test/test_uploads.py
  • tests-unit/requirements.txt
  • tests-unit/seeder_test/test_seeder.py
💤 Files with no reviewable changes (7)
  • tests-unit/assets_test/conftest.py
  • tests-unit/requirements.txt
  • app/assets/database/bulk_ops.py
  • app/assets/manager.py
  • app/assets/hashing.py
  • app/assets/database/tags.py
  • app/assets/database/queries.py
✅ Files skipped from review due to trivial changes (1)
  • tests-unit/assets_test/queries/test_cache_state.py
🚧 Files skipped from review as they are similar to previous changes (16)
  • tests-unit/assets_test/test_assets_missing_sync.py
  • tests-unit/assets_test/queries/test_tags.py
  • app/assets/database/queries/common.py
  • requirements.txt
  • tests-unit/assets_test/services/init.py
  • tests-unit/assets_test/services/test_asset_management.py
  • app/assets/database/queries/init.py
  • tests-unit/assets_test/services/test_enrich.py
  • tests-unit/assets_test/services/test_ingest.py
  • tests-unit/assets_test/services/test_bulk_ingest.py
  • tests-unit/assets_test/test_downloads.py
  • app/assets/services/hashing.py
  • app/assets/services/init.py
  • tests-unit/assets_test/test_uploads.py
  • tests-unit/assets_test/test_prune_orphaned_assets.py
  • alembic_db/versions/0002_merge_to_asset_references.py

Comment on lines +359 to +364
if not parsed.tmp_path or not os.path.exists(parsed.tmp_path):
return _build_error_response(
404,
"ASSET_NOT_FOUND",
"Provided hash not found and no file uploaded.",
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

HTTP 404 is misleading when no file was uploaded and hash doesn't match.

When parsed.tmp_path is missing and the hash wasn't found, the 404 response with "ASSET_NOT_FOUND" is technically about the missing uploaded content, not a missing resource. A 400 (BAD_REQUEST) would better convey that the client's request was incomplete.

Proposed fix
             if not parsed.tmp_path or not os.path.exists(parsed.tmp_path):
                 return _build_error_response(
-                    404,
-                    "ASSET_NOT_FOUND",
+                    400,
+                    "BAD_REQUEST",
                     "Provided hash not found and no file uploaded.",
                 )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not parsed.tmp_path or not os.path.exists(parsed.tmp_path):
return _build_error_response(
404,
"ASSET_NOT_FOUND",
"Provided hash not found and no file uploaded.",
)
if not parsed.tmp_path or not os.path.exists(parsed.tmp_path):
return _build_error_response(
400,
"BAD_REQUEST",
"Provided hash not found and no file uploaded.",
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/api/routes.py` around lines 359 - 364, The current branch returns
a 404 via _build_error_response when parsed.tmp_path is falsy or missing, which
misrepresents a malformed request; update the conditional that checks
parsed.tmp_path (the block using parsed.tmp_path and
os.path.exists(parsed.tmp_path)) to return a 400 status with an appropriate
error code such as "BAD_REQUEST" and a message like "No file uploaded and
provided hash not found" instead of "ASSET_NOT_FOUND" so callers receive a
client error for the missing uploaded file.

Comment on lines +152 to +157
if isinstance(v, list):
ors = [_exists_clause_for_value(k, elem) for elem in v]
if ors:
stmt = stmt.where(sa.or_(*ors))
else:
stmt = stmt.where(_exists_clause_for_value(k, v))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Empty metadata_filter list value silently skips the filter instead of returning no results.

When v is an empty list, ors = [] and if ors: is False, so no WHERE clause is added. A caller passing {"key": []} (semantically "value equals any of nothing") gets back all rows instead of zero rows.

🐛 Proposed fix
         if isinstance(v, list):
             ors = [_exists_clause_for_value(k, elem) for elem in v]
             if ors:
                 stmt = stmt.where(sa.or_(*ors))
+            else:
+                stmt = stmt.where(sa.false())
         else:
             stmt = stmt.where(_exists_clause_for_value(k, v))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/database/queries/asset_reference.py` around lines 152 - 157, The
code handling metadata_filter list values improperly skips empty lists; inside
the branch where v is a list (the call-site that builds ors via
_exists_clause_for_value), detect if v is an empty list and in that case add a
predicate that always evaluates to false (e.g., use SQLAlchemy's false() /
literal(False)) to the statement instead of no WHERE clause so the query returns
zero rows; otherwise keep the existing behavior of stmt.where(sa.or_(*ors)) when
ors is non-empty.

Comment on lines +655 to +669
def restore_references_by_paths(session: Session, file_paths: list[str]) -> int:
"""Restore references that were previously marked as missing.

Returns number of references restored.
"""
if not file_paths:
return 0

result = session.execute(
sa.update(AssetReference)
.where(AssetReference.file_path.in_(file_paths))
.where(AssetReference.is_missing == True) # noqa: E712
.values(is_missing=False)
)
return result.rowcount
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing IN-clause chunking in bulk mutation functions will hit SQLite's 999 bind-parameter limit.

The read-side functions (get_reference_ids_by_ids line 986, get_references_by_paths_and_asset_ids line 965) already use iter_chunks(…, MAX_BIND_PARAMS). The mutation functions below pass the full list to a single IN (…) clause, which will raise OperationalError: too many SQL variables for any realistically-sized model library:

Function Affected clause Line
restore_references_by_paths file_path.in_(file_paths) 664
delete_assets_by_ids asset_id.in_(asset_ids) + Asset.id.in_(asset_ids) 701, 703
bulk_update_needs_verify id.in_(reference_ids) 781
bulk_update_is_missing id.in_(reference_ids) 797
delete_references_by_ids id.in_(reference_ids) 812
bulk_update_enrichment_level id.in_(reference_ids) 922
🐛 Sketch for bulk_update_needs_verify (same pattern applies to the others)
 def bulk_update_needs_verify(
     session: Session, reference_ids: list[str], value: bool
 ) -> int:
     if not reference_ids:
         return 0
-    result = session.execute(
-        sa.update(AssetReference)
-        .where(AssetReference.id.in_(reference_ids))
-        .values(needs_verify=value)
-    )
-    return result.rowcount
+    total = 0
+    for chunk in iter_chunks(reference_ids, MAX_BIND_PARAMS):
+        result = session.execute(
+            sa.update(AssetReference)
+            .where(AssetReference.id.in_(chunk))
+            .values(needs_verify=value)
+        )
+        total += int(result.rowcount or 0)
+    return total

Also applies to: 693-704, 770-784, 787-801, 804-814, 909-925

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/database/queries/asset_reference.py` around lines 655 - 669, The
mutation functions (restore_references_by_paths, delete_assets_by_ids,
bulk_update_needs_verify, bulk_update_is_missing, delete_references_by_ids,
bulk_update_enrichment_level) currently pass the entire list into a single
IN(...) which will exceed SQLite's 999 bind-parameter limit; fix each by
iterating the input ids/paths using the existing iter_chunks(...,
MAX_BIND_PARAMS) helper, perform the sa.update/sa.delete per chunk (e.g. replace
AssetReference.file_path.in_(file_paths) with .in_(chunk)), accumulate and
return the total rowcount across chunks, and keep early-return for empty inputs.
Ensure the same chunked pattern is applied to both places where
asset_id/Asset.id and reference id lists are used.

Comment on lines +693 to +704
def delete_assets_by_ids(session: Session, asset_ids: list[str]) -> int:
"""Delete assets and their references by ID.

Returns number of assets deleted.
"""
if not asset_ids:
return 0
session.execute(
sa.delete(AssetReference).where(AssetReference.asset_id.in_(asset_ids))
)
result = session.execute(sa.delete(Asset).where(Asset.id.in_(asset_ids)))
return result.rowcount
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

result.rowcount returned without int(… or 0) guard in five functions not covered by the earlier review.

delete_assets_by_ids (line 704), bulk_update_needs_verify (line 784), bulk_update_is_missing (line 801), delete_references_by_ids (line 814), and bulk_update_enrichment_level (line 925) all return result.rowcount directly. Like the already-flagged cases at lines 652 and 669, CursorResult.rowcount can be None or -1 for certain driver/statement combinations while each of these functions is typed -> int.

🐛 Proposed fix (apply to all five return sites)
-    return result.rowcount
+    return int(result.rowcount or 0)

Also applies to: 770-784, 787-801, 804-814, 909-925

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/database/queries/asset_reference.py` around lines 693 - 704,
Several functions (delete_assets_by_ids, bulk_update_needs_verify,
bulk_update_is_missing, delete_references_by_ids, bulk_update_enrichment_level)
return result.rowcount directly, but CursorResult.rowcount can be None or -1;
update each return to cast/fallback so the function always returns an int (e.g.
use int(result.rowcount or 0) or (result.rowcount if result.rowcount and
result.rowcount >= 0 else 0)). Locate the session.execute calls that assign to
result in those functions and replace the bare return result.rowcount with a
guarded int conversion/fallback to 0 to satisfy the -> int return type.

Comment on lines +661 to +716
def _run_enrich_phase(self, roots: tuple[RootType, ...]) -> int:
"""Run phase 2: enrich existing records with metadata and hashes.

Returns:
Total number of assets enriched
"""
total_enriched = 0
batch_size = 100
last_progress_time = time.perf_counter()
progress_interval = 1.0

# Get the target enrichment level based on compute_hashes
if not self._compute_hashes:
target_max_level = ENRICHMENT_STUB
else:
target_max_level = ENRICHMENT_METADATA

self._emit_event(
"assets.seed.started",
{"roots": list(roots), "phase": "enrich"},
)

while True:
if self._check_pause_and_cancel():
logging.info("Enrich scan cancelled after %d assets", total_enriched)
break

# Fetch next batch of unenriched assets
unenriched = get_unenriched_assets_for_roots(
roots,
max_level=target_max_level,
limit=batch_size,
)

if not unenriched:
break

enriched, failed = enrich_assets_batch(
unenriched,
extract_metadata=True,
compute_hash=self._compute_hashes,
)
total_enriched += enriched

now = time.perf_counter()
if now - last_progress_time >= progress_interval:
self._emit_event(
"assets.seed.progress",
{
"phase": "enrich",
"enriched": total_enriched,
},
)
last_progress_time = now

return total_enriched
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Enrichment loop may re-fetch the same rows indefinitely if enrichment fails.

In _run_enrich_phase, get_unenriched_assets_for_roots fetches rows with enrichment_level <= max_level. If enrich_assets_batch fails for a row (e.g., the file can't be read), the enrichment level is not updated (stays at the old level), and the same row will be fetched again in the next iteration of the while True loop. This creates an infinite loop when enrichment consistently fails for some rows.

Consider adding an offset that advances past failed rows, or tracking failed reference IDs to exclude them.

Suggested approach: break if no progress is made
     while True:
         if self._check_pause_and_cancel():
             logging.info("Enrich scan cancelled after %d assets", total_enriched)
             break

         unenriched = get_unenriched_assets_for_roots(
             roots,
             max_level=target_max_level,
             limit=batch_size,
         )

         if not unenriched:
             break

         enriched, failed = enrich_assets_batch(
             unenriched,
             extract_metadata=True,
             compute_hash=self._compute_hashes,
         )
         total_enriched += enriched

+        # If no progress was made in this batch, break to avoid infinite loop
+        if enriched == 0:
+            if failed > 0:
+                logging.warning(
+                    "Enrich phase stopping: %d items failed with no progress",
+                    failed,
+                )
+            break
+
         now = time.perf_counter()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/seeder.py` around lines 661 - 716, The enrich loop in
_run_enrich_phase can repeatedly re-fetch the same records when
enrich_assets_batch fails for some rows because get_unenriched_assets_for_roots
selects by enrichment_level <= max_level; fix by ensuring progress even on
failures: after calling enrich_assets_batch (function enrich_assets_batch), if
enriched == 0 and failed > 0 either (a) mark the failed asset IDs as
skipped/failed (update their enrichment_level or a failure flag) so they are not
returned by get_unenriched_assets_for_roots, or (b) maintain a local set of
failed IDs and pass them to the fetch (or filter them out) / advance an offset
to avoid refetching the same batch, and as a safety break the loop if no net
progress is made for a full iteration; locate and update logic in
_run_enrich_phase, get_unenriched_assets_for_roots usage, and
enrich_assets_batch handling.

Comment on lines +333 to +347
with create_session() as session:
pair = fetch_reference_and_asset(
session, reference_id=reference_id, owner_id=owner_id
)
if not pair:
raise RuntimeError("inconsistent DB state after ingest")
ref, asset = pair
tag_names = get_reference_tags(session, reference_id=ref.id)

return UploadResult(
ref=extract_reference_data(ref),
asset=extract_asset_data(asset),
tags=tag_names,
created_new=ingest_result.asset_created,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Find create_session definition and understand its behavior
rg -n "def create_session" --type py -A 10

Repository: Comfy-Org/ComfyUI

Length of output: 149


🏁 Script executed:

# Check for expire_on_commit settings
rg -n "expire_on_commit" --type py -B 2 -A 2

Repository: Comfy-Org/ComfyUI

Length of output: 43


🏁 Script executed:

# Find extract_reference_data and extract_asset_data implementations
rg -n "def extract_reference_data|def extract_asset_data" --type py -A 15

Repository: Comfy-Org/ComfyUI

Length of output: 1395


🏁 Script executed:

# Look for sessionmaker or Session configuration
rg -n "sessionmaker|Session" --type py -B 2 -A 2 | head -100

Repository: Comfy-Org/ComfyUI

Length of output: 7003


🏁 Script executed:

# Check the full Session configuration in db.py
cat -n app/database/db.py | head -150

Repository: Comfy-Org/ComfyUI

Length of output: 4460


🏁 Script executed:

# Look for sessionmaker configuration
rg -n "sessionmaker|Session\(" --type py app/database/db.py -B 3 -A 3

Repository: Comfy-Org/ComfyUI

Length of output: 490


🏁 Script executed:

# Check the ORM model definitions to understand the relationships
rg -n "class AssetReference|class Asset" --type py -A 30 | head -150

Repository: Comfy-Org/ComfyUI

Length of output: 7851


🏁 Script executed:

# Verify if the code at lines 333-347 is new in this PR
cd app/assets/services && git log --oneline -n 20 ingest.py 2>/dev/null || echo "Git history not available in sandbox"

Repository: Comfy-Org/ComfyUI

Length of output: 93


🏁 Script executed:

# Check if there are ORM model definitions with relationship configurations
rg -n "class AssetReference|class Asset" --type py app/assets/models.py -A 40 | head -200

Repository: Comfy-Org/ComfyUI

Length of output: 121


🏁 Script executed:

# Verify the exact behavior - check if lazy loading is configured on relationships
rg -n "lazy\s*=" --type py app/assets/models.py -B 2 -A 2

Repository: Comfy-Org/ComfyUI

Length of output: 121


🏁 Script executed:

# Read the full upload_from_temp_path function to understand the context
sed -n '240,360p' app/assets/services/ingest.py

Repository: Comfy-Org/ComfyUI

Length of output: 4123


🏁 Script executed:

# Find the ORM model definitions - they might be in a different location
find . -name "*.py" -type f | xargs rg -l "class AssetReference|class Asset" | head -10

Repository: Comfy-Org/ComfyUI

Length of output: 277


🏁 Script executed:

# Check the full ingest.py file structure to understand the context better
head -20 app/assets/services/ingest.py && echo "---" && tail -20 app/assets/services/ingest.py

Repository: Comfy-Org/ComfyUI

Length of output: 1120


🏁 Script executed:

# Look for fetch_reference_and_asset to understand what it returns
rg -n "def fetch_reference_and_asset" --type py -A 20

Repository: Comfy-Org/ComfyUI

Length of output: 1683


🏁 Script executed:

# Check if the problematic code section (lines 333-347) is new in this PR by looking at git diff
git diff HEAD~1 app/assets/services/ingest.py | grep -A 20 -B 5 "fetch_reference_and_asset" | head -50

Repository: Comfy-Org/ComfyUI

Length of output: 247


🏁 Script executed:

# Verify the default sessionmaker behavior and session context manager
rg -n "__enter__|__exit__|expire_on_commit" --type py app/database/db.py

Repository: Comfy-Org/ComfyUI

Length of output: 43


🏁 Script executed:

# Check if there are any other similar patterns in the codebase where ORM objects are accessed after session closes
rg -n "with create_session()" --type py -A 10 | grep -E "(extract_|with create_session|return)" | head -40

Repository: Comfy-Org/ComfyUI

Length of output: 2621


🏁 Script executed:

# Check the ORM model definitions to see the exact column definitions
cat -n app/assets/database/models.py | head -200

Repository: Comfy-Org/ComfyUI

Length of output: 8657


🏁 Script executed:

# Verify SQLAlchemy behavior - check if session is properly closed via context manager
python3 << 'EOF'
from sqlalchemy.orm import Session
import inspect

# Check if Session has __enter__ and __exit__ methods
print("Session has __enter__:", hasattr(Session, '__enter__'))
print("Session has __exit__:", hasattr(Session, '__exit__'))

# Check sessionmaker default for expire_on_commit
from sqlalchemy.orm import sessionmaker
sm = sessionmaker()
print("Default expire_on_commit:", sm.kw.get('expire_on_commit', 'True (default)'))
EOF

Repository: Comfy-Org/ComfyUI

Length of output: 183


🏁 Script executed:

# Check if there's any specific configuration that would prevent attribute expiration
rg -n "Session\(" --type py app/ -A 3 | grep -E "expire_on_commit|Session" | head -20

Repository: Comfy-Org/ComfyUI

Length of output: 104


🏁 Script executed:

# Check if there are any other patterns in the codebase that do similar things successfully
rg -n "extract_reference_data|extract_asset_data" --type py -B 5 -A 2 | grep -E "(with create_session|extract_|return)" | head -50

Repository: Comfy-Org/ComfyUI

Length of output: 2207


🏁 Script executed:

# Look at asset_management.py to see how other functions handle this pattern
sed -n '44,60p' app/assets/services/asset_management.py

Repository: Comfy-Org/ComfyUI

Length of output: 489


🏁 Script executed:

# Verify the exact behavior by checking what happens when attributes are accessed on expired ORM objects
# Look at the actual lines to confirm the issue
sed -n '333,347p' app/assets/services/ingest.py

Repository: Comfy-Org/ComfyUI

Length of output: 588


🏁 Script executed:

# Check if there are similar patterns in _ingest_file_from_path that also extract data after session close
sed -n '60,210p' app/assets/services/ingest.py | grep -A 30 "def _ingest_file_from_path"

Repository: Comfy-Org/ComfyUI

Length of output: 43


🏁 Script executed:

# Search for the definition of fetch_reference_asset_and_tags to compare with fetch_reference_and_asset
rg -n "def fetch_reference_asset_and_tags" --type py -A 20

Repository: Comfy-Org/ComfyUI

Length of output: 1843


🏁 Script executed:

# Check the pattern more thoroughly - see if there are uses where ORM objects are passed to functions that access attributes
# Look at _register_existing_asset to understand what it does with the tag_names passed to it
rg -n "def _register_existing_asset" --type py -A 50 | head -70

Repository: Comfy-Org/ComfyUI

Length of output: 3328


🏁 Script executed:

# Check if extract_reference_data and extract_asset_data are ever called inside session context vs outside
rg -n "extract_reference_data|extract_asset_data" --type py -B 3 | grep -E "^(app/|--)" | head -80

Repository: Comfy-Org/ComfyUI

Length of output: 4249


🏁 Script executed:

# Verify by checking if session.expire_on_commit is ever explicitly set to False
rg -n "expire_on_commit\s*=" --type py

Repository: Comfy-Org/ComfyUI

Length of output: 43


🏁 Script executed:

# Check the full context of _register_existing_asset to see how it differs from upload_from_temp_path
sed -n '129,195p' app/assets/services/ingest.py

Repository: Comfy-Org/ComfyUI

Length of output: 2046


Move extraction of data inside the session context.

After the with create_session() block exits (line 340), the session closes and all ORM attributes are expired. Lines 343-344 then call extract_reference_data(ref) and extract_asset_data(asset) on these detached, expired objects, which will raise DetachedInstanceError when accessing any attribute.

Extract the data while the session is still active:

Proposed fix
     with create_session() as session:
         pair = fetch_reference_and_asset(
             session, reference_id=reference_id, owner_id=owner_id
         )
         if not pair:
             raise RuntimeError("inconsistent DB state after ingest")
         ref, asset = pair
         tag_names = get_reference_tags(session, reference_id=ref.id)
+        ref_data = extract_reference_data(ref)
+        asset_data = extract_asset_data(asset)

     return UploadResult(
-        ref=extract_reference_data(ref),
-        asset=extract_asset_data(asset),
+        ref=ref_data,
+        asset=asset_data,
         tags=tag_names,
         created_new=ingest_result.asset_created,
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/services/ingest.py` around lines 333 - 347, The ref and asset are
accessed after the session is closed which will raise DetachedInstanceError;
move the extraction calls into the active session block. Inside the with
create_session() as session: after fetch_reference_and_asset(...) and
get_reference_tags(...), call extract_reference_data(ref) and
extract_asset_data(asset) to produce ref_data and asset_data, then exit the
block and return UploadResult(ref=ref_data, asset=asset_data, tags=tag_names,
created_new=ingest_result.asset_created). Keep references to create_session(),
fetch_reference_and_asset, get_reference_tags, extract_reference_data,
extract_asset_data, and UploadResult to locate the changes.

Comment on lines +264 to +271
tw = st_meta.get("trained_words")
if isinstance(tw, str):
try:
meta.trained_words = json.loads(tw)
except json.JSONDecodeError:
meta.trained_words = [w.strip() for w in tw.split(",") if w.strip()]
elif isinstance(tw, list):
meta.trained_words = tw
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

json.loads(tw) result not validated as a list.

On line 267, json.loads(tw) could return any JSON type (dict, string, number, etc.), but it's assigned directly to meta.trained_words which is typed as list[str] | None. A non-list value here would cause subtle downstream issues (e.g., iterating over a string yields characters, iterating over a dict yields keys).

🐛 Proposed fix
         if isinstance(tw, str):
             try:
-                meta.trained_words = json.loads(tw)
+                parsed = json.loads(tw)
+                if isinstance(parsed, list):
+                    meta.trained_words = [str(w) for w in parsed if w]
             except json.JSONDecodeError:
                 meta.trained_words = [w.strip() for w in tw.split(",") if w.strip()]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/assets/services/metadata_extract.py` around lines 264 - 271, The JSON
decode result of tw may not be a list, so in the block handling tw (variable tw,
meta.trained_words in metadata_extract.py) change the assignment to validate the
parsed value: call json.loads(tw) in the try, then if isinstance(parsed, list)
assign meta.trained_words = parsed; elif isinstance(parsed, str) assign
meta.trained_words = [parsed]; otherwise fall back to the comma-split logic
(meta.trained_words = [w.strip() for w in tw.split(",") if w.strip()]) or set
meta.trained_words = None—ensure no non-list type is assigned to
meta.trained_words.

Comment on lines +115 to +138
rows = [
{"id": str(uuid.uuid4()), "hash": "blake3:bulk1", "size_bytes": 100, "mime_type": "text/plain", "created_at": None},
{"id": str(uuid.uuid4()), "hash": "blake3:bulk2", "size_bytes": 200, "mime_type": "image/png", "created_at": None},
{"id": str(uuid.uuid4()), "hash": "blake3:bulk3", "size_bytes": 300, "mime_type": None, "created_at": None},
]
bulk_insert_assets(session, rows)
session.commit()

assets = session.query(Asset).all()
assert len(assets) == 3
hashes = {a.hash for a in assets}
assert hashes == {"blake3:bulk1", "blake3:bulk2", "blake3:bulk3"}

def test_empty_list_is_noop(self, session: Session):
bulk_insert_assets(session, [])
session.commit()
assert session.query(Asset).count() == 0

def test_handles_large_batch(self, session: Session):
"""Test chunking logic with more rows than MAX_BIND_PARAMS allows."""
rows = [
{"id": str(uuid.uuid4()), "hash": f"blake3:large{i}", "size_bytes": i, "mime_type": None, "created_at": None}
for i in range(200)
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find and examine the Asset model
find . -type f -name "*.py" -path "*/assets/database/models*" | head -20

Repository: Comfy-Org/ComfyUI

Length of output: 92


🏁 Script executed:

# Check for Asset model definition - search more broadly
rg -n "class Asset" --type py

Repository: Comfy-Org/ComfyUI

Length of output: 1092


🏁 Script executed:

# Look for created_at in asset-related models
rg -n "created_at" --type py -l | grep -E "(model|asset)" | head -10

Repository: Comfy-Org/ComfyUI

Length of output: 539


🏁 Script executed:

# Examine the Asset model definition to check created_at
sed -n '26,70p' app/assets/database/models.py

Repository: Comfy-Org/ComfyUI

Length of output: 1723


🏁 Script executed:

# Check the specific test file section mentioned in the review
sed -n '115,145p' tests-unit/assets_test/queries/test_asset.py

Repository: Comfy-Org/ComfyUI

Length of output: 1334


🏁 Script executed:

# Find and examine bulk_insert_assets function
rg -n "def bulk_insert_assets" --type py -A 15

Repository: Comfy-Org/ComfyUI

Length of output: 1162


🏁 Script executed:

# Check git history to see if these test rows are new
git log --oneline -n 30 -- tests-unit/assets_test/queries/test_asset.py

Repository: Comfy-Org/ComfyUI

Length of output: 93


Omit created_at=None from bulk insert rows. Asset.created_at is non-nullable with a server-side default (get_utc_now). SQLAlchemy Core inserts bypass Python defaults, so explicitly passing None will attempt to violate the NOT NULL constraint and cause an IntegrityError. Remove the created_at key from the dictionaries to let the database apply the default.

Suggested fix
-            {"id": str(uuid.uuid4()), "hash": "blake3:bulk1", "size_bytes": 100, "mime_type": "text/plain", "created_at": None},
-            {"id": str(uuid.uuid4()), "hash": "blake3:bulk2", "size_bytes": 200, "mime_type": "image/png", "created_at": None},
-            {"id": str(uuid.uuid4()), "hash": "blake3:bulk3", "size_bytes": 300, "mime_type": None, "created_at": None},
+            {"id": str(uuid.uuid4()), "hash": "blake3:bulk1", "size_bytes": 100, "mime_type": "text/plain"},
+            {"id": str(uuid.uuid4()), "hash": "blake3:bulk2", "size_bytes": 200, "mime_type": "image/png"},
+            {"id": str(uuid.uuid4()), "hash": "blake3:bulk3", "size_bytes": 300, "mime_type": None},
-            {"id": str(uuid.uuid4()), "hash": f"blake3:large{i}", "size_bytes": i, "mime_type": None, "created_at": None}
+            {"id": str(uuid.uuid4()), "hash": f"blake3:large{i}", "size_bytes": i, "mime_type": None}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
rows = [
{"id": str(uuid.uuid4()), "hash": "blake3:bulk1", "size_bytes": 100, "mime_type": "text/plain", "created_at": None},
{"id": str(uuid.uuid4()), "hash": "blake3:bulk2", "size_bytes": 200, "mime_type": "image/png", "created_at": None},
{"id": str(uuid.uuid4()), "hash": "blake3:bulk3", "size_bytes": 300, "mime_type": None, "created_at": None},
]
bulk_insert_assets(session, rows)
session.commit()
assets = session.query(Asset).all()
assert len(assets) == 3
hashes = {a.hash for a in assets}
assert hashes == {"blake3:bulk1", "blake3:bulk2", "blake3:bulk3"}
def test_empty_list_is_noop(self, session: Session):
bulk_insert_assets(session, [])
session.commit()
assert session.query(Asset).count() == 0
def test_handles_large_batch(self, session: Session):
"""Test chunking logic with more rows than MAX_BIND_PARAMS allows."""
rows = [
{"id": str(uuid.uuid4()), "hash": f"blake3:large{i}", "size_bytes": i, "mime_type": None, "created_at": None}
for i in range(200)
]
rows = [
{"id": str(uuid.uuid4()), "hash": "blake3:bulk1", "size_bytes": 100, "mime_type": "text/plain"},
{"id": str(uuid.uuid4()), "hash": "blake3:bulk2", "size_bytes": 200, "mime_type": "image/png"},
{"id": str(uuid.uuid4()), "hash": "blake3:bulk3", "size_bytes": 300, "mime_type": None},
]
bulk_insert_assets(session, rows)
session.commit()
assets = session.query(Asset).all()
assert len(assets) == 3
hashes = {a.hash for a in assets}
assert hashes == {"blake3:bulk1", "blake3:bulk2", "blake3:bulk3"}
def test_empty_list_is_noop(self, session: Session):
bulk_insert_assets(session, [])
session.commit()
assert session.query(Asset).count() == 0
def test_handles_large_batch(self, session: Session):
"""Test chunking logic with more rows than MAX_BIND_PARAMS allows."""
rows = [
{"id": str(uuid.uuid4()), "hash": f"blake3:large{i}", "size_bytes": i, "mime_type": None}
for i in range(200)
]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests-unit/assets_test/queries/test_asset.py` around lines 115 - 138, The
tests pass explicit "created_at": None into bulk_insert_assets which bypasses
the DB server default and violates Asset.created_at NOT NULL; update the test
data generation in the methods that build rows (e.g. the rows list in the
bulk-insert test and the rows comprehension in test_handles_large_batch) to omit
the "created_at" key entirely so the database/server_default (get_utc_now) is
used; keep using bulk_insert_assets(session, rows) and session.commit() as
before and do not set created_at anywhere in those dictionaries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants