feat(assets): asynchronous background asset scanning#12336
feat(assets): asynchronous background asset scanning#12336luke-mino-altherr wants to merge 67 commits intomasterfrom
Conversation
6f4de2b to
0121a55
Compare
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
beae633 to
5ff6bf7
Compare
|
Hi! I believe a few smaller PRs will have a better chance of being merged. |
📝 WalkthroughWalkthroughThis PR replaces the previous AssetInfo/AssetCacheState schema with a unified AssetReference model and adds an Alembic migration to create the new tables. It restructures database query code into a modular queries package, introduces a services layer for ingest/management/tagging/hashing/metadata, implements a threaded AssetSeeder for background scanning, refactors API routes and multipart upload handling, removes several legacy modules, and adds extensive unit tests for queries, services, seeding, and API behaviors. 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 46
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
app/assets/api/routes.py (1)
246-253:⚠️ Potential issue | 🟠 MajorSynchronous file I/O inside an async generator blocks the event loop.
open(abs_path, "rb")andf.read(chunk_size)are blocking syscalls executed directly on the event loop thread. For large files this stalls all other request handling. Useasyncio.to_thread/loop.run_in_executor, or aiohttp's built-inFileResponsewhich already handles this correctly.Suggested fix using aiohttp's FileResponse
- async def stream_file_chunks(): - chunk_size = 64 * 1024 - with open(abs_path, "rb") as f: - while True: - chunk = f.read(chunk_size) - if not chunk: - break - yield chunk - - return web.Response( - body=stream_file_chunks(), - content_type=content_type, - headers={ - "Content-Disposition": cd, - "Content-Length": str(file_size), - }, - ) + resp = web.FileResponse( + abs_path, + headers={ + "Content-Disposition": cd, + "Content-Type": content_type, + }, + ) + return respIf custom streaming is needed (e.g., to preserve the explicit
Content-Lengthheader andContent-Type), offload reads to a thread executor:async def stream_file_chunks(): chunk_size = 64 * 1024 loop = asyncio.get_running_loop() f = await loop.run_in_executor(None, open, abs_path, "rb") try: while True: chunk = await loop.run_in_executor(None, f.read, chunk_size) if not chunk: break yield chunk finally: await loop.run_in_executor(None, f.close)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/assets/api/routes.py` around lines 246 - 253, The async generator stream_file_chunks uses synchronous file I/O (open and f.read) which blocks the event loop; replace those blocking calls by offloading them to a thread executor (e.g., asyncio.to_thread or asyncio.get_running_loop().run_in_executor) or switch to aiohttp's FileResponse; specifically, open the file via to_thread/run_in_executor, perform each f.read via to_thread/run_in_executor, and ensure the file is closed in a finally block (or use FileResponse which handles streaming and headers for you) so streaming no longer blocks the event loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@alembic_db/versions/0002_merge_to_asset_references.py`:
- Around line 173-177: The downgrade() in migration
0002_merge_to_asset_references currently raises a generic NotImplementedError
with no pre-flight warning; update it to provide an explicit, actionable failure
message and signal to operators (e.g., raise RuntimeError or SystemExit with a
clear, prominent message) that this migration is destructive and cannot be
auto-rolled-back, and include a short pointer to the runbook/backup procedure
and an instruction to prevent accidental alembic downgrade attempts (or, if
Alembic hooks are used in the project, install a before_downgrade hook that
logs/prints the same prominent warning); locate and change the downgrade()
function in this migration file to implement this clearer, operator-facing
behavior.
- Around line 97-106: The migration is missing an index for updated_at which
causes ORDER BY updated_at DESC to full-scan; in the Alembic upgrade() add
op.create_index("ix_asset_references_updated_at", "asset_references",
["updated_at"]) (and add the corresponding op.drop_index in downgrade()), and
mirror the change in the SQLAlchemy model definition (e.g., in models.py on the
AssetReference/table definition add an Index("ix_asset_references_updated_at",
AssetReference.updated_at) or include updated_at in the model's __table_args__)
so the ORM and migration stay consistent.
In `@app/assets/api/routes.py`:
- Around line 46-58: get_query_dict currently iterates over request.query.keys()
which can yield duplicate keys from aiohttp's MultiDict, wasting work; change
the dict comprehension to iterate over a de-duplicated set of keys (e.g., use
set(request.query.keys())) so each key is processed once while keeping the
existing getall/get logic in get_query_dict.
- Around line 580-603: delete_asset_tags currently only catches ValueError from
remove_tags so PermissionError bubbles to the generic handler and returns 500;
update the exception handling in the delete_asset_tags route to catch
PermissionError (same as add_asset_tags and update_asset_route) and map it to a
404 using _build_error_response with the same payload shape as the ValueError
branch. Locate the try/except around the remove_tags call in delete_asset_tags
and either add PermissionError to the existing except (e.g., except (ValueError,
PermissionError) as e) or add a separate except PermissionError block that
returns _build_error_response(404, "ASSET_NOT_FOUND", str(e), {"id":
reference_id}); keep the logging and other exception handling for the generic
Exception branch unchanged.
- Around line 628-662: The handler currently calls the blocking
asset_seeder.wait() directly and also fails to re-disable the seeder on the
non-wait path; change the wait call to run off the event loop (e.g., await
asyncio.to_thread(asset_seeder.wait) or await loop.run_in_executor(None,
asset_seeder.wait)) and ensure that whenever was_disabled was True you call
asset_seeder.disable() before every return path (both after the awaited wait
branch and the non-wait branch where you currently return {"status":"started"}),
keeping the existing logic around asset_seeder.enable(),
asset_seeder.start(roots=valid_roots) and asset_seeder.get_status() intact.
- Around line 695-712: The handler mark_missing_assets currently calls
asset_seeder.mark_missing_outside_prefixes() before checking
asset_seeder.get_status().state.value, allowing a race; modify
mark_missing_assets to first inspect asset_seeder.get_status().state.value (or a
helper like is_running) and if the state indicates a running scan return the 409
{"status":"scan_running","marked":0} immediately, only calling
asset_seeder.mark_missing_outside_prefixes() when the seeder is idle; also
remove the combined `marked == 0` check — treat the running-scan check
independently so a zero-mark result is not conflated with an active scan.
In `@app/assets/api/schemas_in.py`:
- Around line 42-55: Consolidate the duplicate exception classes by moving the
single definitions of HashMismatchError and DependencyMissingError into one
central module (create a shared exceptions module), delete the duplicate class
definitions from both schemas_in.py and ingest.py, and update both modules to
import HashMismatchError and DependencyMissingError from that shared module so
all code catches the same exception types.
In `@app/assets/api/upload.py`:
- Around line 93-109: The current multipart handler assumes the `hash` field is
seen before `file` and only drains without writing when `provided_hash and
provided_hash_exists is True`, causing unnecessary disk writes if `file` arrives
first; change the logic so the `file` branch writes incoming chunks to a temp
file (use a secure tempfile path tracked as e.g. temp_path) and updates
`file_written` while still supporting the existing drain-only branch behavior,
then after the multipart loop check `provided_hash_exists` and if true delete
the temp file and skip further storage, otherwise move/commit the temp file to
permanent storage; ensure exceptions around `field.read_chunk` still raise
`UploadError` and clean up temp files on error.
- Around line 111-128: The try/except currently calls
delete_temp_file_if_exists(tmp_path) but leaves the parent unique_dir behind,
causing orphaned empty directories; update the cleanup so that after removing
tmp_path you also remove its parent directory (unique_dir) if it's empty: locate
the variables uploads_root, unique_dir, tmp_path and the call to
delete_temp_file_if_exists in the except block and modify cleanup to remove the
file and then attempt to remove unique_dir (e.g., os.rmdir(unique_dir) guarded
by os.path.isdir(unique_dir) and catching OSError if not empty) before raising
the UploadError so the temporary upload directory is not left behind.
In `@app/assets/database/models.py`:
- Around line 145-162: The asset_references model defines many indexes in
__table_args__ but is missing an index on updated_at; add
Index("ix_asset_references_updated_at", "updated_at") to __table_args__ so
queries ordering by updated_at can use an index. Also add a corresponding
database migration change by calling
op.create_index("ix_asset_references_updated_at", "asset_references",
["updated_at"]) in the 0002_merge_to_asset_references.py migration (and the
reverse op.drop_index in downgrade), so the schema and migration stay
consistent.
In `@app/assets/database/queries/asset_reference.py`:
- Around line 926-942: The call to calculate_rows_per_statement uses a hardcoded
14 which can drift; in bulk_insert_references_ignore_conflicts compute the
columns-per-row dynamically (e.g., cols_per_row = len(enriched_rows[0]) after
enriching rows) and pass that to calculate_rows_per_statement instead of 14 so
the batching reflects the actual number of bind parameters for the ins execute
(references: function bulk_insert_references_ignore_conflicts, enriched_rows,
calculate_rows_per_statement, iter_chunks, AssetReference).
- Around line 815-827: The function delete_orphaned_seed_asset currently deletes
AssetReference rows for any asset_id before checking the asset's hash; update it
to first load the Asset via session.get(Asset, asset_id), verify that asset
exists and that asset.hash is None, and only then delete both the AssetReference
rows (sa.delete(AssetReference).where(AssetReference.asset_id == asset_id)) and
the Asset (session.delete(asset)); if the asset is missing or asset.hash is not
None return False without deleting references. Ensure you reference the existing
symbols: delete_orphaned_seed_asset, Asset, AssetReference, session.get, and
session.delete.
- Around line 626-667: Both functions mark_references_missing_outside_prefixes
and restore_references_by_paths currently return result.rowcount which can be
None; change each to return a definite integer by converting/casting rowcount to
int and defaulting to 0 when None (e.g., use an int-cast with a fallback of 0)
instead of returning result.rowcount directly so the functions always satisfy
the -> int annotation.
- Around line 290-363: The count query duplicates the filtering logic in
list_references_page (see variables base and count_stmt) which risks divergence;
refactor by constructing a single filtered selectable (e.g., filtered_base or
filtered_subq) that applies build_visible_owner_clause, name_contains escaping,
_apply_tag_filters and _apply_metadata_filter once, then derive the paginated
query from filtered_base.order_by(...).limit(...).offset(...) and derive the
total with select(sa.func.count()).select_from(filtered_base.subquery()/alias())
so both queries share the exact same filters; keep sorting/pagination only on
the paginated query and reuse the same filtered_base when loading tags
(id_list).
- Around line 945-972: The query in get_references_by_paths_and_asset_ids uses
two independent IN clauses and causes cross-product matches; replace the IN/IN
with a disjunction of correlated conditions so each pair is matched exactly
(i.e. build an OR of (AssetReference.file_path == path AND
AssetReference.asset_id == asset_id) for each entry in the chunk), adjust
chunking to account for roughly two bind params per pair (e.g. use
MAX_BIND_PARAMS // 2 or similar), execute the select with the combined or_
condition targeting AssetReference.file_path and AssetReference.asset_id, and
continue to collect winners from result.scalars().all(); use iter_chunks and
MAX_BIND_PARAMS symbols to locate where to change logic.
In `@app/assets/database/queries/asset.py`:
- Around line 123-137: The function reassign_asset_references currently always
calls session.flush() even when no AssetReference was found; modify the control
flow to only call session.flush() when ref is not None (e.g., move the
session.flush() inside the existing if ref: block or return early when ref is
None) so you avoid an unnecessary DB round-trip; reference the ref variable,
session.flush(), and the reassign_asset_references function when applying the
change.
- Around line 93-103: get_existing_asset_ids currently passes the full asset_ids
list into a single IN(...) clause which can exceed SQLite's bind-parameter
limit; change it to iterate the asset_ids in chunks (reuse the existing
iter_chunks helper pattern used by get_reference_ids_by_ids) and accumulate
results across chunked queries against select(Asset.id). For each chunk call
session.execute(select(Asset.id).where(Asset.id.in_(chunk))).fetchall() and add
the returned ids to a set to return, while keeping the early-empty-list check
and function signature intact.
In `@app/assets/database/queries/common.py`:
- Around line 17-20: iter_chunks currently assumes seq supports len() and
slicing which breaks for generators — tighten the contract by typing seq as a
sliceable Sequence (e.g. add from typing import Sequence, TypeVar, Iterator and
use def iter_chunks(seq: Sequence[T], n: int) -> Iterator[Sequence[T]] with a
TypeVar T) so callers know the function requires a sequence; update the function
signature and imports accordingly and run type-checks to ensure compatibility
with existing callers.
In `@app/assets/database/queries/tags.py`:
- Around line 291-325: The total_q subquery ignores the owner filter when
include_zero is False; update the inner select used in total_q (the
select(AssetReferenceTag.tag_name).group_by(...) expression) to include the same
owner filter applied elsewhere (i.e., add .where(AssetReferenceTag.owner_user_id
== <owner_id> or the equivalent owner predicate used in the main query) so
Tag.name.in_(...) only considers tags owned by that user), ensuring the owner
predicate matches the one used for the main query (refer to total_q,
include_zero, AssetReferenceTag, and Tag in tags.py).
In `@app/assets/helpers.py`:
- Around line 6-23: The function select_best_live_path accepts an
unparameterized Sequence and accesses duck-typed attributes via getattr; define
a lightweight Protocol (e.g., CacheStateProtocol) that declares the expected
attributes (file_path: Optional[str] and needs_verify: bool or Optional[bool])
and import Protocol and Optional from typing, then change the function signature
to states: Sequence[CacheStateProtocol] so type checkers can verify callers and
the attribute use inside select_best_live_path is explicit (refer to the
select_best_live_path function and introduce CacheStateProtocol to match its
file_path and needs_verify usage).
In `@app/assets/scanner.py`:
- Around line 298-306: The fast-scan loop using os.stat(abs_p,
follow_symlinks=False) is using lstat semantics that mismatch the verification
path and cause perpetual fast_ok/“missing” mismatches; update the stat calls in
the fast-scan loop (the for p in paths block where stat_p is assigned) to use
follow_symlinks=True (and make the identical change for the second occurrence
around lines 372-380) so the symlink-following policy matches
verification/build_*_specs behavior.
In `@app/assets/seeder.py`:
- Around line 250-251: Remove the trailing whitespace at the end of the
docstring line containing "Resume a paused scan." in app/assets/seeder.py (the
docstring lines shown in the diff); edit the docstring so there are no space
characters after the final text or newline, save the file, and re-run ruff to
confirm the trailing-whitespace warnings are resolved.
In `@app/assets/services/bulk_ingest.py`:
- Around line 186-193: The current code filters out reference_rows whose
r["asset_id"] weren't returned from get_existing_asset_ids, which drops
references for assets that already exist (duplicate hash); instead resolve asset
IDs by asset hash for all reference_rows and map each row to the correct
existing asset id when present. Change the logic around get_existing_asset_ids:
query the assets table by asset hash (use or add a helper like
get_asset_ids_by_hashes(session, hashes) or extend get_existing_asset_ids to
accept hashes), build a hash->asset_id map, then for each reference row replace
or set r["asset_id"] to the mapped existing id when available (keeping newly
inserted ids otherwise) and do not simply drop rows based on inserted_asset_ids;
use the symbols reference_rows, r["asset_id"], and the existing DB lookup helper
(get_existing_asset_ids or a new get_asset_ids_by_hashes) to locate and
implement the fix.
In `@app/assets/services/file_utils.py`:
- Around line 4-8: The getattr fallback in get_mtime_ns causes unnecessary
float-to-int conversion and precision loss; instead, change get_mtime_ns to
directly return stat_result.st_mtime_ns (remove the int(stat_result.st_mtime *
1_000_000_000) fallback) so the function always uses the native nanosecond
timestamp and avoids spurious file-change mismatches in verify_file_unchanged.
In `@app/assets/services/hashing.py`:
- Around line 35-54: _hash_file_obj currently only restores the file position
when orig_pos != 0, leaving the stream at EOF for callers that started at 0;
change the finally block to always seek back to orig_pos (i.e., call
file_obj.seek(orig_pos) unconditionally) so the original position is restored
after reading, while keeping the existing pre-read seek(0) logic that runs only
when orig_pos != 0.
In `@app/assets/services/ingest.py`:
- Around line 242-326: The function upload_from_temp_path doesn't guard against
tags being None; normalize tags at the top of the function (e.g. in
upload_from_temp_path) by converting None to an empty list (tags = list(tags or
[])) before calling resolve_destination_from_tags, passing to
_register_existing_asset, and to _ingest_file_from_path so
resolve_destination_from_tags(tags) and any list(tags) usages won't raise
TypeError; update references to use this normalized variable throughout the
function (keep function name upload_from_temp_path,
resolve_destination_from_tags, _register_existing_asset, and
_ingest_file_from_path for locating the change).
In `@app/assets/services/path_utils.py`:
- Around line 53-60: The path check in validate_path_within_base uses
os.path.abspath which can be bypassed via symlinks; replace abspath with
os.path.realpath for both candidate and base before calling os.path.commonpath
(i.e., use realpath for cand_abs and base_abs), preserve the same ValueError
behavior/messages (raising "destination escapes base directory" when commonpath
differs and "invalid destination path" on other exceptions), and keep the
function signature validate_path_within_base(candidate: str, base: str) -> None
unchanged.
- Around line 25-50: The function resolve_destination_from_tags currently
assumes tags is non-empty and treats any unknown root as "output"; first
validate that tags is a non-empty list and raise ValueError("empty tags") if
not. Next, after computing root = tags[0], check that root is one of the allowed
roots ("models", "input", "output") and raise ValueError(f"unknown root tag
'{root}'") for any other value instead of falling through; keep the existing
models branch but retain its existing checks for a missing category tag and
unknown category, and ensure raw_subdirs handling and path normalization remain
unchanged.
In `@app/assets/services/schemas.py`:
- Around line 76-79: The service-layer NamedTuple TagUsage uses tag_type while
the API-layer schemas_out.TagUsage uses type; make them consistent by renaming
the service field to type (change TagUsage fields to name: str, type: str,
count: int) or, if renaming is undesirable, add a clear mapping comment and
conversion helper where service TagUsage is converted to schemas_out.TagUsage
(referencing TagUsage in this file and schemas_out.TagUsage.type) so callers and
route mappers know to translate tag_type <-> type explicitly.
- Around line 11-14: AssetData.hash is declared as str but Asset.hash is typed
str | None, causing extract_asset_data to construct AssetData with None for
hash; update the AssetData class to declare hash: str | None (and adjust any
related type hints) or, alternatively, have extract_asset_data
validate/normalize asset.hash (e.g., raise or replace None with an empty string)
before constructing AssetData; modify the symbol AssetData (in
app/assets/services/schemas.py) and/or the extract_asset_data function to ensure
types align and downstream code won’t receive a None where a str is expected.
- Around line 31-43: AssetDetailResult and AssetSummaryData share identical
fields and RegisterAssetResult and UploadResult only differ by the boolean flag
name (created vs created_new); refactor by introducing a shared frozen dataclass
base (e.g., AssetResultBase or AssetSummaryBase) with fields ref, asset, tags
and have AssetDetailResult/AssetSummaryData inherit from it, and likewise unify
RegisterAssetResult and UploadResult to inherit the same base and expose a
single boolean (choose one canonical name such as created) while preserving
external API compatibility (add a small adapter/alias or property if consumers
expect the alternate name), or if the separation is intentional, add a brief
comment above AssetDetailResult, AssetSummaryData, RegisterAssetResult and
UploadResult documenting why they are distinct to avoid future confusion.
- Around line 35-36: The dataclasses in app/assets/services/schemas.py declare
mutable list fields (e.g., tags and ListAssetsResult.items) while using
frozen=True, which breaks immutability; change the type annotations from
list[str] / list[AssetSummaryData] to tuple[str, ...] / tuple[AssetSummaryData,
...] (and adjust any default_factory usages to use tuple() or None with
post-init conversion) for all affected fields (tags, ListAssetsResult.items and
the other list-typed fields called out), and update any callers/constructors to
pass tuples (or convert lists to tuples) so the dataclass truly remains
immutable.
- Around line 89-92: ListAssetsResult should own pagination logic: add an
offset: int field to the dataclass and expose has_more as a computed boolean
(e.g., a `@property` or computed field) using (offset + len(items)) < total;
update the service function that constructs ListAssetsResult (the service method
returning ListAssetsResult) to pass the offset and remove the duplicate has_more
computation in the route layer (where it’s currently computed) so callers can
read result.has_more directly.
In `@main.py`:
- Around line 262-266: The code calls asset_seeder.pause() then e.execute(...)
and asset_seeder.resume(), but if e.execute raises an exception resume() is
never called; wrap the execute call in a try/finally so asset_seeder.resume() is
guaranteed to run (i.e., call asset_seeder.pause(), then try: e.execute(item[2],
prompt_id, extra_data, item[4]) finally: asset_seeder.resume()), keeping the
same parameters and preserve any exception propagation after resume.
- Around line 263-265: Remove trailing whitespace on the blank lines around the
call to e.execute(item[2], prompt_id, extra_data, item[4]) — the lint failure is
caused by whitespace-only characters on the blank lines near that statement;
open the block containing the e.execute call and delete any trailing spaces on
the empty lines (or enable your editor's trim-trailing-whitespace/trim on save)
so the file has truly empty lines rather than whitespace.
In `@requirements.txt`:
- Line 27: Pin the blake3 dependency to a safe minimum version to avoid silently
pulling in breaking or insecure releases; update the requirements.txt entry for
"blake3" to include a version constraint such as "blake3>=1.0.8" (or the chosen
minimum) so installs will require at least 1.0.8 which contains the PyBuffer
race-fix.
In `@server.py`:
- Around line 692-695: Wrap the call to asset_seeder.start inside
get_object_info in a try/except so any exception from the background seeding
does not bubble and fail the request; i.e., around the
asset_seeder.start(roots=("models","input","output")) call (and optionally the
surrounding with folder_paths.cache_helper block) catch Exception, log the error
via the existing logger and continue so the route returns normally even if
seeding fails.
In `@tests-unit/assets_test/queries/conftest.py`:
- Around line 8-14: The SQLite in-memory session fixture (session) currently
doesn't enable foreign key enforcement, so update the fixture to turn on SQLite
PRAGMA foreign_keys=ON on engine connect; import and use sqlalchemy.event.listen
targeting the engine's "connect" event (apply PRAGMA using the DBAPI connection
in the listener) before calling Base.metadata.create_all(engine) so tests
enforce FK constraints for the AssetReference schema.
In `@tests-unit/assets_test/queries/test_asset_info.py`:
- Around line 428-501: Add a brief inline comment in the
TestBulkInsertReferencesIgnoreConflicts tests (near the rows definitions)
explaining that file_path is intentionally omitted so it defaults to NULL and,
because the unique/conflict key is file_path and SQLite treats NULLs as
distinct, all inserts will succeed; reference the
bulk_insert_references_ignore_conflicts helper and the AssetReference
model/file_path uniqueness to make the intent explicit.
- Around line 206-207: Remove the redundant local "import time" that is
duplicated at module level; keep the existing top-level import and delete the
inline "import time" immediately before "time.sleep(0.01)" so the test uses the
module-level time and avoids a duplicate import statement.
In `@tests-unit/assets_test/queries/test_asset.py`:
- Around line 24-31: In test_exists_by_hash, the outer guard "if setup_hash is
not None or setup_hash is None" is always true and should be removed; instead
keep a single meaningful condition—replace the two-level check with a single
conditional such as "if setup_hash is not None or query_hash == '':" before
creating and committing the Asset (referencing test_exists_by_hash, setup_hash,
query_hash, and Asset) so the setup only creates an asset when intended.
In `@tests-unit/assets_test/services/conftest.py`:
- Around line 50-54: The temp_dir pytest fixture duplicates built-in tmp_path;
remove the custom temp_dir fixture from conftest.py (the function named
temp_dir) and update any tests that reference temp_dir to use pytest's tmp_path
instead (e.g., change temp_dir -> tmp_path in tests such as test_file_utils.py
if present), ensuring imports/fixtures are not left unused.
In `@tests-unit/assets_test/services/test_enrich.py`:
- Around line 161-195: The test mocks create_session via mock_cs but
enrich_asset returns ENRICHMENT_STUB before any DB session is created, so the
mock is unused; update the test_missing_file_returns_stub_level to either remove
the mock_cs setup or keep it but add a clear assertion or comment to document
intent (e.g., call mock_cs.assert_not_called() after enrich_asset) so readers
know the DB pathway wasn't exercised—refer to mock_cs, enrich_asset, and the
test_missing_file_returns_stub_level function when making the change.
- Around line 61-69: Add a reusable pytest fixture (e.g., patch_scanner_session)
that defines the contextmanager _create_session which yields Session(db_engine)
and patches app.assets.scanner.create_session with side_effect=_create_session;
place this fixture in conftest.py or at the top of the test module and have each
test that currently uses the repeated with patch(...) block accept
patch_scanner_session as a parameter and remove the duplicated patch code,
keeping references to create_session, _create_session, Session, and db_engine
consistent.
In `@tests-unit/assets_test/test_crud.py`:
- Around line 5-8: Replace the top-level import of the local helpers module with
an explicit relative import to make the intent clear and robust: in test_crud.py
change the line importing helpers to use a relative form (e.g., import the
module as from . import helpers or import the functions as from .helpers import
get_asset_filename, trigger_sync_seed_assets) so that the local helpers module
and the functions get_asset_filename and trigger_sync_seed_assets are
unambiguously loaded.
In `@tests-unit/assets_test/test_file_utils.py`:
- Around line 1-55: Add unit tests for get_mtime_ns, get_size_and_mtime_ns, and
verify_file_unchanged to cover the missing change-detection edge cases: assert
get_mtime_ns and get_size_and_mtime_ns return correct nanosecond mtime and size
for real files (use tmp_path to create files), add verify_file_unchanged tests
for mtime_db=None (should return False), size_db=None (should skip size check
and only compare mtime), explicit mtime mismatch and size mismatch cases (return
False), and a precision-sensitive test where mtime differs only at the
nanosecond level to verify the nanosecond equality path; reference the functions
get_mtime_ns, get_size_and_mtime_ns, and verify_file_unchanged in your new tests
and use tmp_path and monkeypatch/os.stat stubs as needed to control mtime/size
values.
---
Outside diff comments:
In `@app/assets/api/routes.py`:
- Around line 246-253: The async generator stream_file_chunks uses synchronous
file I/O (open and f.read) which blocks the event loop; replace those blocking
calls by offloading them to a thread executor (e.g., asyncio.to_thread or
asyncio.get_running_loop().run_in_executor) or switch to aiohttp's FileResponse;
specifically, open the file via to_thread/run_in_executor, perform each f.read
via to_thread/run_in_executor, and ensure the file is closed in a finally block
(or use FileResponse which handles streaming and headers for you) so streaming
no longer blocks the event loop.
ℹ️ Review info
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (58)
alembic_db/versions/0002_merge_to_asset_references.pyapp/assets/api/routes.pyapp/assets/api/schemas_in.pyapp/assets/api/schemas_out.pyapp/assets/api/upload.pyapp/assets/database/bulk_ops.pyapp/assets/database/models.pyapp/assets/database/queries.pyapp/assets/database/queries/__init__.pyapp/assets/database/queries/asset.pyapp/assets/database/queries/asset_reference.pyapp/assets/database/queries/common.pyapp/assets/database/queries/tags.pyapp/assets/database/tags.pyapp/assets/hashing.pyapp/assets/helpers.pyapp/assets/manager.pyapp/assets/scanner.pyapp/assets/seeder.pyapp/assets/services/__init__.pyapp/assets/services/asset_management.pyapp/assets/services/bulk_ingest.pyapp/assets/services/file_utils.pyapp/assets/services/hashing.pyapp/assets/services/ingest.pyapp/assets/services/metadata_extract.pyapp/assets/services/path_utils.pyapp/assets/services/schemas.pyapp/assets/services/tagging.pyapp/database/db.pymain.pyrequirements.txtserver.pytests-unit/assets_test/conftest.pytests-unit/assets_test/helpers.pytests-unit/assets_test/queries/conftest.pytests-unit/assets_test/queries/test_asset.pytests-unit/assets_test/queries/test_asset_info.pytests-unit/assets_test/queries/test_cache_state.pytests-unit/assets_test/queries/test_metadata.pytests-unit/assets_test/queries/test_tags.pytests-unit/assets_test/services/__init__.pytests-unit/assets_test/services/conftest.pytests-unit/assets_test/services/test_asset_management.pytests-unit/assets_test/services/test_bulk_ingest.pytests-unit/assets_test/services/test_enrich.pytests-unit/assets_test/services/test_ingest.pytests-unit/assets_test/services/test_tagging.pytests-unit/assets_test/test_assets_missing_sync.pytests-unit/assets_test/test_crud.pytests-unit/assets_test/test_downloads.pytests-unit/assets_test/test_file_utils.pytests-unit/assets_test/test_list_filter.pytests-unit/assets_test/test_prune_orphaned_assets.pytests-unit/assets_test/test_tags_api.pytests-unit/assets_test/test_uploads.pytests-unit/requirements.txttests-unit/seeder_test/test_seeder.py
💤 Files with no reviewable changes (7)
- tests-unit/requirements.txt
- app/assets/hashing.py
- tests-unit/assets_test/conftest.py
- app/assets/database/tags.py
- app/assets/database/bulk_ops.py
- app/assets/manager.py
- app/assets/database/queries.py
| op.create_index( | ||
| "ix_asset_references_enrichment_level", "asset_references", ["enrichment_level"] | ||
| ) | ||
| op.create_index("ix_asset_references_created_at", "asset_references", ["created_at"]) | ||
| op.create_index( | ||
| "ix_asset_references_last_access_time", "asset_references", ["last_access_time"] | ||
| ) | ||
| op.create_index( | ||
| "ix_asset_references_owner_name", "asset_references", ["owner_id", "name"] | ||
| ) |
There was a problem hiding this comment.
Missing index on updated_at — sorting by updated_at will full-scan at scale.
created_at and last_access_time get explicit indexes, but updated_at is omitted. The integration test test_list_assets_sort_by_updated_at_desc confirms the API supports and exercises ORDER BY updated_at DESC. Without an index, every such sort on a large asset_references table requires a full table scan + sort.
🔧 Proposed fix (add to upgrade(), and mirror in models.py)
op.create_index(
"ix_asset_references_owner_name", "asset_references", ["owner_id", "name"]
)
+ op.create_index(
+ "ix_asset_references_updated_at", "asset_references", ["updated_at"]
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@alembic_db/versions/0002_merge_to_asset_references.py` around lines 97 - 106,
The migration is missing an index for updated_at which causes ORDER BY
updated_at DESC to full-scan; in the Alembic upgrade() add
op.create_index("ix_asset_references_updated_at", "asset_references",
["updated_at"]) (and add the corresponding op.drop_index in downgrade()), and
mirror the change in the SQLAlchemy model definition (e.g., in models.py on the
AssetReference/table definition add an Index("ix_asset_references_updated_at",
AssetReference.updated_at) or include updated_at in the model's __table_args__)
so the ORM and migration stay consistent.
app/assets/api/schemas_in.py
Outdated
| class HashMismatchError(Exception): | ||
| """Uploaded file hash does not match provided hash.""" | ||
|
|
||
| def __init__(self, message: str): | ||
| super().__init__(message) | ||
| self.message = message | ||
|
|
||
|
|
||
| class DependencyMissingError(Exception): | ||
| """A required dependency is not installed.""" | ||
|
|
||
| def __init__(self, message: str): | ||
| super().__init__(message) | ||
| self.message = message |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify duplicate exception definitions
echo "=== HashMismatchError definitions ==="
rg -n 'class HashMismatchError' --type=py
echo ""
echo "=== DependencyMissingError definitions ==="
rg -n 'class DependencyMissingError' --type=py
echo ""
echo "=== Import sites ==="
rg -n 'import.*HashMismatchError' --type=py
rg -n 'import.*DependencyMissingError' --type=pyRepository: Comfy-Org/ComfyUI
Length of output: 450
Consolidate duplicate HashMismatchError and DependencyMissingError definitions.
These exception classes are defined in both app/assets/services/ingest.py (lines 232, 236) and app/assets/api/schemas_in.py (lines 42, 50). This creates two independent classes with the same name, breaking exception handling across module boundaries—code catching one won't catch the other, causing silently unhandled errors if the wrong module's exception is raised.
Define these exceptions in a single location and import them from there.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/assets/api/schemas_in.py` around lines 42 - 55, Consolidate the duplicate
exception classes by moving the single definitions of HashMismatchError and
DependencyMissingError into one central module (create a shared exceptions
module), delete the duplicate class definitions from both schemas_in.py and
ingest.py, and update both modules to import HashMismatchError and
DependencyMissingError from that shared module so all code catches the same
exception types.
| uploads_root = os.path.join(folder_paths.get_temp_directory(), "uploads") | ||
| unique_dir = os.path.join(uploads_root, uuid.uuid4().hex) | ||
| os.makedirs(unique_dir, exist_ok=True) | ||
| tmp_path = os.path.join(unique_dir, ".upload.part") | ||
|
|
||
| try: | ||
| with open(tmp_path, "wb") as f: | ||
| while True: | ||
| chunk = await field.read_chunk(8 * 1024 * 1024) | ||
| if not chunk: | ||
| break | ||
| f.write(chunk) | ||
| file_written += len(chunk) | ||
| except Exception: | ||
| delete_temp_file_if_exists(tmp_path) | ||
| raise UploadError( | ||
| 500, "UPLOAD_IO_ERROR", "Failed to receive and store uploaded file." | ||
| ) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Orphaned temp directory left behind on cleanup.
delete_temp_file_if_exists removes the file but not the parent unique_dir created here. Over time, empty directories accumulate under temp/uploads/.
♻️ Proposed fix for delete_temp_file_if_exists
def delete_temp_file_if_exists(tmp_path: str | None) -> None:
"""Safely remove a temp file if it exists."""
if tmp_path:
try:
if os.path.exists(tmp_path):
os.remove(tmp_path)
+ parent = os.path.dirname(tmp_path)
+ if parent and os.path.isdir(parent):
+ try:
+ os.rmdir(parent) # only succeeds if empty
+ except OSError:
+ pass
except OSError as e:
logging.debug("Failed to delete temp file %s: %s", tmp_path, e)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/assets/api/upload.py` around lines 111 - 128, The try/except currently
calls delete_temp_file_if_exists(tmp_path) but leaves the parent unique_dir
behind, causing orphaned empty directories; update the cleanup so that after
removing tmp_path you also remove its parent directory (unique_dir) if it's
empty: locate the variables uploads_root, unique_dir, tmp_path and the call to
delete_temp_file_if_exists in the except block and modify cleanup to remove the
file and then attempt to remove unique_dir (e.g., os.rmdir(unique_dir) guarded
by os.path.isdir(unique_dir) and catching OSError if not empty) before raising
the UploadError so the temporary upload directory is not left behind.
| @pytest.fixture | ||
| def temp_dir(): | ||
| """Temporary directory for file operations.""" | ||
| with tempfile.TemporaryDirectory() as tmpdir: | ||
| yield Path(tmpdir) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
temp_dir duplicates pytest's built-in tmp_path fixture.
tmp_path already returns a Path to a unique per-test temporary directory. test_file_utils.py uses tmp_path directly, so this fixture appears unused by the existing test suite.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests-unit/assets_test/services/conftest.py` around lines 50 - 54, The
temp_dir pytest fixture duplicates built-in tmp_path; remove the custom temp_dir
fixture from conftest.py (the function named temp_dir) and update any tests that
reference temp_dir to use pytest's tmp_path instead (e.g., change temp_dir ->
tmp_path in tests such as test_file_utils.py if present), ensuring
imports/fixtures are not left unused.
| with patch("app.assets.scanner.create_session") as mock_cs: | ||
| from contextlib import contextmanager | ||
|
|
||
| @contextmanager | ||
| def _create_session(): | ||
| with Session(db_engine) as sess: | ||
| yield sess | ||
|
|
||
| mock_cs.side_effect = _create_session |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Extract the repeated create_session patch into a shared fixture.
The identical context-manager mock for app.assets.scanner.create_session is duplicated in every test method (5 times). A @pytest.fixture (or conftest.py helper) would remove the boilerplate and make the tests easier to maintain.
♻️ Example fixture
# In conftest.py or at top of this file
`@pytest.fixture`
def patch_scanner_session(db_engine):
"""Patch scanner's create_session to use the test engine."""
from contextlib import contextmanager
from unittest.mock import patch
`@contextmanager`
def _create_session():
with Session(db_engine) as sess:
yield sess
with patch("app.assets.scanner.create_session", side_effect=_create_session):
yieldThen each test simply requests the fixture:
def test_extracts_mime_type_and_updates_asset(
self, patch_scanner_session, temp_dir: Path, session: Session
):
# ... test body without the `with patch(...)` block🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests-unit/assets_test/services/test_enrich.py` around lines 61 - 69, Add a
reusable pytest fixture (e.g., patch_scanner_session) that defines the
contextmanager _create_session which yields Session(db_engine) and patches
app.assets.scanner.create_session with side_effect=_create_session; place this
fixture in conftest.py or at the top of the test module and have each test that
currently uses the repeated with patch(...) block accept patch_scanner_session
as a parameter and remove the duplicated patch code, keeping references to
create_session, _create_session, Session, and db_engine consistent.
| def test_missing_file_returns_stub_level( | ||
| self, db_engine, temp_dir: Path, session: Session | ||
| ): | ||
| """Verify missing files don't cause errors and return STUB level.""" | ||
| file_path = temp_dir / "nonexistent.bin" | ||
|
|
||
| asset, ref = _create_stub_asset( | ||
| session, str(file_path), "asset-4", "ref-4" | ||
| ) | ||
| session.commit() | ||
|
|
||
| with patch("app.assets.scanner.create_session") as mock_cs: | ||
| from contextlib import contextmanager | ||
|
|
||
| @contextmanager | ||
| def _create_session(): | ||
| with Session(db_engine) as sess: | ||
| yield sess | ||
|
|
||
| mock_cs.side_effect = _create_session | ||
|
|
||
| new_level = enrich_asset( | ||
| file_path=str(file_path), | ||
| reference_id=ref.id, | ||
| asset_id=asset.id, | ||
| extract_metadata=True, | ||
| compute_hash=True, | ||
| ) | ||
|
|
||
| assert new_level == ENRICHMENT_STUB | ||
|
|
||
| session.expire_all() | ||
| updated_asset = session.get(Asset, "asset-4") | ||
| assert updated_asset.mime_type is None | ||
| assert updated_asset.hash is None |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Note: create_session mock is unused for the missing-file case.
enrich_asset returns ENRICHMENT_STUB immediately after the failed os.stat, before any DB call. The mock setup is harmless but unnecessary here. Consider adding a brief comment or assertion (e.g., mock_cs.assert_not_called()) to document the intent.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests-unit/assets_test/services/test_enrich.py` around lines 161 - 195, The
test mocks create_session via mock_cs but enrich_asset returns ENRICHMENT_STUB
before any DB session is created, so the mock is unused; update the
test_missing_file_returns_stub_level to either remove the mock_cs setup or keep
it but add a clear assertion or comment to document intent (e.g., call
mock_cs.assert_not_called() after enrich_asset) so readers know the DB pathway
wasn't exercised—refer to mock_cs, enrich_asset, and the
test_missing_file_returns_stub_level function when making the change.
| import pytest | ||
| import requests | ||
| from conftest import get_asset_filename, trigger_sync_seed_assets | ||
| from helpers import get_asset_filename, trigger_sync_seed_assets | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Locate all helpers.py files in the repository
echo "=== Locating helpers.py files ==="
fd -a "helpers.py"
echo -e "\n=== Searching for function definitions ==="
# Search for the functions being imported
rg "def (get_asset_filename|trigger_sync_seed_assets)" -A 2
echo -e "\n=== Directory structure around the test file ==="
# Check the directory structure
ls -la tests-unit/assets_test/ 2>/dev/null || echo "Directory not found at expected location"
echo -e "\n=== Contents of test_crud.py for import context ==="
# Show the imports and surrounding context
head -15 tests-unit/assets_test/test_crud.pyRepository: Comfy-Org/ComfyUI
Length of output: 2408
Clarify the relative import to improve maintainability.
The from helpers import ... on line 7 correctly resolves to the local tests-unit/assets_test/helpers.py since both imported functions are defined there and Python prioritizes the local module. However, using an explicit relative import (from . import helpers or from .helpers import get_asset_filename, trigger_sync_seed_assets) would make the intent clearer and is more robust to future refactoring.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests-unit/assets_test/test_crud.py` around lines 5 - 8, Replace the
top-level import of the local helpers module with an explicit relative import to
make the intent clear and robust: in test_crud.py change the line importing
helpers to use a relative form (e.g., import the module as from . import helpers
or import the functions as from .helpers import get_asset_filename,
trigger_sync_seed_assets) so that the local helpers module and the functions
get_asset_filename and trigger_sync_seed_assets are unambiguously loaded.
| from app.assets.services.file_utils import is_visible, list_files_recursively | ||
|
|
||
|
|
||
| class TestIsVisible: | ||
| def test_visible_file(self): | ||
| assert is_visible("file.txt") is True | ||
|
|
||
| def test_hidden_file(self): | ||
| assert is_visible(".hidden") is False | ||
|
|
||
| def test_hidden_directory(self): | ||
| assert is_visible(".git") is False | ||
|
|
||
| def test_visible_directory(self): | ||
| assert is_visible("src") is True | ||
|
|
||
| def test_dotdot_is_hidden(self): | ||
| assert is_visible("..") is False | ||
|
|
||
| def test_dot_is_hidden(self): | ||
| assert is_visible(".") is False | ||
|
|
||
|
|
||
| class TestListFilesRecursively: | ||
| def test_skips_hidden_files(self, tmp_path): | ||
| (tmp_path / "visible.txt").write_text("a") | ||
| (tmp_path / ".hidden").write_text("b") | ||
|
|
||
| result = list_files_recursively(str(tmp_path)) | ||
|
|
||
| assert len(result) == 1 | ||
| assert result[0].endswith("visible.txt") | ||
|
|
||
| def test_skips_hidden_directories(self, tmp_path): | ||
| hidden_dir = tmp_path / ".hidden_dir" | ||
| hidden_dir.mkdir() | ||
| (hidden_dir / "file.txt").write_text("a") | ||
|
|
||
| visible_dir = tmp_path / "visible_dir" | ||
| visible_dir.mkdir() | ||
| (visible_dir / "file.txt").write_text("b") | ||
|
|
||
| result = list_files_recursively(str(tmp_path)) | ||
|
|
||
| assert len(result) == 1 | ||
| assert "visible_dir" in result[0] | ||
| assert ".hidden_dir" not in result[0] | ||
|
|
||
| def test_empty_directory(self, tmp_path): | ||
| result = list_files_recursively(str(tmp_path)) | ||
| assert result == [] | ||
|
|
||
| def test_nonexistent_directory(self, tmp_path): | ||
| result = list_files_recursively(str(tmp_path / "nonexistent")) | ||
| assert result == [] |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
get_mtime_ns, get_size_and_mtime_ns, and verify_file_unchanged have no test coverage.
These three functions drive the core change-detection logic (whether an asset is re-enriched or skipped). Key verify_file_unchanged edge cases that should be covered: mtime_db=None → False, size_db=None → skip size check, mtime mismatch, size mismatch, and the precision-sensitive path where nanosecond mtime equality is evaluated.
Do you want me to draft the missing test cases, or should I open a follow-up issue to track this?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests-unit/assets_test/test_file_utils.py` around lines 1 - 55, Add unit
tests for get_mtime_ns, get_size_and_mtime_ns, and verify_file_unchanged to
cover the missing change-detection edge cases: assert get_mtime_ns and
get_size_and_mtime_ns return correct nanosecond mtime and size for real files
(use tmp_path to create files), add verify_file_unchanged tests for
mtime_db=None (should return False), size_db=None (should skip size check and
only compare mtime), explicit mtime mismatch and size mismatch cases (return
False), and a precision-sensitive test where mtime differs only at the
nanosecond level to verify the nanosecond equality path; reference the functions
get_mtime_ns, get_size_and_mtime_ns, and verify_file_unchanged in your new tests
and use tmp_path and monkeypatch/os.stat stubs as needed to control mtime/size
values.
| def get_query_dict(request: web.Request) -> dict[str, Any]: | ||
| """Gets a dictionary of query parameters from the request. | ||
|
|
||
| request.query is a MultiMapping[str], needs to be converted to a dict | ||
| to be validated by Pydantic. | ||
| """ | ||
| query_dict = { | ||
| key: request.query.getall(key) | ||
| if len(request.query.getall(key)) > 1 | ||
| else request.query.get(key) | ||
| for key in request.query.keys() | ||
| } | ||
| return query_dict |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
get_query_dict may emit duplicate keys from the multidict.
request.query.keys() on aiohttp's MultiDict can yield the same key multiple times. Using it directly in a dict comprehension means earlier entries are overwritten by later ones, but the final value is always computed via getall/get, so the result itself is correct. However, iterating redundantly is wasteful. Consider using set(request.query.keys()) or just request.query iteration semantics.
Minor: deduplicate keys
query_dict = {
key: request.query.getall(key)
if len(request.query.getall(key)) > 1
else request.query.get(key)
- for key in request.query.keys()
+ for key in set(request.query.keys())
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/assets/api/routes.py` around lines 46 - 58, get_query_dict currently
iterates over request.query.keys() which can yield duplicate keys from aiohttp's
MultiDict, wasting work; change the dict comprehension to iterate over a
de-duplicated set of keys (e.g., use set(request.query.keys())) so each key is
processed once while keeping the existing getall/get logic in get_query_dict.
| try: | ||
| result = manager.remove_tags_from_asset( | ||
| asset_info_id=asset_info_id, | ||
| result = remove_tags( | ||
| reference_id=reference_id, | ||
| tags=data.tags, | ||
| owner_id=USER_MANAGER.get_request_user_id(request), | ||
| ) | ||
| payload = schemas_out.TagsRemove( | ||
| removed=result.removed, | ||
| not_present=result.not_present, | ||
| total_tags=result.total_tags, | ||
| ) | ||
| except ValueError as ve: | ||
| return _error_response(404, "ASSET_NOT_FOUND", str(ve), {"id": asset_info_id}) | ||
| return _build_error_response( | ||
| 404, "ASSET_NOT_FOUND", str(ve), {"id": reference_id} | ||
| ) | ||
| except Exception: | ||
| logging.exception( | ||
| "remove_tags_from_asset failed for asset_info_id=%s, owner_id=%s", | ||
| asset_info_id, | ||
| "remove_tags_from_asset failed for reference_id=%s, owner_id=%s", | ||
| reference_id, | ||
| USER_MANAGER.get_request_user_id(request), | ||
| ) | ||
| return _error_response(500, "INTERNAL", "Unexpected server error.") | ||
| return _build_error_response(500, "INTERNAL", "Unexpected server error.") | ||
|
|
||
| return web.json_response(result.model_dump(mode="json"), status=200) | ||
| return web.json_response(payload.model_dump(mode="json"), status=200) |
There was a problem hiding this comment.
delete_asset_tags doesn't catch PermissionError, unlike sibling handlers.
add_asset_tags (Line 547) and update_asset_route (Line 436) both catch (ValueError, PermissionError), but delete_asset_tags only catches ValueError (Line 591). If remove_tags raises PermissionError for a non-owner, it falls through to the generic Exception handler and returns a 500 instead of a 404.
Proposed fix
- except ValueError as ve:
+ except (ValueError, PermissionError) as ve:
return _build_error_response(
404, "ASSET_NOT_FOUND", str(ve), {"id": reference_id}
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| result = manager.remove_tags_from_asset( | |
| asset_info_id=asset_info_id, | |
| result = remove_tags( | |
| reference_id=reference_id, | |
| tags=data.tags, | |
| owner_id=USER_MANAGER.get_request_user_id(request), | |
| ) | |
| payload = schemas_out.TagsRemove( | |
| removed=result.removed, | |
| not_present=result.not_present, | |
| total_tags=result.total_tags, | |
| ) | |
| except ValueError as ve: | |
| return _error_response(404, "ASSET_NOT_FOUND", str(ve), {"id": asset_info_id}) | |
| return _build_error_response( | |
| 404, "ASSET_NOT_FOUND", str(ve), {"id": reference_id} | |
| ) | |
| except Exception: | |
| logging.exception( | |
| "remove_tags_from_asset failed for asset_info_id=%s, owner_id=%s", | |
| asset_info_id, | |
| "remove_tags_from_asset failed for reference_id=%s, owner_id=%s", | |
| reference_id, | |
| USER_MANAGER.get_request_user_id(request), | |
| ) | |
| return _error_response(500, "INTERNAL", "Unexpected server error.") | |
| return _build_error_response(500, "INTERNAL", "Unexpected server error.") | |
| return web.json_response(result.model_dump(mode="json"), status=200) | |
| return web.json_response(payload.model_dump(mode="json"), status=200) | |
| try: | |
| result = remove_tags( | |
| reference_id=reference_id, | |
| tags=data.tags, | |
| owner_id=USER_MANAGER.get_request_user_id(request), | |
| ) | |
| payload = schemas_out.TagsRemove( | |
| removed=result.removed, | |
| not_present=result.not_present, | |
| total_tags=result.total_tags, | |
| ) | |
| except (ValueError, PermissionError) as ve: | |
| return _build_error_response( | |
| 404, "ASSET_NOT_FOUND", str(ve), {"id": reference_id} | |
| ) | |
| except Exception: | |
| logging.exception( | |
| "remove_tags_from_asset failed for reference_id=%s, owner_id=%s", | |
| reference_id, | |
| USER_MANAGER.get_request_user_id(request), | |
| ) | |
| return _build_error_response(500, "INTERNAL", "Unexpected server error.") | |
| return web.json_response(payload.model_dump(mode="json"), status=200) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/assets/api/routes.py` around lines 580 - 603, delete_asset_tags currently
only catches ValueError from remove_tags so PermissionError bubbles to the
generic handler and returns 500; update the exception handling in the
delete_asset_tags route to catch PermissionError (same as add_asset_tags and
update_asset_route) and map it to a 404 using _build_error_response with the
same payload shape as the ValueError branch. Locate the try/except around the
remove_tags call in delete_asset_tags and either add PermissionError to the
existing except (e.g., except (ValueError, PermissionError) as e) or add a
separate except PermissionError block that returns _build_error_response(404,
"ASSET_NOT_FOUND", str(e), {"id": reference_id}); keep the logging and other
exception handling for the generic Exception branch unchanged.
| wait_param = request.query.get("wait", "").lower() | ||
| should_wait = wait_param in ("true", "1", "yes") | ||
|
|
||
| # Temporarily enable seeder for explicit API calls (--disable-assets-autoscan | ||
| # only prevents the automatic startup scan, not manual triggers) | ||
| was_disabled = asset_seeder.is_disabled() | ||
| if was_disabled: | ||
| asset_seeder.enable() | ||
|
|
||
| started = asset_seeder.start(roots=valid_roots) | ||
| if not started: | ||
| if was_disabled: | ||
| asset_seeder.disable() | ||
| return web.json_response({"status": "already_running"}, status=409) | ||
|
|
||
| if should_wait: | ||
| asset_seeder.wait() | ||
| if was_disabled: | ||
| asset_seeder.disable() | ||
| status = asset_seeder.get_status() | ||
| return web.json_response( | ||
| { | ||
| "status": "completed", | ||
| "progress": { | ||
| "scanned": status.progress.scanned if status.progress else 0, | ||
| "total": status.progress.total if status.progress else 0, | ||
| "created": status.progress.created if status.progress else 0, | ||
| "skipped": status.progress.skipped if status.progress else 0, | ||
| }, | ||
| "errors": status.errors, | ||
| }, | ||
| status=200, | ||
| ) | ||
|
|
||
| return web.json_response({"status": "started"}, status=202) |
There was a problem hiding this comment.
Two issues: blocking wait() freezes the event loop, and seeder is never re-disabled on the async (non-wait) path.
-
Event-loop blocking (critical):
asset_seeder.wait()(Line 644) is a synchronous blocking call (likelyThread.join()) invoked directly inside anasync defhandler. This freezes the entire aiohttp event loop until the scan completes. Wrap it inawait asyncio.to_thread(asset_seeder.wait)orawait loop.run_in_executor(None, asset_seeder.wait). -
Seeder stays enabled after async start (bug): When
was_disabled=Trueandshould_wait=False, the handler returns at Line 662 without callingasset_seeder.disable(). The seeder remains permanently enabled after a single manual trigger, contradicting the--disable-assets-autoscanflag.
Proposed fix
+ import asyncio
+
if should_wait:
- asset_seeder.wait()
+ await asyncio.to_thread(asset_seeder.wait)
if was_disabled:
asset_seeder.disable()
status = asset_seeder.get_status()
return web.json_response(
{
"status": "completed",
"progress": {
"scanned": status.progress.scanned if status.progress else 0,
"total": status.progress.total if status.progress else 0,
"created": status.progress.created if status.progress else 0,
"skipped": status.progress.skipped if status.progress else 0,
},
"errors": status.errors,
},
status=200,
)
+ # For async path: schedule re-disable after scan completes
+ if was_disabled:
+ def _restore_disabled():
+ asset_seeder.wait()
+ asset_seeder.disable()
+ threading.Thread(target=_restore_disabled, daemon=True).start()
+
return web.json_response({"status": "started"}, status=202)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/assets/api/routes.py` around lines 628 - 662, The handler currently calls
the blocking asset_seeder.wait() directly and also fails to re-disable the
seeder on the non-wait path; change the wait call to run off the event loop
(e.g., await asyncio.to_thread(asset_seeder.wait) or await
loop.run_in_executor(None, asset_seeder.wait)) and ensure that whenever
was_disabled was True you call asset_seeder.disable() before every return path
(both after the awaited wait branch and the non-wait branch where you currently
return {"status":"started"}), keeping the existing logic around
asset_seeder.enable(), asset_seeder.start(roots=valid_roots) and
asset_seeder.get_status() intact.
| @ROUTES.post("/api/assets/prune") | ||
| async def mark_missing_assets(request: web.Request) -> web.Response: | ||
| """Mark assets as missing when outside all known root prefixes. | ||
|
|
||
| This is a non-destructive soft-delete operation. Assets and metadata | ||
| are preserved, but references are flagged as missing. They can be | ||
| restored if the file reappears in a future scan. | ||
|
|
||
| Returns: | ||
| 200 OK with count of marked assets | ||
| 409 Conflict if a scan is currently running | ||
| """ | ||
| marked = asset_seeder.mark_missing_outside_prefixes() | ||
| if marked == 0 and asset_seeder.get_status().state.value != "IDLE": | ||
| return web.json_response( | ||
| {"status": "scan_running", "marked": 0}, | ||
| status=409, | ||
| ) |
There was a problem hiding this comment.
Check-after-act: "scan running" guard fires only after the prune has already executed.
mark_missing_outside_prefixes() runs first (Line 707), then the handler checks whether a scan is running (Line 708). If a scan is in progress, the prune may race with the seeder. The guard should be checked before calling mark_missing_outside_prefixes().
Additionally, marked == 0 with an active scan is ambiguous — it could also mean there was genuinely nothing to mark. The condition conflates two different states.
Proposed fix: check state before pruning
async def mark_missing_assets(request: web.Request) -> web.Response:
+ status = asset_seeder.get_status()
+ if status.state.value != "IDLE":
+ return web.json_response(
+ {"status": "scan_running", "marked": 0},
+ status=409,
+ )
marked = asset_seeder.mark_missing_outside_prefixes()
- if marked == 0 and asset_seeder.get_status().state.value != "IDLE":
- return web.json_response(
- {"status": "scan_running", "marked": 0},
- status=409,
- )
return web.json_response({"status": "completed", "marked": marked}, status=200)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/assets/api/routes.py` around lines 695 - 712, The handler
mark_missing_assets currently calls asset_seeder.mark_missing_outside_prefixes()
before checking asset_seeder.get_status().state.value, allowing a race; modify
mark_missing_assets to first inspect asset_seeder.get_status().state.value (or a
helper like is_running) and if the state indicates a running scan return the 409
{"status":"scan_running","marked":0} immediately, only calling
asset_seeder.mark_missing_outside_prefixes() when the seeder is idle; also
remove the combined `marked == 0` check — treat the running-scan check
independently so a zero-mark result is not conflated with an active scan.
| def list_references_page( | ||
| session: Session, | ||
| owner_id: str = "", | ||
| limit: int = 100, | ||
| offset: int = 0, | ||
| name_contains: str | None = None, | ||
| include_tags: Sequence[str] | None = None, | ||
| exclude_tags: Sequence[str] | None = None, | ||
| metadata_filter: dict | None = None, | ||
| sort: str | None = None, | ||
| order: str | None = None, | ||
| ) -> tuple[list[AssetReference], dict[str, list[str]], int]: | ||
| """List references with pagination, filtering, and sorting. | ||
|
|
||
| Returns (references, tag_map, total_count). | ||
| """ | ||
| base = ( | ||
| select(AssetReference) | ||
| .join(Asset, Asset.id == AssetReference.asset_id) | ||
| .where(build_visible_owner_clause(owner_id)) | ||
| .options(noload(AssetReference.tags)) | ||
| ) | ||
|
|
||
| if name_contains: | ||
| escaped, esc = escape_sql_like_string(name_contains) | ||
| base = base.where(AssetReference.name.ilike(f"%{escaped}%", escape=esc)) | ||
|
|
||
| base = _apply_tag_filters(base, include_tags, exclude_tags) | ||
| base = _apply_metadata_filter(base, metadata_filter) | ||
|
|
||
| sort = (sort or "created_at").lower() | ||
| order = (order or "desc").lower() | ||
| sort_map = { | ||
| "name": AssetReference.name, | ||
| "created_at": AssetReference.created_at, | ||
| "updated_at": AssetReference.updated_at, | ||
| "last_access_time": AssetReference.last_access_time, | ||
| "size": Asset.size_bytes, | ||
| } | ||
| sort_col = sort_map.get(sort, AssetReference.created_at) | ||
| sort_exp = sort_col.desc() if order == "desc" else sort_col.asc() | ||
|
|
||
| base = base.order_by(sort_exp).limit(limit).offset(offset) | ||
|
|
||
| count_stmt = ( | ||
| select(sa.func.count()) | ||
| .select_from(AssetReference) | ||
| .join(Asset, Asset.id == AssetReference.asset_id) | ||
| .where(build_visible_owner_clause(owner_id)) | ||
| ) | ||
| if name_contains: | ||
| escaped, esc = escape_sql_like_string(name_contains) | ||
| count_stmt = count_stmt.where( | ||
| AssetReference.name.ilike(f"%{escaped}%", escape=esc) | ||
| ) | ||
| count_stmt = _apply_tag_filters(count_stmt, include_tags, exclude_tags) | ||
| count_stmt = _apply_metadata_filter(count_stmt, metadata_filter) | ||
|
|
||
| total = int(session.execute(count_stmt).scalar_one() or 0) | ||
| refs = session.execute(base).unique().scalars().all() | ||
|
|
||
| id_list: list[str] = [r.id for r in refs] | ||
| tag_map: dict[str, list[str]] = defaultdict(list) | ||
| if id_list: | ||
| rows = session.execute( | ||
| select(AssetReferenceTag.asset_reference_id, Tag.name) | ||
| .join(Tag, Tag.name == AssetReferenceTag.tag_name) | ||
| .where(AssetReferenceTag.asset_reference_id.in_(id_list)) | ||
| .order_by(AssetReferenceTag.added_at) | ||
| ) | ||
| for ref_id, tag_name in rows.all(): | ||
| tag_map[ref_id].append(tag_name) | ||
|
|
||
| return list(refs), tag_map, total |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Count query duplicates all filter logic from the data query — divergence risk.
The count_stmt (Lines 334–346) manually re-applies name_contains, _apply_tag_filters, and _apply_metadata_filter independently from the base query. If a new filter is added to base but forgotten in count_stmt, the count will silently diverge from the actual results.
Consider building a shared "filtered base" CTE or subquery, then deriving both the paginated result and the count from it.
♻️ Sketch: derive count from the same filtered base
+ # Build shared filter base
+ filtered = (
+ select(AssetReference.id)
+ .join(Asset, Asset.id == AssetReference.asset_id)
+ .where(build_visible_owner_clause(owner_id))
+ )
+ if name_contains:
+ escaped, esc = escape_sql_like_string(name_contains)
+ filtered = filtered.where(AssetReference.name.ilike(f"%{escaped}%", escape=esc))
+ filtered = _apply_tag_filters(filtered, include_tags, exclude_tags)
+ filtered = _apply_metadata_filter(filtered, metadata_filter)
+ filtered_sub = filtered.subquery()
+
+ # Count
+ total = int(
+ session.execute(select(sa.func.count()).select_from(filtered_sub)).scalar_one() or 0
+ )
+
+ # Paginated data query
+ ...
+ base = (
+ select(AssetReference)
+ .join(Asset, Asset.id == AssetReference.asset_id)
+ .where(AssetReference.id.in_(select(filtered_sub.c.id)))
+ .options(noload(AssetReference.tags))
+ .order_by(sort_exp)
+ .limit(limit)
+ .offset(offset)
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/assets/database/queries/asset_reference.py` around lines 290 - 363, The
count query duplicates the filtering logic in list_references_page (see
variables base and count_stmt) which risks divergence; refactor by constructing
a single filtered selectable (e.g., filtered_base or filtered_subq) that applies
build_visible_owner_clause, name_contains escaping, _apply_tag_filters and
_apply_metadata_filter once, then derive the paginated query from
filtered_base.order_by(...).limit(...).offset(...) and derive the total with
select(sa.func.count()).select_from(filtered_base.subquery()/alias()) so both
queries share the exact same filters; keep sorting/pagination only on the
paginated query and reuse the same filtered_base when loading tags (id_list).
| def get_references_by_paths_and_asset_ids( | ||
| session: Session, | ||
| path_to_asset: dict[str, str], | ||
| ) -> set[str]: | ||
| """Query references to find paths where our asset_id won the insert. | ||
|
|
||
| Args: | ||
| path_to_asset: Mapping of file_path -> asset_id we tried to insert | ||
|
|
||
| Returns: | ||
| Set of file_paths where our asset_id is present | ||
| """ | ||
| if not path_to_asset: | ||
| return set() | ||
|
|
||
| paths = list(path_to_asset.keys()) | ||
| winners: set[str] = set() | ||
|
|
||
| for chunk in iter_chunks(paths, MAX_BIND_PARAMS): | ||
| result = session.execute( | ||
| select(AssetReference.file_path).where( | ||
| AssetReference.file_path.in_(chunk), | ||
| AssetReference.asset_id.in_([path_to_asset[p] for p in chunk]), | ||
| ) | ||
| ) | ||
| winners.update(result.scalars().all()) | ||
|
|
||
| return winners |
There was a problem hiding this comment.
IN / IN cross-product doesn't correctly correlate path↔asset_id pairs.
The query uses two independent IN clauses (Lines 966–968):
file_path IN ('/a', '/b') AND asset_id IN ('1', '2')
This matches any combination, e.g. (path='/a', asset_id='2') even though path_to_asset maps /a → 1. A reference that belongs to a different asset could be incorrectly reported as "winning" the insert.
Proposed fix: use correlated OR conditions per pair
def get_references_by_paths_and_asset_ids(
session: Session,
path_to_asset: dict[str, str],
) -> set[str]:
if not path_to_asset:
return set()
paths = list(path_to_asset.keys())
winners: set[str] = set()
for chunk in iter_chunks(paths, MAX_BIND_PARAMS // 2):
- result = session.execute(
- select(AssetReference.file_path).where(
- AssetReference.file_path.in_(chunk),
- AssetReference.asset_id.in_([path_to_asset[p] for p in chunk]),
- )
- )
+ pair_conds = [
+ sa.and_(
+ AssetReference.file_path == p,
+ AssetReference.asset_id == path_to_asset[p],
+ )
+ for p in chunk
+ ]
+ result = session.execute(
+ select(AssetReference.file_path).where(sa.or_(*pair_conds))
+ )
winners.update(result.scalars().all())
return winnersNote: adjust chunk size for the doubled bind params per pair.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/assets/database/queries/asset_reference.py` around lines 945 - 972, The
query in get_references_by_paths_and_asset_ids uses two independent IN clauses
and causes cross-product matches; replace the IN/IN with a disjunction of
correlated conditions so each pair is matched exactly (i.e. build an OR of
(AssetReference.file_path == path AND AssetReference.asset_id == asset_id) for
each entry in the chunk), adjust chunking to account for roughly two bind params
per pair (e.g. use MAX_BIND_PARAMS // 2 or similar), execute the select with the
combined or_ condition targeting AssetReference.file_path and
AssetReference.asset_id, and continue to collect winners from
result.scalars().all(); use iter_chunks and MAX_BIND_PARAMS symbols to locate
where to change logic.
Split the ~1000 line app/assets/database/queries.py into focused modules: - queries/asset.py - Asset entity queries (asset_exists_by_hash, get_asset_by_hash) - queries/asset_info.py - AssetInfo queries (~15 functions) - queries/cache_state.py - AssetCacheState queries (list_cache_states_by_asset_id, pick_best_live_path, prune_orphaned_assets, fast_db_consistency_pass) - queries/tags.py - Tag queries (8 functions including ensure_tags_exist, add/remove tag functions, list_tags_with_usage) - queries/__init__.py - Re-exports all public functions for backward compatibility Also adds comprehensive unit tests using in-memory SQLite: - tests-unit/assets_test/queries/conftest.py - Session fixture - tests-unit/assets_test/queries/test_asset.py - 5 tests - tests-unit/assets_test/queries/test_asset_info.py - 23 tests - tests-unit/assets_test/queries/test_cache_state.py - 8 tests - tests-unit/assets_test/queries/test_metadata.py - 12 tests for _apply_metadata_filter - tests-unit/assets_test/queries/test_tags.py - 23 tests All 71 unit tests pass. Existing integration tests unaffected. Amp-Thread-ID: https://ampcode.com/threads/T-019c24bb-475b-7442-9ff9-8288edea3345 Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019c24bb-475b-7442-9ff9-8288edea3345 Co-authored-by: Amp <amp@ampcode.com>
Architecture changes: - API Routes -> manager.py (thin adapter) -> services/ (business logic) -> queries/ (atomic DB ops) - Services own session lifecycle via create_session() - Queries accept Session as parameter, do single-table atomic operations New app/assets/services/ layer: - __init__.py - exports all service functions - ingest.py - ingest_file_from_path(), register_existing_asset() - asset_management.py - get_asset_detail(), update_asset_metadata(), delete_asset_reference(), set_asset_preview() - tagging.py - apply_tags(), remove_tags(), list_tags() Removed from queries/asset_info.py: - ingest_fs_asset (moved to services/ingest.py as ingest_file_from_path) - update_asset_info_full (moved to services/asset_management.py as update_asset_metadata) - create_asset_info_for_existing_asset (moved to services/ingest.py as register_existing_asset) Updated manager.py: - Now a thin adapter that transforms API schemas to/from service calls - Delegates all business logic to services layer - No longer imports sqlalchemy.orm.Session or models directly Test updates: - Fixed test_cache_state.py import of pick_best_live_path (moved to helpers.py) - Added comprehensive service layer tests (41 new tests) - All 112 query + service tests pass Amp-Thread-ID: https://ampcode.com/threads/T-019c24e2-7ae4-707f-ad19-c775ed8b82b5 Co-authored-by: Amp <amp@ampcode.com>
…ly argument pattern - Move imports from inside functions to module top-level in: - app/assets/database/queries/asset.py - app/assets/database/queries/asset_info.py - app/assets/database/queries/cache_state.py - app/assets/manager.py - app/assets/services/asset_management.py - app/assets/services/ingest.py - Remove keyword-only argument markers (*,) from app/assets/ to match codebase conventions Amp-Thread-ID: https://ampcode.com/threads/T-019c24eb-bfa2-727f-8212-8bc976048604 Co-authored-by: Amp <amp@ampcode.com>
- Move app/assets/scanner.py to app/assets/services/scanner.py - Extract pure queries from fast_db_consistency_pass: - get_cache_states_for_prefixes() - bulk_set_needs_verify() - delete_cache_states_by_ids() - delete_orphaned_seed_asset() - Split prune_orphaned_assets into pure queries: - delete_cache_states_outside_prefixes() - get_orphaned_seed_asset_ids() - delete_assets_by_ids() - Add reconcile_cache_states_for_root() service function - Add prune_orphaned_assets() service function - Remove function injection pattern - Update imports in main.py, server.py, routes.py Amp-Thread-ID: https://ampcode.com/threads/T-019c24f1-3385-701b-87e0-8b6bc87e841b Co-authored-by: Amp <amp@ampcode.com>
- Delete bulk_ops.py, moving logic to appropriate layers - Add bulk insert query functions: - queries/asset.bulk_insert_assets - queries/cache_state.bulk_insert_cache_states_ignore_conflicts - queries/cache_state.get_cache_states_by_paths_and_asset_ids - queries/asset_info.bulk_insert_asset_infos_ignore_conflicts - queries/asset_info.get_asset_info_ids_by_ids - queries/tags.bulk_insert_tags_and_meta - Move seed_from_paths_batch orchestration to scanner._seed_from_paths_batch Amp-Thread-ID: https://ampcode.com/threads/T-019c24fd-157d-776a-ad24-4f19cf5d3afe Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019c2510-33fa-7199-ae4b-bc31102277a7 Co-authored-by: Amp <amp@ampcode.com>
- Move scanner-only functions to scanner.py - Move query-only functions (is_scalar, project_kv) to asset_info.py - Move get_query_dict to routes.py - Create path_utils.py service for path-related functions - Reduce helpers.py to shared utilities only Amp-Thread-ID: https://ampcode.com/threads/T-019c2510-33fa-7199-ae4b-bc31102277a7 Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019c2510-33fa-7199-ae4b-bc31102277a7 Co-authored-by: Amp <amp@ampcode.com>
- Add app/assets/api/upload.py with parse_multipart_upload() for HTTP parsing - Add ParsedUpload dataclass to schemas_in.py - Add domain exceptions (AssetValidationError, AssetNotFoundError, HashMismatchError) - Add manager.process_upload() with domain exceptions (no HTTP status codes) - Routes map domain exceptions to HTTP responses - Slim down upload_asset route to ~20 lines (was ~150) Amp-Thread-ID: https://ampcode.com/threads/T-019c2519-abe1-738a-ad2e-29ece17c0e42 Co-authored-by: Amp <amp@ampcode.com>
…t queries - Fix unused imports in routes.py, asset.py, manager.py, asset_management.py, ingest.py - Fix whitespace issues in upload.py, asset_info.py, ingest.py - Fix typo in manager.py (stray character after result["asset"]) - Fix broken import in test_metadata.py (project_kv moved to asset_info.py) - Add fixture override in queries/conftest.py for unit test isolation Add 48 new tests covering all previously untested query functions: - asset.py: upsert_asset, bulk_insert_assets - cache_state.py: upsert_cache_state, delete_cache_states_outside_prefixes, get_orphaned_seed_asset_ids, delete_assets_by_ids, get_cache_states_for_prefixes, bulk_set_needs_verify, delete_cache_states_by_ids, delete_orphaned_seed_asset, bulk_insert_cache_states_ignore_conflicts, get_cache_states_by_paths_and_asset_ids - asset_info.py: insert_asset_info, get_or_create_asset_info, update_asset_info_timestamps, replace_asset_info_metadata_projection, bulk_insert_asset_infos_ignore_conflicts, get_asset_info_ids_by_ids - tags.py: bulk_insert_tags_and_meta Total: 119 tests pass (up from 71) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove unused os imports in conftest.py and test_ingest.py - Remove unused Tag import in test_asset_management.py - Remove unused ensure_tags_exist import in test_ingest.py - Fix unused info2 variable in test_asset_management.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Rename functions across app/assets/ to follow verb-based naming: - is_scalar → check_is_scalar - project_kv → expand_metadata_to_rows - _visible_owner_clause → _build_visible_owner_clause - _chunk_rows → _iter_row_chunks - _at_least_one → _validate_at_least_one_field - _tags_norm → _normalize_tags_field - _ser_dt → _serialize_datetime - _ser_updated → _serialize_updated_at - _error_response → _build_error_response - _validation_error_response → _build_validation_error_response - file_sender → stream_file_chunks - seed_assets_endpoint → seed_assets - utcnow → get_utc_now - _safe_sort_field → _validate_sort_field - _safe_filename → _sanitize_filename - fast_asset_file_check → check_asset_file_fast - prefixes_for_root → get_prefixes_for_root - blake3_hash → compute_blake3_hash - blake3_hash_async → compute_blake3_hash_async - _is_within → _check_is_within - _rel → _compute_relative Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Rename functions to use clearer verb-based names: - pick_best_live_path → select_best_live_path - escape_like_prefix → escape_sql_like_string - list_tree → list_files_recursively - check_asset_file_fast → verify_asset_file_unchanged - _seed_from_paths_batch → _batch_insert_assets_from_paths - reconcile_cache_states_for_root → sync_cache_states_with_filesystem - touch_asset_info_by_id → update_asset_info_access_time - replace_asset_info_metadata_projection → set_asset_info_metadata - expand_metadata_to_rows → convert_metadata_to_rows - _rows_per_stmt → _calculate_rows_per_statement - ensure_within_base → validate_path_within_base - _cleanup_temp → _delete_temp_file_if_exists - validate_hash_format → normalize_and_validate_hash - get_relative_to_root_category_path_of_asset → get_asset_category_and_relative_path Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extract helper functions to eliminate nested try-except blocks in scanner.py and remove duplicated type-checking logic in asset_info.py. Simplify nested conditionals in asset_management.py for clearer control flow. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…pers Extract focused helper functions to eliminate the try-finally block that wrapped ~50 lines just for logging. The new helpers (_collect_paths_for_roots, _build_asset_specs, _insert_asset_specs) make seed_assets a simple linear flow. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Make blake3 an optional import that fails gracefully at import time, with a clear error message when hashing functions are actually called. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extract helper functions from conftest.py to a dedicated helpers.py module to fix import resolution issues when pytest processes subdirectories. Rename test_tags.py to test_tags_api.py to avoid module name collision with queries/test_tags.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…layer Replace dict/ORM object returns with explicit dataclasses to fix DetachedInstanceError when accessing ORM attributes after session closes. - Add app/assets/services/schemas.py with AssetData, AssetInfoData, AssetDetailResult, and RegisterAssetResult dataclasses - Update asset_management.py and ingest.py to return dataclasses - Update manager.py to use attribute access on dataclasses - Fix created_new to be False in create_asset_from_hash (content exists) - Add DependencyMissingError for better blake3 missing error handling - Update tests to use attribute access instead of dict subscripting Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…service layer Add update_asset_info_name and update_asset_info_updated_at query functions and update asset_management.py to use them instead of modifying ORM objects directly. This ensures the service layer only uses explicit operations from the queries package. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019c3167-e8f1-7409-904b-5fc0edaeef37 Co-authored-by: Amp <amp@ampcode.com>
- Rename shorthand variables to explicit names (sp -> spec, aid -> asset_id, etc.) - Move imports to top of file - Add TypedDict definitions for AssetRow, CacheStateRow, AssetInfoRow, TagRow, MetadataRow - Replace bare dict types with typed alternatives Amp-Thread-ID: https://ampcode.com/threads/T-019c316d-13f7-77f8-b92b-ea7276c3e09c Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019c316d-13f7-77f8-b92b-ea7276c3e09c Co-authored-by: Amp <amp@ampcode.com>
- Add bulk_update_is_missing() to efficiently update is_missing flag - Update sync_cache_states_with_filesystem() to mark non-existent files as is_missing=True - Call restore_cache_states_by_paths() in batch_insert_seed_assets() to restore previously-missing states when files reappear during scanning Amp-Thread-ID: https://ampcode.com/threads/T-019c3177-e591-7666-ac6b-7e05c71c8ebf Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019c3542-377d-76ea-b10a-e551d317d92f Co-authored-by: Amp <amp@ampcode.com>
- Remove duplicate from routes.py - Import from upload.py instead - Rename to public API (remove leading underscore) Amp-Thread-ID: https://ampcode.com/threads/T-019c3549-c245-7628-950c-dd6826185394 Co-authored-by: Amp <amp@ampcode.com>
- Fix race in mark_missing_outside_prefixes: set state to RUNNING inside lock before operations, restore to IDLE in finally block to prevent concurrent start() calls - Fix timing consistency: capture perf_counter before _update_progress for consistent event timing Amp-Thread-ID: https://ampcode.com/threads/T-019c354b-e7d7-7309-aa0e-79e5e7dff2b7 Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019c354d-d627-7233-864d-1e6f7a4b8caa Co-authored-by: Amp <amp@ampcode.com>
Since size_bytes is declared as non-nullable (nullable=False, default=0) in the Asset model, simplify the conditional checks: - Use 'if item.asset else None' when the asset relationship might be None - Access size_bytes directly when asset is guaranteed to exist (create endpoints) Amp-Thread-ID: https://ampcode.com/threads/T-019c354e-cbfb-77d8-acdd-0d066c16006e Co-authored-by: Amp <amp@ampcode.com>
- Add MAX_SAFETENSORS_HEADER_SIZE constant in metadata_extract.py - Fix double 'compute' typo: compute_compute_blake3_hash_async → compute_blake3_hash_async Amp-Thread-ID: https://ampcode.com/threads/T-019c3550-4dbc-7301-a5e8-e6e23aa2d7b1 Co-authored-by: Amp <amp@ampcode.com>
- Make blake3 import lazy in hashing.py (only imported when needed) - Add compute_hashes parameter to AssetSeeder.start(), build_asset_specs(), and seed_assets() - Fix missing tag clearing: include is_missing states in sync when update_missing_tags=True - Clear is_missing flag on cache states when files are restored with matching mtime/size - Fix validation error serialization in routes.py (use json.loads(ve.json())) Amp-Thread-ID: https://ampcode.com/threads/T-019c3614-56d4-74a8-a717-19922d6dbbee Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019c3626-c6ad-7139-a570-62da4e656a1a Co-authored-by: Amp <amp@ampcode.com>
Co-authored-by: Amp <amp@ampcode.com> Amp-Thread-ID: https://ampcode.com/threads/T-019c3626-c6ad-7139-a570-62da4e656a1a
- Add custom MIME type registrations for model files (.safetensors, .pt, .ckpt, .gguf, .yaml) - Pass mime_type through SeedAssetSpec to bulk_ingest - Re-register types before use since server.py mimetypes.init() resets them - Add tests for bulk ingest mime_type handling Amp-Thread-ID: https://ampcode.com/threads/T-019c3626-c6ad-7139-a570-62da4e656a1a Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019c3a75-046e-758d-ac96-08d45281a0c8 Co-authored-by: Amp <amp@ampcode.com>
- Remove debug print statements - Remove trailing whitespace on blank lines - Remove unused pytest import Amp-Thread-ID: https://ampcode.com/threads/T-019c3a8d-3b4f-75b4-8513-1c77914782f7 Co-authored-by: Amp <amp@ampcode.com>
Phase 1 (FAST): Creates stub records with filesystem metadata only - path, size, mtime - no file content reading - Populates asset database quickly on startup Phase 2 (ENRICH): Extracts metadata and computes hashes - Safetensors header parsing, MIME types - Optional blake3 hash computation - Updates existing stub records Changes: - Add ScanPhase enum (FAST, ENRICH, FULL) - Add enrichment_level column to AssetCacheState (0=stub, 1=metadata, 2=hashed) - Add build_stub_specs() for fast scanning without metadata extraction - Add get_unenriched_cache_states(), enrich_asset(), enrich_assets_batch() - Add start_fast(), start_enrich() convenience methods to AssetSeeder - Update start() to accept phase parameter (defaults to FULL) - Split _run_scan() into _run_fast_phase() and _run_enrich_phase() - Add migration 0003_add_enrichment_level.py - Update tests for new architecture Amp-Thread-ID: https://ampcode.com/threads/T-019c4eef-1568-778f-aede-38254728f848 Co-authored-by: Amp <amp@ampcode.com>
- Add PAUSED state to state machine - Add pause() method - blocks scan at next checkpoint - Add resume() method - unblocks paused scan - Add stop() method - alias for cancel() - Add restart() method - cancel + wait + start with same/overridden params - Add _check_pause_and_cancel() helper for checkpoint locations - Emit assets.seed.paused and assets.seed.resumed WebSocket events - Update get_object_info to use async seeder instead of blocking seed_assets - Scan all roots (models, input, output) on object_info, not just models Amp-Thread-ID: https://ampcode.com/threads/T-019c4f2b-5801-711c-8d47-bd1525808d77 Co-authored-by: Amp <amp@ampcode.com>
- Add log when scanner start is requested and when skipped due to already running - Remove noisy 'no mime_type' info log (expected during fast stub phase) Amp-Thread-ID: https://ampcode.com/threads/T-019c4f56-3fe1-72cb-888a-3ac4ac99b3d7 Co-authored-by: Amp <amp@ampcode.com>
Log pause, resume, cancel, and restart events Amp-Thread-ID: https://ampcode.com/threads/T-019c4f56-3fe1-72cb-888a-3ac4ac99b3d7 Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019c4f63-c049-733e-ab94-1b3f3db070e7 Co-authored-by: Amp <amp@ampcode.com>
…-autoscan flag - Added disable(), enable(), and is_disabled() methods to AssetSeeder - start() now checks is_disabled() and returns early if disabled - Updated main.py to call asset_seeder.disable() when CLI flag is set - Fixes bypass where /object_info would trigger scans regardless of flag Amp-Thread-ID: https://ampcode.com/threads/T-019c4f66-6773-72d2-bdfe-b55f5aa76021 Co-authored-by: Amp <amp@ampcode.com>
…ence This change solves the basename collision bug by using UNIQUE(file_path) on the unified asset_references table. Key changes: Database: - Migration 0005 merges asset_cache_states and asset_infos into asset_references - AssetReference now contains: cache state fields (file_path, mtime_ns, needs_verify, is_missing, enrichment_level) plus info fields (name, owner_id, preview_id, etc.) - AssetReferenceMeta replaces AssetInfoMeta - AssetReferenceTag replaces AssetInfoTag - UNIQUE constraint on file_path prevents duplicate entries for same file Code: - New unified query module: asset_reference.py (replaces asset_info.py, cache_state.py) - Updated scanner, seeder, and services to use AssetReference - Updated API routes to use reference_id instead of asset_info_id Tests: - All 175 unit tests updated and passing - Integration tests require server environment (not run here) Amp-Thread-ID: https://ampcode.com/threads/T-019c4fe8-9dcb-75ce-bea8-ea786343a581 Co-authored-by: Amp <amp@ampcode.com>
Consolidate all branch migrations into 0002_merge_to_asset_references. Drops old tables (asset_info_meta, asset_info_tags, asset_cache_state, assets_info), truncates assets, and creates the new unified schema (asset_references, asset_reference_tags, asset_reference_meta). Amp-Thread-ID: https://ampcode.com/threads/T-019c5005-66e9-719f-94df-ebb4b25bcf14 Co-authored-by: Amp <amp@ampcode.com>
…autoscan disabled Move inline/lazy imports to top-level now that blake3 is a main dependency and circular-import concerns are resolved. Remove decorative section-header comments in asset_reference queries. Allow the seed API endpoint to temporarily re-enable the seeder when --disable-assets-autoscan is active so manual triggers still work. Add retry-on-409 logic in test helpers and remove try/finally around prompt execution in favour of sequential pause/resume. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix _hash_file_obj not restoring file position when orig_pos == 0 - Fix validate_path_within_base swallowing its own ValueError - Guard against tags=None crash in upload_from_temp_path - Filter is_missing references from list_references_page results - Include Decimal in metadata filter isinstance check - Return 403 instead of 404 for PermissionError in update/add_tags routes - Re-disable seeder after non-blocking start with --disable-assets-autoscan - Remove duplicate HashMismatchError/DependencyMissingError from schemas_in - Move _emit_event outside lock in seeder resume() to prevent deadlock - Fix total_q ignoring owner_id in list_tags_with_usage - Capture ORM attributes before commit in resolve_asset_for_download - Remove dead code (if mime_type is None: pass) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019c9122-ba77-768a-b827-1a4adea1c97e Co-authored-by: Amp <amp@ampcode.com>
185100c to
1fa4c88
Compare
There was a problem hiding this comment.
Actionable comments posted: 8
♻️ Duplicate comments (26)
app/assets/services/file_utils.py (1)
4-8: Preferst_mtime_nsdirectly to avoid float precision loss.The fallback
int(stat_result.st_mtime * 1_000_000_000)can lose precision and cause spurious “changed” detections. If your minimum Python version guaranteesst_mtime_ns, return it directly.♻️ Suggested simplification
def get_mtime_ns(stat_result: os.stat_result) -> int: """Extract mtime in nanoseconds from a stat result.""" - return getattr( - stat_result, "st_mtime_ns", int(stat_result.st_mtime * 1_000_000_000) - ) + return stat_result.st_mtime_ns#!/bin/bash # Check declared minimum Python version fd -g 'pyproject.toml' -g 'setup.cfg' -g 'setup.py' -g 'tox.ini' \ | xargs rg -n "requires-python|python_requires|python_version"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/assets/services/file_utils.py` around lines 4 - 8, The get_mtime_ns function should use st_mtime_ns directly to avoid float precision loss; replace the getattr fallback with a direct return of stat_result.st_mtime_ns (i.e., change get_mtime_ns to return stat_result.st_mtime_ns) and remove the int(stat_result.st_mtime * 1_000_000_000) fallback since the project's minimum Python version guarantees st_mtime_ns.tests-unit/assets_test/test_file_utils.py (1)
1-55: Add tests for the new mtime/size helpers.Current coverage misses
get_mtime_ns,get_size_and_mtime_ns, andverify_file_unchangededge cases (mtime None, size None, mismatch cases). These are new in this PR and drive change-detection behavior.Do you want me to draft the missing test cases or open a follow-up issue?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests-unit/assets_test/test_file_utils.py` around lines 1 - 55, Add unit tests covering the new mtime/size helper functions: write tests for get_mtime_ns, get_size_and_mtime_ns, and verify_file_unchanged to exercise edge cases (when mtime is None, when size is None, and when returned size/mtime mismatch expected values). Specifically, create files under tmp_path and assert get_mtime_ns returns an int (and handles missing files/None), assert get_size_and_mtime_ns returns a tuple (size, mtime_ns) and handles None components, and assert verify_file_unchanged returns True for matching size/mtime and False for each mismatch scenario (size mismatch, mtime mismatch, and when one value is None). Use the function names get_mtime_ns, get_size_and_mtime_ns, and verify_file_unchanged to locate the helpers.tests-unit/assets_test/queries/conftest.py (1)
2-13:⚠️ Potential issue | 🟡 MinorEnable SQLite foreign keys in the in-memory fixture.
SQLite disables FK enforcement by default; without
PRAGMA foreign_keys=ON, tests can pass while runtime fails. Add the connect listener beforecreate_all.🛠️ Suggested update
-from sqlalchemy import create_engine +from sqlalchemy import create_engine, event @@ engine = create_engine("sqlite:///:memory:") + `@event.listens_for`(engine, "connect") + def _set_sqlite_fk(dbapi_connection, _): + cursor = dbapi_connection.cursor() + cursor.execute("PRAGMA foreign_keys=ON") + cursor.close() Base.metadata.create_all(engine)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests-unit/assets_test/queries/conftest.py` around lines 2 - 13, The in-memory SQLite session fixture disables foreign key enforcement by default; add a connection listener that runs PRAGMA foreign_keys=ON before creating tables. In the session fixture (where engine = create_engine(...) and Base.metadata.create_all(engine) are called), register an event listener on the engine's "connect" event to execute "PRAGMA foreign_keys=ON" for each new DBAPI connection, and register it before calling Base.metadata.create_all(engine) so tests enforce FK constraints as in production.tests-unit/assets_test/queries/test_asset_info.py (1)
206-207: Drop the redundant local import oftime.
The module already importstimeat the top.♻️ Suggested refactor
- import time time.sleep(0.01)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests-unit/assets_test/queries/test_asset_info.py` around lines 206 - 207, Remove the redundant local import of time in the test (the lines "import time" near the sleep) since time is already imported at module top; keep the existing time.sleep(0.01) call but delete the duplicate import so there are no duplicate imports in tests-unit/assets_test/queries/test_asset_info.py.main.py (1)
261-265:⚠️ Potential issue | 🟠 MajorEnsure
asset_seeder.resume()always runs.
Ife.execute(...)raises, the seeder stays paused for the rest of the process.🐛 Suggested fix
asset_seeder.pause() - - e.execute(item[2], prompt_id, extra_data, item[4]) - - asset_seeder.resume() + try: + e.execute(item[2], prompt_id, extra_data, item[4]) + finally: + asset_seeder.resume()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@main.py` around lines 261 - 265, The asset_seeder is paused then resumed but if e.execute(item[2], prompt_id, extra_data, item[4]) throws the seeder remains paused; wrap the call in a try/finally so asset_seeder.resume() always runs (call asset_seeder.pause() before the try, put e.execute(...) in the try block, and asset_seeder.resume() in the finally block) referencing the existing asset_seeder.pause, e.execute, and asset_seeder.resume symbols.app/assets/services/schemas.py (1)
10-14:⚠️ Potential issue | 🟠 MajorAssetData.hash should allow None to match Asset.hash.
Asset.hashisstr | None, andextract_asset_datapasses it through; keepingAssetData.hashasstrcan leak aNoneinto a non-optional field.🐛 Suggested fix
`@dataclass`(frozen=True) class AssetData: - hash: str + hash: str | None size_bytes: int | None mime_type: str | None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/assets/services/schemas.py` around lines 10 - 14, AssetData.hash is currently typed as str but Asset.hash (and values passed through extract_asset_data) can be None; update the AssetData dataclass definition (AssetData) to make hash Optional by changing its type to str | None so it matches Asset.hash and prevents None being stored in a non-optional field; verify extract_asset_data and any callers still type-check with the new nullable signature.tests-unit/assets_test/queries/test_asset.py (1)
24-31: Simplify the always-true guard in setup.
This conditional is redundant and obscures intent.♻️ Suggested refactor
- if setup_hash is not None or setup_hash is None: - # Create asset with given hash (including None for null hash test) - if setup_hash is not None or query_hash == "": - asset = Asset(hash=setup_hash, size_bytes=100) - session.add(asset) - session.commit() + # Create asset with given hash (including None for null hash test) + if setup_hash is not None or query_hash == "": + asset = Asset(hash=setup_hash, size_bytes=100) + session.add(asset) + session.commit()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests-unit/assets_test/queries/test_asset.py` around lines 24 - 31, The outer guard in test_exists_by_hash (if setup_hash is not None or setup_hash is None) is always true and should be removed; keep the inner condition that actually controls creation (if setup_hash is not None or query_hash == "") so the test only creates the Asset when intended—update the test_exists_by_hash method to eliminate the redundant outer if and retain the inner conditional and asset creation logic (Asset(hash=setup_hash, size_bytes=100), session.add, session.commit()).server.py (1)
698-701:⚠️ Potential issue | 🟠 MajorGuard asset seeder startup so
/object_infostays reliable.
A seeding exception here will bubble and fail the request.🛡️ Suggested fix
- asset_seeder.start(roots=("models", "input", "output")) + try: + asset_seeder.start(roots=("models", "input", "output")) + except Exception as e: + logging.warning("Asset seeder start failed: %s", e)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server.py` around lines 698 - 701, The call to asset_seeder.start in the get_object_info handler can raise and abort the request; wrap asset_seeder.start(roots=("models","input","output")) in a try/except that catches Exception, logs the error (using the existing logger or processLogger) with context, and allows the request to continue (or return a safe fallback) so a seeding failure does not bubble up; update the get_object_info function (identified by the routes.get("/object_info") decorator and asset_seeder.start usage) to perform this guarded startup before entering the with folder_paths.cache_helper block.app/assets/database/queries/asset.py (2)
133-137:session.flush()called unconditionally even whenrefisNone.When the reference isn't found, the flush is a no-op but unnecessary. Move it inside the
if ref:block.♻️ Proposed fix
ref = session.get(AssetReference, reference_id) if ref: ref.asset_id = to_asset_id - - session.flush() + session.flush()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/assets/database/queries/asset.py` around lines 133 - 137, The call to session.flush() is executed even when no AssetReference is found; update the logic in the code handling AssetReference retrieval (where session.get(AssetReference, reference_id) is assigned to ref) so that session.flush() is moved inside the if ref: block (after setting ref.asset_id = to_asset_id) and is not called when ref is None; locate the block referencing AssetReference, reference_id, ref.asset_id and adjust the control flow accordingly.
93-103:⚠️ Potential issue | 🟠 MajorMissing chunking —
INclause can exceed SQLite bind-parameter limit.
get_existing_asset_idspasses the entireasset_idslist into a singleIN(...)clause. Large lists will hit SQLite's bind-parameter ceiling. The sibling functionbulk_insert_assetsalready usesiter_chunks— the same pattern should be applied here.🐛 Proposed fix
def get_existing_asset_ids( session: Session, asset_ids: list[str], ) -> set[str]: """Return the subset of asset_ids that exist in the database.""" if not asset_ids: return set() - rows = session.execute( - select(Asset.id).where(Asset.id.in_(asset_ids)) - ).fetchall() - return {row[0] for row in rows} + found: set[str] = set() + for chunk in iter_chunks(asset_ids, calculate_rows_per_statement(1)): + rows = session.execute( + select(Asset.id).where(Asset.id.in_(chunk)) + ).fetchall() + found.update(row[0] for row in rows) + return found🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/assets/database/queries/asset.py` around lines 93 - 103, get_existing_asset_ids currently runs a single SELECT ... WHERE Asset.id IN (asset_ids) which will break on large lists due to SQLite bind-parameter limits; change it to iterate the input asset_ids in chunks using the existing iter_chunks helper (same pattern as bulk_insert_assets), run a select(Asset.id).where(Asset.id.in_(chunk)) for each chunk against the Session, collect IDs from each rows batch into a single set, and return that set so large inputs are handled safely.app/assets/api/upload.py (2)
163-170:⚠️ Potential issue | 🟡 MinorOrphaned temp directory left behind on cleanup.
delete_temp_file_if_existsremoves the file but not the parentunique_dircreated at line 112. Over time, empty directories accumulate undertemp/uploads/.♻️ Proposed fix
def delete_temp_file_if_exists(tmp_path: str | None) -> None: """Safely remove a temp file if it exists.""" if tmp_path: try: if os.path.exists(tmp_path): os.remove(tmp_path) + parent = os.path.dirname(tmp_path) + if parent and os.path.isdir(parent): + try: + os.rmdir(parent) # only succeeds if empty + except OSError: + pass except OSError as e: logging.debug("Failed to delete temp file %s: %s", tmp_path, e)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/assets/api/upload.py` around lines 163 - 170, delete_temp_file_if_exists currently only removes the temp file but leaves its parent unique_dir (created earlier) behind; update delete_temp_file_if_exists to after removing tmp_path compute the parent directory (e.g., parent = os.path.dirname(tmp_path)) and attempt to remove it if empty (os.rmdir or Path(parent).rmdir()) while catching and logging OSError/Exception so non-empty or permission errors are ignored; reference the function delete_temp_file_if_exists and ensure the cleanup targets the parent directory created for uploads (the unique_dir under temp/uploads) to prevent accumulating empty directories.
93-109:⚠️ Potential issue | 🟡 MinorHash-before-file ordering dependency in multipart parsing.
The drain-without-write optimization on line 97 only activates if the
hashfield arrives beforefile. If the client sendsfilefirst, data is written to disk even when the hash is already known. Multipart field order isn't guaranteed by the HTTP spec.Consider buffering the decision: accept the file to a temp path, then delete the temp file post-loop if the hash turned out to be known. Alternatively, document the expected field order in the API contract.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/assets/api/upload.py` around lines 93 - 109, The current branch inside the multipart handling (when fname == "file") only drains without writing if provided_hash_exists is already true, creating a dependency on field order; instead, always stream the incoming file to a temp path (use the existing file_client_name/file_written logic and field.read_chunk) and record the path, then after the overall multipart loop check provided_hash/provided_hash_exists and, if the hash indicates the upload is a duplicate, delete the temp file and adjust any tracking rather than relying on hash arriving before file; keep the same UploadError handling around read/write operations and ensure cleanup on exceptions.app/assets/helpers.py (1)
6-23:select_best_live_pathuses an unparameterizedSequence.The function accesses
file_pathandneeds_verifyviagetattr, which makes the expected interface invisible to type checkers. A lightweightProtocolwould make the contract explicit.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/assets/helpers.py` around lines 6 - 23, The function select_best_live_path currently types its parameter as an unparameterized Sequence and relies on getattr for file_path and needs_verify; define a lightweight Protocol (e.g., class HasFilePath(Protocol): file_path: str | None; needs_verify: bool) and change the signature of select_best_live_path to accept Sequence[HasFilePath] so type checkers see the expected attributes, then replace getattr usage with direct attribute access (s.file_path, s.needs_verify) inside select_best_live_path.app/assets/services/path_utils.py (2)
25-50:⚠️ Potential issue | 🟠 MajorMissing validation for empty or unknown root tags.
tags[0]on line 27 will raiseIndexErroron an empty list, and any unknown root (e.g.,"custom") silently falls through to the output directory on line 43.🐛 Proposed fix
def resolve_destination_from_tags(tags: list[str]) -> tuple[str, list[str]]: """Validates and maps tags -> (base_dir, subdirs_for_fs)""" + if not tags: + raise ValueError("at least one tag required") root = tags[0] + if root not in ("models", "input", "output"): + raise ValueError(f"unknown root tag '{root}'") if root == "models": if len(tags) < 2: raise ValueError("at least two tags required for model asset") @@ - else: - base_dir = os.path.abspath( - folder_paths.get_input_directory() - if root == "input" - else folder_paths.get_output_directory() - ) + elif root == "input": + base_dir = os.path.abspath(folder_paths.get_input_directory()) + raw_subdirs = tags[1:] + else: # root == "output" + base_dir = os.path.abspath(folder_paths.get_output_directory()) raw_subdirs = tags[1:]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/assets/services/path_utils.py` around lines 25 - 50, In resolve_destination_from_tags, first validate that the tags list is not empty (guard against tags[0] IndexError) and raise a clear ValueError if it is empty; next replace the implicit fallback that treats any unknown root as "output" with an explicit allowed-root check (only allow "models", "input", or "output") and raise ValueError for unknown roots (e.g., "custom"); keep the existing models-specific logic (folder_paths.folder_names_and_paths lookup and bases emptiness check) and preserve the path component validation for raw_subdirs.
53-61:⚠️ Potential issue | 🟠 MajorSymlink bypass in path containment check.
os.path.abspathdoesn't resolve symlinks. A symlinked subdirectory can escape the base path while passing this check. Useos.path.realpathinstead.🐛 Proposed fix
def validate_path_within_base(candidate: str, base: str) -> None: - cand_abs = os.path.abspath(candidate) - base_abs = os.path.abspath(base) + cand_abs = os.path.realpath(candidate) + base_abs = os.path.realpath(base) try: common = os.path.commonpath([cand_abs, base_abs]) except Exception: raise ValueError("invalid destination path") if common != base_abs: raise ValueError("destination escapes base directory")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/assets/services/path_utils.py` around lines 53 - 61, The validate_path_within_base function uses os.path.abspath which does not resolve symlinks and allows symlinked paths to bypass the containment check; change both cand_abs and base_abs to use os.path.realpath (resolve symlinks) before computing os.path.commonpath and keep the existing exception handling and ValueError messages so the function properly rejects paths that escape the base via symlinks.app/assets/api/routes.py (3)
651-652:⚠️ Potential issue | 🔴 CriticalBlocking
asset_seeder.wait()call inside async handler freezes the event loop.
asset_seeder.wait()callsThread.join()synchronously, blocking the entire aiohttp event loop until the scan completes. This should be offloaded to a thread executor.Proposed fix
+ import asyncio + if should_wait: - asset_seeder.wait() + await asyncio.to_thread(asset_seeder.wait) if was_disabled: asset_seeder.disable()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/assets/api/routes.py` around lines 651 - 652, The handler is calling asset_seeder.wait(), which blocks via Thread.join() and freezes the aiohttp event loop; change the call to run the blocking wait in a thread executor (e.g., use asyncio.get_running_loop().run_in_executor(...) or asyncio.to_thread(...)) so the join runs off the event loop; update the async handler where asset_seeder.wait() is invoked to await the executor call instead of calling wait() directly (referencing asset_seeder.wait and the surrounding async request handler).
720-725:⚠️ Potential issue | 🟡 MinorCheck-after-act:
mark_missing_outside_prefixes()executes before the state check.The seeder's own
mark_missing_outside_prefixesmethod does check for IDLE internally (returning 0 if not IDLE), so the prune won't actually race with a scan. However,marked == 0is ambiguous — it could mean "nothing to mark" or "seeder was busy". Consider checking state before calling the method for a cleaner 409 response.Proposed fix: check state first
async def mark_missing_assets(request: web.Request) -> web.Response: + status = asset_seeder.get_status() + if status.state.value != "IDLE": + return web.json_response( + {"status": "scan_running", "marked": 0}, + status=409, + ) marked = asset_seeder.mark_missing_outside_prefixes() - if marked == 0 and asset_seeder.get_status().state.value != "IDLE": - return web.json_response( - {"status": "scan_running", "marked": 0}, - status=409, - ) return web.json_response({"status": "completed", "marked": marked}, status=200)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/assets/api/routes.py` around lines 720 - 725, The code calls asset_seeder.mark_missing_outside_prefixes() before checking asset_seeder.get_status().state.value, making a return value of 0 ambiguous; update the logic to first check asset_seeder.get_status().state.value != "IDLE" and immediately return the 409 json_response when a scan is running, and only then call asset_seeder.mark_missing_outside_prefixes(); after calling mark_missing_outside_prefixes() return the normal response based on the actual marked count so a 0 means “nothing to mark” not “seeder busy.”
588-611:⚠️ Potential issue | 🟠 Major
delete_asset_tagsdoesn't catchPermissionError, unlikeadd_asset_tagsandupdate_asset_route.If
remove_tagsraisesPermissionErrorfor a non-owner, it falls through to the genericExceptionhandler and returns 500 instead of 403.Proposed fix
- except ValueError as ve: + except (ValueError, PermissionError) as ve: return _build_error_response( 404, "ASSET_NOT_FOUND", str(ve), {"id": reference_id} )Note: If you want
PermissionErrorto return 403 (like other handlers), you may want separateexceptclauses:+ except PermissionError as pe: + return _build_error_response( + 403, "FORBIDDEN", str(pe), {"id": reference_id} + ) except ValueError as ve: return _build_error_response( 404, "ASSET_NOT_FOUND", str(ve), {"id": reference_id} )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/assets/api/routes.py` around lines 588 - 611, delete_asset_tags currently catches ValueError and a generic Exception but not PermissionError, so PermissionError raised by remove_tags is treated as a 500; add a dedicated except PermissionError block in the delete_asset_tags handler to mirror add_asset_tags/update_asset_route behavior: catch PermissionError (from remove_tags) and return _build_error_response(403, "FORBIDDEN", str(pe) or a suitable message, and include {"id": reference_id}) while leaving the existing ValueError and generic Exception handlers intact; locate the try/except around remove_tags in delete_asset_tags to implement this.app/assets/database/models.py (1)
145-162:⚠️ Potential issue | 🟡 MinorMissing index on
updated_at— API sort queries on this column will require a full table scan.The
_validate_sort_fieldfunction inroutes.py(Line 92) permits sorting byupdated_at, but no index exists for it in__table_args__. Bothcreated_atandlast_access_timehave indexes, soupdated_atshould too.Proposed fix
Index("ix_asset_references_created_at", "created_at"), Index("ix_asset_references_last_access_time", "last_access_time"), + Index("ix_asset_references_updated_at", "updated_at"), Index("ix_asset_references_owner_name", "owner_id", "name"),Also add a corresponding
op.create_indexin the Alembic migration.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/assets/database/models.py` around lines 145 - 162, Add an index on the updated_at column to avoid full table scans by updating the model's __table_args__ (add an Index like ix_asset_references_updated_at on "updated_at") and create the matching Alembic migration that calls op.create_index for the same index name on the asset_references table; ensure the index name matches between the SQLAlchemy model (in models.py) and the Alembic migration so API sort queries using updated_at (per _validate_sort_field in routes.py) can use the index.app/assets/services/bulk_ingest.py (1)
186-193:⚠️ Potential issue | 🟠 MajorDuplicate-hash assets silently lose their references.
When
bulk_insert_assetsskips an asset due to hash conflict (ON CONFLICT DO NOTHING), the newasset_idis never inserted, soget_existing_asset_idswon't return it. The corresponding reference row is then filtered out at Line 191-193, meaning a new file path pointing to an already-known hash never gets a reference row.Consider resolving the existing asset ID by hash for skipped specs, and remapping
reference_rowsto point to the existing asset rather than dropping them.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/assets/services/bulk_ingest.py` around lines 186 - 193, bulk_insert_assets currently drops reference_rows whose asset_id wasn't returned by get_existing_asset_ids, losing references for assets skipped due to hash conflicts; fix by resolving skipped assets by hash and remapping reference_rows to the existing asset IDs instead of dropping them. Specifically: after calling get_existing_asset_ids(session, [...]) and before filtering reference_rows, collect the hashes from the original import specs (the same hash field used by bulk_insert_assets), query assets by hash to build a map hash -> existing_asset_id, then for any reference_row whose asset_id is missing from inserted_asset_ids replace its asset_id with the existing_asset_id from that map; finally filter reference_rows to only those whose (possibly remapped) asset_id exists. Use the existing functions bulk_insert_assets and get_existing_asset_ids and the reference_rows list to locate where to insert the resolver and remapping logic.app/assets/scanner.py (1)
298-306:⚠️ Potential issue | 🟠 Major
follow_symlinks=Falsein spec builders mismatchesfollow_symlinks=Truein sync — perpetualneeds_verifyfor symlinks.Both
build_asset_specs(Line 304) andbuild_stub_specs(Line 376) useos.stat(abs_p, follow_symlinks=False)(lstat), whilesync_references_with_filesystem(Line 147) usesfollow_symlinks=True(stat). For symlinked files, the recorded mtime/size will differ from the verification check, causingverify_file_unchangedto returnFalseevery scan cycle.Proposed fix: align follow_symlinks policy
- stat_p = os.stat(abs_p, follow_symlinks=False) + stat_p = os.stat(abs_p, follow_symlinks=True)Apply to both
build_asset_specs(Line 304) andbuild_stub_specs(Line 376).Also applies to: 370-380
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/assets/scanner.py` around lines 298 - 306, The spec builders build_asset_specs and build_stub_specs currently call os.stat(..., follow_symlinks=False) (lstat), which records symlink metadata and mismatches sync_references_with_filesystem/verify_file_unchanged that use os.stat(..., follow_symlinks=True), causing symlinked files to always appear changed; update the os.stat calls in build_asset_specs and build_stub_specs to use follow_symlinks=True so recorded mtime/size match the verification path and remove perpetual needs_verify for symlinks.app/assets/database/queries/asset_reference.py (5)
628-669:result.rowcountmay beNoneat lines 652 and 669.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/assets/database/queries/asset_reference.py` around lines 628 - 669, The SQLAlchemy ExecutionResult.rowcount can be None, so ensure both mark_references_missing_outside_prefixes and restore_references_by_paths return an int by normalizing None to 0; after executing the update in mark_references_missing_outside_prefixes (result) and restore_references_by_paths (result) convert result.rowcount to an integer (e.g., use result.rowcount or 0 or int(result.rowcount or 0)) before returning so the functions always return a concrete int.
947-974:IN/INcross-product doesn't correctly correlatefile_path ↔ asset_idpairs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/assets/database/queries/asset_reference.py` around lines 947 - 974, The WHERE currently uses two independent IN clauses causing a cross-product; change it to match pairs by using a tuple IN comparison: build a list of (file_path, asset_id) tuples for each chunk (using path_to_asset to map paths) and use sqlalchemy.sql.expression.tuple_(AssetReference.file_path, AssetReference.asset_id).in_(list_of_tuples) in the select. Update the loop in get_references_by_paths_and_asset_ids to create chunk_pairs from iter_chunks(paths, MAX_BIND_PARAMS) and pass those paired tuples to the query so file_path ↔ asset_id are correlated.
817-829:delete_orphaned_seed_assetdoesn't enforce the "hash is None" precondition.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/assets/database/queries/asset_reference.py` around lines 817 - 829, The function delete_orphaned_seed_asset currently deletes AssetReference rows unconditionally; change it to first load the Asset (session.get(Asset, asset_id)), verify that asset is not None and asset.hash is None, and only then execute the deletion of AssetReference records (sa.delete(AssetReference).where(AssetReference.asset_id == asset_id)) and session.delete(asset); if the asset is missing or its hash is not None, do not delete references and return False. Use the existing symbols Asset, AssetReference, and delete_orphaned_seed_asset to locate and update the logic so the "hash is None" precondition is enforced and the return value reflects whether deletion occurred.
928-944: Hardcoded column count14incalculate_rows_per_statement(14)may drift if the model changes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/assets/database/queries/asset_reference.py` around lines 928 - 944, The hardcoded column count (14) in bulk_insert_references_ignore_conflicts when calling calculate_rows_per_statement can drift; replace the literal with a dynamic column count computed from the actual insert payload or model (e.g., use the number of keys in enriched_rows[0] or AssetReference.__table__.columns) and pass that value to calculate_rows_per_statement so the chunk sizing stays correct; update the call at the iter_chunks(...) line inside bulk_insert_references_ignore_conflicts accordingly.
290-365: Count query duplicates all filter logic from the data query — divergence risk.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/assets/database/queries/asset_reference.py` around lines 290 - 365, The count query duplicates the filter/join logic in list_references_page (risking divergence); refactor to build a single filtered Select and reuse it for both data and count. Concretely, inside list_references_page extract the select(AssetReference).join(...).where(...).options(...) construction (and the subsequent name_contains, _apply_tag_filters, _apply_metadata_filter steps) into a single variable (e.g. filtered_stmt) and then derive the paginated query from it by applying order_by/limit/offset, while deriving the count by converting filtered_stmt to a count select (e.g. filtered_stmt.with_only_columns(sa.func.count()).order_by(None) or select(sa.func.count()).select_from(filtered_stmt.subquery())). Update usages of base and count_stmt accordingly and remove the duplicated join/where logic; keep references to functions _apply_tag_filters and _apply_metadata_filter and symbols AssetReference, AssetReferenceTag, Tag, and Asset to locate code.
🧹 Nitpick comments (9)
app/assets/services/tagging.py (1)
41-64:remove_tagsperforms a redundant reference lookup.
remove_tags_from_reference(intags.pyline 177-215 from the snippets) already fetches the reference and raisesValueErrorif not found. The pre-fetch and ownership check here in the service layer means the reference is loaded twice. This is intentional for ownership enforcement (which the query layer doesn't do), so it's fine from a correctness standpoint, but worth noting the double lookup.Consider passing
reference_row=ref_rowtoremove_tags_from_reference(similar to howapply_tagspasses it toadd_tags_to_reference) to avoid the redundantsession.get()inside the query function. This would require updating theremove_tags_from_referencesignature to accept an optionalreference_rowparameter, matching the pattern already used byadd_tags_to_reference.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/assets/services/tagging.py` around lines 41 - 64, The service function remove_tags currently does a duplicate lookup; to avoid it, change remove_tags to pass the already-fetched ref_row into remove_tags_from_reference (i.e. call remove_tags_from_reference(session, reference_id=reference_id, tags=tags, reference_row=ref_row)) and update the remove_tags_from_reference function signature to accept an optional reference_row parameter and use that row if provided instead of calling session.get()/get_reference_by_id; keep the ownership check in remove_tags as-is and ensure the updated query function still raises ValueError when no reference_row is available.app/assets/helpers.py (1)
26-30:ALLOWED_ROOTSconstant is a clean addition.Centralizing the root literals in one place improves consistency. Consider using this constant in
resolve_destination_from_tags(inpath_utils.py) for the root validation rather than hardcoding the set there.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/assets/helpers.py` around lines 26 - 30, Update resolve_destination_from_tags in path_utils.py to use the centralized ALLOWED_ROOTS constant (from app.assets.helpers) instead of the hardcoded set; add the appropriate import for ALLOWED_ROOTS at the top of path_utils.py and replace any validation logic that compares tag roots (e.g., the code that previously used {"models", "input", "output"}) to reference ALLOWED_ROOTS so root validation stays consistent with the new constant.tests-unit/seeder_test/test_seeder.py (2)
664-666: Missing blank line between test classes.PEP 8 expects two blank lines between top-level definitions. There's only one blank line before
TestSeederStopRestart.♻️ Fix
fresh_seeder.wait(timeout=5.0) assert fresh_seeder.get_status().state == State.IDLE + class TestSeederStopRestart:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests-unit/seeder_test/test_seeder.py` around lines 664 - 666, The file is missing the required blank line separation between top-level classes; add one additional blank line between the end of the previous test class (just after the assertion checking fresh_seeder.get_status().state == State.IDLE) and the declaration of the next test class TestSeederStopRestart so there are two blank lines between top-level definitions per PEP 8.
339-364: Potential race betweenbarrier.set()andwith patch(...)scope exit.After
barrier.set()on line 362, the seeder thread unblocks and continues executing. Thewith patch(...)context on line 348 exits immediately after line 362, restoring the originalcollect_paths_for_roots. If the seeder thread somehow re-invokescollect_paths_for_rootsafter the patch is torn down but before themock_dependenciesfixture restores its version, you'd get unexpected behavior.In practice this is safe here because
collect_paths_for_rootsis only called once per scan, andmock_dependenciescovers the remaining functions. Adding afresh_seeder.wait(timeout=5.0)before exiting thewithblock would make this airtight, consistent with how other tests handle it.♻️ Suggested improvement
barrier.set() - assert sum(results) == 1 + fresh_seeder.wait(timeout=5.0) + + assert sum(results) == 1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests-unit/seeder_test/test_seeder.py` around lines 339 - 364, There's a potential race where the patched collect_paths_for_roots may be restored before the background seeder finishes after barrier.set(); to fix, keep the patch active until the seeder thread completes by waiting for the seeder to finish inside the with block—call fresh_seeder.wait(timeout=5.0) (or equivalent synchronization on AssetSeeder) before exiting the with patch(...) context so collect_paths_for_roots remains patched while the seeder thread unwraps.app/assets/services/path_utils.py (1)
114-117:_compute_relativehas a convoluted normalization pattern.os.path.relpath( os.path.join(os.sep, os.path.relpath(child, parent)), os.sep )This computes
relpath(child, parent), then joins it to/, then computesrelpathof that against/— effectively just normalizing out any leading../segments. But ifchildis not underparent, the intermediaterelpathwill contain..components, and the outerrelpathwill silently collapse them, potentially returning a misleading relative path. This function is only called after_check_is_withinconfirms containment, so in practice it's safe, but a simpleros.path.relpath(child, parent)would suffice and be more readable.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/assets/services/path_utils.py` around lines 114 - 117, The _compute_relative function uses a convoluted normalization (calling os.path.relpath(child, parent), joining to os.sep, then relpath again) which can silently collapse leading ".." and is unnecessary because callers run _check_is_within first; simplify _compute_relative to directly return os.path.relpath(child, parent) (replace the current implementation in function _compute_relative to use a single os.path.relpath call and remove the os.path.join/os.sep double-relpath pattern).app/assets/api/upload.py (1)
12-30: Hash validation duplicated betweennormalize_and_validate_hashandUploadAssetSpec._parse_hash/CreateFromHashBody._require_blake3.Three separate locations implement nearly identical blake3 hash validation logic (this function, plus the two Pydantic validators in
schemas_in.py). Consolidating into a single shared helper would reduce the chance of drift.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/assets/api/upload.py` around lines 12 - 30, There is duplicated blake3 hash validation across normalize_and_validate_hash and the Pydantic validators UploadAssetSpec._parse_hash and CreateFromHashBody._require_blake3 in schemas_in.py; extract the canonical logic into a single shared helper (e.g., normalize_and_validate_blake3_hash) that returns the normalized "blake3:<hex>" or raises a clear error, then update normalize_and_validate_hash to call that helper and replace the bodies of UploadAssetSpec._parse_hash and CreateFromHashBody._require_blake3 to call the same helper (converting its exception to the validator-appropriate exception type if needed, e.g., raise ValueError inside Pydantic validators but keep UploadError for API paths), and add the new helper to the appropriate shared utils module and imports so all three places reuse the single implementation.app/assets/services/ingest.py (1)
251-256: Exception chaining lost when re-raising.The
from eclause is missing, which discards the original traceback inDependencyMissingErrorandRuntimeErrorraises.Proposed fix
try: digest = hashing.compute_blake3_hash(temp_path) except ImportError as e: - raise DependencyMissingError(str(e)) + raise DependencyMissingError(str(e)) from e except Exception as e: - raise RuntimeError(f"failed to hash uploaded file: {e}") + raise RuntimeError(f"failed to hash uploaded file: {e}") from e🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/assets/services/ingest.py` around lines 251 - 256, The except blocks around hashing.compute_blake3_hash(temp_path) drop the original tracebacks; update both re-raises to preserve exception chaining by using "from e" (e.g., raise DependencyMissingError(str(e)) from e and raise RuntimeError(f"failed to hash uploaded file: {e}") from e) so the original exception context is retained when raising DependencyMissingError and RuntimeError.app/assets/seeder.py (1)
84-100:_prune_firstis not initialized in__init__, requiring agetattrworkaround inrestart.
_prune_firstis only set insidestart()(Line 149), butrestart()usesgetattr(self, "_prune_first", False)at Line 293 to avoidAttributeError. Initialize it in__init__alongside other instance variables for consistency and to remove the defensivegetattr.Proposed fix
self._compute_hashes: bool = False self._progress_callback: ProgressCallback | None = None self._disabled: bool = False + self._prune_first: bool = False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/assets/seeder.py` around lines 84 - 100, Initialize the missing _prune_first attribute in the class __init__ to avoid the getattr workaround used in restart; add self._prune_first: bool = False alongside other instance variables in __init__ so start() can overwrite it and restart() can safely reference it without getattr(self, "_prune_first", False).app/assets/services/asset_management.py (1)
130-177: File deletion after commit silently swallows all exceptions.
contextlib.suppress(Exception)at Line 173 will suppress any exception during file deletion, including unexpected errors (e.g., permission errors on directories, symlink issues). While defensive, this could hide operational problems. Consider logging atdebugorwarninglevel inside the suppression.Suggestion
for p in file_paths: - with contextlib.suppress(Exception): - if p and os.path.isfile(p): - os.remove(p) + try: + if p and os.path.isfile(p): + os.remove(p) + except Exception: + logging.debug("Failed to remove orphan file %s", p, exc_info=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/assets/services/asset_management.py` around lines 130 - 177, The file deletion loop in delete_asset_reference uses contextlib.suppress(Exception) which hides all errors when removing paths in file_paths; change this to catch Exception as e (or use contextlib.suppress but still log) and emit a warning/debug log including the path and exception details so permission/symlink errors aren't silently swallowed; add or use a module logger (e.g., logger = logging.getLogger(__name__)) and log inside the exception handler referencing delete_asset_reference and file_paths (or the loop variable p) so failures are recorded while still allowing the loop to continue.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (58)
alembic_db/versions/0002_merge_to_asset_references.pyapp/assets/api/routes.pyapp/assets/api/schemas_in.pyapp/assets/api/schemas_out.pyapp/assets/api/upload.pyapp/assets/database/bulk_ops.pyapp/assets/database/models.pyapp/assets/database/queries.pyapp/assets/database/queries/__init__.pyapp/assets/database/queries/asset.pyapp/assets/database/queries/asset_reference.pyapp/assets/database/queries/common.pyapp/assets/database/queries/tags.pyapp/assets/database/tags.pyapp/assets/hashing.pyapp/assets/helpers.pyapp/assets/manager.pyapp/assets/scanner.pyapp/assets/seeder.pyapp/assets/services/__init__.pyapp/assets/services/asset_management.pyapp/assets/services/bulk_ingest.pyapp/assets/services/file_utils.pyapp/assets/services/hashing.pyapp/assets/services/ingest.pyapp/assets/services/metadata_extract.pyapp/assets/services/path_utils.pyapp/assets/services/schemas.pyapp/assets/services/tagging.pyapp/database/db.pymain.pyrequirements.txtserver.pytests-unit/assets_test/conftest.pytests-unit/assets_test/helpers.pytests-unit/assets_test/queries/conftest.pytests-unit/assets_test/queries/test_asset.pytests-unit/assets_test/queries/test_asset_info.pytests-unit/assets_test/queries/test_cache_state.pytests-unit/assets_test/queries/test_metadata.pytests-unit/assets_test/queries/test_tags.pytests-unit/assets_test/services/__init__.pytests-unit/assets_test/services/conftest.pytests-unit/assets_test/services/test_asset_management.pytests-unit/assets_test/services/test_bulk_ingest.pytests-unit/assets_test/services/test_enrich.pytests-unit/assets_test/services/test_ingest.pytests-unit/assets_test/services/test_tagging.pytests-unit/assets_test/test_assets_missing_sync.pytests-unit/assets_test/test_crud.pytests-unit/assets_test/test_downloads.pytests-unit/assets_test/test_file_utils.pytests-unit/assets_test/test_list_filter.pytests-unit/assets_test/test_prune_orphaned_assets.pytests-unit/assets_test/test_tags_api.pytests-unit/assets_test/test_uploads.pytests-unit/requirements.txttests-unit/seeder_test/test_seeder.py
💤 Files with no reviewable changes (7)
- tests-unit/assets_test/conftest.py
- tests-unit/requirements.txt
- app/assets/database/bulk_ops.py
- app/assets/manager.py
- app/assets/hashing.py
- app/assets/database/tags.py
- app/assets/database/queries.py
✅ Files skipped from review due to trivial changes (1)
- tests-unit/assets_test/queries/test_cache_state.py
🚧 Files skipped from review as they are similar to previous changes (16)
- tests-unit/assets_test/test_assets_missing_sync.py
- tests-unit/assets_test/queries/test_tags.py
- app/assets/database/queries/common.py
- requirements.txt
- tests-unit/assets_test/services/init.py
- tests-unit/assets_test/services/test_asset_management.py
- app/assets/database/queries/init.py
- tests-unit/assets_test/services/test_enrich.py
- tests-unit/assets_test/services/test_ingest.py
- tests-unit/assets_test/services/test_bulk_ingest.py
- tests-unit/assets_test/test_downloads.py
- app/assets/services/hashing.py
- app/assets/services/init.py
- tests-unit/assets_test/test_uploads.py
- tests-unit/assets_test/test_prune_orphaned_assets.py
- alembic_db/versions/0002_merge_to_asset_references.py
| if not parsed.tmp_path or not os.path.exists(parsed.tmp_path): | ||
| return _build_error_response( | ||
| 404, | ||
| "ASSET_NOT_FOUND", | ||
| "Provided hash not found and no file uploaded.", | ||
| ) |
There was a problem hiding this comment.
HTTP 404 is misleading when no file was uploaded and hash doesn't match.
When parsed.tmp_path is missing and the hash wasn't found, the 404 response with "ASSET_NOT_FOUND" is technically about the missing uploaded content, not a missing resource. A 400 (BAD_REQUEST) would better convey that the client's request was incomplete.
Proposed fix
if not parsed.tmp_path or not os.path.exists(parsed.tmp_path):
return _build_error_response(
- 404,
- "ASSET_NOT_FOUND",
+ 400,
+ "BAD_REQUEST",
"Provided hash not found and no file uploaded.",
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if not parsed.tmp_path or not os.path.exists(parsed.tmp_path): | |
| return _build_error_response( | |
| 404, | |
| "ASSET_NOT_FOUND", | |
| "Provided hash not found and no file uploaded.", | |
| ) | |
| if not parsed.tmp_path or not os.path.exists(parsed.tmp_path): | |
| return _build_error_response( | |
| 400, | |
| "BAD_REQUEST", | |
| "Provided hash not found and no file uploaded.", | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/assets/api/routes.py` around lines 359 - 364, The current branch returns
a 404 via _build_error_response when parsed.tmp_path is falsy or missing, which
misrepresents a malformed request; update the conditional that checks
parsed.tmp_path (the block using parsed.tmp_path and
os.path.exists(parsed.tmp_path)) to return a 400 status with an appropriate
error code such as "BAD_REQUEST" and a message like "No file uploaded and
provided hash not found" instead of "ASSET_NOT_FOUND" so callers receive a
client error for the missing uploaded file.
| if isinstance(v, list): | ||
| ors = [_exists_clause_for_value(k, elem) for elem in v] | ||
| if ors: | ||
| stmt = stmt.where(sa.or_(*ors)) | ||
| else: | ||
| stmt = stmt.where(_exists_clause_for_value(k, v)) |
There was a problem hiding this comment.
Empty metadata_filter list value silently skips the filter instead of returning no results.
When v is an empty list, ors = [] and if ors: is False, so no WHERE clause is added. A caller passing {"key": []} (semantically "value equals any of nothing") gets back all rows instead of zero rows.
🐛 Proposed fix
if isinstance(v, list):
ors = [_exists_clause_for_value(k, elem) for elem in v]
if ors:
stmt = stmt.where(sa.or_(*ors))
+ else:
+ stmt = stmt.where(sa.false())
else:
stmt = stmt.where(_exists_clause_for_value(k, v))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/assets/database/queries/asset_reference.py` around lines 152 - 157, The
code handling metadata_filter list values improperly skips empty lists; inside
the branch where v is a list (the call-site that builds ors via
_exists_clause_for_value), detect if v is an empty list and in that case add a
predicate that always evaluates to false (e.g., use SQLAlchemy's false() /
literal(False)) to the statement instead of no WHERE clause so the query returns
zero rows; otherwise keep the existing behavior of stmt.where(sa.or_(*ors)) when
ors is non-empty.
| def restore_references_by_paths(session: Session, file_paths: list[str]) -> int: | ||
| """Restore references that were previously marked as missing. | ||
|
|
||
| Returns number of references restored. | ||
| """ | ||
| if not file_paths: | ||
| return 0 | ||
|
|
||
| result = session.execute( | ||
| sa.update(AssetReference) | ||
| .where(AssetReference.file_path.in_(file_paths)) | ||
| .where(AssetReference.is_missing == True) # noqa: E712 | ||
| .values(is_missing=False) | ||
| ) | ||
| return result.rowcount |
There was a problem hiding this comment.
Missing IN-clause chunking in bulk mutation functions will hit SQLite's 999 bind-parameter limit.
The read-side functions (get_reference_ids_by_ids line 986, get_references_by_paths_and_asset_ids line 965) already use iter_chunks(…, MAX_BIND_PARAMS). The mutation functions below pass the full list to a single IN (…) clause, which will raise OperationalError: too many SQL variables for any realistically-sized model library:
| Function | Affected clause | Line |
|---|---|---|
restore_references_by_paths |
file_path.in_(file_paths) |
664 |
delete_assets_by_ids |
asset_id.in_(asset_ids) + Asset.id.in_(asset_ids) |
701, 703 |
bulk_update_needs_verify |
id.in_(reference_ids) |
781 |
bulk_update_is_missing |
id.in_(reference_ids) |
797 |
delete_references_by_ids |
id.in_(reference_ids) |
812 |
bulk_update_enrichment_level |
id.in_(reference_ids) |
922 |
🐛 Sketch for bulk_update_needs_verify (same pattern applies to the others)
def bulk_update_needs_verify(
session: Session, reference_ids: list[str], value: bool
) -> int:
if not reference_ids:
return 0
- result = session.execute(
- sa.update(AssetReference)
- .where(AssetReference.id.in_(reference_ids))
- .values(needs_verify=value)
- )
- return result.rowcount
+ total = 0
+ for chunk in iter_chunks(reference_ids, MAX_BIND_PARAMS):
+ result = session.execute(
+ sa.update(AssetReference)
+ .where(AssetReference.id.in_(chunk))
+ .values(needs_verify=value)
+ )
+ total += int(result.rowcount or 0)
+ return totalAlso applies to: 693-704, 770-784, 787-801, 804-814, 909-925
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/assets/database/queries/asset_reference.py` around lines 655 - 669, The
mutation functions (restore_references_by_paths, delete_assets_by_ids,
bulk_update_needs_verify, bulk_update_is_missing, delete_references_by_ids,
bulk_update_enrichment_level) currently pass the entire list into a single
IN(...) which will exceed SQLite's 999 bind-parameter limit; fix each by
iterating the input ids/paths using the existing iter_chunks(...,
MAX_BIND_PARAMS) helper, perform the sa.update/sa.delete per chunk (e.g. replace
AssetReference.file_path.in_(file_paths) with .in_(chunk)), accumulate and
return the total rowcount across chunks, and keep early-return for empty inputs.
Ensure the same chunked pattern is applied to both places where
asset_id/Asset.id and reference id lists are used.
| def delete_assets_by_ids(session: Session, asset_ids: list[str]) -> int: | ||
| """Delete assets and their references by ID. | ||
|
|
||
| Returns number of assets deleted. | ||
| """ | ||
| if not asset_ids: | ||
| return 0 | ||
| session.execute( | ||
| sa.delete(AssetReference).where(AssetReference.asset_id.in_(asset_ids)) | ||
| ) | ||
| result = session.execute(sa.delete(Asset).where(Asset.id.in_(asset_ids))) | ||
| return result.rowcount |
There was a problem hiding this comment.
result.rowcount returned without int(… or 0) guard in five functions not covered by the earlier review.
delete_assets_by_ids (line 704), bulk_update_needs_verify (line 784), bulk_update_is_missing (line 801), delete_references_by_ids (line 814), and bulk_update_enrichment_level (line 925) all return result.rowcount directly. Like the already-flagged cases at lines 652 and 669, CursorResult.rowcount can be None or -1 for certain driver/statement combinations while each of these functions is typed -> int.
🐛 Proposed fix (apply to all five return sites)
- return result.rowcount
+ return int(result.rowcount or 0)Also applies to: 770-784, 787-801, 804-814, 909-925
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/assets/database/queries/asset_reference.py` around lines 693 - 704,
Several functions (delete_assets_by_ids, bulk_update_needs_verify,
bulk_update_is_missing, delete_references_by_ids, bulk_update_enrichment_level)
return result.rowcount directly, but CursorResult.rowcount can be None or -1;
update each return to cast/fallback so the function always returns an int (e.g.
use int(result.rowcount or 0) or (result.rowcount if result.rowcount and
result.rowcount >= 0 else 0)). Locate the session.execute calls that assign to
result in those functions and replace the bare return result.rowcount with a
guarded int conversion/fallback to 0 to satisfy the -> int return type.
| def _run_enrich_phase(self, roots: tuple[RootType, ...]) -> int: | ||
| """Run phase 2: enrich existing records with metadata and hashes. | ||
|
|
||
| Returns: | ||
| Total number of assets enriched | ||
| """ | ||
| total_enriched = 0 | ||
| batch_size = 100 | ||
| last_progress_time = time.perf_counter() | ||
| progress_interval = 1.0 | ||
|
|
||
| # Get the target enrichment level based on compute_hashes | ||
| if not self._compute_hashes: | ||
| target_max_level = ENRICHMENT_STUB | ||
| else: | ||
| target_max_level = ENRICHMENT_METADATA | ||
|
|
||
| self._emit_event( | ||
| "assets.seed.started", | ||
| {"roots": list(roots), "phase": "enrich"}, | ||
| ) | ||
|
|
||
| while True: | ||
| if self._check_pause_and_cancel(): | ||
| logging.info("Enrich scan cancelled after %d assets", total_enriched) | ||
| break | ||
|
|
||
| # Fetch next batch of unenriched assets | ||
| unenriched = get_unenriched_assets_for_roots( | ||
| roots, | ||
| max_level=target_max_level, | ||
| limit=batch_size, | ||
| ) | ||
|
|
||
| if not unenriched: | ||
| break | ||
|
|
||
| enriched, failed = enrich_assets_batch( | ||
| unenriched, | ||
| extract_metadata=True, | ||
| compute_hash=self._compute_hashes, | ||
| ) | ||
| total_enriched += enriched | ||
|
|
||
| now = time.perf_counter() | ||
| if now - last_progress_time >= progress_interval: | ||
| self._emit_event( | ||
| "assets.seed.progress", | ||
| { | ||
| "phase": "enrich", | ||
| "enriched": total_enriched, | ||
| }, | ||
| ) | ||
| last_progress_time = now | ||
|
|
||
| return total_enriched |
There was a problem hiding this comment.
Enrichment loop may re-fetch the same rows indefinitely if enrichment fails.
In _run_enrich_phase, get_unenriched_assets_for_roots fetches rows with enrichment_level <= max_level. If enrich_assets_batch fails for a row (e.g., the file can't be read), the enrichment level is not updated (stays at the old level), and the same row will be fetched again in the next iteration of the while True loop. This creates an infinite loop when enrichment consistently fails for some rows.
Consider adding an offset that advances past failed rows, or tracking failed reference IDs to exclude them.
Suggested approach: break if no progress is made
while True:
if self._check_pause_and_cancel():
logging.info("Enrich scan cancelled after %d assets", total_enriched)
break
unenriched = get_unenriched_assets_for_roots(
roots,
max_level=target_max_level,
limit=batch_size,
)
if not unenriched:
break
enriched, failed = enrich_assets_batch(
unenriched,
extract_metadata=True,
compute_hash=self._compute_hashes,
)
total_enriched += enriched
+ # If no progress was made in this batch, break to avoid infinite loop
+ if enriched == 0:
+ if failed > 0:
+ logging.warning(
+ "Enrich phase stopping: %d items failed with no progress",
+ failed,
+ )
+ break
+
now = time.perf_counter()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/assets/seeder.py` around lines 661 - 716, The enrich loop in
_run_enrich_phase can repeatedly re-fetch the same records when
enrich_assets_batch fails for some rows because get_unenriched_assets_for_roots
selects by enrichment_level <= max_level; fix by ensuring progress even on
failures: after calling enrich_assets_batch (function enrich_assets_batch), if
enriched == 0 and failed > 0 either (a) mark the failed asset IDs as
skipped/failed (update their enrichment_level or a failure flag) so they are not
returned by get_unenriched_assets_for_roots, or (b) maintain a local set of
failed IDs and pass them to the fetch (or filter them out) / advance an offset
to avoid refetching the same batch, and as a safety break the loop if no net
progress is made for a full iteration; locate and update logic in
_run_enrich_phase, get_unenriched_assets_for_roots usage, and
enrich_assets_batch handling.
| with create_session() as session: | ||
| pair = fetch_reference_and_asset( | ||
| session, reference_id=reference_id, owner_id=owner_id | ||
| ) | ||
| if not pair: | ||
| raise RuntimeError("inconsistent DB state after ingest") | ||
| ref, asset = pair | ||
| tag_names = get_reference_tags(session, reference_id=ref.id) | ||
|
|
||
| return UploadResult( | ||
| ref=extract_reference_data(ref), | ||
| asset=extract_asset_data(asset), | ||
| tags=tag_names, | ||
| created_new=ingest_result.asset_created, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find create_session definition and understand its behavior
rg -n "def create_session" --type py -A 10Repository: Comfy-Org/ComfyUI
Length of output: 149
🏁 Script executed:
# Check for expire_on_commit settings
rg -n "expire_on_commit" --type py -B 2 -A 2Repository: Comfy-Org/ComfyUI
Length of output: 43
🏁 Script executed:
# Find extract_reference_data and extract_asset_data implementations
rg -n "def extract_reference_data|def extract_asset_data" --type py -A 15Repository: Comfy-Org/ComfyUI
Length of output: 1395
🏁 Script executed:
# Look for sessionmaker or Session configuration
rg -n "sessionmaker|Session" --type py -B 2 -A 2 | head -100Repository: Comfy-Org/ComfyUI
Length of output: 7003
🏁 Script executed:
# Check the full Session configuration in db.py
cat -n app/database/db.py | head -150Repository: Comfy-Org/ComfyUI
Length of output: 4460
🏁 Script executed:
# Look for sessionmaker configuration
rg -n "sessionmaker|Session\(" --type py app/database/db.py -B 3 -A 3Repository: Comfy-Org/ComfyUI
Length of output: 490
🏁 Script executed:
# Check the ORM model definitions to understand the relationships
rg -n "class AssetReference|class Asset" --type py -A 30 | head -150Repository: Comfy-Org/ComfyUI
Length of output: 7851
🏁 Script executed:
# Verify if the code at lines 333-347 is new in this PR
cd app/assets/services && git log --oneline -n 20 ingest.py 2>/dev/null || echo "Git history not available in sandbox"Repository: Comfy-Org/ComfyUI
Length of output: 93
🏁 Script executed:
# Check if there are ORM model definitions with relationship configurations
rg -n "class AssetReference|class Asset" --type py app/assets/models.py -A 40 | head -200Repository: Comfy-Org/ComfyUI
Length of output: 121
🏁 Script executed:
# Verify the exact behavior - check if lazy loading is configured on relationships
rg -n "lazy\s*=" --type py app/assets/models.py -B 2 -A 2Repository: Comfy-Org/ComfyUI
Length of output: 121
🏁 Script executed:
# Read the full upload_from_temp_path function to understand the context
sed -n '240,360p' app/assets/services/ingest.pyRepository: Comfy-Org/ComfyUI
Length of output: 4123
🏁 Script executed:
# Find the ORM model definitions - they might be in a different location
find . -name "*.py" -type f | xargs rg -l "class AssetReference|class Asset" | head -10Repository: Comfy-Org/ComfyUI
Length of output: 277
🏁 Script executed:
# Check the full ingest.py file structure to understand the context better
head -20 app/assets/services/ingest.py && echo "---" && tail -20 app/assets/services/ingest.pyRepository: Comfy-Org/ComfyUI
Length of output: 1120
🏁 Script executed:
# Look for fetch_reference_and_asset to understand what it returns
rg -n "def fetch_reference_and_asset" --type py -A 20Repository: Comfy-Org/ComfyUI
Length of output: 1683
🏁 Script executed:
# Check if the problematic code section (lines 333-347) is new in this PR by looking at git diff
git diff HEAD~1 app/assets/services/ingest.py | grep -A 20 -B 5 "fetch_reference_and_asset" | head -50Repository: Comfy-Org/ComfyUI
Length of output: 247
🏁 Script executed:
# Verify the default sessionmaker behavior and session context manager
rg -n "__enter__|__exit__|expire_on_commit" --type py app/database/db.pyRepository: Comfy-Org/ComfyUI
Length of output: 43
🏁 Script executed:
# Check if there are any other similar patterns in the codebase where ORM objects are accessed after session closes
rg -n "with create_session()" --type py -A 10 | grep -E "(extract_|with create_session|return)" | head -40Repository: Comfy-Org/ComfyUI
Length of output: 2621
🏁 Script executed:
# Check the ORM model definitions to see the exact column definitions
cat -n app/assets/database/models.py | head -200Repository: Comfy-Org/ComfyUI
Length of output: 8657
🏁 Script executed:
# Verify SQLAlchemy behavior - check if session is properly closed via context manager
python3 << 'EOF'
from sqlalchemy.orm import Session
import inspect
# Check if Session has __enter__ and __exit__ methods
print("Session has __enter__:", hasattr(Session, '__enter__'))
print("Session has __exit__:", hasattr(Session, '__exit__'))
# Check sessionmaker default for expire_on_commit
from sqlalchemy.orm import sessionmaker
sm = sessionmaker()
print("Default expire_on_commit:", sm.kw.get('expire_on_commit', 'True (default)'))
EOFRepository: Comfy-Org/ComfyUI
Length of output: 183
🏁 Script executed:
# Check if there's any specific configuration that would prevent attribute expiration
rg -n "Session\(" --type py app/ -A 3 | grep -E "expire_on_commit|Session" | head -20Repository: Comfy-Org/ComfyUI
Length of output: 104
🏁 Script executed:
# Check if there are any other patterns in the codebase that do similar things successfully
rg -n "extract_reference_data|extract_asset_data" --type py -B 5 -A 2 | grep -E "(with create_session|extract_|return)" | head -50Repository: Comfy-Org/ComfyUI
Length of output: 2207
🏁 Script executed:
# Look at asset_management.py to see how other functions handle this pattern
sed -n '44,60p' app/assets/services/asset_management.pyRepository: Comfy-Org/ComfyUI
Length of output: 489
🏁 Script executed:
# Verify the exact behavior by checking what happens when attributes are accessed on expired ORM objects
# Look at the actual lines to confirm the issue
sed -n '333,347p' app/assets/services/ingest.pyRepository: Comfy-Org/ComfyUI
Length of output: 588
🏁 Script executed:
# Check if there are similar patterns in _ingest_file_from_path that also extract data after session close
sed -n '60,210p' app/assets/services/ingest.py | grep -A 30 "def _ingest_file_from_path"Repository: Comfy-Org/ComfyUI
Length of output: 43
🏁 Script executed:
# Search for the definition of fetch_reference_asset_and_tags to compare with fetch_reference_and_asset
rg -n "def fetch_reference_asset_and_tags" --type py -A 20Repository: Comfy-Org/ComfyUI
Length of output: 1843
🏁 Script executed:
# Check the pattern more thoroughly - see if there are uses where ORM objects are passed to functions that access attributes
# Look at _register_existing_asset to understand what it does with the tag_names passed to it
rg -n "def _register_existing_asset" --type py -A 50 | head -70Repository: Comfy-Org/ComfyUI
Length of output: 3328
🏁 Script executed:
# Check if extract_reference_data and extract_asset_data are ever called inside session context vs outside
rg -n "extract_reference_data|extract_asset_data" --type py -B 3 | grep -E "^(app/|--)" | head -80Repository: Comfy-Org/ComfyUI
Length of output: 4249
🏁 Script executed:
# Verify by checking if session.expire_on_commit is ever explicitly set to False
rg -n "expire_on_commit\s*=" --type pyRepository: Comfy-Org/ComfyUI
Length of output: 43
🏁 Script executed:
# Check the full context of _register_existing_asset to see how it differs from upload_from_temp_path
sed -n '129,195p' app/assets/services/ingest.pyRepository: Comfy-Org/ComfyUI
Length of output: 2046
Move extraction of data inside the session context.
After the with create_session() block exits (line 340), the session closes and all ORM attributes are expired. Lines 343-344 then call extract_reference_data(ref) and extract_asset_data(asset) on these detached, expired objects, which will raise DetachedInstanceError when accessing any attribute.
Extract the data while the session is still active:
Proposed fix
with create_session() as session:
pair = fetch_reference_and_asset(
session, reference_id=reference_id, owner_id=owner_id
)
if not pair:
raise RuntimeError("inconsistent DB state after ingest")
ref, asset = pair
tag_names = get_reference_tags(session, reference_id=ref.id)
+ ref_data = extract_reference_data(ref)
+ asset_data = extract_asset_data(asset)
return UploadResult(
- ref=extract_reference_data(ref),
- asset=extract_asset_data(asset),
+ ref=ref_data,
+ asset=asset_data,
tags=tag_names,
created_new=ingest_result.asset_created,
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/assets/services/ingest.py` around lines 333 - 347, The ref and asset are
accessed after the session is closed which will raise DetachedInstanceError;
move the extraction calls into the active session block. Inside the with
create_session() as session: after fetch_reference_and_asset(...) and
get_reference_tags(...), call extract_reference_data(ref) and
extract_asset_data(asset) to produce ref_data and asset_data, then exit the
block and return UploadResult(ref=ref_data, asset=asset_data, tags=tag_names,
created_new=ingest_result.asset_created). Keep references to create_session(),
fetch_reference_and_asset, get_reference_tags, extract_reference_data,
extract_asset_data, and UploadResult to locate the changes.
| tw = st_meta.get("trained_words") | ||
| if isinstance(tw, str): | ||
| try: | ||
| meta.trained_words = json.loads(tw) | ||
| except json.JSONDecodeError: | ||
| meta.trained_words = [w.strip() for w in tw.split(",") if w.strip()] | ||
| elif isinstance(tw, list): | ||
| meta.trained_words = tw |
There was a problem hiding this comment.
json.loads(tw) result not validated as a list.
On line 267, json.loads(tw) could return any JSON type (dict, string, number, etc.), but it's assigned directly to meta.trained_words which is typed as list[str] | None. A non-list value here would cause subtle downstream issues (e.g., iterating over a string yields characters, iterating over a dict yields keys).
🐛 Proposed fix
if isinstance(tw, str):
try:
- meta.trained_words = json.loads(tw)
+ parsed = json.loads(tw)
+ if isinstance(parsed, list):
+ meta.trained_words = [str(w) for w in parsed if w]
except json.JSONDecodeError:
meta.trained_words = [w.strip() for w in tw.split(",") if w.strip()]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/assets/services/metadata_extract.py` around lines 264 - 271, The JSON
decode result of tw may not be a list, so in the block handling tw (variable tw,
meta.trained_words in metadata_extract.py) change the assignment to validate the
parsed value: call json.loads(tw) in the try, then if isinstance(parsed, list)
assign meta.trained_words = parsed; elif isinstance(parsed, str) assign
meta.trained_words = [parsed]; otherwise fall back to the comma-split logic
(meta.trained_words = [w.strip() for w in tw.split(",") if w.strip()]) or set
meta.trained_words = None—ensure no non-list type is assigned to
meta.trained_words.
| rows = [ | ||
| {"id": str(uuid.uuid4()), "hash": "blake3:bulk1", "size_bytes": 100, "mime_type": "text/plain", "created_at": None}, | ||
| {"id": str(uuid.uuid4()), "hash": "blake3:bulk2", "size_bytes": 200, "mime_type": "image/png", "created_at": None}, | ||
| {"id": str(uuid.uuid4()), "hash": "blake3:bulk3", "size_bytes": 300, "mime_type": None, "created_at": None}, | ||
| ] | ||
| bulk_insert_assets(session, rows) | ||
| session.commit() | ||
|
|
||
| assets = session.query(Asset).all() | ||
| assert len(assets) == 3 | ||
| hashes = {a.hash for a in assets} | ||
| assert hashes == {"blake3:bulk1", "blake3:bulk2", "blake3:bulk3"} | ||
|
|
||
| def test_empty_list_is_noop(self, session: Session): | ||
| bulk_insert_assets(session, []) | ||
| session.commit() | ||
| assert session.query(Asset).count() == 0 | ||
|
|
||
| def test_handles_large_batch(self, session: Session): | ||
| """Test chunking logic with more rows than MAX_BIND_PARAMS allows.""" | ||
| rows = [ | ||
| {"id": str(uuid.uuid4()), "hash": f"blake3:large{i}", "size_bytes": i, "mime_type": None, "created_at": None} | ||
| for i in range(200) | ||
| ] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find and examine the Asset model
find . -type f -name "*.py" -path "*/assets/database/models*" | head -20Repository: Comfy-Org/ComfyUI
Length of output: 92
🏁 Script executed:
# Check for Asset model definition - search more broadly
rg -n "class Asset" --type pyRepository: Comfy-Org/ComfyUI
Length of output: 1092
🏁 Script executed:
# Look for created_at in asset-related models
rg -n "created_at" --type py -l | grep -E "(model|asset)" | head -10Repository: Comfy-Org/ComfyUI
Length of output: 539
🏁 Script executed:
# Examine the Asset model definition to check created_at
sed -n '26,70p' app/assets/database/models.pyRepository: Comfy-Org/ComfyUI
Length of output: 1723
🏁 Script executed:
# Check the specific test file section mentioned in the review
sed -n '115,145p' tests-unit/assets_test/queries/test_asset.pyRepository: Comfy-Org/ComfyUI
Length of output: 1334
🏁 Script executed:
# Find and examine bulk_insert_assets function
rg -n "def bulk_insert_assets" --type py -A 15Repository: Comfy-Org/ComfyUI
Length of output: 1162
🏁 Script executed:
# Check git history to see if these test rows are new
git log --oneline -n 30 -- tests-unit/assets_test/queries/test_asset.pyRepository: Comfy-Org/ComfyUI
Length of output: 93
Omit created_at=None from bulk insert rows. Asset.created_at is non-nullable with a server-side default (get_utc_now). SQLAlchemy Core inserts bypass Python defaults, so explicitly passing None will attempt to violate the NOT NULL constraint and cause an IntegrityError. Remove the created_at key from the dictionaries to let the database apply the default.
Suggested fix
- {"id": str(uuid.uuid4()), "hash": "blake3:bulk1", "size_bytes": 100, "mime_type": "text/plain", "created_at": None},
- {"id": str(uuid.uuid4()), "hash": "blake3:bulk2", "size_bytes": 200, "mime_type": "image/png", "created_at": None},
- {"id": str(uuid.uuid4()), "hash": "blake3:bulk3", "size_bytes": 300, "mime_type": None, "created_at": None},
+ {"id": str(uuid.uuid4()), "hash": "blake3:bulk1", "size_bytes": 100, "mime_type": "text/plain"},
+ {"id": str(uuid.uuid4()), "hash": "blake3:bulk2", "size_bytes": 200, "mime_type": "image/png"},
+ {"id": str(uuid.uuid4()), "hash": "blake3:bulk3", "size_bytes": 300, "mime_type": None},- {"id": str(uuid.uuid4()), "hash": f"blake3:large{i}", "size_bytes": i, "mime_type": None, "created_at": None}
+ {"id": str(uuid.uuid4()), "hash": f"blake3:large{i}", "size_bytes": i, "mime_type": None}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| rows = [ | |
| {"id": str(uuid.uuid4()), "hash": "blake3:bulk1", "size_bytes": 100, "mime_type": "text/plain", "created_at": None}, | |
| {"id": str(uuid.uuid4()), "hash": "blake3:bulk2", "size_bytes": 200, "mime_type": "image/png", "created_at": None}, | |
| {"id": str(uuid.uuid4()), "hash": "blake3:bulk3", "size_bytes": 300, "mime_type": None, "created_at": None}, | |
| ] | |
| bulk_insert_assets(session, rows) | |
| session.commit() | |
| assets = session.query(Asset).all() | |
| assert len(assets) == 3 | |
| hashes = {a.hash for a in assets} | |
| assert hashes == {"blake3:bulk1", "blake3:bulk2", "blake3:bulk3"} | |
| def test_empty_list_is_noop(self, session: Session): | |
| bulk_insert_assets(session, []) | |
| session.commit() | |
| assert session.query(Asset).count() == 0 | |
| def test_handles_large_batch(self, session: Session): | |
| """Test chunking logic with more rows than MAX_BIND_PARAMS allows.""" | |
| rows = [ | |
| {"id": str(uuid.uuid4()), "hash": f"blake3:large{i}", "size_bytes": i, "mime_type": None, "created_at": None} | |
| for i in range(200) | |
| ] | |
| rows = [ | |
| {"id": str(uuid.uuid4()), "hash": "blake3:bulk1", "size_bytes": 100, "mime_type": "text/plain"}, | |
| {"id": str(uuid.uuid4()), "hash": "blake3:bulk2", "size_bytes": 200, "mime_type": "image/png"}, | |
| {"id": str(uuid.uuid4()), "hash": "blake3:bulk3", "size_bytes": 300, "mime_type": None}, | |
| ] | |
| bulk_insert_assets(session, rows) | |
| session.commit() | |
| assets = session.query(Asset).all() | |
| assert len(assets) == 3 | |
| hashes = {a.hash for a in assets} | |
| assert hashes == {"blake3:bulk1", "blake3:bulk2", "blake3:bulk3"} | |
| def test_empty_list_is_noop(self, session: Session): | |
| bulk_insert_assets(session, []) | |
| session.commit() | |
| assert session.query(Asset).count() == 0 | |
| def test_handles_large_batch(self, session: Session): | |
| """Test chunking logic with more rows than MAX_BIND_PARAMS allows.""" | |
| rows = [ | |
| {"id": str(uuid.uuid4()), "hash": f"blake3:large{i}", "size_bytes": i, "mime_type": None} | |
| for i in range(200) | |
| ] |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests-unit/assets_test/queries/test_asset.py` around lines 115 - 138, The
tests pass explicit "created_at": None into bulk_insert_assets which bypasses
the DB server default and violates Asset.created_at NOT NULL; update the test
data generation in the methods that build rows (e.g. the rows list in the
bulk-insert test and the rows comprehension in test_handles_large_batch) to omit
the "created_at" key entirely so the database/server_default (get_utc_now) is
used; keep using bulk_insert_assets(session, rows) and session.commit() as
before and do not set created_at anywhere in those dictionaries.
Summary
This PR refactors the assets package to support asynchronous background scanning, improving startup performance by not blocking on asset discovery.
Key Changes
Architecture Refactoring
New Features
is_missingflag instead of deletionCode Quality
Cleanup