[NSFS | GLACIER] Add support for parallel recall and migrate #9183

tangledbytes · 2025-08-05T09:41:44Z

Describe the Problem

Currently NooBaa allows to run only one glacier operation at a time. Either a migration can be running or a restore operation can be running. This was done to avoid several kinds of race cases. This doesn't create any "correctness" problem however it means that we don't completely utilize the underlying tape drives.

Another problem this PR tries to address is that the PersistentLogger.process would delete the file after the callback would return. This can potentially create a race situation because the file gets "moved" outside of lock.

Explain the Changes

This PR addresses the parallel processing of migration and recall by splitting the processing into 2 different stages:

Stage 1 - "stages" the files that needs to be processed. This stage still requires a cluster wide lock. This is done to ensure that all the processes get a consistent view of the system state. This stage involves reading each log file line-by-line, opening it, stat-ting it, xattr modification. These steps should take negligible amount of time when compared to the time taken by actual tape migration and recall.
Stage 2 - This stage doesn't acquire any cluster wide lock however it still acquires a lock to ensure only 1 restore/migrate is running at a time (i.e a migrate and a restore at max). This is done to ensure that in case of a multi-node cluster we don't end up in situation where multiple nodes are trying to consume the same file. As a future improvement in this stage, I would want to use Promise.all([this._migrate...]) to run multiple migrations concurrently within the same process which could potentially improve the drive utilization further.

To address the race in delete scenario - This PR acquires a lock on the file before handing it over to the callback and releases the lock once the callback returns. This has some implications:

The callback needs to make sure that it doesn't try to acquire lock on the file again.
It needs to make sure that it doesn't deinit the LogFile as that would close the file leading to losing the lock.
Notice that the callback no longer receives the filename in its parameter rather the LogFile object.

Issues: Fixed #xxx / Gap #xxx

Testing Instructions:

./node_modules/.bin/mocha src/test/unit_tests/nsfs/test_nsfs_glacier_backend.js

Doc added/updated
Tests added

Summary by CodeRabbit

New Features
- Added a staging phase for migration and restore flows and a config flag to enable DMAPI checks before finalizing restores (disabled by default).
Refactor
- Centralized orchestration for migration/restore/expiry and unified operation execution.
- Introduced a log-file abstraction and shared open-with-lock utilities to simplify log handling.
Bug Fixes
- Improved error reporting, cleanup, and restore finalization handling.
Tests
- Simplified and centralized backend test helpers and updated migration/restore tests.

coderabbitai · 2025-08-05T09:41:52Z

Walkthrough

Adds a staging layer and centralized orchestration/locking for Glacier migration/restore/expiry, introduces a LogFile abstraction and open_with_lock utility, updates restore finalization to optionally check DMAPI xattrs via a new config flag, and adapts utilities and tests to the new LogFile APIs. (≤50 words)

Changes

Cohort / File(s)	Change Summary
Configuration Flag Addition `config.js`	Adds boolean `NSFS_GLACIER_DMAPI_FINALIZE_RESTORE_ENABLE` (default `false`) to control DMAPI xattr checks during restore finalization.
Glacier Orchestration (manager layer) `src/manage_nsfs/manage_nsfs_glacier.js`	Removes per-operation lock wrappers and helpers; replaces per-log processing with unified `backend.perform()` calls for MIGRATION/RESTORE/EXPIRY while keeping decision logic and timestamp updates.
Glacier SDK — staging & perform API `src/sdk/glacier.js`	Adds staging constants (`XATTR_STAGE_MIGRATE`, `MIGRATE_STAGE_WAL_NAME`, `RESTORE_STAGE_WAL_NAME`), cluster/per-type/scan locks, staging entrypoints (`stage_migrate`, `stage_restore`), and unified `perform(fs_context, type)` orchestrator; updates xattr handling and docs.
TapeCloud Glacier — staged flows & finalize `src/sdk/glacier_tapecloud.js`	Adds `stage_migrate`/`stage_restore`, refactors `migrate`/`restore` to use staging and `failure_recorder`, extends `_finalize_restore` to accept optional `failure_recorder` and perform DMAPI premigration xattr check when enabled; improves task-id parsing and logging.
Persistent logging abstraction `src/util/persistent_logger.js`	Introduces exported `LogFile` class (init/deinit/collect/collect_and_process), refactors `PersistentLogger.init()` to use `open_with_lock`, and updates `_process`/`process` to operate on `LogFile` instances with centralized lifecycle and deletion handling.
Native FS utilities `src/util/native_fs_utils.js`	Tightens `lock_and_run` typing and adds exported `open_with_lock` to open files with optional EXCLUSIVE/SHARED locks, retries, backoff, and inode/link validation.
File reader lifecycle `src/util/file_reader.js`	`NewlineReader.close()` now clears the internal file handle before closing and adds JSDoc noting the reader remains usable and will reinitialize on subsequent reads.
Bucket logs utility `src/util/bucket_logs_utils.js`	Removes LogFile wrapper and `fs_context` param; `_upload_to_targets` signature now `(s3_connection, log_file, bucket_to_owner_keys_func)` and uses `log_file.log_path` and `log_file.collect_and_process`.
Notifications util `src/util/notifications_util.js`	Removes in-function construction of `LogFile`; `_notify` signature now accepts a `LogFile` instance and `failure_append` and uses the provided `LogFile` directly; process_notification_files now closes logs and clears connections in finally.
Manage NSFS Glacier tests `src/test/unit_tests/nsfs/test_nsfs_glacier_backend.js`	Adds `get_patched_backend()` test helper to produce deterministic backends; replaces inline backend patches and low-level WAL iteration with `collect_and_process` usage and `backend.perform()` calls for restores.

Sequence Diagram(s)

sequenceDiagram
    participant Manager as manage_nsfs_glacier.js
    participant Backend as Glacier / TapeCloudGlacier
    participant Logger as PersistentLogger
    participant LogFile as LogFile

    Manager->>Backend: perform(fs_context, "MIGRATION"/"RESTORE"/"EXPIRY")
    alt MIGRATION or RESTORE
        Backend->>Logger: process(primary logs)  -- acquire GLACIER_CLUSTER_LOCK
        Logger->>LogFile: init (acquire log lock)
        LogFile->>Backend: stage_migrate/stage_restore(log_file, failure_recorder)
        Backend->>Logger: process(staged logs)  -- acquire per-type cluster lock
        Logger->>LogFile: init (staged log lock)
        LogFile->>Backend: migrate/restore(log_file, failure_recorder)
        Backend->>Backend: finalize (e.g., _finalize_restore with optional DMAPI check)
    else EXPIRY
        Backend->>Logger: process(expiry logs)  -- guarded by GLACIER_SCAN_LOCK
        Logger->>LogFile: init (acquire log lock)
        LogFile->>Backend: process expiry entries
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested labels

size/XXL

Suggested reviewers

jackyalbo

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

src/sdk/glacier.js

src/sdk/glacier_tapecloud.js

src/util/file_reader.js

src/util/native_fs_utils.js

tangledbytes · 2025-08-05T09:57:13Z

src/util/native_fs_utils.js

+ * }} [cfg]
+ * @returns {Promise<nb.NativeFile | undefined>}
+ */
+async function open_with_lock(fs_context, file_path, flags, mode, cfg = {}) {


Consider using in newline reader. However newline reader wants to be able to read even a deleted file I guess.

@tangledbytes reevaluate.

src/util/native_fs_utils.js

src/util/persistent_logger.js

src/util/native_fs_utils.js

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (5)

src/native/fs/fs_napi.cpp (1)
1897-1954: Well-implemented non-blocking file locking with minor typo.

The implementation correctly handles non-blocking file locking semantics:

Uses F_OFD_SETLK for non-blocking operation

Properly distinguishes between lock unavailable (EAGAIN) and actual errors

Returns boolean result to indicate lock acquisition success

Follows established patterns from FileFcntlLock

Fix the typo in the comment:
-            // Failed to aquire lock because another process
+            // Failed to acquire lock because another process
src/util/native_fs_utils.js (1)
800-802: Consider additional file validation checks

The current validation only checks inode match and nlink > 0. Consider adding additional checks like file size or modification time to ensure the file hasn't been replaced.
             if (fh_stat.ino === path_stat.ino && fh_stat.nlink > 0) {
+                // Optionally validate other attributes like size or mtime
+                // if specific consistency requirements exist
                 return file;
             }
src/sdk/glacier_tapecloud.js (1)
217-217: Inconsistent file handle management

At line 217, entry_fh is closed without checking if it's defined, but at line 251 optional chaining is used. Consider using consistent approach.
-                    await entry_fh?.close(fs_context);
+                    if (entry_fh) await entry_fh.close(fs_context);
Also applies to: 251-251
src/util/persistent_logger.js (1)
250-253: Consider making NewlineReader options configurable

The NewlineReader is created with hardcoded options. Consider making these configurable for flexibility.
-    constructor(fs_context, log_path) {
+    constructor(fs_context, log_path, options = {}) {
         this.fs_context = fs_context;
         this.log_path = log_path;
         this.log_reader = new NewlineReader(
             this.fs_context,
-            this.log_path, { lock: 'EXCLUSIVE', skip_overflow_lines: true, skip_leftover_line: true },
+            this.log_path, { 
+                lock: 'EXCLUSIVE', 
+                skip_overflow_lines: true, 
+                skip_leftover_line: true,
+                ...options 
+            },
         );
     }
src/sdk/glacier.js (1)
104-112: Consider adding validation in base staging methods

The base implementations of stage_migrate and stage_restore simply copy all entries without validation. While subclasses can override these, consider adding basic validation in the base class.
     async stage_migrate(fs_context, log_file, failure_recorder) {
         try {
-            await log_file.collect(Glacier.MIGRATE_STAGE_WAL_NAME, async (entry, batch_recorder) => batch_recorder(entry));
+            await log_file.collect(Glacier.MIGRATE_STAGE_WAL_NAME, async (entry, batch_recorder) => {
+                // Basic validation: check if file exists
+                try {
+                    await nb_native().fs.stat(fs_context, entry);
+                    await batch_recorder(entry);
+                } catch (err) {
+                    if (err.code !== 'ENOENT') {
+                        await failure_recorder(entry);
+                    }
+                }
+            });
             return true;
         } catch (error) {
             dbg.error('Glacier.stage_migrate error:', error);
             throw error;
         }
     }
Also applies to: 145-153

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f559d9b and cb099e6.

📒 Files selected for processing (12)

config.js (1 hunks)
src/manage_nsfs/manage_nsfs_glacier.js (1 hunks)
src/native/fs/fs_napi.cpp (4 hunks)
src/sdk/glacier.js (7 hunks)
src/sdk/glacier_tapecloud.js (6 hunks)
src/sdk/nb.d.ts (1 hunks)
src/test/unit_tests/nsfs/test_nsfs_glacier_backend.js (2 hunks)
src/util/bucket_logs_utils.js (4 hunks)
src/util/file_reader.js (1 hunks)
src/util/native_fs_utils.js (2 hunks)
src/util/notifications_util.js (3 hunks)
src/util/persistent_logger.js (6 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

src/test/**/*.*

⚙️ CodeRabbit Configuration File

src/test/**/*.*: Ensure that the PR includes tests for the changes.

Files:

src/test/unit_tests/nsfs/test_nsfs_glacier_backend.js

🧠 Learnings (1)

📚 Learning: the noobaa config.js file has a built-in environment variable override system via `load_config_env_o...

Learnt from: liranmauda
PR: noobaa/noobaa-core#9174
File: config.js:1044-1046
Timestamp: 2025-07-30T10:20:09.402Z
Learning: The NooBaa config.js file has a built-in environment variable override system via `load_config_env_overrides()` function. Any config property can be overridden using environment variables prefixed with `CONFIG_JS_`. For example, `CONFIG_JS_NC_HEALTH_BUCKETS_COUNT_LIMIT=10000` would override `config.NC_HEALTH_BUCKETS_COUNT_LIMIT`. The system handles type conversion automatically and provides consistent override patterns across all config values.

Applied to files:

config.js

🧬 Code Graph Analysis (4)

src/util/file_reader.js (1)

src/manage_nsfs/manage_nsfs_glacier.js (2)

fh (96-96)

fh (98-98)

src/util/native_fs_utils.js (1)

src/util/debug_module.js (2)

nb_native (26-26)

file (60-60)

src/manage_nsfs/manage_nsfs_glacier.js (3)

src/util/native_fs_utils.js (3)

fs_context (549-554)

config (10-10)

path (7-7)

src/manage_nsfs/manage_nsfs_cli_utils.js (2)

fs_context (125-125)

native_fs_utils (7-7)

src/sdk/glacier.js (3)

config (9-9)

path (4-4)

native_fs_utils (11-11)

src/sdk/glacier_tapecloud.js (5)

src/util/native_fs_utils.js (9)

fs_context (549-554)

dbg (4-4)

nb_native (12-12)

nb_native (670-670)

stat (401-401)

stat (449-449)

stat (598-598)

err (353-353)

config (10-10)

src/sdk/glacier.js (3)

dbg (8-8)

nb_native (5-5)

config (9-9)

src/util/persistent_logger.js (2)

dbg (9-9)

nb_native (5-5)

src/util/file_reader.js (1)

nb_native (5-5)

config.js (2)

stat (1348-1348)

config (7-7)

🔇 Additional comments (18)

config.js (1)

919-922: LGTM! Well-documented configuration flag added.

The new NSFS_GLACIER_DMAPI_FINALIZE_RESTORE_ENABLE configuration flag is properly named, well-documented, and follows existing conventions. The conservative default value of false is appropriate for a feature that prevents accidental blocking reads during restore finalization.

src/sdk/nb.d.ts (1)

1033-1033: LGTM! Consistent non-blocking lock method added.

The new fcntltrylock method is well-designed with a signature consistent with existing lock methods. The Promise<boolean> return type appropriately indicates the success/failure of the non-blocking lock attempt, providing useful functionality for concurrent file operations.

src/util/file_reader.js (1)

203-206: LGTM! Improved file handle lifecycle management.

The refactored close method enhances safety by immediately clearing the instance's file handle reference (this.fh = null) before performing the asynchronous close operation. This prevents potential race conditions and accidental reuse of the file handle during or after the close operation, which is particularly important in concurrent scenarios.

src/test/unit_tests/nsfs/test_nsfs_glacier_backend.js (2)

179-179: LGTM! Correctly adapted to the new LogFile abstraction.

This change properly uses the log_reader property from the LogFile instance instead of manually creating a NewlineReader. This aligns with the refactored log file handling system where LogFile manages the reader internally with proper locking.

221-221: LGTM! Correctly updated for new restore method signature.

The addition of the third argument async () => null as a failure recorder callback aligns with the updated restore method signature in the Glacier backend. The no-op callback is appropriate for this test scenario where failure handling isn't specifically tested.

src/util/notifications_util.js (3)

6-6: LGTM! Import correctly updated for refactored design.

Removing the LogFile import is appropriate since LogFile instances are now passed directly to the _notify method rather than being constructed locally. The PersistentLogger import is correctly retained.

100-100: LGTM! Call correctly simplified for new interface.

The simplified call to _notify with only file and failure_append parameters aligns with the refactored method signature. The fs_context parameter is no longer needed since the LogFile instance encapsulates the necessary context.

116-120: LGTM! Method signature correctly updated for LogFile abstraction.

The updated _notify method signature properly accepts a LogFile instance directly rather than constructing one internally. The JSDoc comments are appropriately updated to reflect the new parameter types. This change aligns with the broader refactoring to use the LogFile abstraction for better resource management and locking consistency.

src/native/fs/fs_napi.cpp (3)

1458-1458: LGTM!

The method registration correctly follows the established pattern used by other FileWrap methods.

1489-1489: LGTM!

The method declaration is consistent with other FileWrap methods and follows the established pattern.

2114-2118: LGTM!

The method implementation correctly follows the established pattern used by other FileWrap methods, properly utilizing the api<> template to create and queue the worker.

src/util/bucket_logs_utils.js (4)

9-9: LGTM!

The removal of the unused LogFile import is correct and aligns with the refactoring where LogFile instances are now provided by PersistentLogger.process.

39-39: LGTM!

The removal of the fs_context parameter is correct and consistent with the updated _upload_to_targets function signature. The LogFile instance provided by PersistentLogger.process should contain all necessary context.

55-59: LGTM!

The function signature simplification and updated JSDoc documentation correctly reflect the API changes:

Removal of the redundant fs_context parameter

Direct use of LogFile instance parameter

Proper JSDoc type annotation for the log_file parameter

63-64: LGTM!

The updated usage correctly leverages the provided LogFile instance:

Using log_file.log_path for logging maintains the same information while working with the new API

Direct call to log_file.collect_and_process() eliminates redundant LogFile creation

Consistent error logging with log_file.log_path

Also applies to: 101-101
src/util/native_fs_utils.js (2)

724-736: Potential resource management issue with lock release

The current implementation always closes the file descriptor in the finally block, but if the callback uses release_lock() to unlock, the file descriptor might be in an inconsistent state. Consider tracking whether the lock was released to avoid potential issues.
 async function lock_and_run(fs_context, lock_path, cb) {
     const lockfd = await nb_native().fs.open(fs_context, lock_path, 'w');
+    let lock_released = false;
     const release_lock = async () => {
+        if (lock_released) return;
         await lockfd.fcntllock(fs_context, 'UNLOCK');
+        lock_released = true;
     };
 
     try {
         await lockfd.fcntllock(fs_context, 'EXCLUSIVE');
         await cb(release_lock);
     } finally {
-        await lockfd.close(fs_context);
+        if (!lock_released) {
+            await lockfd.fcntllock(fs_context, 'UNLOCK');
+        }
+        await lockfd.close(fs_context);
     }
 }
Likely an incorrect or invalid review comment.

763-791: Incorrect retry logic and unnecessary delays

The retry logic has several issues:

The loop decrements retries at the start, causing one less retry than configured

There's an unnecessary delay after successful non-blocking lock acquisition
-    for (; retries > 0; retries -= 1) {
+    while (retries > 0) {
         try {
             file = await nb_native().fs.open(fs_context, file_path, flags, mode);
 
             if (cfg.lock && cfg.lock_no_block) {
                 if (!await file.fcntltrylock(fs_context, cfg.lock)) {
+                    await file.close(fs_context);
+                    file = undefined;
+                    retries -= 1;
                     await P.delay(backoff * (1 + Math.random()));
                     continue;
                 }
             }
Likely an incorrect or invalid review comment.
src/manage_nsfs/manage_nsfs_glacier.js (1)

31-34: The low_free_space check for restores is correct—no change needed

Restores bring data back onto disk and consume space, so it’s intentional to skip restore operations when free space is low. You can safely ignore the earlier suggestion to invert this condition.

Likely an incorrect or invalid review comment.

src/sdk/glacier_tapecloud.js

src/sdk/glacier.js

src/util/native_fs_utils.js

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (4)

src/sdk/glacier_tapecloud.js (2)
503-503: Add documentation for the intentional error case

As noted in past comments, this section needs documentation to explain why the error is thrown when failure_recorder is not provided.

Add a comment explaining the intentional error behavior:
+            // Intentionally throw error to fail the entire batch and preserve log file for retry
+            // This prevents files from being silently stuck in an unrestored state
             if (!failure_recorder) {
                 throw new Error('restored file not actually restored');
             }
344-344: Fix the incorrect time comparison condition

The condition stat.ctime <= stat.mtime will never be true as noted in previous reviews. This needs to be removed as mentioned in the past comments.

Apply this diff to fix the condition:
-                    if (stat.xattr[Glacier.XATTR_STAGE_MIGRATE] || stat.ctime <= stat.mtime) {
+                    if (stat.xattr[Glacier.XATTR_STAGE_MIGRATE]) {
src/sdk/glacier.js (2)
156-156: Fix reference to non-existent function

The comment references restore_namespace() which doesn't exist, as noted in previous reviews.

Apply this diff to fix the comment:
-     * restore must take a log file (from `restore_namespace()`) which will
+     * restore must take a log file (from `Glacier.RESTORE_STAGE_WAL_NAME`) which will
238-240: Add missing await in async arrow function

The arrow function is missing an await before the process_glacier_logs call, as noted in previous reviews.

Apply this diff to add the missing await:
             await native_fs_utils.lock_and_run(
-                fs_context, lock_path(lock_file), () => process_glacier_logs(namespace_stage, log_process_cb)
+                fs_context, lock_path(lock_file), async () => await process_glacier_logs(namespace_stage, log_process_cb)
             );

🧹 Nitpick comments (1)

src/sdk/glacier.js (1)

243-260: Consider making operation type handling more explicit

The current if statements work correctly, but the mutual exclusivity of operation types could be made more explicit with if-else if structure.

Consider this more explicit structure:

-        if (type === 'MIGRATION') {
+        if (type === 'MIGRATION') {
             await run_operation(
                 Glacier.MIGRATE_WAL_NAME,
                 Glacier.MIGRATE_STAGE_WAL_NAME,
                 this.stage_migrate.bind(this),
                 this.migrate.bind(this),
                 Glacier.GLACIER_MIGRATE_CLUSTER_LOCK,
             );
-        }
-        if (type === 'RESTORE') {
+        } else if (type === 'RESTORE') {
             await run_operation(
                 Glacier.RESTORE_WAL_NAME,
                 Glacier.RESTORE_STAGE_WAL_NAME,
                 this.stage_restore.bind(this),
                 this.restore.bind(this),
                 Glacier.GLACIER_RESTORE_CLUSTER_LOCK,
             );
         }

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cb099e6 and 0b11eac.

📒 Files selected for processing (12)

config.js (1 hunks)
src/manage_nsfs/manage_nsfs_glacier.js (1 hunks)
src/native/fs/fs_napi.cpp (4 hunks)
src/sdk/glacier.js (7 hunks)
src/sdk/glacier_tapecloud.js (6 hunks)
src/sdk/nb.d.ts (1 hunks)
src/test/unit_tests/nsfs/test_nsfs_glacier_backend.js (2 hunks)
src/util/bucket_logs_utils.js (4 hunks)
src/util/file_reader.js (1 hunks)
src/util/native_fs_utils.js (2 hunks)
src/util/notifications_util.js (3 hunks)
src/util/persistent_logger.js (6 hunks)

✅ Files skipped from review due to trivial changes (1)

config.js

🚧 Files skipped from review as they are similar to previous changes (9)

src/manage_nsfs/manage_nsfs_glacier.js
src/util/file_reader.js
src/test/unit_tests/nsfs/test_nsfs_glacier_backend.js
src/sdk/nb.d.ts
src/util/bucket_logs_utils.js
src/util/native_fs_utils.js
src/native/fs/fs_napi.cpp
src/util/notifications_util.js
src/util/persistent_logger.js

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: tangledbytes
PR: noobaa/noobaa-core#9183
File: src/sdk/glacier_tapecloud.js:507-509
Timestamp: 2025-08-05T10:05:51.938Z
Learning: In TapeCloudGlacier._finalize_restore method, when failure_recorder is not provided but the DMAPI premig check fails, the code intentionally throws an error to cause the entire batch to fail and preserve the log file for retry, preventing files from being silently stuck in an unrestored state.

📚 Learning: in tapecloudglacier._finalize_restore method, when failure_recorder is not provided but the dmapi pr...

Learnt from: tangledbytes
PR: noobaa/noobaa-core#9183
File: src/sdk/glacier_tapecloud.js:507-509
Timestamp: 2025-08-05T10:05:51.938Z
Learning: In TapeCloudGlacier._finalize_restore method, when failure_recorder is not provided but the DMAPI premig check fails, the code intentionally throws an error to cause the entire batch to fail and preserve the log file for retry, preventing files from being silently stuck in an unrestored state.

Applied to files:

src/sdk/glacier_tapecloud.js
src/sdk/glacier.js

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Build Noobaa Image
GitHub Check: run-package-lock-validation
GitHub Check: run-jest-unit-tests

🔇 Additional comments (8)

src/sdk/glacier_tapecloud.js (5)

18-18: LGTM: Clean JSDoc import for type annotations

The JSDoc import for LogFile is appropriate and aligns with the new LogFile abstraction from persistent_logger.

199-260: Well-implemented staging method with proper error handling

The stage_migrate method correctly implements the staging pattern with:

Proper file handle management and cleanup

Appropriate error handling for ENOENT cases

Staging xattr marking before batch recording

Comprehensive failure recording

268-306: Solid implementation of migration processing

The migrate method correctly separates migration execution from staging cleanup. The approach of removing staging xattrs from all files (regardless of individual migration success) is appropriate since failed files are recorded for retry.

378-408: Well-structured restore processing with proper callback handling

The restore method correctly implements the two-phase restore approach with appropriate success/failure callbacks and conditional batch processing.

491-513: DMAPI premig check implementation is correct

The DMAPI premigration check is properly implemented. The error throwing behavior when failure_recorder is not provided (lines 507-509) is intentional per the retrieved learning - it ensures the entire batch fails and preserves the log file for retry, preventing files from being silently stuck in an unrestored state.

src/sdk/glacier.js (3)

4-89: New imports and constants properly support staging architecture

The added imports (path, PersistentLogger, native_fs_utils) and new constants for staging WAL namespaces, cluster locks, and staging xattr are well-structured and consistent with the overall refactor.

104-112: Appropriate base implementation for staging

The stage_migrate method provides a solid base implementation that simply transfers entries to the staging namespace, allowing subclasses to implement more sophisticated staging logic.

446-446: Correct addition of staging xattr to should_restore check

Adding Glacier.XATTR_STAGE_MIGRATE to the xattr keys ensures proper staging state checking during restore operations.

src/sdk/glacier.js

tangledbytes · 2025-08-07T13:35:48Z

@alphaprinz - This PR attempts to deal with the race between unlink and init of PersistentLogger.

@alphaprinz @jackyalbo the change in the PersistentLogger.process affects the notification util and bucket notifications. I have tried making appropriate changes, please lemme know if I missed something.

alphaprinz · 2025-08-08T04:45:28Z

@tangledbytes Nsfs panic was not reproduced with the fix. Great job :)

The changes to notification_utils are minor, I have no problems with them.

alphaprinz

Not familiar with the Glacier content.
Had some minor notes regarding comments.

src/sdk/glacier.js

src/sdk/glacier_tapecloud.js

coderabbitai

Actionable comments posted: 6

🔭 Outside diff range comments (1)

src/sdk/glacier_tapecloud.js (1)

204-266: Ensure file handle is always awaited on close and error paths

A couple of close calls aren’t awaited, which can leak descriptors and hold locks longer than needed if the promise rejects.

-                    await entry_fh?.close(fs_context);
+                    await entry_fh?.close(fs_context);
...
-                } finally {
-                    entry_fh?.close(fs_context);
-                }
+                } finally {
+                    await entry_fh?.close(fs_context);
+                }

🧹 Nitpick comments (4)

src/util/native_fs_utils.js (1)
764-769: includes_any should return a boolean in all cases

The helper returns true or undefined. Make it explicit to avoid accidental truthiness/falseiness bugs and improve readability.
-    const includes_any = (string, ...chars) => {
-        for (const ch of chars) {
-            if (string.includes(ch)) return true;
-        }
-    };
+    const includes_any = (string, ...chars) => {
+        for (const ch of chars) {
+            if (string.includes(ch)) return true;
+        }
+        return false;
+    };
src/manage_nsfs/manage_nsfs_glacier.js (1)
56-58: Fix function name typo: prepare_galcier_fs_context → prepare_glacier_fs_context

Minor typo in the helper name; rename for clarity and consistency.
-        await backend.perform(prepare_galcier_fs_context(fs_context), "EXPIRY");
+        await backend.perform(prepare_glacier_fs_context(fs_context), "EXPIRY");
-    await backend.perform(prepare_galcier_fs_context(fs_context), "RESTORE");
+    await backend.perform(prepare_glacier_fs_context(fs_context), "RESTORE");
-function prepare_galcier_fs_context(fs_context) {
+function prepare_glacier_fs_context(fs_context) {
Also applies to: 117-127, 21-23, 36-38
src/sdk/glacier.js (2)
114-121: Fix typo in documentation (“LofFile” → “LogFile”) and clarify namespace

Doc nit; correct type name improves readability for consumers.
-     * migrate must take a LofFile object (this should from the
+     * migrate must take a LogFile object (this should be from the
449-457: Include only-needed xattrs in should_restore; XATTR_STAGE_MIGRATE inclusion is redundant here

You fetch XATTR_STAGE_MIGRATE in should_restore but don’t use it in the function. It’s checked in stage_restore. Consider removing it here to save a sys-call payload, unless you plan to consult it here as well.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6ffbf13 and 09fbdf6.

📒 Files selected for processing (10)

config.js (1 hunks)
src/manage_nsfs/manage_nsfs_glacier.js (1 hunks)
src/sdk/glacier.js (7 hunks)
src/sdk/glacier_tapecloud.js (7 hunks)
src/test/unit_tests/nsfs/test_nsfs_glacier_backend.js (4 hunks)
src/util/bucket_logs_utils.js (4 hunks)
src/util/file_reader.js (1 hunks)
src/util/native_fs_utils.js (3 hunks)
src/util/notifications_util.js (3 hunks)
src/util/persistent_logger.js (6 hunks)

🚧 Files skipped from review as they are similar to previous changes (5)

src/util/file_reader.js
src/util/bucket_logs_utils.js
src/util/notifications_util.js
src/test/unit_tests/nsfs/test_nsfs_glacier_backend.js
src/util/persistent_logger.js

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-08-05T10:05:51.959Z

Learnt from: tangledbytes
PR: noobaa/noobaa-core#9183
File: src/sdk/glacier_tapecloud.js:507-509
Timestamp: 2025-08-05T10:05:51.959Z
Learning: In TapeCloudGlacier._finalize_restore method, when failure_recorder is not provided but the DMAPI premig check fails, the code intentionally throws an error to cause the entire batch to fail and preserve the log file for retry, preventing files from being silently stuck in an unrestored state.

Applied to files:

src/sdk/glacier_tapecloud.js
src/sdk/glacier.js

🧬 Code Graph Analysis (5)

config.js (16)

src/manage_nsfs/manage_nsfs_glacier.js (1)

config (6-6)

src/util/notifications_util.js (1)

config (5-5)

src/util/native_fs_utils.js (1)

config (10-10)

src/server/node_services/nodes_monitor.js (1)

config (15-15)

src/sdk/namespace_fs.js (1)

config (13-13)

src/endpoint/endpoint.js (1)

config (18-18)

src/manage_nsfs/nc_lifecycle.js (1)

config (10-10)

src/util/http_utils.js (1)

config (19-19)

src/util/fork_utils.js (1)

config (11-11)

src/cmd/manage_nsfs.js (1)

config (18-18)

src/sdk/bucketspace_fs.js (1)

config (9-9)

src/endpoint/s3/s3_rest.js (1)

config (15-15)

src/cmd/nsfs.js (1)

config (20-20)

src/agent/block_store_services/block_store_base.js (1)

config (9-9)

src/util/debug_module.js (1)

config (24-24)

src/server/bg_workers.js (1)

config (16-16)

src/sdk/glacier_tapecloud.js (4)

src/sdk/glacier.js (2)

nb_native (5-5)

config (9-9)

src/util/persistent_logger.js (1)

nb_native (5-5)

src/util/file_reader.js (1)

nb_native (5-5)

config.js (2)

stat (1351-1351)

config (7-7)

src/manage_nsfs/manage_nsfs_glacier.js (4)

src/test/unit_tests/nsfs/test_nsfs_glacier_backend.js (6)

backend (90-90)

backend (174-174)

fs_context (191-191)

fs_context (303-303)

fs_context (439-439)

config (11-11)

src/util/native_fs_utils.js (2)

fs_context (549-554)

config (10-10)

src/cmd/manage_nsfs.js (3)

fs_context (73-73)

config (18-18)

native_fs_utils (23-23)

src/manage_nsfs/manage_nsfs_cli_utils.js (2)

fs_context (125-125)

native_fs_utils (7-7)

src/sdk/glacier.js (3)

src/manage_nsfs/manage_nsfs_glacier.js (5)

path (4-4)

native_fs_utils (9-9)

fs_context (13-13)

fs_context (28-28)

fs_context (42-42)

src/util/persistent_logger.js (3)

path (4-4)

dbg (9-9)

native_fs_utils (6-6)

src/util/native_fs_utils.js (3)

path (7-7)

dbg (4-4)

fs_context (549-554)

src/util/native_fs_utils.js (1)

src/native/fs/fs_napi.cpp (5)

flags (800-828)

error (700-708)

error (700-700)

error (722-735)

error (722-722)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Build Noobaa Image
GitHub Check: run-package-lock-validation
GitHub Check: run-jest-unit-tests

🔇 Additional comments (6)

config.js (1)

919-923: LGTM: adds DMAPI finalize-restore toggle

The new NSFS_GLACIER_DMAPI_FINALIZE_RESTORE_ENABLE flag and inline docs look good and align with the TapeCloud finalize path.

src/util/native_fs_utils.js (1)

762-763: Confirm '+*' open mode support in nb_native().fs.open
We didn’t find any parsing for the custom '+' mode in our JS wrappers or native sources, and '+' isn’t a standard fs.open flag. Please verify that nb_native().fs.open supports '+*'; if not, replace it with a supported mode (for example, 'r+' or 'rw').

• Location to check/use:
– src/util/native_fs_utils.js:762
js flags = flags || (cfg.lock === 'EXCLUSIVE' ? '+*' : 'r');

src/manage_nsfs/manage_nsfs_glacier.js (2)

16-24: LGTM: migration trigger conditions and timestamping

The OR combination of low_free_space/time_exceeded/log_threshold makes sense to drive more aggressive migrations during pressure while keeping periodic runs and a size-triggered fast path. Timestamp handling is correct.

31-39: LGTM: restore guard prevents work under pressure and enforces interval

Early-return on low_free_space or insufficient interval is appropriate to avoid competing with migrations and keep restore cadence bounded.

src/sdk/glacier_tapecloud.js (1)

121-130: LGTM: task ID parsing handles new “a:b:c” format with fallback

The updated regex with a sane fallback is robust to both TapeCloud versions.

src/sdk/glacier.js (1)

200-271: LGTM: two-phase orchestration with cluster locks

The central perform() flow (cluster-wide staging + per-op consumption under separate locks) matches the PR design goals for parallelism and safety. The lock file derivation under NSFS_GLACIER_LOGS_DIR is consistent and avoids cross-namespace contention.

src/sdk/glacier_tapecloud.js

src/util/native_fs_utils.js

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (2)

src/manage_nsfs/manage_nsfs_glacier.js (1)

92-105: File handle leak in migrate_log_exceeds_threshold

fh is opened but never closed, which can leak FDs. Close it in a finally block.

 async function migrate_log_exceeds_threshold(threshold = config.NSFS_GLACIER_MIGRATE_LOG_THRESHOLD) {
     const log = new PersistentLogger(config.NSFS_GLACIER_LOGS_DIR, Glacier.MIGRATE_WAL_NAME, { locking: null });
     let log_size = Number.MAX_SAFE_INTEGER;
+    let fh;
     try {
-        const fh = await log._open();
+        fh = await log._open();
         const { size } = await fh.stat(log.fs_context);
         log_size = size;
     } catch (error) {
         console.error("failed to get size of", Glacier.MIGRATE_WAL_NAME, error);
+    } finally {
+        try { await fh?.close(log.fs_context); } catch (_) {}
     }
 
     return log_size > threshold;
 }

src/sdk/glacier_tapecloud.js (1)

204-258: Two resource-handling bugs in stage_migrate (FD leak and missing await on close)

FD leak: when should_migrate is false, the opened entry_fh is not closed before returning.
Missing await: entry_fh?.close(fs_context) is not awaited in the finally block, risking delayed cleanup and lock leaks.

Fix both:

             let entry_fh;
             let should_migrate = true;
             try {
                 entry_fh = await nb_native().fs.open(fs_context, entry);
                 const stat = await entry_fh.stat(fs_context, {
                     xattr_get_keys: [
                         Glacier.XATTR_RESTORE_REQUEST,
                         Glacier.XATTR_RESTORE_EXPIRY,
                         Glacier.STORAGE_CLASS_XATTR,
                     ],
                 });
                 should_migrate = await this.should_migrate(fs_context, entry, stat);
             } catch (err) {
-                await entry_fh?.close(fs_context);
+                await entry_fh?.close(fs_context);
 
                 if (err.code === 'ENOENT') {
                     // Skip this file
                     return;
                 }
                 ...
             }
 
             // Skip the file if it shouldn't be migrated
-            if (!should_migrate) return;
+            if (!should_migrate) {
+                await entry_fh?.close(fs_context);
+                return;
+            }
 
             // Mark the file staged
             try {
                 await entry_fh.replacexattr(fs_context, { [Glacier.XATTR_STAGE_MIGRATE]: Date.now().toString() });
                 await batch_recorder(entry);
             } catch (error) {
                 dbg.error('failed to mark the entry migrate staged', error);
                 ...
             } finally {
-                entry_fh?.close(fs_context);
+                await entry_fh?.close(fs_context);
             }

♻️ Duplicate comments (2)

src/util/native_fs_utils.js (1)

762-775: Reiterating prior discussion: default flag '+*' is unconventional

This was discussed previously and you opted to keep '+*'. Acknowledging and not blocking; just tagging as a duplicate for traceability.

src/sdk/glacier_tapecloud.js (1)

483-511: Finalize-restore DMAPI premig enforcement

The premig check and failure-recorder/throw behavior match the previously stated design choice to avoid silent "restored" states. No change requested.

🧹 Nitpick comments (17)

src/util/native_fs_utils.js (4)

735-745: JSDoc inconsistencies and typos in open_with_lock docs

"aqcuired" → "acquired"

"cfg.locking" should be "cfg.lock" (matches implementation)

"O_RDRW" should be "O_RDWR" or reflect your actual flag strings (e.g., 'r+')

Recommend correcting to avoid confusion for future callers.
- * acquiring the lock on the file. Once lock is aqcuired
+ * acquiring the lock on the file. Once lock is acquired
...
- * - If locking is needed then cfg.locking must be specified.
+ * - If locking is needed then cfg.lock must be specified.
...
- * the flag is automatically updated to O_RDRW.
+ * the flag is automatically updated to a write-capable mode (e.g., 'r+').
764-769: Ensure includes_any returns a boolean

Currently returns true or undefined. Make it explicit for readability and to prevent accidental truthiness bugs in the future.
     const includes_any = (string, ...chars) => {
         for (const ch of chars) {
             if (string.includes(ch)) return true;
         }
+        return false;
     };
786-801: Consider retrying on transient race errors during validation (stat phase)

If nb_native().fs.stat(fs_context, file_path) races with a rename/unlink, it can throw ENOENT. Today this aborts immediately. Given you already have a retry loop for identity mismatches, treating certain errors as retriable can make open_with_lock more robust without changing lock semantics.
         } catch (error) {
-            await file?.close(fs_context);
-            throw error;
+            try { await file?.close(fs_context); } catch (_) {}
+            file = undefined;
+            if (['ENOENT', 'ESTALE'].includes(error.code) && retries > 1) {
+                await P.delay(backoff * (1 + Math.random()));
+                continue;
+            }
+            throw error;
         }
804-807: Explicitly return undefined for clarity when retries are exhausted

The function currently relies on implicit undefined return. Being explicit improves readability.
-    await file?.close(fs_context);
+    await file?.close(fs_context);
+    return undefined;

src/test/unit_tests/nsfs/test_nsfs_glacier_backend.js (1)

124-140: Restore all mutated config flags in safe_dmapi_surround (missing NSFS_GLACIER_ENABLED)

The helper mutates NSFS_GLACIER_ENABLED but does not restore it, which can leak state across tests. Capture and restore it like the others.

-    const safe_dmapi_surround = async (init, cb) => {
+    const safe_dmapi_surround = async (init, cb) => {
         await dmapi_config_semaphore.surround(async () => {
-            const start_value_1 = config.NSFS_GLACIER_DMAPI_ENABLE;
-            const start_value_2 = config.NSFS_GLACIER_DMAPI_IMPLICIT_SC;
-            const start_value_3 = config.NSFS_GLACIER_DMAPI_IMPLICIT_RESTORE_STATUS;
+            const start_value_enabled = config.NSFS_GLACIER_ENABLED;
+            const start_value_dmapi_enable = config.NSFS_GLACIER_DMAPI_ENABLE;
+            const start_value_implicit_sc = config.NSFS_GLACIER_DMAPI_IMPLICIT_SC;
+            const start_value_implicit_restore = config.NSFS_GLACIER_DMAPI_IMPLICIT_RESTORE_STATUS;
 
             config.NSFS_GLACIER_ENABLED = init;
             config.NSFS_GLACIER_DMAPI_IMPLICIT_SC = init;
             config.NSFS_GLACIER_DMAPI_IMPLICIT_RESTORE_STATUS = init;
             await cb();
 
-            config.NSFS_GLACIER_DMAPI_ENABLE = start_value_1;
-            config.NSFS_GLACIER_DMAPI_IMPLICIT_SC = start_value_2;
-            config.NSFS_GLACIER_DMAPI_IMPLICIT_RESTORE_STATUS = start_value_3;
+            config.NSFS_GLACIER_ENABLED = start_value_enabled;
+            config.NSFS_GLACIER_DMAPI_ENABLE = start_value_dmapi_enable;
+            config.NSFS_GLACIER_DMAPI_IMPLICIT_SC = start_value_implicit_sc;
+            config.NSFS_GLACIER_DMAPI_IMPLICIT_RESTORE_STATUS = start_value_implicit_restore;
         });
     };

src/util/bucket_logs_utils.js (1)

39-45: Incorrect property in error log (log.file)

log is a PersistentLogger instance; using log.file is likely incorrect or undefined. Prefer a path identifier such as log.log_path (or construct from dir/name).
-            dbg.error('processing log file failed', log.file);
+            dbg.error('processing log file failed', log.log_path);
If log.log_path is not available on PersistentLogger, consider:
path.join(config.PERSISTENT_BUCKET_LOG_DIR, entry.name)

src/manage_nsfs/manage_nsfs_glacier.js (1)

117-127: Typo in helper name: prepare_galcier_fs_context

Consider renaming to prepare_glacier_fs_context for clarity and consistency. If you agree, update all call sites in this file and the export.

-function prepare_galcier_fs_context(fs_context) {
+function prepare_glacier_fs_context(fs_context) {
...
-        return { ...fs_context, backend: 'GPFS', use_dmapi: true };
+        return { ...fs_context, backend: 'GPFS', use_dmapi: true };
...
-    return { ...fs_context };
+    return { ...fs_context };
 }
 
-exports.process_migrations = process_migrations;
+exports.process_migrations = process_migrations;
 ...
-exports.process_expiry = process_expiry;
+exports.process_expiry = process_expiry;
-// and export the renamed function if it's part of the public API
+// update call sites above: prepare_glacier_fs_context(fs_context)

src/sdk/glacier.js (8)

86-90: Lock taxonomy is clear; consider grouping constants

The separate cluster-wide vs per-operation locks are well-defined. Minor: consider grouping these under a single object/enum to avoid string typos and improve discoverability.

-    static GLACIER_CLUSTER_LOCK = 'glacier.cluster.lock';
-    static GLACIER_MIGRATE_CLUSTER_LOCK = 'glacier.cluster.migrate.lock';
-    static GLACIER_RESTORE_CLUSTER_LOCK = 'glacier.cluster.restore.lock';
-    static GLACIER_SCAN_LOCK = 'glacier.scan.lock';
+    static LOCKS = {
+        CLUSTER: 'glacier.cluster.lock',
+        MIGRATE: 'glacier.cluster.migrate.lock',
+        RESTORE: 'glacier.cluster.restore.lock',
+        SCAN: 'glacier.scan.lock',
+    };

91-103: JSDoc nits: typos and parameter clarification

Tighten wording and fix typos for clarity (“separated”, parameter type/description).

-    /**
-     * stage_migrate must take a LogFile object (this should be from the
-     * `GLACIER.MIGRATE_WAL_NAME` namespace) which will have
-     * newline seperated entries of filenames which needs to be
-     * migrated to GLACIER and should stage the files for migration.
-     * 
-     * The function should return false if it needs the log file to be
-     * preserved.
-     * @param {nb.NativeFSContext} fs_context
-     * @param {LogFile} log_file log filename
-     * @param {(entry: string) => Promise<void>} failure_recorder
-     * @returns {Promise<boolean>}
-     */
+    /**
+     * Stage entries from a MIGRATE log into the stage namespace.
+     * - Input: LogFile from Glacier.MIGRATE_WAL_NAME (newline-separated file paths).
+     * - Output: Entries are copied into Glacier.MIGRATE_STAGE_WAL_NAME for later processing.
+     * - Return: true to delete the source log; false to preserve it.
+     * @param {nb.NativeFSContext} fs_context
+     * @param {LogFile} log_file
+     * @param {(entry: string) => Promise<void>} failure_recorder
+     * @returns {Promise<boolean>}
+     */

105-112: Stage-migrate flow is correct; minor stylistic nit

Returning true is consistent with the staging design. The async arrow’s body can just return the promise for brevity.

-            await log_file.collect(Glacier.MIGRATE_STAGE_WAL_NAME, async (entry, batch_recorder) => batch_recorder(entry));
+            await log_file.collect(Glacier.MIGRATE_STAGE_WAL_NAME, (entry, batch_recorder) => batch_recorder(entry));

115-126: Fix typos in JSDoc (LogFile, grammar)

LofFile → LogFile; minor wording/typos.

-    /**
-     * migrate must take a LofFile object (this should from the
-     * `GLACIER.MIGRATE_STAGE_WAL_NAME` namespace) which will have newline
-     * separated entries of filenames which needs to be migrated to GLACIER
-     * and should perform migration of those files if feasible.
+    /**
+     * migrate must take a LogFile object (from the
+     * `Glacier.MIGRATE_STAGE_WAL_NAME` namespace) which has newline-
+     * separated file paths to migrate to GLACIER and should perform
+     * migration of those files if feasible.

134-146: Fix copy/paste errors in JSDoc for stage_restore

The input should be from RESTORE (primary) namespace, and the wording should be about staging restore, not migration.

-    /**
-     * stage_restore must take a log file (from `Glacier.RESTORE_STAGE_WAL_NAME`)
-     * which will have newline seperated entries of filenames which needs to be
-     * migrated to GLACIER and should stage the files for migration.
-     * 
-     * The function should return false if it needs the log file to be
-     * preserved.
-     * @param {nb.NativeFSContext} fs_context
-     * @param {LogFile} log_file log filename
-     * @param {(entry: string) => Promise<void>} failure_recorder
-     * @returns {Promise<boolean>}
-     */
+    /**
+     * Stage entries from a RESTORE log into the stage namespace.
+     * - Input: LogFile from `Glacier.RESTORE_WAL_NAME` (newline-separated file paths).
+     * - Output: Entries are copied into `Glacier.RESTORE_STAGE_WAL_NAME` for later processing.
+     * - Return: true to delete the source log; false to preserve it.
+     * @param {nb.NativeFSContext} fs_context
+     * @param {LogFile} log_file
+     * @param {(entry: string) => Promise<void>} failure_recorder
+     * @returns {Promise<boolean>}
+     */

156-167: JSDoc typo and consistency in restore signature

Minor: “seperated” → “separated”; ensure LogFile type mention is consistent across methods.

-     * have newline seperated entries of filenames which needs to be 
+     * have newline separated entries of filenames which need to be

219-230: Nit: rename “entry” to “log_file” for clarity

The callback receives a LogFile, not an entry. Renaming improves readability and avoids confusion with line entries inside the log.

-            await logs.process(async (entry, failure_recorder) => cb(fs_context, entry, failure_recorder));
+            await logs.process(async (log_file, failure_recorder) => cb(fs_context, log_file, failure_recorder));

449-456: XATTR_STAGE_MIGRATE is read but never used

Fetching XATTR_STAGE_MIGRATE here has no effect since the logic doesn’t reference it. Either remove it from xattr_get_keys or incorporate it into the decision (e.g., to avoid redundant restore staging when a file is already flagged for migration).

-                xattr_get_keys: [
-                    Glacier.XATTR_RESTORE_REQUEST,
-                    Glacier.STORAGE_CLASS_XATTR,
-                    Glacier.XATTR_STAGE_MIGRATE,
-                ],
+                xattr_get_keys: [
+                    Glacier.XATTR_RESTORE_REQUEST,
+                    Glacier.STORAGE_CLASS_XATTR,
+                ],

Would you like to factor the staging xattr into should_restore/should_migrate decisions? If not, dropping it from the stat call reduces overhead.

src/util/persistent_logger.js (2)

137-140: Consider preserving deletion failure context

Currently rethrowing will surface the error; consider adding the caught error (with code/path) to dbg.warn for quicker triage, or include it in the throw to retain context.
-            } catch (error) {
-                dbg.warn('failed to delete the log file:', log_file.log_path);
-                throw error;
+            } catch (error) {
+                dbg.warn('failed to delete the log file:', log_file.log_path, 'error:', error);
+                throw error;
             }
280-322: collect pipeline is correct; minor nits

Resetting the reader before each run is fine and avoids partial reads.

Minor: close() is called both before return and in finally; harmless but redundant.

Optional cleanup:
-            if (filtered_log.local_size === 0) return;
-
-            await filtered_log.close();
+            if (filtered_log.local_size === 0) return;

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 09fbdf6 and d0a93ec.

📒 Files selected for processing (10)

config.js (1 hunks)
src/manage_nsfs/manage_nsfs_glacier.js (1 hunks)
src/sdk/glacier.js (7 hunks)
src/sdk/glacier_tapecloud.js (7 hunks)
src/test/unit_tests/nsfs/test_nsfs_glacier_backend.js (4 hunks)
src/util/bucket_logs_utils.js (4 hunks)
src/util/file_reader.js (1 hunks)
src/util/native_fs_utils.js (3 hunks)
src/util/notifications_util.js (3 hunks)
src/util/persistent_logger.js (6 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

src/util/notifications_util.js
src/util/file_reader.js
config.js

🧰 Additional context used

📓 Path-based instructions (1)

src/test/**/*.*

⚙️ CodeRabbit Configuration File

src/test/**/*.*: Ensure that the PR includes tests for the changes.

Files:

src/test/unit_tests/nsfs/test_nsfs_glacier_backend.js

🧠 Learnings (4)

📚 Learning: 2025-08-18T05:00:52.268Z

Learnt from: tangledbytes
PR: noobaa/noobaa-core#9183
File: src/sdk/glacier_tapecloud.js:377-406
Timestamp: 2025-08-18T05:00:52.268Z
Learning: In TapeCloudGlacier.restore method, the method always returns true (rather than propagating _recall() result) because it uses a staging approach where the original log file should always be deleted after processing. Failed entries are captured by failure_recorder for retry through staged files, so the original log preservation is not needed.

Applied to files:

src/test/unit_tests/nsfs/test_nsfs_glacier_backend.js
src/sdk/glacier.js
src/sdk/glacier_tapecloud.js

📚 Learning: 2025-08-05T10:05:51.959Z

Learnt from: tangledbytes
PR: noobaa/noobaa-core#9183
File: src/sdk/glacier_tapecloud.js:507-509
Timestamp: 2025-08-05T10:05:51.959Z
Learning: In TapeCloudGlacier._finalize_restore method, when failure_recorder is not provided but the DMAPI premig check fails, the code intentionally throws an error to cause the entire batch to fail and preserve the log file for retry, preventing files from being silently stuck in an unrestored state.

Applied to files:

src/sdk/glacier.js
src/sdk/glacier_tapecloud.js

📚 Learning: 2025-08-18T05:01:58.212Z

Learnt from: tangledbytes
PR: noobaa/noobaa-core#9183
File: src/util/native_fs_utils.js:778-799
Timestamp: 2025-08-18T05:01:58.212Z
Learning: In the NooBaa codebase, fcntllock uses F_OFD_SETLKW which is a blocking lock operation that waits until the lock can be acquired, rather than returning EAGAIN/EWOULDBLOCK like non-blocking variants would.

Applied to files:

src/util/native_fs_utils.js

📚 Learning: 2025-08-18T05:00:21.111Z

Learnt from: tangledbytes
PR: noobaa/noobaa-core#9183
File: src/sdk/glacier_tapecloud.js:273-311
Timestamp: 2025-08-18T05:00:21.111Z
Learning: In TapeCloudGlacier.migrate method, ignoring the boolean return value from _migrate() and always proceeding to unstage entries is acceptable behavior - the migrate operation doesn't require checking the return value like other operations might.

Applied to files:

src/sdk/glacier_tapecloud.js

🧬 Code Graph Analysis (3)

src/test/unit_tests/nsfs/test_nsfs_glacier_backend.js (2)

src/manage_nsfs/manage_nsfs_glacier.js (6)

backend (14-14)

backend (29-29)

backend (43-43)

fs_context (13-13)

fs_context (28-28)

fs_context (42-42)

src/util/native_fs_utils.js (1)

fs_context (549-554)

src/sdk/glacier.js (3)

src/manage_nsfs/manage_nsfs_glacier.js (5)

config (6-6)

native_fs_utils (9-9)

fs_context (13-13)

fs_context (28-28)

fs_context (42-42)

src/util/persistent_logger.js (1)

native_fs_utils (6-6)

src/util/native_fs_utils.js (2)

config (10-10)

fs_context (549-554)

src/manage_nsfs/manage_nsfs_glacier.js (5)

src/util/native_fs_utils.js (3)

fs_context (549-554)

config (10-10)

path (7-7)

src/cmd/manage_nsfs.js (3)

fs_context (73-73)

config (18-18)

native_fs_utils (23-23)

src/manage_nsfs/manage_nsfs_cli_utils.js (2)

fs_context (125-125)

native_fs_utils (7-7)

src/sdk/glacier.js (3)

config (9-9)

path (4-4)

native_fs_utils (11-11)

config.js (2)

config (7-7)

path (11-11)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Build Noobaa Image
GitHub Check: run-jest-unit-tests
GitHub Check: run-package-lock-validation

🔇 Additional comments (26)

src/test/unit_tests/nsfs/test_nsfs_glacier_backend.js (4)

89-98: Good test helper: centralized backend patching

Centralizing backend patching into get_patched_backend() reduces duplication and makes tests easier to reason about.

176-207: Test logic aligns with staged WAL processing

Using collect_and_process on a LogFile and deferring delete by returning false matches the new PersistentLogger callback contract. Also verifying should_migrate here is a nice sanity check.

236-237: RESTORE orchestration through backend.perform is the right abstraction

Switching to backend.perform(fs_context, "RESTORE") appropriately reflects the new centralized execution model.

306-307: End-to-end failure path validation for RESTORE looks good

Exercising partial failures via a patched backend and asserting only success entries finalize is valuable coverage for the staging+finalization pipeline.

src/util/bucket_logs_utils.js (2)

9-9: Import cleanup aligns with new LogFile callback contract

Relying on the LogFile instance passed by PersistentLogger.process keeps responsibilities clear and avoids redundant wrappers.

55-60: JSDoc update to accept LogFile is correct

Processing the provided log_file directly via collect_and_process aligns with the PersistentLogger changes.

src/manage_nsfs/manage_nsfs_glacier.js (3)

16-24: Migration trigger logic correctly centralizes via backend.perform

Conditionally invoking backend.perform(..., "MIGRATION") and updating the timestamp afterward matches the stated design. Looks good.

36-39: Restore trigger path LGTM

Time-based gating followed by backend.perform(..., "RESTORE") is clear and consistent with the new orchestration.

56-58: Expiry trigger path LGTM

Using is_desired_time and recording timestamp after perform(..., "EXPIRY") keeps the periodicity semantics intact.

src/sdk/glacier_tapecloud.js (3)

122-131: Robust task ID parsing with compatibility fallback

Recognizing the colon-delimited task ID and falling back to the legacy numeric format is a good compatibility improvement.

319-369: Stage RESTORE flow reads and routes correctly

Opening per-entry, checking should_restore, and routing entries to failure_recorder when XATTR_STAGE_MIGRATE is present protects against migrate/restore overlap. Cleanup is awaited. Looks good.

377-405: RESTORE orchestration and finalization path LGTM

Deferring to _recall with partial failure/success recorders and then finalizing via _finalize_restore on success aligns with the staged design. Always returning true here (to delete the original log) is consistent with the intended pipeline.

src/sdk/glacier.js (7)

4-4: Imports and type hinting look correct

path, PersistentLogger, native_fs_utils, and the LogFile JSDoc import are appropriate for the new staging/locking flow.

Also applies to: 10-12, 13-13

49-50: Clarify semantics and lifecycle of XATTR_STAGE_MIGRATE

This xattr is introduced but not used in this file’s logic. Please confirm:

Which component is responsible for setting/clearing it (backend migrate/restore flows)?

Whether should_migrate should also read/consider it to avoid re-staging already-staged entries (or if that’s intentionally handled purely via WAL names).

75-75: Stage WAL naming is consistent

stage.migrate and stage.restore align with the two-phase orchestration and make the flow readable.

Also applies to: 77-77

145-153: Stage-restore mirrors stage-migrate correctly

Flow matches the intended design (collect into staged WAL and delete original). Good error handling.

205-211: Expiry under dedicated scan lock looks good

Running expiry under GLACIER_SCAN_LOCK isolates it from staging/processing locks. LGTM.

239-251: Two-phase orchestration and locks are correct

Phase 1: cluster-wide staging under GLACIER_CLUSTER_LOCK

Phase 2: type-specific processing under MIGRATE/RESTORE lock

This cleanly enables one migrate and one restore concurrently cluster-wide.

253-269: MIGRATION/RESTORE wiring is correct

Correct mapping of primary/staged namespaces and functions to the two-phase runner.

src/util/persistent_logger.js (7)

54-64: open_with_lock + ensured directory creation = safer init

Creating the directory and using native_fs_utils.open_with_lock with retries/backoff resolves the race seen previously and improves robustness under contention.

100-100: Updated callback signature is sound

Updating the internal processing callback to accept LogFile enables holding the file lock across callback execution and deletion.

119-146: Race fix: delete under lock, then deinit — correct sequencing

LogFile.init() acquires an exclusive lock.

Callback runs while held.

Deletion happens before deinit, avoiding the previous race.

Finally deinit ensures the lock is released even on errors.

This addresses the unlink/init race and is aligned with the PR objective.

250-254: EXCLUSIVE NewlineReader per LogFile is appropriate

Ensures per-file mutual exclusion while iterating entries and interacting with staging/processing.

256-268: Explicit init semantics are clean

Eagerly initializing to acquire the lock when needed is helpful; callers have control to hold/release locks as desired.

269-279: Graceful deinit

Idempotent close/reset is good; safe even if the callback preemptively deinitialized.

345-348: Reset before collect_and_process is good

Ensures a fresh read regardless of prior usage; matches the updated lifecycle model.

src/util/native_fs_utils.js

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (2)

src/sdk/glacier_tapecloud.js (1)

208-258: File descriptor leak when should_migrate() returns false

entry_fh is not closed when should_migrate is false (early return on Line 242), leaking fds under load. Ensure the handle is closed before returning.

Apply this diff within the callback to close the handle on the early return:
-                // Skip the file if it shouldn't be migrated
-                if (!should_migrate) return;
+                // Skip the file if it shouldn't be migrated
+                if (!should_migrate) {
+                    await entry_fh?.close(fs_context);
+                    return;
+                }

src/manage_nsfs/manage_nsfs_glacier.js (1)

93-105: File handle leak in migrate_log_exceeds_threshold

fh from log._open() is never closed, leaking descriptors. Close it in a finally block.

Apply this diff:

 async function migrate_log_exceeds_threshold(threshold = config.NSFS_GLACIER_MIGRATE_LOG_THRESHOLD) {
     const log = new PersistentLogger(config.NSFS_GLACIER_LOGS_DIR, Glacier.MIGRATE_WAL_NAME, { locking: null });
     let log_size = Number.MAX_SAFE_INTEGER;
-    try {
-        const fh = await log._open();
-
-        const { size } = await fh.stat(log.fs_context);
-        log_size = size;
-    } catch (error) {
-        console.error("failed to get size of", Glacier.MIGRATE_WAL_NAME, error);
-    }
+    let fh;
+    try {
+        fh = await log._open();
+        const { size } = await fh.stat(log.fs_context);
+        log_size = size;
+    } catch (error) {
+        console.error("failed to get size of", Glacier.MIGRATE_WAL_NAME, error);
+    } finally {
+        if (fh) {
+            try { await fh.close(log.fs_context); } catch (e) { /* best effort */ }
+        }
+    }
 
     return log_size > threshold;
 }

♻️ Duplicate comments (1)

src/sdk/glacier_tapecloud.js (1)

483-511: Consider gating premig check on fs_context.use_dmapi (already discussed)

A prior review suggested adding fs_context.use_dmapi to the conditional to avoid false positives when DMAPI isn’t used. The discussion concluded it’s not necessary. Just noting the prior context here.

🧹 Nitpick comments (13)

src/sdk/glacier_tapecloud.js (1)

49-52: Error message does not include the exit code due to incorrect Error construction

new Error accepts only the message string; passing errcode as a second argument is ignored, losing the exit code in the message.

Apply this diff:
-            if (errcode) {
-                throw new Error('process exited with non-zero exit code:', errcode);
-            }
+            if (errcode) {
+                throw new Error(`process exited with non-zero exit code: ${errcode}`);
+            }

src/manage_nsfs/manage_nsfs_glacier.js (4)

16-24: Add error handling around MIGRATION perform + timestamp update

If backend.perform throws, you currently skip timestamping (good) but also lose diagnostics. Wrap with try/catch to log the failure explicitly.

Apply this diff:

-    if (
+    if (
         await backend.low_free_space() ||
         await time_exceeded(fs_context, config.NSFS_GLACIER_MIGRATE_INTERVAL, Glacier.MIGRATE_TIMESTAMP_FILE) ||
         await migrate_log_exceeds_threshold()
     ) {
-        await backend.perform(prepare_galcier_fs_context(fs_context), "MIGRATION");
-        const timestamp_file_path = path.join(config.NSFS_GLACIER_LOGS_DIR, Glacier.MIGRATE_TIMESTAMP_FILE);
-        await record_current_time(fs_context, timestamp_file_path);
+        try {
+            await backend.perform(prepare_galcier_fs_context(fs_context), "MIGRATION");
+            const timestamp_file_path = path.join(config.NSFS_GLACIER_LOGS_DIR, Glacier.MIGRATE_TIMESTAMP_FILE);
+            await record_current_time(fs_context, timestamp_file_path);
+        } catch (err) {
+            console.error('process_migrations: backend.perform failed:', err);
+        }
     }

31-39: Add error handling around RESTORE perform + timestamp update

Mirror the migration handling to improve observability on failures.

Apply this diff:

-    await backend.perform(prepare_galcier_fs_context(fs_context), "RESTORE");
-    const timestamp_file_path = path.join(config.NSFS_GLACIER_LOGS_DIR, Glacier.RESTORE_TIMESTAMP_FILE);
-    await record_current_time(fs_context, timestamp_file_path);
+    try {
+        await backend.perform(prepare_galcier_fs_context(fs_context), "RESTORE");
+        const timestamp_file_path = path.join(config.NSFS_GLACIER_LOGS_DIR, Glacier.RESTORE_TIMESTAMP_FILE);
+        await record_current_time(fs_context, timestamp_file_path);
+    } catch (err) {
+        console.error('process_restores: backend.perform failed:', err);
+    }

56-58: Add error handling around EXPIRY perform + timestamp update

Keep behavior consistent with other operations and preserve diagnostics.

Apply this diff:

-        await backend.perform(prepare_galcier_fs_context(fs_context), "EXPIRY");
-        await record_current_time(fs_context, timestamp_file_path);
+        try {
+            await backend.perform(prepare_galcier_fs_context(fs_context), "EXPIRY");
+            await record_current_time(fs_context, timestamp_file_path);
+        } catch (err) {
+            console.error('process_expiry: backend.perform failed:', err);
+        }

117-127: Fix typo: prepare_galcier_fs_context → prepare_glacier_fs_context

Minor naming typo; rename for clarity and consistency.

Apply this diff for the definition:

-function prepare_galcier_fs_context(fs_context) {
+function prepare_glacier_fs_context(fs_context) {

And update call sites similarly:

Line 21: backend.perform(prepare_glacier_fs_context(fs_context), "MIGRATION")
Line 36: backend.perform(prepare_glacier_fs_context(fs_context), "RESTORE")
Line 56: backend.perform(prepare_glacier_fs_context(fs_context), "EXPIRY")

src/sdk/glacier.js (8)

91-112: Docs: fix typos (“seperated” → “separated”), tighten wording

Small documentation cleanup for stage_migrate.

Apply this diff within the comment block:

-     * newline seperated entries of filenames which needs to be
-     * migrated to GLACIER and should stage the files for migration.
+     * newline separated entries of filenames that need to be
+     * migrated to GLACIER and should stage the files for migration.
-     * The function should return false if it needs the log file to be
-     * preserved.
+     * Return false to preserve the log file.

114-129: Docs: fix type name (“LofFile” → “LogFile”) and typos

Minor documentation corrections.

Apply this diff:

-     * migrate must take a LofFile object (this should from the
+     * migrate must take a LogFile object (from the
-     * `GLACIER.MIGRATE_STAGE_WAL_NAME` namespace) which will have newline
-     * separated entries of filenames which needs to be migrated to GLACIER
+     * `GLACIER.MIGRATE_STAGE_WAL_NAME` namespace) which has newline
+     * separated filenames that need to be migrated to GLACIER

133-153: Docs: stage_restore description is copied from migrate; adjust wording

This block should describe staging for restore, not migration, and fix typos.

Apply this diff:

-     * stage_restore must take a log file (from `Glacier.RESTORE_STAGE_WAL_NAME`)
-     * which will have newline seperated entries of filenames which needs to be
-     * migrated to GLACIER and should stage the files for migration.
+     * stage_restore must take a log file (from `Glacier.RESTORE_WAL_NAME`)
+     * which has newline separated filenames that need to be
+     * restored from GLACIER and should stage the files for restore.
-     * The function should return false if it needs the log file to be
-     * preserved.
+     * Return false to preserve the log file.
-     * @param {LogFile} log_file log filename
+     * @param {LogFile} log_file

155-169: Docs: fix typos and clarify restore description

Minor clarity improvements.

Apply this diff:

-     * restore must take a log file (from `Glacier.RESTORE_WAL_NAME`) which will
-     * have newline seperated entries of filenames which needs to be 
-     * restored from GLACIER and should perform restore of those files if
-     * feasible
+     * restore must take a log file (from `Glacier.RESTORE_STAGE_WAL_NAME`) which
+     * has newline separated filenames to restore from GLACIER and should
+     * perform restore of those files if feasible.

129-131: Error message typo: “Unimplementented” → “Unimplemented”

Fix the exception text for clarity.

Apply this diff:

-        throw new Error('Unimplementented');
+        throw new Error('Unimplemented');

171-172: Error message typo: “Unimplementented” → “Unimplemented” (restore)

Same typo in restore.

Apply this diff:

-        throw new Error('Unimplementented');
+        throw new Error('Unimplemented');

181-182: Error message typo: “Unimplementented” → “Unimplemented” (expiry)

Same typo in expiry.

Apply this diff:

-        throw new Error('Unimplementented');
+        throw new Error('Unimplemented');

197-198: Error message typo: “Unimplementented” → “Unimplemented” (low_free_space)

Same typo in low_free_space.

Apply this diff:

-        throw new Error('Unimplementented');
+        throw new Error('Unimplemented');

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d0a93ec and 37e5240.

📒 Files selected for processing (10)

config.js (1 hunks)
src/manage_nsfs/manage_nsfs_glacier.js (1 hunks)
src/sdk/glacier.js (7 hunks)
src/sdk/glacier_tapecloud.js (7 hunks)
src/test/unit_tests/nsfs/test_nsfs_glacier_backend.js (4 hunks)
src/util/bucket_logs_utils.js (4 hunks)
src/util/file_reader.js (1 hunks)
src/util/native_fs_utils.js (3 hunks)
src/util/notifications_util.js (3 hunks)
src/util/persistent_logger.js (6 hunks)

🚧 Files skipped from review as they are similar to previous changes (7)

config.js
src/util/file_reader.js
src/util/notifications_util.js
src/test/unit_tests/nsfs/test_nsfs_glacier_backend.js
src/util/persistent_logger.js
src/util/bucket_logs_utils.js
src/util/native_fs_utils.js

🧰 Additional context used

🧠 Learnings (3)

📚 Learning: 2025-08-18T05:00:52.268Z

Learnt from: tangledbytes
PR: noobaa/noobaa-core#9183
File: src/sdk/glacier_tapecloud.js:377-406
Timestamp: 2025-08-18T05:00:52.268Z
Learning: In TapeCloudGlacier.restore method, the method always returns true (rather than propagating _recall() result) because it uses a staging approach where the original log file should always be deleted after processing. Failed entries are captured by failure_recorder for retry through staged files, so the original log preservation is not needed.

Applied to files:

src/sdk/glacier.js
src/sdk/glacier_tapecloud.js

📚 Learning: 2025-08-05T10:05:51.959Z

Learnt from: tangledbytes
PR: noobaa/noobaa-core#9183
File: src/sdk/glacier_tapecloud.js:507-509
Timestamp: 2025-08-05T10:05:51.959Z
Learning: In TapeCloudGlacier._finalize_restore method, when failure_recorder is not provided but the DMAPI premig check fails, the code intentionally throws an error to cause the entire batch to fail and preserve the log file for retry, preventing files from being silently stuck in an unrestored state.

Applied to files:

src/sdk/glacier.js
src/sdk/glacier_tapecloud.js

📚 Learning: 2025-08-18T05:00:21.111Z

Learnt from: tangledbytes
PR: noobaa/noobaa-core#9183
File: src/sdk/glacier_tapecloud.js:273-311
Timestamp: 2025-08-18T05:00:21.111Z
Learning: In TapeCloudGlacier.migrate method, ignoring the boolean return value from _migrate() and always proceeding to unstage entries is acceptable behavior - the migrate operation doesn't require checking the return value like other operations might.

Applied to files:

src/sdk/glacier_tapecloud.js

🧬 Code Graph Analysis (1)

src/manage_nsfs/manage_nsfs_glacier.js (5)

src/test/unit_tests/nsfs/test_nsfs_glacier_backend.js (7)

backend (90-90)

backend (174-174)

fs_context (191-191)

fs_context (303-303)

fs_context (439-439)

config (11-11)

path (6-6)

src/util/native_fs_utils.js (3)

fs_context (549-554)

config (10-10)

path (7-7)

src/cmd/manage_nsfs.js (3)

fs_context (73-73)

config (18-18)

native_fs_utils (23-23)

src/sdk/glacier.js (3)

config (9-9)

path (4-4)

native_fs_utils (11-11)

config.js (2)

config (7-7)

path (11-11)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Build Noobaa Image
GitHub Check: run-jest-unit-tests
GitHub Check: run-package-lock-validation

🔇 Additional comments (3)

src/sdk/glacier_tapecloud.js (3)

18-18: JSDoc import-only for LogFile looks good

Using a JSDoc import to pull types without a runtime dependency is appropriate here.

121-130: Robust task ID parsing with backward-compatible fallback

The new regex supports the modern 1005:1111:4444 format and correctly falls back to the legacy numeric format. Good defensive handling.

380-402: Restore always returns true by design (staging semantics) — LGTM

Returning true regardless of _recall()’s boolean aligns with the two-phase staging approach: original logs are deleted post-staging; failed entries are retried via staged logs using failure_recorder.

src/sdk/glacier_tapecloud.js

Signed-off-by: Utkarsh Srivastava <[email protected]> remove fcntltrylock Signed-off-by: Utkarsh Srivastava <[email protected]> add support for new task ID format Signed-off-by: Utkarsh Srivastava <[email protected]>

coderabbitai

Actionable comments posted: 2

🔭 Outside diff range comments (2)

src/sdk/glacier.js (1)

449-456: Include RESTORE_EXPIRY in xattr_get_keys for should_restore

get_restore_status consults XATTR_RESTORE_EXPIRY. Fetch it in the stat call to avoid partial status evaluation, especially when DMAPI implicit flags are disabled.
             stat = await nb_native().fs.stat(fs_context, file, {
                 xattr_get_keys: [
                     Glacier.XATTR_RESTORE_REQUEST,
                     Glacier.STORAGE_CLASS_XATTR,
+                    Glacier.XATTR_RESTORE_EXPIRY,
                     Glacier.XATTR_STAGE_MIGRATE,
                 ],
             });

src/sdk/glacier_tapecloud.js (1)

207-258: File descriptor leak when skipping non-migratable entries + missing await on close.

If should_migrate is false, the function returns without closing entry_fh.
The final close is not awaited; if it rejects you may get an unhandled rejection, and you also risk racing with subsequent operations.

Apply this diff to ensure proper closure in all paths:

                 // Skip the file if it shouldn't be migrated
-                if (!should_migrate) return;
+                if (!should_migrate) {
+                    await entry_fh?.close(fs_context);
+                    return;
+                }

                 // Mark the file staged
                 try {
                     await entry_fh.replacexattr(fs_context, { [Glacier.XATTR_STAGE_MIGRATE]: Date.now().toString() });
                     await batch_recorder(entry);
                 } catch (error) {
                     dbg.error('failed to mark the entry migrate staged', error);

                     // Can't really do anything if this fails - provider
                     // needs to make sure that appropriate error handling
                     // is being done there
                     await failure_recorder(entry);
                 } finally {
-                    entry_fh?.close(fs_context);
+                    await entry_fh?.close(fs_context);
                 }

♻️ Duplicate comments (5)

src/util/persistent_logger.js (1)

274-278: Redundant reader reset during deinit [same concern as before]

deinit already closes the reader; both collect and collect_and_process also reset before iterating. Resetting here is likely redundant; consider removing to reduce surprises and keep lifecycle symmetric.
src/sdk/glacier_tapecloud.js (4)
18-18: Type-only JSDoc import is correct; keep it.

This import is useful for tooling/typing and has no runtime cost.

273-311: Migrate flow matches the two-stage design; ignoring _migrate()’s boolean is OK.

Proceeding to unstage regardless of _migrate()’s boolean aligns with the design decision captured in this PR’s discussion and learnings.

383-407: Returning true from restore() is intentional with the staged flow.

Capturing success to decide whether to iterate the staged log but always returning true matches the stated design: original logs are deleted post-staging; failures are retried via staged files.

489-517: DMAPI pre-migration gate: behavior and documentation look correct; optional guard for misconfig.

Good addition: only request the DMAPI xattr when fs_context.use_dmapi is true.

Throwing when failure_recorder is absent is intentional to preserve the log and avoid silent stuck files.

The doc block explains the rationale clearly. Nice.

Optional hardening: Add fs_context.use_dmapi to the conditional to avoid false positives if the config flag is accidentally enabled without DMAPI.
-            if (
-                config.NSFS_GLACIER_DMAPI_FINALIZE_RESTORE_ENABLE &&
-                !stat.xattr[Glacier.GPFS_DMAPI_XATTR_TAPE_PREMIG]
-            ) {
+            if (
+                config.NSFS_GLACIER_DMAPI_FINALIZE_RESTORE_ENABLE &&
+                fs_context.use_dmapi &&
+                !stat.xattr[Glacier.GPFS_DMAPI_XATTR_TAPE_PREMIG]
+            ) {

🧹 Nitpick comments (10)

src/util/notifications_util.js (2)
100-100: Remove redundant await and async in the callback passed to PersistentLogger.process

No need to wrap _notify with an extra async/await. Let process await the returned promise directly.
-                await log.process(async (file, failure_append) => await this._notify(file, failure_append));
+                await log.process((file, failure_append) => this._notify(file, failure_append));
120-160: Avoid creating a temporary filtered log when only iterating entries

Using file.collect_and_process without ever calling the batch_recorder parameter forces a temporary PersistentLogger allocation per file (albeit quickly removed). Consider adding and using a lightweight iteration helper (e.g., LogFile.forEach) or exposing NewlineReader iteration on LogFile to avoid the extra logger allocations. If not feasible now, optional.
src/util/persistent_logger.js (2)

174-188: Consider closing the failure logger after processing inactive failures

Current code closes failure_log in finally, which is fine. Optionally, call await failure_log.close() after each phase to release handles earlier if contention arises. Not critical.

345-349: Reset before iteration is fine; ensure this stays the single point of reset

Given the explicit reset before reading, the extra reset in deinit can be dropped to reduce duplication. Keeping reset localized to iteration points improves clarity.
src/util/bucket_logs_utils.js (1)
39-39: Integrate failure recording to improve retry granularity (optional)

You’re returning false on any error, which preserves the whole WAL file. If you pass through failure_recorder and use it for per-entry failures, you can delete the file while persisting only failed entries to a failure log, reducing reprocessing. This is optional and depends on your retry strategy.

For example, thread the recorder through:
-            return log.process(async file => _upload_to_targets(s3_connection, file, bucket_to_owner_keys_func));
+            return log.process(async (file, failure_recorder) =>
+                _upload_to_targets(s3_connection, file, bucket_to_owner_keys_func, failure_recorder)
+            );
And extend _upload_to_targets accordingly to append selectively on failures.
src/sdk/glacier.js (2)
91-112: Stage-to-stage copy is correct; underscore unused params (optional)

stage_migrate writes primary entries to the staged WAL namespace. failure_recorder isn’t used here; prefixing it with _failure_recorder will satisfy linters.

223-229: Close the logs after processing to minimize handle lifetime (optional)

process_glacier_logs constructs a PersistentLogger and runs process but never closes the logger. Today that's fine (no append/init), but explicitly closing improves hygiene and future-proofs changes.
-            const logs = new PersistentLogger(
+            const logs = new PersistentLogger(
                 config.NSFS_GLACIER_LOGS_DIR,
                 namespace, { locking: 'EXCLUSIVE' },
             );
-            await logs.process(async (entry, failure_recorder) => cb(fs_context, entry, failure_recorder));
+            try {
+                await logs.process(async (file, failure_recorder) => cb(fs_context, file, failure_recorder));
+            } finally {
+                await logs.close();
+            }
src/sdk/glacier_tapecloud.js (3)
121-130: Unify task-id parsing into a single, future-proof regex.

Support both the new colon-delimited form and the legacy numeric form without a fallback branch. Also make the match case-insensitive.

Apply this diff:
-        // ID in latest version must look like 1005:1111:4444
-        let match = stdout.match(/task ID is (\d+:\d+:\d+)/);
-        if (!match || match.length !== 2) {
-            // Fallback to the older task ID extraction in case we fail to extract new one
-            match = stdout.match(/task ID is (\d+)/);
-            if (!match || match.length !== 2) {
-                throw error;
-            }
-        }
+        // ID can look like 1005:1111:4444 (new) or a plain number (old)
+        const match = stdout.match(/task ID is (\d+(?::\d+){2,}|\d+)/i);
+        if (!match) {
+            throw error;
+        }
295-301: Typo in log message.

Minor nit: “enty” -> “entry”.
-                    // Add the enty to the failure log - This could be wasteful as it might
+                    // Add the entry to the failure log - This could be wasteful as it might
495-506: Nit: fix step numbering in the comment.

Step 8 is missing (goes 7 -> 9). Keeping sequential numbering aids future readers.
-            // 9. NooBaa finalizes the restore
+            // 8. NooBaa finalizes the restore

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 37e5240 and 197213b.

📒 Files selected for processing (10)

config.js (1 hunks)
src/manage_nsfs/manage_nsfs_glacier.js (1 hunks)
src/sdk/glacier.js (7 hunks)
src/sdk/glacier_tapecloud.js (7 hunks)
src/test/unit_tests/nsfs/test_nsfs_glacier_backend.js (4 hunks)
src/util/bucket_logs_utils.js (4 hunks)
src/util/file_reader.js (1 hunks)
src/util/native_fs_utils.js (3 hunks)
src/util/notifications_util.js (3 hunks)
src/util/persistent_logger.js (6 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

config.js
src/util/native_fs_utils.js
src/util/file_reader.js
src/manage_nsfs/manage_nsfs_glacier.js

🧰 Additional context used

📓 Path-based instructions (1)

src/test/**/*.*

⚙️ CodeRabbit Configuration File

src/test/**/*.*: Ensure that the PR includes tests for the changes.

Files:

src/test/unit_tests/nsfs/test_nsfs_glacier_backend.js

🧠 Learnings (3)

📚 Learning: 2025-08-18T05:00:52.268Z

Learnt from: tangledbytes
PR: noobaa/noobaa-core#9183
File: src/sdk/glacier_tapecloud.js:377-406
Timestamp: 2025-08-18T05:00:52.268Z
Learning: In TapeCloudGlacier.restore method, the method always returns true (rather than propagating _recall() result) because it uses a staging approach where the original log file should always be deleted after processing. Failed entries are captured by failure_recorder for retry through staged files, so the original log preservation is not needed.

Applied to files:

src/test/unit_tests/nsfs/test_nsfs_glacier_backend.js
src/sdk/glacier.js
src/sdk/glacier_tapecloud.js

📚 Learning: 2025-08-05T10:05:51.959Z

Learnt from: tangledbytes
PR: noobaa/noobaa-core#9183
File: src/sdk/glacier_tapecloud.js:507-509
Timestamp: 2025-08-05T10:05:51.959Z
Learning: In TapeCloudGlacier._finalize_restore method, when failure_recorder is not provided but the DMAPI premig check fails, the code intentionally throws an error to cause the entire batch to fail and preserve the log file for retry, preventing files from being silently stuck in an unrestored state.

Applied to files:

src/sdk/glacier.js
src/sdk/glacier_tapecloud.js

📚 Learning: 2025-08-18T05:00:21.111Z

Learnt from: tangledbytes
PR: noobaa/noobaa-core#9183
File: src/sdk/glacier_tapecloud.js:273-311
Timestamp: 2025-08-18T05:00:21.111Z
Learning: In TapeCloudGlacier.migrate method, ignoring the boolean return value from _migrate() and always proceeding to unstage entries is acceptable behavior - the migrate operation doesn't require checking the return value like other operations might.

Applied to files:

src/sdk/glacier_tapecloud.js

🧬 Code Graph Analysis (5)

src/test/unit_tests/nsfs/test_nsfs_glacier_backend.js (3)

src/manage_nsfs/manage_nsfs_glacier.js (6)

backend (14-14)

backend (29-29)

backend (43-43)

fs_context (13-13)

fs_context (28-28)

fs_context (42-42)

src/sdk/namespace_fs.js (2)

s3_utils (15-15)

entry (267-267)

src/sdk/glacier.js (1)

s3_utils (6-6)

src/util/bucket_logs_utils.js (2)

src/util/persistent_logger.js (1)

require (8-8)

src/endpoint/s3/s3_bucket_logging.js (1)

log (146-158)

src/sdk/glacier.js (2)

src/manage_nsfs/manage_nsfs_glacier.js (8)

path (4-4)

nb_native (7-7)

nb_native (72-72)

config (6-6)

native_fs_utils (9-9)

fs_context (13-13)

fs_context (28-28)

fs_context (42-42)

src/util/persistent_logger.js (3)

path (4-4)

nb_native (5-5)

native_fs_utils (6-6)

src/sdk/glacier_tapecloud.js (3)

src/native/fs/fs_napi.cpp (2)

stat (2029-2033)

stat (2030-2030)

src/util/native_fs_utils.js (7)

fs_context (549-554)

nb_native (12-12)

nb_native (670-670)

stat (401-401)

stat (449-449)

stat (598-598)

config (10-10)

src/sdk/glacier.js (2)

nb_native (5-5)

config (9-9)

src/util/notifications_util.js (3)

src/sdk/namespace_fs.js (2)

require (26-26)

require (29-29)

src/server/bg_services/lifecycle.js (1)

require (13-14)

src/manage_nsfs/manage_nsfs_glacier.js (1)

log (93-93)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: ceph-s3-tests / Ceph S3 Tests

🔇 Additional comments (7)

src/util/notifications_util.js (1)

115-120: API migration to LogFile looks good

JSDoc and signature updates correctly align _notify to accept a LogFile and a failure recorder, matching the new PersistentLogger.process contract. Implementation uses the provided file instance without re-locking.

src/util/persistent_logger.js (1)

119-147: Good: switching _process to LogFile with explicit lifecycle and lock ownership

Creating LogFile per inactive WAL, acquiring EX lock in init(), and deinit() in finally ensures safe processing and deletion semantics. The delete-or-keep contract is clear.

src/util/bucket_logs_utils.js (1)

59-105: LGTM: direct use of LogFile.collect_and_process

Using the LogFile instance directly simplifies the flow and aligns with the new PersistentLogger API. Logging with log_file.log_path is consistent.

src/test/unit_tests/nsfs/test_nsfs_glacier_backend.js (1)

89-98: Centralized backend patching is clean

get_patched_backend() consolidates test doubles for migrate/recall/expiry and keeps tests focused. Good refactor.
src/sdk/glacier.js (2)

253-269: Two-phase orchestration with cluster and per-op locks looks good

Staging under GLACIER_CLUSTER_LOCK and executing under per-op locks (migrate/restore) is a solid approach to improve parallelism while preserving consistency. Nice.

133-153: Fix JSDoc: source namespace for stage_restore

Documentation says stage_restore takes a log file from Glacier.RESTORE_STAGE_WAL_NAME, but stage_restore is responsible for copying from the primary RESTORE WAL to the staged log. Update the doc to say it takes a file from Glacier.RESTORE_WAL_NAME.
-     * stage_restore must take a log file (from `Glacier.RESTORE_STAGE_WAL_NAME`)
+     * stage_restore must take a log file (from `Glacier.RESTORE_WAL_NAME`)
⛔ Skipped due to learnings
Learnt from: tangledbytes
PR: noobaa/noobaa-core#9183
File: src/sdk/glacier_tapecloud.js:377-406
Timestamp: 2025-08-18T05:00:52.268Z
Learning: In TapeCloudGlacier.restore method, the method always returns true (rather than propagating _recall() result) because it uses a staging approach where the original log file should always be deleted after processing. Failed entries are captured by failure_recorder for retry through staged files, so the original log preservation is not needed.
Learnt from: tangledbytes
PR: noobaa/noobaa-core#9183
File: src/sdk/glacier_tapecloud.js:507-509
Timestamp: 2025-08-05T10:05:51.959Z
Learning: In TapeCloudGlacier._finalize_restore method, when failure_recorder is not provided but the DMAPI premig check fails, the code intentionally throws an error to cause the entire batch to fail and preserve the log file for retry, preventing files from being silently stuck in an unrestored state.
src/sdk/glacier_tapecloud.js (1)

319-370: Stage/restore screening and requeueing look solid.

Requests the needed xattrs up-front.

Correctly requeues entries staged for migration to avoid restore/migrate contention.

Handles ENOENT gracefully and closes handles properly.

coderabbitai · 2025-08-18T07:10:37Z

src/test/unit_tests/nsfs/test_nsfs_glacier_backend.js

+        mocha.it('upload to GLACIER should work', async function() {
            const data = crypto.randomBytes(100);
            const upload_res = await glacier_ns.upload_object({
                bucket: upload_bkt,
                key: upload_key,
-				storage_class: s3_utils.STORAGE_CLASS_GLACIER,
+                storage_class: s3_utils.STORAGE_CLASS_GLACIER,
                xattr,
                source_stream: buffer_utils.buffer_to_read_stream(data)
            }, dummy_object_sdk);

            console.log('upload_object response', inspect(upload_res));

-			// Check if the log contains the entry
-			let found = false;
-			await NamespaceFS.migrate_wal._process(async file => {
-				const fs_context = glacier_ns.prepare_fs_context(dummy_object_sdk);
-				const reader = new NewlineReader(fs_context, file, { lock: 'EXCLUSIVE' });
-
-				await reader.forEachFilePathEntry(async entry => {
-					if (entry.path.endsWith(`${upload_key}`)) {
-						found = true;
-
-						// Not only should the file exist, it should be ready for
-						// migration as well
-						assert(backend.should_migrate(fs_context, entry.path));
-					}
+            // Check if the log contains the entry
+            let found = false;
+            await NamespaceFS.migrate_wal._process(async file => {
+                const fs_context = glacier_ns.prepare_fs_context(dummy_object_sdk);
+                await file.collect_and_process(async entry => {
+                    if (entry.endsWith(`${upload_key}`)) {
+                        found = true;

-					return true;
-				});
+                        // Not only should the file exist, it should be ready for
+                        // migration as well
+                        assert(backend.should_migrate(fs_context, entry));
+                    }
+                });

-				// Don't delete the file
-				return false;
-			});
+                // Don't delete the file
+                return false;
+            });

-			assert(found);
-		});
+            assert(found);
+        });


⚠️ Potential issue

Await the async should_migrate() before asserting

should_migrate returns a Promise. Asserting the Promise object will always pass. Await it before asserting the boolean result.

- assert(backend.should_migrate(fs_context, entry)); + assert(await backend.should_migrate(fs_context, entry));

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

mocha.it('upload to GLACIER should work', async function() {

const data = crypto.randomBytes(100);

const upload_res = await glacier_ns.upload_object({

bucket: upload_bkt,

key: upload_key,

storage_class: s3_utils.STORAGE_CLASS_GLACIER,

storage_class: s3_utils.STORAGE_CLASS_GLACIER,

xattr,

source_stream: buffer_utils.buffer_to_read_stream(data)

}, dummy_object_sdk);

console.log('upload_object response', inspect(upload_res));

// Check if the log contains the entry

let found = false;

await NamespaceFS.migrate_wal._process(async file => {

const fs_context = glacier_ns.prepare_fs_context(dummy_object_sdk);

const reader = new NewlineReader(fs_context, file, { lock: 'EXCLUSIVE' });

await reader.forEachFilePathEntry(async entry => {

if (entry.path.endsWith(`${upload_key}`)) {

found = true;

// Not only should the file exist, it should be ready for

// migration as well

assert(backend.should_migrate(fs_context, entry.path));

}

// Check if the log contains the entry

let found = false;

await NamespaceFS.migrate_wal._process(async file => {

const fs_context = glacier_ns.prepare_fs_context(dummy_object_sdk);

await file.collect_and_process(async entry => {

if (entry.endsWith(`${upload_key}`)) {

found = true;

return true;

});

// Not only should the file exist, it should be ready for

// migration as well

assert(backend.should_migrate(fs_context, entry));

}

});

// Don't delete the file

return false;

});

// Don't delete the file

return false;

});

assert(found);

});

assert(found);

});

mocha.it('upload to GLACIER should work', async function() {

const data = crypto.randomBytes(100);

const upload_res = await glacier_ns.upload_object({

bucket: upload_bkt,

key: upload_key,

storage_class: s3_utils.STORAGE_CLASS_GLACIER,

xattr,

source_stream: buffer_utils.buffer_to_read_stream(data)

}, dummy_object_sdk);

console.log('upload_object response', inspect(upload_res));

// Check if the log contains the entry

let found = false;

await NamespaceFS.migrate_wal._process(async file => {

const fs_context = glacier_ns.prepare_fs_context(dummy_object_sdk);

await file.collect_and_process(async entry => {

if (entry.endsWith(`${upload_key}`)) {

found = true;

// Not only should the file exist, it should be ready for

// migration as well

assert(await backend.should_migrate(fs_context, entry));

}

});

// Don't delete the file

return false;

});

assert(found);

});

🤖 Prompt for AI Agents

In src/test/unit_tests/nsfs/test_nsfs_glacier_backend.js around lines 176 to 207, the test calls backend.should_migrate(fs_context, entry) which returns a Promise<boolean> but the code asserts the Promise object directly; update the test to await the result (e.g. const migratable = await backend.should_migrate(fs_context, entry)) and then assert the boolean (assert(migratable)); ensure the surrounding callback is async so the await works and no Promise is accidentally asserted.

coderabbitai · 2025-08-18T07:10:37Z

src/util/persistent_logger.js

+            await native_fs_utils._create_path(this.dir, this.fs_context);
+            this.fh = await native_fs_utils.open_with_lock(
+                this.fs_context,
+                this.active_path,
+                'as+',
+                undefined,
+                { lock: this.locking, retries: 10, backoff: 5 }
            );


🛠️ Refactor suggestion

Preserve active file change detection by setting fh_stat on init

The poller (_poll_active_file_change) relies on this.fh_stat to detect inode changes, but init no longer sets it. Without it, active file change detection is effectively disabled. Set fh_stat after opening to restore the behavior.

this.fh = await native_fs_utils.open_with_lock( this.fs_context, this.active_path, 'as+', undefined, { lock: this.locking, retries: 10, backoff: 5 } ); - - return this.fh; + // Record current inode to allow _poll_active_file_change to detect rotations + try { + this.fh_stat = await nb_native().fs.stat(this.fs_context, this.active_path); + } catch (e) { + // If the file vanishes between open and stat we will re-initialize on next append + dbg.warn('failed to stat active log after open:', this.active_path, e); + } + return this.fh;

Also applies to: 63-63

🤖 Prompt for AI Agents

In src/util/persistent_logger.js around lines 54-61 (and also line 63), init opens the active file but no longer sets this.fh_stat, which disables the poller’s inode-change detection; after successfully opening the file handle assign this.fh_stat to the current file stat (e.g., obtain the inode/mtime via the file handle’s stat or fs.fstat on the descriptor) so the _poll_active_file_change routine has a baseline to compare against; apply the same assignment in the alternate/branch at line 63 so both init paths restore active file change detection.

pull-request-size bot added the size/XL label Aug 5, 2025