fix: stale NFS mounts (#434) by dialmaster · Pull Request #440 · DialmasterOrg/Youtarr

dialmaster · 2026-02-21T16:45:53Z

Add pre-flight health check (write+delete probe with timeout) before each download to catch unresponsive or permission-denied output dirs before yt-dlp contaminates the archive
Replace bare fs.move/ensureDir in post-processor with retry-enabled variants (exponential backoff) to ride out transient NFS errors
Remove failed videos from yt-dlp archive so they are retried on the next scheduled run instead of appearing permanently "already downloaded"
Replace synchronous statSync diagnostic with timeout-guarded async stat to avoid hanging on stale NFS mounts during error logging
Add NFS mount configuration guidance to CONFIG.md

- Add pre-flight health check (write+delete probe with timeout) before each download to catch unresponsive or permission-denied output dirs before yt-dlp contaminates the archive - Replace bare fs.move/ensureDir in post-processor with retry-enabled variants (exponential backoff) to ride out transient NFS errors - Remove failed videos from yt-dlp archive so they are retried on the next scheduled run instead of appearing permanently "already downloaded" - Replace synchronous statSync diagnostic with timeout-guarded async stat to avoid hanging on stale NFS mounts during error logging - Add NFS mount configuration guidance to CONFIG.md

github-actions · 2026-02-21T16:48:52Z

📊 Test Coverage Report

Backend Coverage

Type	Coverage	Status
Lines	79.59%	🟡
Statements	79.44%	🟡
Functions	80.20%	🟢
Branches	72.79%	🟡

Frontend Coverage

Type	Coverage	Status
Lines	87.21%	🟢
Statements	86.51%	🟢
Functions	80.43%	🟢
Branches	81.47%	🟢

Coverage Requirements

Minimum threshold: 70% line coverage
Backend: ✅ Passes
Frontend: ✅ Passes

Coverage report generated for commit 8d6e6a1

claude · 2026-02-21T16:50:41Z

PR Review: fix/stale-NFS-mounts

This is a well-motivated fix for a real data-integrity bug. The core approach — pre-flight health check + retry wrappers + archive cleanup — addresses the root causes systematically. Below are my findings, ordered by severity.

Issues

1. Health check file may be orphaned on NFS timeout (medium)

In checkOutputDirectoryHealth, if writeFile completes but the NFS mount becomes unresponsive before unlink finishes, the timeout fires and the test file is left on the filesystem with no cleanup path:

await Promise.race([
  (async () => {
    await fsPromises.writeFile(testFile, 'healthcheck');  // succeeds
    await fsPromises.unlink(testFile);                    // hangs → timeout fires → orphan
  })(),
  timeoutPromise
]);

If the NFS mount is slow-but-not-dead, the background unlink will eventually complete. If it's permanently hung, the .youtarr_healthcheck_ file stays on the NFS server. Consider tracking whether writeFile completed and adding a best-effort cleanup attempt in the finally block.

Additionally, Date.now() for the filename suffix has a (very low) collision risk if two health checks run in the same millisecond. A short random suffix would be more robust.

2. Unreachable `throw lastError` in retry loops (low)

Both ensureDirWithRetries and moveWithRetries end with:

while (attempt <= retries) {
  try {
    ...
    return;
  } catch (err) {
    lastError = err;
    if (attempt === retries) throw err;  // always throws before loop can exit
    ...
    attempt += 1;
  }
}
throw lastError;  // unreachable

The while loop either returns on success or throws when attempt === retries. The trailing throw lastError is dead code in both implementations. Not a bug, but worth cleaning up.

3. Archive removal on metadata detection failures could trigger spurious re-downloads (low)

failedVideosList classifies a video as failed if fileSize is null/zero after VideoMetadataProcessor.processVideoMetadata(). If the processor fails to stat a file that was actually moved successfully (e.g. due to a transient read error during post-processing), the video would be:

Removed from the yt-dlp archive
Re-downloaded on the next run
Detected as a duplicate if the file already exists

This is an edge case and the graceful isVideoInArchive check in removeVideoFromArchive limits the blast radius, but it could cause unnecessary duplicate download attempts.

4. `configModule.directoryPath` coupling (low)

The health check calls:

await this.checkOutputDirectoryHealth(configModule.directoryPath);

If directoryPath is undefined (e.g. called before config is loaded), path.join(undefined, ...) throws a TypeError — which would still block the download with a confusing error rather than a meaningful one. A defensive check or use of getConfig().outputDir (with the same pattern used elsewhere) would be clearer.

Observations / Minor Suggestions

Cumulative retry delay: With retries: 5, delayMs: 500 (exponential backoff), the worst-case cumulative delay before a move fails is ~15.5 s per operation. With two operations (ensureDirWithRetries + moveWithRetries), a single post-processor run can block for ~31 s before calling process.exit(1). This is likely acceptable for NFS, but worth documenting in the post-processor comments.
removeVideoFromArchive uses synchronous I/O: readCompleteListLines() and fs.writeFileSync() are synchronous. Pre-existing, but called from async context. On large archives this could block the event loop briefly.
Health check scope: The health check probes configModule.directoryPath (final output dir) but doesn't probe the temp directory when useTmpForDownloads: true. This is intentional per the PR description (the temp→output move is what contaminates the archive), but is worth a comment in the code to make the reasoning explicit.
Test for timeout case uses jest.useFakeTimers — the jest.useRealTimers() call at the end should ideally be in an afterEach to prevent timer leakage if the test assertion throws before cleanup.

What's done well

The pre-flight health check with an explicit timeout is exactly the right pattern for detecting hung NFS mounts before they corrupt state. Catching the problem before yt-dlp runs is the correct fix.
Exponential backoff on moveWithRetries / ensureDirWithRetries handles transient NFS hiccups cleanly.
Archive cleanup logic correctly gates on !allowRedownload and checks isVideoInArchive before writing, avoiding double-removal.
The async stat with timeout in the error-logging path (replacing a synchronous statSync) correctly avoids hanging the process during diagnostics.
Documentation in CONFIG.md is thorough and actionable.
Test coverage is solid across the new code paths.

Overall the change is sound and addresses the issue correctly. The orphaned health check file and the unreachable throw are the most worth addressing before merging.

- Health check: add best-effort cleanup for orphaned test files when unlink fails after writeFile succeeds (e.g. NFS timeout between the two operations). Use crypto.randomUUID() instead of Date.now() to avoid filename collisions. - Health check: add guard for undefined outputDir to throw a clear error instead of a cryptic TypeError from path.join(undefined, ...). - Retry loops: replace while+lastError pattern with for-loop in both ensureDirWithRetries and moveWithRetries to eliminate unreachable dead code (throw lastError after exhaustive loop). - Archive cleanup: only remove videos from the yt-dlp archive when they were explicitly marked as failed during download, not when they merely lack file size (stat/waitForFile failure). Prevents spurious re-downloads when NFS lag causes stat to fail on files that actually exist on disk. https://claude.ai/code/session_01XKqGTES8nh9U7xerGXHxjW

…mtvr Fix issues raised in PR #440 review

claude · 2026-02-21T18:02:13Z

Code Review

Overall this is solid work that addresses a real problem. The design decisions are sensible and the test coverage is good. A few issues worth discussing before merging:

Potential Bug: Double-unlink in `checkOutputDirectoryHealth` happy path

In the finally block, fsPromises.unlink(testFile) is called whenever fileWritten === true, including the happy path where the inner async function already ran unlink successfully:

// Inner async:
await fsPromises.writeFile(testFile, 'healthcheck');
fileWritten = true;
await fsPromises.unlink(testFile);   // ← succeeds

// finally block always runs:
if (fileWritten) {
  fsPromises.unlink(testFile).catch(() => {}); // ← tries again on an already-deleted file
}

The ENOENT is swallowed by .catch(() => {}) so this is harmless, but the intent is muddied. The finally cleanup should only be a safety net when something went wrong. One option:

let fileWritten = false;
let fileDeleted = false;
// ...
await fsPromises.writeFile(testFile, 'healthcheck');
fileWritten = true;
await fsPromises.unlink(testFile);
fileDeleted = true;
// ...
// finally:
if (fileWritten && !fileDeleted) {
  fsPromises.unlink(testFile).catch(() => {});
}

Test/Implementation alignment concern: "no metadata" archive removal

The test 'should remove failed videos from archive that have no metadata' mocks processVideoMetadata to return [] and expects removeVideoFromArchive to be called with 'nodata123'. But the new archive-removal code only iterates failedVideosList:

if (!allowRedownload && failedVideosList.length > 0) {
  for (const failedVideo of failedVideosList) { ... }
}

If failedVideosList is derived from processVideoMetadata results, an empty metadata result means an empty list, and the loop body never runs. Either the test documents an assumption about how failedVideosList is constructed that isn't visible in this diff (in which case a comment would help), or this test will fail once run against the real code. Worth verifying before merge.

Stack trace dropped in error logging

The existing error log:

logger.error({ error, videoDirectory }, '[Post-Process] ERROR during move operation');

passed the raw error object. The replacement:

logger.error({ error: error.message, code: error.code, syscall: error.syscall, src: ... }, ...)

logs explicit fields but loses the stack trace. For production debugging this is a regression. Most Pino-based loggers serialise err specially — using { err: error, code: error.code, ... } would retain the stack while also making code and syscall explicit top-level fields.

Minor: `require` calls inside methods

checkOutputDirectoryHealth does:

const fsPromises = require('fs').promises;
const crypto = require('crypto');

These are resolved on every call. They're cached by Node's module system so there's no performance issue, but placing them at the top of the file is the conventional pattern and makes dependencies easier to audit.

Similarly, require('../archiveModule') inside the archive-removal block is fine (Node caches it) but inconsistent with how other modules are imported.

Things done well

The pre-flight health check design is solid: write+delete probe with a configurable timeout, using crypto.randomUUID() to avoid filename collisions, and a separate timeout Promise via Promise.race to detect hung mounts. This cleanly solves the original issue.
Distinguishing between "explicitly failed" (yt-dlp stdout ERROR) and "stat-only failed" videos before removing from archive is a subtle and correct call — prevents spurious re-downloads when NFS stat latency is the real cause.
The moveWithRetries for-loop refactor removes a dead throw lastError that could never be reached. Clean.
The async, timeout-guarded diagnostic stat in the catch block is a good improvement over a synchronous call that would hang on a stale mount.
The CONFIG.md NFS section is thorough and actionable — the Docker native NFS volume example in particular is good advice.
Test coverage for checkOutputDirectoryHealth is comprehensive, including the hung-filesystem timeout case with fake timers.

claude and others added 2 commits February 21, 2026 17:47

Merge pull request #441 from DialmasterOrg/claude/fix-pr-440-issues-u…

8d6e6a1

…mtvr Fix issues raised in PR #440 review

dialmaster merged commit 675c2f1 into dev Feb 21, 2026
6 checks passed

dialmaster mentioned this pull request Feb 21, 2026

Scheduled downloads failing(?) resulting in "Missing" videos #434

Closed

dialmaster mentioned this pull request Mar 15, 2026

Dev -> Main (Release) #461

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: stale NFS mounts (#434)#440

fix: stale NFS mounts (#434)#440
dialmaster merged 3 commits intodevfrom
fix/434-scheduled-downloads-failing-missing

dialmaster commented Feb 21, 2026

Uh oh!

github-actions bot commented Feb 21, 2026 •

edited

Loading

Uh oh!

claude bot commented Feb 21, 2026

Uh oh!

claude bot commented Feb 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

dialmaster commented Feb 21, 2026

Uh oh!

github-actions bot commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Test Coverage Report

Backend Coverage

Frontend Coverage

Coverage Requirements

Uh oh!

claude bot commented Feb 21, 2026

PR Review: fix/stale-NFS-mounts

Issues

1. Health check file may be orphaned on NFS timeout (medium)

2. Unreachable throw lastError in retry loops (low)

3. Archive removal on metadata detection failures could trigger spurious re-downloads (low)

4. configModule.directoryPath coupling (low)

Observations / Minor Suggestions

What's done well

Uh oh!

claude bot commented Feb 21, 2026

Code Review

Potential Bug: Double-unlink in checkOutputDirectoryHealth happy path

Test/Implementation alignment concern: "no metadata" archive removal

Stack trace dropped in error logging

Minor: require calls inside methods

Things done well

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Feb 21, 2026 •

edited

Loading

2. Unreachable `throw lastError` in retry loops (low)

4. `configModule.directoryPath` coupling (low)

Potential Bug: Double-unlink in `checkOutputDirectoryHealth` happy path

Minor: `require` calls inside methods