Skip to content

fix: stale NFS mounts (#434)#440

Merged
dialmaster merged 3 commits intodevfrom
fix/434-scheduled-downloads-failing-missing
Feb 21, 2026
Merged

fix: stale NFS mounts (#434)#440
dialmaster merged 3 commits intodevfrom
fix/434-scheduled-downloads-failing-missing

Conversation

@dialmaster
Copy link
Collaborator

  • Add pre-flight health check (write+delete probe with timeout) before each download to catch unresponsive or permission-denied output dirs before yt-dlp contaminates the archive
  • Replace bare fs.move/ensureDir in post-processor with retry-enabled variants (exponential backoff) to ride out transient NFS errors
  • Remove failed videos from yt-dlp archive so they are retried on the next scheduled run instead of appearing permanently "already downloaded"
  • Replace synchronous statSync diagnostic with timeout-guarded async stat to avoid hanging on stale NFS mounts during error logging
  • Add NFS mount configuration guidance to CONFIG.md

- Add pre-flight health check (write+delete probe with timeout) before
  each download to catch unresponsive or permission-denied output dirs
  before yt-dlp contaminates the archive
- Replace bare fs.move/ensureDir in post-processor with retry-enabled
  variants (exponential backoff) to ride out transient NFS errors
- Remove failed videos from yt-dlp archive so they are retried on the
  next scheduled run instead of appearing permanently "already downloaded"
- Replace synchronous statSync diagnostic with timeout-guarded async stat
  to avoid hanging on stale NFS mounts during error logging
- Add NFS mount configuration guidance to CONFIG.md
@github-actions
Copy link
Contributor

github-actions bot commented Feb 21, 2026

📊 Test Coverage Report

Backend Coverage

Type Coverage Status
Lines 79.59% 🟡
Statements 79.44% 🟡
Functions 80.20% 🟢
Branches 72.79% 🟡

Frontend Coverage

Type Coverage Status
Lines 87.21% 🟢
Statements 86.51% 🟢
Functions 80.43% 🟢
Branches 81.47% 🟢

Coverage Requirements

  • Minimum threshold: 70% line coverage
  • Backend: ✅ Passes
  • Frontend: ✅ Passes

Coverage report generated for commit 8d6e6a1

@claude
Copy link

claude bot commented Feb 21, 2026

PR Review: fix/stale-NFS-mounts

This is a well-motivated fix for a real data-integrity bug. The core approach — pre-flight health check + retry wrappers + archive cleanup — addresses the root causes systematically. Below are my findings, ordered by severity.


Issues

1. Health check file may be orphaned on NFS timeout (medium)

In checkOutputDirectoryHealth, if writeFile completes but the NFS mount becomes unresponsive before unlink finishes, the timeout fires and the test file is left on the filesystem with no cleanup path:

await Promise.race([
  (async () => {
    await fsPromises.writeFile(testFile, 'healthcheck');  // succeeds
    await fsPromises.unlink(testFile);                    // hangs → timeout fires → orphan
  })(),
  timeoutPromise
]);

If the NFS mount is slow-but-not-dead, the background unlink will eventually complete. If it's permanently hung, the .youtarr_healthcheck_ file stays on the NFS server. Consider tracking whether writeFile completed and adding a best-effort cleanup attempt in the finally block.

Additionally, Date.now() for the filename suffix has a (very low) collision risk if two health checks run in the same millisecond. A short random suffix would be more robust.

2. Unreachable throw lastError in retry loops (low)

Both ensureDirWithRetries and moveWithRetries end with:

while (attempt <= retries) {
  try {
    ...
    return;
  } catch (err) {
    lastError = err;
    if (attempt === retries) throw err;  // always throws before loop can exit
    ...
    attempt += 1;
  }
}
throw lastError;  // unreachable

The while loop either returns on success or throws when attempt === retries. The trailing throw lastError is dead code in both implementations. Not a bug, but worth cleaning up.

3. Archive removal on metadata detection failures could trigger spurious re-downloads (low)

failedVideosList classifies a video as failed if fileSize is null/zero after VideoMetadataProcessor.processVideoMetadata(). If the processor fails to stat a file that was actually moved successfully (e.g. due to a transient read error during post-processing), the video would be:

  1. Removed from the yt-dlp archive
  2. Re-downloaded on the next run
  3. Detected as a duplicate if the file already exists

This is an edge case and the graceful isVideoInArchive check in removeVideoFromArchive limits the blast radius, but it could cause unnecessary duplicate download attempts.

4. configModule.directoryPath coupling (low)

The health check calls:

await this.checkOutputDirectoryHealth(configModule.directoryPath);

If directoryPath is undefined (e.g. called before config is loaded), path.join(undefined, ...) throws a TypeError — which would still block the download with a confusing error rather than a meaningful one. A defensive check or use of getConfig().outputDir (with the same pattern used elsewhere) would be clearer.


Observations / Minor Suggestions

  • Cumulative retry delay: With retries: 5, delayMs: 500 (exponential backoff), the worst-case cumulative delay before a move fails is ~15.5 s per operation. With two operations (ensureDirWithRetries + moveWithRetries), a single post-processor run can block for ~31 s before calling process.exit(1). This is likely acceptable for NFS, but worth documenting in the post-processor comments.

  • removeVideoFromArchive uses synchronous I/O: readCompleteListLines() and fs.writeFileSync() are synchronous. Pre-existing, but called from async context. On large archives this could block the event loop briefly.

  • Health check scope: The health check probes configModule.directoryPath (final output dir) but doesn't probe the temp directory when useTmpForDownloads: true. This is intentional per the PR description (the temp→output move is what contaminates the archive), but is worth a comment in the code to make the reasoning explicit.

  • Test for timeout case uses jest.useFakeTimers — the jest.useRealTimers() call at the end should ideally be in an afterEach to prevent timer leakage if the test assertion throws before cleanup.


What's done well

  • The pre-flight health check with an explicit timeout is exactly the right pattern for detecting hung NFS mounts before they corrupt state. Catching the problem before yt-dlp runs is the correct fix.
  • Exponential backoff on moveWithRetries / ensureDirWithRetries handles transient NFS hiccups cleanly.
  • Archive cleanup logic correctly gates on !allowRedownload and checks isVideoInArchive before writing, avoiding double-removal.
  • The async stat with timeout in the error-logging path (replacing a synchronous statSync) correctly avoids hanging the process during diagnostics.
  • Documentation in CONFIG.md is thorough and actionable.
  • Test coverage is solid across the new code paths.

Overall the change is sound and addresses the issue correctly. The orphaned health check file and the unreachable throw are the most worth addressing before merging.

claude and others added 2 commits February 21, 2026 17:47
- Health check: add best-effort cleanup for orphaned test files when
  unlink fails after writeFile succeeds (e.g. NFS timeout between the
  two operations). Use crypto.randomUUID() instead of Date.now() to
  avoid filename collisions.
- Health check: add guard for undefined outputDir to throw a clear
  error instead of a cryptic TypeError from path.join(undefined, ...).
- Retry loops: replace while+lastError pattern with for-loop in both
  ensureDirWithRetries and moveWithRetries to eliminate unreachable
  dead code (throw lastError after exhaustive loop).
- Archive cleanup: only remove videos from the yt-dlp archive when
  they were explicitly marked as failed during download, not when they
  merely lack file size (stat/waitForFile failure). Prevents spurious
  re-downloads when NFS lag causes stat to fail on files that actually
  exist on disk.

https://claude.ai/code/session_01XKqGTES8nh9U7xerGXHxjW
@claude
Copy link

claude bot commented Feb 21, 2026

Code Review

Overall this is solid work that addresses a real problem. The design decisions are sensible and the test coverage is good. A few issues worth discussing before merging:


Potential Bug: Double-unlink in checkOutputDirectoryHealth happy path

In the finally block, fsPromises.unlink(testFile) is called whenever fileWritten === true, including the happy path where the inner async function already ran unlink successfully:

// Inner async:
await fsPromises.writeFile(testFile, 'healthcheck');
fileWritten = true;
await fsPromises.unlink(testFile);   // ← succeeds

// finally block always runs:
if (fileWritten) {
  fsPromises.unlink(testFile).catch(() => {}); // ← tries again on an already-deleted file
}

The ENOENT is swallowed by .catch(() => {}) so this is harmless, but the intent is muddied. The finally cleanup should only be a safety net when something went wrong. One option:

let fileWritten = false;
let fileDeleted = false;
// ...
await fsPromises.writeFile(testFile, 'healthcheck');
fileWritten = true;
await fsPromises.unlink(testFile);
fileDeleted = true;
// ...
// finally:
if (fileWritten && !fileDeleted) {
  fsPromises.unlink(testFile).catch(() => {});
}

Test/Implementation alignment concern: "no metadata" archive removal

The test 'should remove failed videos from archive that have no metadata' mocks processVideoMetadata to return [] and expects removeVideoFromArchive to be called with 'nodata123'. But the new archive-removal code only iterates failedVideosList:

if (!allowRedownload && failedVideosList.length > 0) {
  for (const failedVideo of failedVideosList) { ... }
}

If failedVideosList is derived from processVideoMetadata results, an empty metadata result means an empty list, and the loop body never runs. Either the test documents an assumption about how failedVideosList is constructed that isn't visible in this diff (in which case a comment would help), or this test will fail once run against the real code. Worth verifying before merge.


Stack trace dropped in error logging

The existing error log:

logger.error({ error, videoDirectory }, '[Post-Process] ERROR during move operation');

passed the raw error object. The replacement:

logger.error({ error: error.message, code: error.code, syscall: error.syscall, src: ... }, ...)

logs explicit fields but loses the stack trace. For production debugging this is a regression. Most Pino-based loggers serialise err specially — using { err: error, code: error.code, ... } would retain the stack while also making code and syscall explicit top-level fields.


Minor: require calls inside methods

checkOutputDirectoryHealth does:

const fsPromises = require('fs').promises;
const crypto = require('crypto');

These are resolved on every call. They're cached by Node's module system so there's no performance issue, but placing them at the top of the file is the conventional pattern and makes dependencies easier to audit.

Similarly, require('../archiveModule') inside the archive-removal block is fine (Node caches it) but inconsistent with how other modules are imported.


Things done well

  • The pre-flight health check design is solid: write+delete probe with a configurable timeout, using crypto.randomUUID() to avoid filename collisions, and a separate timeout Promise via Promise.race to detect hung mounts. This cleanly solves the original issue.
  • Distinguishing between "explicitly failed" (yt-dlp stdout ERROR) and "stat-only failed" videos before removing from archive is a subtle and correct call — prevents spurious re-downloads when NFS stat latency is the real cause.
  • The moveWithRetries for-loop refactor removes a dead throw lastError that could never be reached. Clean.
  • The async, timeout-guarded diagnostic stat in the catch block is a good improvement over a synchronous call that would hang on a stale mount.
  • The CONFIG.md NFS section is thorough and actionable — the Docker native NFS volume example in particular is good advice.
  • Test coverage for checkOutputDirectoryHealth is comprehensive, including the hung-filesystem timeout case with fake timers.

@dialmaster dialmaster merged commit 675c2f1 into dev Feb 21, 2026
6 checks passed
@dialmaster dialmaster mentioned this pull request Mar 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants