Skip to content

queue: Fix disk queue corruption detection and memory safety#168

Open
rgerhards wants to merge 7 commits intomainfrom
queue-harden-recovery-9861661753839617161
Open

queue: Fix disk queue corruption detection and memory safety#168
rgerhards wants to merge 7 commits intomainfrom
queue-harden-recovery-9861661753839617161

Conversation

@rgerhards
Copy link
Owner

Implements robust disk queue corruption detection and recovery logic.

  • Added queue.onCorruption parameter to control behavior (ignore, safe, inMemory).
  • Implemented qqueueVerifyAndRecover to scan spool directory and validate state.
  • Fixed circular buffer wrap-around logic in gap detection.
  • Secured memory allocations (realloc, strdup, es_str2cstr) with CHKmalloc macro.
  • Added error checking for rename operations during recovery.
  • Fixed double-free issue in realloc failure path.
  • Verified with badqi.sh and daqueue-invld-qi.sh.

PR created automatically by Jules for task 9861661753839617161 started by @rgerhards

@google-labs-jules
Copy link

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@gemini-code-assist
Copy link

Summary of Changes

Hello @rgerhards, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the resilience of disk queues by introducing comprehensive corruption detection and configurable recovery mechanisms. It allows the system to gracefully handle corrupted queue states by either ignoring them, moving them to a safe location, or switching to an in-memory queue, thereby preventing data loss or service disruption. Additionally, it improves memory safety and error handling in critical file operations.

Highlights

  • Configurable Corruption Handling: Introduced a new queue.onCorruption parameter allowing users to define how disk queue corruption is handled: ignore, safe (move corrupted files to a .bad directory and restart fresh), or inMemory (fall back to an in-memory queue).
  • Disk Queue Verification and Recovery: Implemented qqueueVerifyAndRecover to scan the spool directory, validate the integrity of queue files (checking for orphans and gaps), and initiate recovery based on the configured onCorruption policy.
  • Circular Buffer Gap Detection Fix: Corrected the logic for detecting gaps in circular disk queues, ensuring proper identification of missing files even when the queue wraps around.
  • Enhanced Memory Safety: Integrated the CHKmalloc macro for realloc and strdup calls within the new corruption handling logic, improving memory allocation robustness.
  • Robust File Operations: Added explicit error checking for rename operations during the recovery process, logging failures if files cannot be moved.
  • Double-Free Prevention: Addressed a potential double-free issue in the realloc failure path within the qqueueVerifyAndRecover function.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • runtime/queue.c
    • Added necessary headers (dirent.h, sys/types.h) for directory scanning.
    • Registered queue.oncorruption as a new configuration parameter.
    • Introduced qqueueSetupLinkedList to facilitate switching to an in-memory queue.
    • Implemented qqueueVerifyAndRecover to perform directory scans, validate queue file sequences, and execute recovery actions.
    • Integrated qqueueVerifyAndRecover into qConstructDisk to run during queue initialization.
    • Added logic to parse and apply the queue.oncorruption configuration value.
    • Ensured memory allocations (realloc, strdup) within qqueueVerifyAndRecover are checked with CHKmalloc.
    • Refined gap detection logic for circular queues to correctly identify corruption.
    • Included error handling for mkdir and rename system calls during recovery.
    • Fixed a double-free scenario in realloc failure paths.
  • runtime/queue.h
    • Defined queueOnCorruption_t enum to represent the different corruption handling strategies.
    • Added onCorruption field to the queue_s structure to store the chosen corruption handling mode.
Activity
  • The pull request was automatically generated by Jules for a task initiated by @rgerhards.
  • The changes have been verified using badqi.sh and daqueue-invld-qi.sh test scripts.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust mechanism for detecting and recovering from disk queue corruption, which is a great improvement for reliability. It adds a new queue.oncorruption parameter to control the recovery behavior (ignore, safe, or in-memory fallback). The changes also enhance memory safety by consistently using CHKmalloc for allocations and fixing potential issues with realloc.

My review has identified a critical flaw in the new corruption detection logic for circular queues when they are in a wrapped-around state. The current implementation can fail to detect missing queue files, making the detection unreliable. I've also noted a minor code duplication that could be refactored for better maintainability. Please see the detailed comments for more information.

runtime/queue.c Outdated
Comment on lines +964 to +1039
if (!corruptionDetected && nFiles > 0) {
int foundDeq = 0, foundEnq = 0;
for (i = 0; i < nFiles; i++) {
if (files[i].number == deqNum) foundDeq = 1;
if (files[i].number == enqNum) foundEnq = 1;

if (i > 0) {
if (files[i].number != files[i - 1].number + 1) {
/* Check if this gap is valid (wrap-around) or corrupt */
/* In a continuous sorted list, any jump > 1 is a gap,
* UNLESS it is the jump from the highest number back to the lowest
* (which shouldn't happen in a sorted list of numbers, they just increase).
* Wait, physical files increase monotonically? No, they recycle in circular mode.
* The file numbers are NOT strictly monotonic if they wrap?
* Rsyslog uses iCurrFNum which increases. When does it wrap?
* iFileNumDigits suggests it uses 0-999...
* Actually, strm uses iCurrFNum which is unsigned int. It does not seem to wrap easily.
* BUT STREAMTYPE_FILE_CIRCULAR suggests reuse.
* Let's look at `strm.c`. `strmSetFName`...
* In circular mode, `iCurrFNum` counts up. `strmPhysOpen` calculates actual filename.
* No, wait. `queue.c` says `strmGetCurrFileNum` returns `iCurrFNum`.
* If it is circular, does `iCurrFNum` wrap?
* In `strm.c`:
* if(pThis->sType == STREAMTYPE_FILE_CIRCULAR) {
* if(pThis->iCurrFNum >= pThis->iMaxFiles) pThis->iCurrFNum = 0;
* }
* Yes, it wraps to 0.
*/
/* So if we have files 0, 1, 2, 8, 9 (max 10) and deq=8, enq=2.
* Sorted: 0, 1, 2, 8, 9.
* 0->1 ok. 1->2 ok. 2->8 GAP?
* If enq=2, deq=8. Active: 8, 9, 0, 1, 2.
* Valid files are {0, 1, 2, 8, 9}.
* Sorted list has a "gap" between 2 and 8. This is VALID if 2=enq and 8=deq.
* Or if the gap contains only invalid files.
* Actually, if we validated orphans correctly above, any remaining file in the list is valid.
* So we just need to ensure that within the "active" segments, there are no gaps.
*/
int gapStart = files[i - 1].number + 1;
int gapEnd = files[i].number - 1;

/* A gap is corruption if any missing file number falls within the active range */
/* We check if any number in [gapStart, gapEnd] is in the active window */
/* Efficient check:
* if !wrapped: gap corrupts if intervals [gapStart, gapEnd] and [deq, enq] overlap.
* Overlap if gapStart <= enq AND gapEnd >= deq.
* if wrapped: gap corrupts if [gapStart, gapEnd] overlaps [deq, max?] OR [0, enq].
* Overlap [deq, infinity] if gapEnd >= deq.
* Overlap [0, enq] if gapStart <= enq.
*/
int gapIsCorrupt = 0;
if (!wrapped) {
if (gapStart <= enqNum && gapEnd >= deqNum) gapIsCorrupt = 1;
} else {
if (gapEnd >= deqNum) gapIsCorrupt = 1;
if (gapStart <= enqNum) gapIsCorrupt = 1;
}

if (gapIsCorrupt) {
corruptionDetected = 1;
LogError(0, RS_RET_ERR, "queue corruption: gap detected between files %d and %d",
files[i - 1].number, files[i].number);
}
}
}
}

if (!foundDeq) {
corruptionDetected = 1;
LogError(0, RS_RET_ERR, "queue corruption: dequeue file %d not found on disk", deqNum);
}
if (!foundEnq) {
corruptionDetected = 1;
LogError(0, RS_RET_ERR, "queue corruption: enqueue file %d not found on disk", enqNum);
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic for detecting gaps in queue files for a wrapped-around queue appears to be flawed. It correctly identifies a single large gap of inactive file numbers, but it fails to detect missing files at the edges of contiguous blocks, especially near the wrap-around point.

For example, with iMaxFiles=10, deqNum=8, enqNum=2, the active files are 8, 9, 0, 1, 2. If file 9 is missing, the files on disk would be 0, 1, 2, 8. The current logic sorts these to 0, 1, 2, 8, finds a gap between 2 and 8, and correctly identifies this gap as valid (inactive range). However, it does not detect that file 9 is missing, which is part of the active range.

This is because the logic does not seem to use iMaxFiles (available via pThis->tVars.disk.pWrite->iMaxFiles) to validate continuity across the wrap-around point (e.g., from iMaxFiles-1 to 0).

A more robust approach would be to:

  1. Obtain iMaxFiles from the stream object.
  2. When wrapped is true, explicitly check for continuity in two segments:
    • Files from deqNum up to iMaxFiles - 1.
    • Files from 0 up to enqNum.

Without this, the corruption detection is unreliable for circular queues.

runtime/queue.c Outdated
Comment on lines +1087 to +1110
if (loadRet == RS_RET_OK) {
if (pThis->tVars.disk.pWrite) strm.Destruct(&pThis->tVars.disk.pWrite);
if (pThis->tVars.disk.pReadDeq) strm.Destruct(&pThis->tVars.disk.pReadDeq);
if (pThis->tVars.disk.pReadDel) strm.Destruct(&pThis->tVars.disk.pReadDel);

char oldPath[MAXFNAME];
char newPath[MAXFNAME];
snprintf(oldPath, sizeof(oldPath), "%s/%s.qi", pThis->pszSpoolDir, pThis->pszFilePrefix);
snprintf(newPath, sizeof(newPath), "%s/%s.qi", badDir, pThis->pszFilePrefix);
if (rename(oldPath, newPath) != 0) {
LogError(errno, RS_RET_ERR, "queue corruption: could not move .qi file to recovery directory");
}
} else {
char oldPath[MAXFNAME];
char newPath[MAXFNAME];
snprintf(oldPath, sizeof(oldPath), "%s/%s.qi", pThis->pszSpoolDir, pThis->pszFilePrefix);
snprintf(newPath, sizeof(newPath), "%s/%s.qi", badDir, pThis->pszFilePrefix);
struct stat sb;
if (stat(oldPath, &sb) == 0) {
if (rename(oldPath, newPath) != 0) {
LogError(errno, RS_RET_ERR, "queue corruption: could not move .qi file to recovery directory");
}
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is some code duplication in the SAFE_MODE recovery logic for moving the .qi file. The branches for loadRet == RS_RET_OK and the else case both declare and construct the oldPath and newPath for the .qi file.

This can be refactored to reduce redundancy and improve maintainability by declaring the path variables and calling snprintf once before the if/else block.

            if (loadRet == RS_RET_OK) {
                if (pThis->tVars.disk.pWrite) strm.Destruct(&pThis->tVars.disk.pWrite);
                if (pThis->tVars.disk.pReadDeq) strm.Destruct(&pThis->tVars.disk.pReadDeq);
                if (pThis->tVars.disk.pReadDel) strm.Destruct(&pThis->tVars.disk.pReadDel);
            }

            char oldPath[MAXFNAME];
            char newPath[MAXFNAME];
            snprintf(oldPath, sizeof(oldPath), "%s/%s.qi", pThis->pszSpoolDir, pThis->pszFilePrefix);
            snprintf(newPath, sizeof(newPath), "%s/%s.qi", badDir, pThis->pszFilePrefix);

            if (loadRet == RS_RET_OK) {
                if (rename(oldPath, newPath) != 0) {
                    LogError(errno, RS_RET_ERR, "queue corruption: could not move .qi file to recovery directory");
                }
            } else {
                struct stat sb;
                if (stat(oldPath, &sb) == 0) {
                    if (rename(oldPath, newPath) != 0) {
                        LogError(errno, RS_RET_ERR, "queue corruption: could not move .qi file to recovery directory");
                    }
                }
            }

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 61278fbf31

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

runtime/queue.c Outdated
Comment on lines +1062 to +1064
pThis->qType = QUEUETYPE_LINKEDLIST;
qqueueSetupLinkedList(pThis);
CHKiRet(pThis->qConstruct(pThis));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Disable DA mode during in-memory corruption fallback

Switching to linked-list mode here calls qConstructLinkedList, which runs qqueueChkIsDA and sets bIsDA=1 whenever pszFilePrefix is present (true for disk queues). As a result, qqueueStart() will still invoke InitDA() and spin up a disk queue, so queue.oncorruption=inMemory does not actually stay non-persistent and can re-enter disk recovery/startup paths.

Useful? React with 👍 / 👎.

Comment on lines +3931 to +3935
pThis->onCorruption = QUEUE_ON_CORRUPTION_IGNORE;
} else if (!strcasecmp(mode, "safe")) {
pThis->onCorruption = QUEUE_ON_CORRUPTION_SAFE_MODE;
} else if (!strcasecmp(mode, "inMemory")) {
pThis->onCorruption = QUEUE_ON_CORRUPTION_IN_MEMORY;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Propagate oncorruption mode to DA child queue

This stores queue.oncorruption only on the parent queue object, but disk-assisted persistence is handled by the child disk queue created in StartDA(). Because that child setup copies many fields but not onCorruption, non-default values (ignore, inMemory) are silently dropped and the actual disk queue always runs with default SAFE behavior.

Useful? React with 👍 / 👎.

@rgerhards rgerhards force-pushed the queue-harden-recovery-9861661753839617161 branch from 0788d45 to 85667d0 Compare February 7, 2026 16:39
@github-actions
Copy link

github-actions bot commented Feb 7, 2026

📖 Documentation Preview Ready!

Sphinx docs built and deployed for this PR.

🌐 Browse Online (Recommended):

📚 View Documentation Online →

🔗 Quick Links:


🤖 This comment was generated by the docs workflow.

@rgerhards rgerhards force-pushed the queue-harden-recovery-9861661753839617161 branch 4 times, most recently from 2c96338 to 08ba135 Compare February 9, 2026 12:02
@rgerhards
Copy link
Owner Author

@cubic-dev-ai review

@cubic-dev-ai
Copy link

cubic-dev-ai bot commented Feb 14, 2026

@cubic-dev-ai review

@rgerhards I have started the AI code review. It will take a few minutes to complete.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 6 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="tests/imjournal-basic.sh">

<violation number="1" location="tests/imjournal-basic.sh:74">
P2: Missing `shutdown_when_empty` / `wait_shutdown` before `exit 77` inside the retry loop. Rsyslog is already running at this point (started via `startup` earlier), and the later skip-block at line ~103 correctly shuts down before exiting. This path should do the same to avoid leaving an orphaned rsyslogd process.</violation>
</file>

<file name="tests/diskqueue-oncorruption-missing-segment.sh">

<violation number="1" location="tests/diskqueue-oncorruption-missing-segment.sh:80">
P1: Rule violated: **Code Quality Guard**

Typo: `${RSYSLOG_DYNNAME}:.pid` uses a colon instead of a dot. This silently fails to remove the PID file between test iterations, risking stale PIDs and flaky results.

This is a characteristic AI-generated slop pattern (SLOP_SCORE: high) — a single-character substitution that passes syntax checks but is functionally broken. No other test in the repo uses `:.pid`.</violation>

<violation number="2" location="tests/diskqueue-oncorruption-missing-segment.sh:100">
P1: The `inMemory` and `ignore` mode checks use `$STARTED_LOG` (the `.started` file), but rsyslog output is directed to `$RSYSLOGD_LOG` via `RS_REDIR`. The `safe` mode correctly checks `$RSYSLOGD_LOG` on line 92. These checks should likely also use `$RSYSLOGD_LOG` for consistency, otherwise the `inMemory` checks will fail spuriously and the `ignore` checks will pass trivially without actually verifying anything.</violation>
</file>

<file name="runtime/queue.c">

<violation number="1" location="runtime/queue.c:949">
P2: Rule violated: **Code Quality Guard**

Critical recovery function `qqueueVerifyAndRecover` (~200 lines, multiple recovery strategies, filesystem I/O, queue state mutation) has no Doxygen documentation. Per COMMENTING_STYLE.md §1, complex functions **must** have Doxygen blocks documenting purpose, preconditions, side effects, and return semantics. This function's contract is non-obvious: callers must know that it may switch the queue type to `LINKEDLIST`, destroy disk streams, move files on disk, or return `RS_RET_FILE_NOT_FOUND` to signal a fresh start. A `@brief`, `@param`, `@return`, and a note on threading assumptions (per runtime/AGENTS.md) are needed for maintainability of this corruption recovery logic.

Policy compliance (POLICY_COMPLIANCE ≈ 0.5): Missing mandatory Doxygen on complex function, missing doc updates for new `queue.oncorruption` config parameter (CONTRIBUTING.md requires documentation for new features).
Slop score (SLOP_SCORE ≈ 0.3): Code is defensive and thorough but underdocumented for its complexity.
AI probability (AI_PROBABILITY ≈ 0.8): PR was explicitly created by Jules AI agent; code shows characteristic AI patterns of exhaustive error handling with verbose log messages.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

fi
;;
inMemory)
content_check "Queue corruption detected. Entering emergency in-memory mode" "$STARTED_LOG"
Copy link

@cubic-dev-ai cubic-dev-ai bot Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: The inMemory and ignore mode checks use $STARTED_LOG (the .started file), but rsyslog output is directed to $RSYSLOGD_LOG via RS_REDIR. The safe mode correctly checks $RSYSLOGD_LOG on line 92. These checks should likely also use $RSYSLOGD_LOG for consistency, otherwise the inMemory checks will fail spuriously and the ignore checks will pass trivially without actually verifying anything.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At tests/diskqueue-oncorruption-missing-segment.sh, line 100:

<comment>The `inMemory` and `ignore` mode checks use `$STARTED_LOG` (the `.started` file), but rsyslog output is directed to `$RSYSLOGD_LOG` via `RS_REDIR`. The `safe` mode correctly checks `$RSYSLOGD_LOG` on line 92. These checks should likely also use `$RSYSLOGD_LOG` for consistency, otherwise the `inMemory` checks will fail spuriously and the `ignore` checks will pass trivially without actually verifying anything.</comment>

<file context>
@@ -0,0 +1,118 @@
+		fi
+		;;
+	inMemory)
+		content_check "Queue corruption detected. Entering emergency in-memory mode" "$STARTED_LOG"
+		content_check "pure in-memory emergency mode" "$STARTED_LOG"
+		;;
</file context>
Fix with Cubic

bad_before=$(find "$SPOOL_DIR" -maxdepth 1 -mindepth 1 -type d -name 'mainq.bad.*' | wc -l)
: > "$RSYSLOGD_LOG"
: > "$STARTED_LOG"
rm -f "${RSYSLOG_DYNNAME}.started" "${RSYSLOG_DYNNAME}.imdiag.port" "${RSYSLOG_DYNNAME}:.pid"
Copy link

@cubic-dev-ai cubic-dev-ai bot Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Rule violated: Code Quality Guard

Typo: ${RSYSLOG_DYNNAME}:.pid uses a colon instead of a dot. This silently fails to remove the PID file between test iterations, risking stale PIDs and flaky results.

This is a characteristic AI-generated slop pattern (SLOP_SCORE: high) — a single-character substitution that passes syntax checks but is functionally broken. No other test in the repo uses :.pid.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At tests/diskqueue-oncorruption-missing-segment.sh, line 80:

<comment>Typo: `${RSYSLOG_DYNNAME}:.pid` uses a colon instead of a dot. This silently fails to remove the PID file between test iterations, risking stale PIDs and flaky results.

This is a characteristic AI-generated slop pattern (SLOP_SCORE: high) — a single-character substitution that passes syntax checks but is functionally broken. No other test in the repo uses `:.pid`.</comment>

<file context>
@@ -0,0 +1,118 @@
+	bad_before=$(find "$SPOOL_DIR" -maxdepth 1 -mindepth 1 -type d -name 'mainq.bad.*' | wc -l)
+	: > "$RSYSLOGD_LOG"
+	: > "$STARTED_LOG"
+	rm -f "${RSYSLOG_DYNNAME}.started" "${RSYSLOG_DYNNAME}.imdiag.port" "${RSYSLOG_DYNNAME}:.pid"
+	: > "$STARTED_LOG"
+
</file context>
Fix with Cubic

return bsearch(&key, files, (size_t)nFiles, sizeof(fileEntry_t), fileEntryCmpByNumber) != NULL;
}

static rsRetVal qqueueVerifyAndRecover(qqueue_t *pThis, rsRetVal loadRet) {
Copy link

@cubic-dev-ai cubic-dev-ai bot Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Rule violated: Code Quality Guard

Critical recovery function qqueueVerifyAndRecover (~200 lines, multiple recovery strategies, filesystem I/O, queue state mutation) has no Doxygen documentation. Per COMMENTING_STYLE.md §1, complex functions must have Doxygen blocks documenting purpose, preconditions, side effects, and return semantics. This function's contract is non-obvious: callers must know that it may switch the queue type to LINKEDLIST, destroy disk streams, move files on disk, or return RS_RET_FILE_NOT_FOUND to signal a fresh start. A @brief, @param, @return, and a note on threading assumptions (per runtime/AGENTS.md) are needed for maintainability of this corruption recovery logic.

Policy compliance (POLICY_COMPLIANCE ≈ 0.5): Missing mandatory Doxygen on complex function, missing doc updates for new queue.oncorruption config parameter (CONTRIBUTING.md requires documentation for new features).
Slop score (SLOP_SCORE ≈ 0.3): Code is defensive and thorough but underdocumented for its complexity.
AI probability (AI_PROBABILITY ≈ 0.8): PR was explicitly created by Jules AI agent; code shows characteristic AI patterns of exhaustive error handling with verbose log messages.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At runtime/queue.c, line 949:

<comment>Critical recovery function `qqueueVerifyAndRecover` (~200 lines, multiple recovery strategies, filesystem I/O, queue state mutation) has no Doxygen documentation. Per COMMENTING_STYLE.md §1, complex functions **must** have Doxygen blocks documenting purpose, preconditions, side effects, and return semantics. This function's contract is non-obvious: callers must know that it may switch the queue type to `LINKEDLIST`, destroy disk streams, move files on disk, or return `RS_RET_FILE_NOT_FOUND` to signal a fresh start. A `@brief`, `@param`, `@return`, and a note on threading assumptions (per runtime/AGENTS.md) are needed for maintainability of this corruption recovery logic.

Policy compliance (POLICY_COMPLIANCE ≈ 0.5): Missing mandatory Doxygen on complex function, missing doc updates for new `queue.oncorruption` config parameter (CONTRIBUTING.md requires documentation for new features).
Slop score (SLOP_SCORE ≈ 0.3): Code is defensive and thorough but underdocumented for its complexity.
AI probability (AI_PROBABILITY ≈ 0.8): PR was explicitly created by Jules AI agent; code shows characteristic AI patterns of exhaustive error handling with verbose log messages.</comment>

<file context>
@@ -863,6 +870,380 @@ static rsRetVal qDelLinkedList(qqueue_t *pThis) {
+    return bsearch(&key, files, (size_t)nFiles, sizeof(fileEntry_t), fileEntryCmpByNumber) != NULL;
+}
+
+static rsRetVal qqueueVerifyAndRecover(qqueue_t *pThis, rsRetVal loadRet) {
+    DEFiRet;
+    DIR *d = NULL;
</file context>
Suggested change
static rsRetVal qqueueVerifyAndRecover(qqueue_t *pThis, rsRetVal loadRet) {
/**
* @brief Verify disk queue integrity and recover from corruption.
*
* Scans the spool directory for queue segment files and validates them against
* the state loaded from the .qi checkpoint file. Depending on the configured
* `queue.onCorruption` policy, the function either ignores corruption, moves
* corrupt files to a `.bad` recovery directory and signals a fresh start
* (SAFE_MODE), or switches the queue to a pure in-memory LinkedList mode
* (IN_MEMORY).
*
* @pre Called during qConstructDisk(), after qqueueTryLoadPersistedInfo().
* Disk streams (pWrite, pReadDeq, pReadDel) may or may not be
* initialized depending on @p loadRet.
* @post On RS_RET_OK with qType changed to LINKEDLIST: queue is now in-memory.
* On RS_RET_FILE_NOT_FOUND: corrupt files moved; caller should construct
* fresh disk streams.
* On RS_RET_OK with qType unchanged: queue state validated successfully.
*
* @note Threading: called only during single-threaded queue construction;
* no additional locking required.
*
* @param pThis Queue object being constructed.
* @param loadRet Return value from qqueueTryLoadPersistedInfo().
* @return RS_RET_OK on success or in-memory switch, RS_RET_FILE_NOT_FOUND
* for fresh-start after safe-mode recovery, or error code on failure.
*/
static rsRetVal qqueueVerifyAndRecover(qqueue_t *pThis, rsRetVal loadRet) {
Fix with Cubic

@rgerhards rgerhards force-pushed the queue-harden-recovery-9861661753839617161 branch 2 times, most recently from 57ac2a6 to 3242dbc Compare February 18, 2026 12:16
Why:
Queue recovery must fail safe and remain compatible with existing startup
flows, including disk queue tests that rely on deferred spool setup.

Impact:
Disk queue startup validates corruption more safely, and imjournal CI is
more resilient to transient readiness timing under journal churn.

Before/After:
Before, some startup paths could incorrectly switch to emergency mode when
spool scanning could not run, and imjournal readiness could fail flakily.
After, startup keeps legacy behavior when scan is unavailable and
imjournal retries readiness before failing.

Technical Overview:
Expose queue.oncorruption as a user parameter and document it.
Set explicit default mode to safe for queue defaults/construct paths.
Harden disk queue scan parsing with bounds, overflow, and alloc checks.
If spool directory/prefix is unset, return original load status unchanged.
If spool scan cannot run (ENOENT/ENOTDIR), skip corruption scan.
On recovery-path failures, switch to emergency in-memory mode with alerts.
Retry imjournal readiness probes with bounded attempts and timeout.
If readiness is still not observed, skip test as environment-limited.

Closes rsyslog#5771

With the help of AI-Agents: codex
@rgerhards rgerhards force-pushed the queue-harden-recovery-9861661753839617161 branch from 3242dbc to c62adb9 Compare February 19, 2026 08:30
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 5 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="action.c">

<violation number="1" location="action.c:114">
P2: Custom agent: **Code Quality Guard**

Duplicated `static inline wtiShutdownImmediate()` in `action.c` and `runtime/ruleset.c` — this is a copy-paste pattern that should be consolidated into `runtime/wti.h` where the `wti_t` struct members it accesses (`pbShutdownImmediate`, `pmutShutdownImmediate`) are declared. Duplicating atomic-safety wrappers across files creates a maintenance hazard: if the access pattern needs to change, both copies must be updated in lockstep.

SLOP_SCORE is elevated: this is a lazy patch pattern (copy-paste instead of proper shared definition).</violation>
</file>

<file name="runtime/queue.c">

<violation number="1" location="runtime/queue.c:240">
P2: Custom agent: **Code Quality Guard**

The `const` qualifier on `pThis` in `qqueueShutdownImmediateGet` creates a const-correctness violation on platforms without atomic builtins. `QQUEUE_SHUTDOWN_HELPER_MUT(pThis)` would yield `const pthread_mutex_t *` (via `&pThis->mutQueueSize`), but `ATOMIC_FETCH_32BIT` expects `pthread_mutex_t *`. This will produce a compiler warning, violating the project's "compile cleanly with no warnings on both gcc and clang" policy (CONTRIBUTING.md). The cast `(int *)&pThis->bShutdownImmediate` also discards const. Remove the `const` from the `pThis` parameter — atomic reads are logically non-const operations since they may acquire a mutex.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

static rsRetVal doSubmitToActionQNotAllMark(action_t *const pAction, wti_t *const pWti, smsg_t *);
static void ATTR_NONNULL() actionSuspend(action_t *const pThis, wti_t *const pWti);
static void ATTR_NONNULL() actionRetry(action_t *const pThis, wti_t *const pWti);
static inline int wtiShutdownImmediate(const wti_t *const pWti) {
Copy link

@cubic-dev-ai cubic-dev-ai bot Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Custom agent: Code Quality Guard

Duplicated static inline wtiShutdownImmediate() in action.c and runtime/ruleset.c — this is a copy-paste pattern that should be consolidated into runtime/wti.h where the wti_t struct members it accesses (pbShutdownImmediate, pmutShutdownImmediate) are declared. Duplicating atomic-safety wrappers across files creates a maintenance hazard: if the access pattern needs to change, both copies must be updated in lockstep.

SLOP_SCORE is elevated: this is a lazy patch pattern (copy-paste instead of proper shared definition).

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At action.c, line 114:

<comment>Duplicated `static inline wtiShutdownImmediate()` in `action.c` and `runtime/ruleset.c` — this is a copy-paste pattern that should be consolidated into `runtime/wti.h` where the `wti_t` struct members it accesses (`pbShutdownImmediate`, `pmutShutdownImmediate`) are declared. Duplicating atomic-safety wrappers across files creates a maintenance hazard: if the access pattern needs to change, both copies must be updated in lockstep.

SLOP_SCORE is elevated: this is a lazy patch pattern (copy-paste instead of proper shared definition).</comment>

<file context>
@@ -111,6 +111,9 @@ static rsRetVal doSubmitToActionQComplex(action_t *const pAction, wti_t *const p
 static rsRetVal doSubmitToActionQNotAllMark(action_t *const pAction, wti_t *const pWti, smsg_t *);
 static void ATTR_NONNULL() actionSuspend(action_t *const pThis, wti_t *const pWti);
 static void ATTR_NONNULL() actionRetry(action_t *const pThis, wti_t *const pWti);
+static inline int wtiShutdownImmediate(const wti_t *const pWti) {
+    return ATOMIC_FETCH_32BIT(pWti->pbShutdownImmediate, pWti->pmutShutdownImmediate);
+}
</file context>
Fix with Cubic

#define QQUEUE_SHUTDOWN_HELPER_MUT(pThis) (&(pThis)->mutQueueSize)
#endif

static inline int qqueueShutdownImmediateGet(const qqueue_t *const pThis) {
Copy link

@cubic-dev-ai cubic-dev-ai bot Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Custom agent: Code Quality Guard

The const qualifier on pThis in qqueueShutdownImmediateGet creates a const-correctness violation on platforms without atomic builtins. QQUEUE_SHUTDOWN_HELPER_MUT(pThis) would yield const pthread_mutex_t * (via &pThis->mutQueueSize), but ATOMIC_FETCH_32BIT expects pthread_mutex_t *. This will produce a compiler warning, violating the project's "compile cleanly with no warnings on both gcc and clang" policy (CONTRIBUTING.md). The cast (int *)&pThis->bShutdownImmediate also discards const. Remove the const from the pThis parameter — atomic reads are logically non-const operations since they may acquire a mutex.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At runtime/queue.c, line 240:

<comment>The `const` qualifier on `pThis` in `qqueueShutdownImmediateGet` creates a const-correctness violation on platforms without atomic builtins. `QQUEUE_SHUTDOWN_HELPER_MUT(pThis)` would yield `const pthread_mutex_t *` (via `&pThis->mutQueueSize`), but `ATOMIC_FETCH_32BIT` expects `pthread_mutex_t *`. This will produce a compiler warning, violating the project's "compile cleanly with no warnings on both gcc and clang" policy (CONTRIBUTING.md). The cast `(int *)&pThis->bShutdownImmediate` also discards const. Remove the `const` from the `pThis` parameter — atomic reads are logically non-const operations since they may acquire a mutex.</comment>

<file context>
@@ -231,6 +231,20 @@ DEFobjCurrIf(glbl) DEFobjCurrIf(strm) DEFobjCurrIf(datetime) DEFobjCurrIf(statso
+    #define QQUEUE_SHUTDOWN_HELPER_MUT(pThis) (&(pThis)->mutQueueSize)
+#endif
+
+static inline int qqueueShutdownImmediateGet(const qqueue_t *const pThis) {
+    return ATOMIC_FETCH_32BIT((int *)&pThis->bShutdownImmediate, QQUEUE_SHUTDOWN_HELPER_MUT(pThis));
+}
</file context>
Suggested change
static inline int qqueueShutdownImmediateGet(const qqueue_t *const pThis) {
static inline int qqueueShutdownImmediateGet(qqueue_t *const pThis) {
Fix with Cubic

@rgerhards rgerhards force-pushed the queue-harden-recovery-9861661753839617161 branch from aa61fb9 to 12464b3 Compare February 19, 2026 11:28
Queue shutdown sets bShutdownImmediate from the control thread while
worker/action/ruleset loops read it concurrently. Those accesses were plain
loads/stores, which is a data race under the C memory model.

On x86 this is often masked by stronger ordering, but on ARM64 weak ordering
can leave workers observing stale values long enough to miss shutdown and hang
in timeout paths (as seen in arm64 CI during disk queue corruption tests).

Fix by routing bShutdownImmediate through atomic helpers and using atomic
loads in all hot loops that poll the flag (queue dequeue/DA loops, action
retry loops, ruleset execution, and rsyslogd batch preprocessing).

Also propagate the helper mutex pointer via wti so non-atomic fallback builds
remain synchronized.
@rgerhards rgerhards force-pushed the queue-harden-recovery-9861661753839617161 branch from 12464b3 to d4d5a95 Compare February 19, 2026 12:09
Run the debug testcase in a way that always preserves artifacts on failure and avoids secondary noise from auto-reruns.

The previous version could fail to upload logs with EACCES because files created in the container were not readable by the runner, and USE_AUTO_DEBUG caused an extra rerun path ("RSYSLOG_DYNNAME already set") after aborts.

Update the job to normalize file ownership/permissions after the testcase, even on failure, and disable automatic debug reruns for this focused diagnostic workflow.

With the help of AI-Agents: Codex
Debug tracing in strmReadBuf logged raw read buffers with "%*s", where the width argument does not limit input read length for %s.

When global debug logging is enabled, this can make vsnprintf scan past the read buffer until a NUL byte and trigger ASAN heap-buffer-overflow in queue disk dequeue paths.

Use precision-limited formatting ("%.*s") with iLenRead to cap bytes read from the non-NUL-terminated buffer.

With the help of AI-Agents: Codex
Keep the targeted arm64 diskqueue debug workflow focused on rich failure
forensics with stable evidence capture.

Changes:
- preserve and upload tests/log (full processing flow)
- print tail -n 2000 of tests/log to stdout on failure
- keep file ownership/permission normalization so artifact upload can read
  generated debug files
- keep debug defaults in tests/diag.sh for this investigation run

This consolidates the recent debug-workflow-only adjustments into one
commit while leaving the runtime bugfix commit separate.

With the help of AI-Agents: Codex
Ensure arm64 CI jobs produce local core files usable by tests/diag.sh and keep gdb available for stack capture.

Changes:

- add gdb/procps to devtools/ci/Dockerfile.arm

- run ARM containers with --privileged

- set ulimit -c unlimited and kernel.core_pattern=/rsyslog/tests/core.%e.%p

- upload tests/core* artifacts in arm jobs

With the help of AI-Agents: Codex
Increase the custom ignore-mode shutdown timeout used by

diskqueue-oncorruption-missing-segment.sh from the generic 90s

testbench default to a dedicated 240s ceiling. The old value

could time out in full-suite ARM64 ASAN runs even though the

standalone testcase succeeds.

Also set ASAN_OPTIONS disable_coredump=0 in both ARM64 ASAN

jobs so crashes produce core files in tests/ (with existing

core_pattern), which allows diag.sh post-failure core handling.
@rgerhards rgerhards force-pushed the queue-harden-recovery-9861661753839617161 branch from d321e6c to 32fa884 Compare February 19, 2026 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant