queue: Fix disk queue corruption detection and memory safety#168
queue: Fix disk queue corruption detection and memory safety#168
Conversation
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
Summary of ChangesHello @rgerhards, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the resilience of disk queues by introducing comprehensive corruption detection and configurable recovery mechanisms. It allows the system to gracefully handle corrupted queue states by either ignoring them, moving them to a safe location, or switching to an in-memory queue, thereby preventing data loss or service disruption. Additionally, it improves memory safety and error handling in critical file operations. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a robust mechanism for detecting and recovering from disk queue corruption, which is a great improvement for reliability. It adds a new queue.oncorruption parameter to control the recovery behavior (ignore, safe, or in-memory fallback). The changes also enhance memory safety by consistently using CHKmalloc for allocations and fixing potential issues with realloc.
My review has identified a critical flaw in the new corruption detection logic for circular queues when they are in a wrapped-around state. The current implementation can fail to detect missing queue files, making the detection unreliable. I've also noted a minor code duplication that could be refactored for better maintainability. Please see the detailed comments for more information.
runtime/queue.c
Outdated
| if (!corruptionDetected && nFiles > 0) { | ||
| int foundDeq = 0, foundEnq = 0; | ||
| for (i = 0; i < nFiles; i++) { | ||
| if (files[i].number == deqNum) foundDeq = 1; | ||
| if (files[i].number == enqNum) foundEnq = 1; | ||
|
|
||
| if (i > 0) { | ||
| if (files[i].number != files[i - 1].number + 1) { | ||
| /* Check if this gap is valid (wrap-around) or corrupt */ | ||
| /* In a continuous sorted list, any jump > 1 is a gap, | ||
| * UNLESS it is the jump from the highest number back to the lowest | ||
| * (which shouldn't happen in a sorted list of numbers, they just increase). | ||
| * Wait, physical files increase monotonically? No, they recycle in circular mode. | ||
| * The file numbers are NOT strictly monotonic if they wrap? | ||
| * Rsyslog uses iCurrFNum which increases. When does it wrap? | ||
| * iFileNumDigits suggests it uses 0-999... | ||
| * Actually, strm uses iCurrFNum which is unsigned int. It does not seem to wrap easily. | ||
| * BUT STREAMTYPE_FILE_CIRCULAR suggests reuse. | ||
| * Let's look at `strm.c`. `strmSetFName`... | ||
| * In circular mode, `iCurrFNum` counts up. `strmPhysOpen` calculates actual filename. | ||
| * No, wait. `queue.c` says `strmGetCurrFileNum` returns `iCurrFNum`. | ||
| * If it is circular, does `iCurrFNum` wrap? | ||
| * In `strm.c`: | ||
| * if(pThis->sType == STREAMTYPE_FILE_CIRCULAR) { | ||
| * if(pThis->iCurrFNum >= pThis->iMaxFiles) pThis->iCurrFNum = 0; | ||
| * } | ||
| * Yes, it wraps to 0. | ||
| */ | ||
| /* So if we have files 0, 1, 2, 8, 9 (max 10) and deq=8, enq=2. | ||
| * Sorted: 0, 1, 2, 8, 9. | ||
| * 0->1 ok. 1->2 ok. 2->8 GAP? | ||
| * If enq=2, deq=8. Active: 8, 9, 0, 1, 2. | ||
| * Valid files are {0, 1, 2, 8, 9}. | ||
| * Sorted list has a "gap" between 2 and 8. This is VALID if 2=enq and 8=deq. | ||
| * Or if the gap contains only invalid files. | ||
| * Actually, if we validated orphans correctly above, any remaining file in the list is valid. | ||
| * So we just need to ensure that within the "active" segments, there are no gaps. | ||
| */ | ||
| int gapStart = files[i - 1].number + 1; | ||
| int gapEnd = files[i].number - 1; | ||
|
|
||
| /* A gap is corruption if any missing file number falls within the active range */ | ||
| /* We check if any number in [gapStart, gapEnd] is in the active window */ | ||
| /* Efficient check: | ||
| * if !wrapped: gap corrupts if intervals [gapStart, gapEnd] and [deq, enq] overlap. | ||
| * Overlap if gapStart <= enq AND gapEnd >= deq. | ||
| * if wrapped: gap corrupts if [gapStart, gapEnd] overlaps [deq, max?] OR [0, enq]. | ||
| * Overlap [deq, infinity] if gapEnd >= deq. | ||
| * Overlap [0, enq] if gapStart <= enq. | ||
| */ | ||
| int gapIsCorrupt = 0; | ||
| if (!wrapped) { | ||
| if (gapStart <= enqNum && gapEnd >= deqNum) gapIsCorrupt = 1; | ||
| } else { | ||
| if (gapEnd >= deqNum) gapIsCorrupt = 1; | ||
| if (gapStart <= enqNum) gapIsCorrupt = 1; | ||
| } | ||
|
|
||
| if (gapIsCorrupt) { | ||
| corruptionDetected = 1; | ||
| LogError(0, RS_RET_ERR, "queue corruption: gap detected between files %d and %d", | ||
| files[i - 1].number, files[i].number); | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| if (!foundDeq) { | ||
| corruptionDetected = 1; | ||
| LogError(0, RS_RET_ERR, "queue corruption: dequeue file %d not found on disk", deqNum); | ||
| } | ||
| if (!foundEnq) { | ||
| corruptionDetected = 1; | ||
| LogError(0, RS_RET_ERR, "queue corruption: enqueue file %d not found on disk", enqNum); | ||
| } | ||
| } |
There was a problem hiding this comment.
The logic for detecting gaps in queue files for a wrapped-around queue appears to be flawed. It correctly identifies a single large gap of inactive file numbers, but it fails to detect missing files at the edges of contiguous blocks, especially near the wrap-around point.
For example, with iMaxFiles=10, deqNum=8, enqNum=2, the active files are 8, 9, 0, 1, 2. If file 9 is missing, the files on disk would be 0, 1, 2, 8. The current logic sorts these to 0, 1, 2, 8, finds a gap between 2 and 8, and correctly identifies this gap as valid (inactive range). However, it does not detect that file 9 is missing, which is part of the active range.
This is because the logic does not seem to use iMaxFiles (available via pThis->tVars.disk.pWrite->iMaxFiles) to validate continuity across the wrap-around point (e.g., from iMaxFiles-1 to 0).
A more robust approach would be to:
- Obtain
iMaxFilesfrom the stream object. - When
wrappedis true, explicitly check for continuity in two segments:- Files from
deqNumup toiMaxFiles - 1. - Files from
0up toenqNum.
- Files from
Without this, the corruption detection is unreliable for circular queues.
runtime/queue.c
Outdated
| if (loadRet == RS_RET_OK) { | ||
| if (pThis->tVars.disk.pWrite) strm.Destruct(&pThis->tVars.disk.pWrite); | ||
| if (pThis->tVars.disk.pReadDeq) strm.Destruct(&pThis->tVars.disk.pReadDeq); | ||
| if (pThis->tVars.disk.pReadDel) strm.Destruct(&pThis->tVars.disk.pReadDel); | ||
|
|
||
| char oldPath[MAXFNAME]; | ||
| char newPath[MAXFNAME]; | ||
| snprintf(oldPath, sizeof(oldPath), "%s/%s.qi", pThis->pszSpoolDir, pThis->pszFilePrefix); | ||
| snprintf(newPath, sizeof(newPath), "%s/%s.qi", badDir, pThis->pszFilePrefix); | ||
| if (rename(oldPath, newPath) != 0) { | ||
| LogError(errno, RS_RET_ERR, "queue corruption: could not move .qi file to recovery directory"); | ||
| } | ||
| } else { | ||
| char oldPath[MAXFNAME]; | ||
| char newPath[MAXFNAME]; | ||
| snprintf(oldPath, sizeof(oldPath), "%s/%s.qi", pThis->pszSpoolDir, pThis->pszFilePrefix); | ||
| snprintf(newPath, sizeof(newPath), "%s/%s.qi", badDir, pThis->pszFilePrefix); | ||
| struct stat sb; | ||
| if (stat(oldPath, &sb) == 0) { | ||
| if (rename(oldPath, newPath) != 0) { | ||
| LogError(errno, RS_RET_ERR, "queue corruption: could not move .qi file to recovery directory"); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
There is some code duplication in the SAFE_MODE recovery logic for moving the .qi file. The branches for loadRet == RS_RET_OK and the else case both declare and construct the oldPath and newPath for the .qi file.
This can be refactored to reduce redundancy and improve maintainability by declaring the path variables and calling snprintf once before the if/else block.
if (loadRet == RS_RET_OK) {
if (pThis->tVars.disk.pWrite) strm.Destruct(&pThis->tVars.disk.pWrite);
if (pThis->tVars.disk.pReadDeq) strm.Destruct(&pThis->tVars.disk.pReadDeq);
if (pThis->tVars.disk.pReadDel) strm.Destruct(&pThis->tVars.disk.pReadDel);
}
char oldPath[MAXFNAME];
char newPath[MAXFNAME];
snprintf(oldPath, sizeof(oldPath), "%s/%s.qi", pThis->pszSpoolDir, pThis->pszFilePrefix);
snprintf(newPath, sizeof(newPath), "%s/%s.qi", badDir, pThis->pszFilePrefix);
if (loadRet == RS_RET_OK) {
if (rename(oldPath, newPath) != 0) {
LogError(errno, RS_RET_ERR, "queue corruption: could not move .qi file to recovery directory");
}
} else {
struct stat sb;
if (stat(oldPath, &sb) == 0) {
if (rename(oldPath, newPath) != 0) {
LogError(errno, RS_RET_ERR, "queue corruption: could not move .qi file to recovery directory");
}
}
}There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 61278fbf31
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
runtime/queue.c
Outdated
| pThis->qType = QUEUETYPE_LINKEDLIST; | ||
| qqueueSetupLinkedList(pThis); | ||
| CHKiRet(pThis->qConstruct(pThis)); |
There was a problem hiding this comment.
Disable DA mode during in-memory corruption fallback
Switching to linked-list mode here calls qConstructLinkedList, which runs qqueueChkIsDA and sets bIsDA=1 whenever pszFilePrefix is present (true for disk queues). As a result, qqueueStart() will still invoke InitDA() and spin up a disk queue, so queue.oncorruption=inMemory does not actually stay non-persistent and can re-enter disk recovery/startup paths.
Useful? React with 👍 / 👎.
| pThis->onCorruption = QUEUE_ON_CORRUPTION_IGNORE; | ||
| } else if (!strcasecmp(mode, "safe")) { | ||
| pThis->onCorruption = QUEUE_ON_CORRUPTION_SAFE_MODE; | ||
| } else if (!strcasecmp(mode, "inMemory")) { | ||
| pThis->onCorruption = QUEUE_ON_CORRUPTION_IN_MEMORY; |
There was a problem hiding this comment.
Propagate oncorruption mode to DA child queue
This stores queue.oncorruption only on the parent queue object, but disk-assisted persistence is handled by the child disk queue created in StartDA(). Because that child setup copies many fields but not onCorruption, non-default values (ignore, inMemory) are silently dropped and the actual disk queue always runs with default SAFE behavior.
Useful? React with 👍 / 👎.
0788d45 to
85667d0
Compare
📖 Documentation Preview Ready!✅ Sphinx docs built and deployed for this PR. 🌐 Browse Online (Recommended):🔗 Quick Links:
🤖 This comment was generated by the docs workflow. |
2c96338 to
08ba135
Compare
|
@cubic-dev-ai review |
@rgerhards I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
4 issues found across 6 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="tests/imjournal-basic.sh">
<violation number="1" location="tests/imjournal-basic.sh:74">
P2: Missing `shutdown_when_empty` / `wait_shutdown` before `exit 77` inside the retry loop. Rsyslog is already running at this point (started via `startup` earlier), and the later skip-block at line ~103 correctly shuts down before exiting. This path should do the same to avoid leaving an orphaned rsyslogd process.</violation>
</file>
<file name="tests/diskqueue-oncorruption-missing-segment.sh">
<violation number="1" location="tests/diskqueue-oncorruption-missing-segment.sh:80">
P1: Rule violated: **Code Quality Guard**
Typo: `${RSYSLOG_DYNNAME}:.pid` uses a colon instead of a dot. This silently fails to remove the PID file between test iterations, risking stale PIDs and flaky results.
This is a characteristic AI-generated slop pattern (SLOP_SCORE: high) — a single-character substitution that passes syntax checks but is functionally broken. No other test in the repo uses `:.pid`.</violation>
<violation number="2" location="tests/diskqueue-oncorruption-missing-segment.sh:100">
P1: The `inMemory` and `ignore` mode checks use `$STARTED_LOG` (the `.started` file), but rsyslog output is directed to `$RSYSLOGD_LOG` via `RS_REDIR`. The `safe` mode correctly checks `$RSYSLOGD_LOG` on line 92. These checks should likely also use `$RSYSLOGD_LOG` for consistency, otherwise the `inMemory` checks will fail spuriously and the `ignore` checks will pass trivially without actually verifying anything.</violation>
</file>
<file name="runtime/queue.c">
<violation number="1" location="runtime/queue.c:949">
P2: Rule violated: **Code Quality Guard**
Critical recovery function `qqueueVerifyAndRecover` (~200 lines, multiple recovery strategies, filesystem I/O, queue state mutation) has no Doxygen documentation. Per COMMENTING_STYLE.md §1, complex functions **must** have Doxygen blocks documenting purpose, preconditions, side effects, and return semantics. This function's contract is non-obvious: callers must know that it may switch the queue type to `LINKEDLIST`, destroy disk streams, move files on disk, or return `RS_RET_FILE_NOT_FOUND` to signal a fresh start. A `@brief`, `@param`, `@return`, and a note on threading assumptions (per runtime/AGENTS.md) are needed for maintainability of this corruption recovery logic.
Policy compliance (POLICY_COMPLIANCE ≈ 0.5): Missing mandatory Doxygen on complex function, missing doc updates for new `queue.oncorruption` config parameter (CONTRIBUTING.md requires documentation for new features).
Slop score (SLOP_SCORE ≈ 0.3): Code is defensive and thorough but underdocumented for its complexity.
AI probability (AI_PROBABILITY ≈ 0.8): PR was explicitly created by Jules AI agent; code shows characteristic AI patterns of exhaustive error handling with verbose log messages.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| fi | ||
| ;; | ||
| inMemory) | ||
| content_check "Queue corruption detected. Entering emergency in-memory mode" "$STARTED_LOG" |
There was a problem hiding this comment.
P1: The inMemory and ignore mode checks use $STARTED_LOG (the .started file), but rsyslog output is directed to $RSYSLOGD_LOG via RS_REDIR. The safe mode correctly checks $RSYSLOGD_LOG on line 92. These checks should likely also use $RSYSLOGD_LOG for consistency, otherwise the inMemory checks will fail spuriously and the ignore checks will pass trivially without actually verifying anything.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At tests/diskqueue-oncorruption-missing-segment.sh, line 100:
<comment>The `inMemory` and `ignore` mode checks use `$STARTED_LOG` (the `.started` file), but rsyslog output is directed to `$RSYSLOGD_LOG` via `RS_REDIR`. The `safe` mode correctly checks `$RSYSLOGD_LOG` on line 92. These checks should likely also use `$RSYSLOGD_LOG` for consistency, otherwise the `inMemory` checks will fail spuriously and the `ignore` checks will pass trivially without actually verifying anything.</comment>
<file context>
@@ -0,0 +1,118 @@
+ fi
+ ;;
+ inMemory)
+ content_check "Queue corruption detected. Entering emergency in-memory mode" "$STARTED_LOG"
+ content_check "pure in-memory emergency mode" "$STARTED_LOG"
+ ;;
</file context>
| bad_before=$(find "$SPOOL_DIR" -maxdepth 1 -mindepth 1 -type d -name 'mainq.bad.*' | wc -l) | ||
| : > "$RSYSLOGD_LOG" | ||
| : > "$STARTED_LOG" | ||
| rm -f "${RSYSLOG_DYNNAME}.started" "${RSYSLOG_DYNNAME}.imdiag.port" "${RSYSLOG_DYNNAME}:.pid" |
There was a problem hiding this comment.
P1: Rule violated: Code Quality Guard
Typo: ${RSYSLOG_DYNNAME}:.pid uses a colon instead of a dot. This silently fails to remove the PID file between test iterations, risking stale PIDs and flaky results.
This is a characteristic AI-generated slop pattern (SLOP_SCORE: high) — a single-character substitution that passes syntax checks but is functionally broken. No other test in the repo uses :.pid.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At tests/diskqueue-oncorruption-missing-segment.sh, line 80:
<comment>Typo: `${RSYSLOG_DYNNAME}:.pid` uses a colon instead of a dot. This silently fails to remove the PID file between test iterations, risking stale PIDs and flaky results.
This is a characteristic AI-generated slop pattern (SLOP_SCORE: high) — a single-character substitution that passes syntax checks but is functionally broken. No other test in the repo uses `:.pid`.</comment>
<file context>
@@ -0,0 +1,118 @@
+ bad_before=$(find "$SPOOL_DIR" -maxdepth 1 -mindepth 1 -type d -name 'mainq.bad.*' | wc -l)
+ : > "$RSYSLOGD_LOG"
+ : > "$STARTED_LOG"
+ rm -f "${RSYSLOG_DYNNAME}.started" "${RSYSLOG_DYNNAME}.imdiag.port" "${RSYSLOG_DYNNAME}:.pid"
+ : > "$STARTED_LOG"
+
</file context>
| return bsearch(&key, files, (size_t)nFiles, sizeof(fileEntry_t), fileEntryCmpByNumber) != NULL; | ||
| } | ||
|
|
||
| static rsRetVal qqueueVerifyAndRecover(qqueue_t *pThis, rsRetVal loadRet) { |
There was a problem hiding this comment.
P2: Rule violated: Code Quality Guard
Critical recovery function qqueueVerifyAndRecover (~200 lines, multiple recovery strategies, filesystem I/O, queue state mutation) has no Doxygen documentation. Per COMMENTING_STYLE.md §1, complex functions must have Doxygen blocks documenting purpose, preconditions, side effects, and return semantics. This function's contract is non-obvious: callers must know that it may switch the queue type to LINKEDLIST, destroy disk streams, move files on disk, or return RS_RET_FILE_NOT_FOUND to signal a fresh start. A @brief, @param, @return, and a note on threading assumptions (per runtime/AGENTS.md) are needed for maintainability of this corruption recovery logic.
Policy compliance (POLICY_COMPLIANCE ≈ 0.5): Missing mandatory Doxygen on complex function, missing doc updates for new queue.oncorruption config parameter (CONTRIBUTING.md requires documentation for new features).
Slop score (SLOP_SCORE ≈ 0.3): Code is defensive and thorough but underdocumented for its complexity.
AI probability (AI_PROBABILITY ≈ 0.8): PR was explicitly created by Jules AI agent; code shows characteristic AI patterns of exhaustive error handling with verbose log messages.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At runtime/queue.c, line 949:
<comment>Critical recovery function `qqueueVerifyAndRecover` (~200 lines, multiple recovery strategies, filesystem I/O, queue state mutation) has no Doxygen documentation. Per COMMENTING_STYLE.md §1, complex functions **must** have Doxygen blocks documenting purpose, preconditions, side effects, and return semantics. This function's contract is non-obvious: callers must know that it may switch the queue type to `LINKEDLIST`, destroy disk streams, move files on disk, or return `RS_RET_FILE_NOT_FOUND` to signal a fresh start. A `@brief`, `@param`, `@return`, and a note on threading assumptions (per runtime/AGENTS.md) are needed for maintainability of this corruption recovery logic.
Policy compliance (POLICY_COMPLIANCE ≈ 0.5): Missing mandatory Doxygen on complex function, missing doc updates for new `queue.oncorruption` config parameter (CONTRIBUTING.md requires documentation for new features).
Slop score (SLOP_SCORE ≈ 0.3): Code is defensive and thorough but underdocumented for its complexity.
AI probability (AI_PROBABILITY ≈ 0.8): PR was explicitly created by Jules AI agent; code shows characteristic AI patterns of exhaustive error handling with verbose log messages.</comment>
<file context>
@@ -863,6 +870,380 @@ static rsRetVal qDelLinkedList(qqueue_t *pThis) {
+ return bsearch(&key, files, (size_t)nFiles, sizeof(fileEntry_t), fileEntryCmpByNumber) != NULL;
+}
+
+static rsRetVal qqueueVerifyAndRecover(qqueue_t *pThis, rsRetVal loadRet) {
+ DEFiRet;
+ DIR *d = NULL;
</file context>
| static rsRetVal qqueueVerifyAndRecover(qqueue_t *pThis, rsRetVal loadRet) { | |
| /** | |
| * @brief Verify disk queue integrity and recover from corruption. | |
| * | |
| * Scans the spool directory for queue segment files and validates them against | |
| * the state loaded from the .qi checkpoint file. Depending on the configured | |
| * `queue.onCorruption` policy, the function either ignores corruption, moves | |
| * corrupt files to a `.bad` recovery directory and signals a fresh start | |
| * (SAFE_MODE), or switches the queue to a pure in-memory LinkedList mode | |
| * (IN_MEMORY). | |
| * | |
| * @pre Called during qConstructDisk(), after qqueueTryLoadPersistedInfo(). | |
| * Disk streams (pWrite, pReadDeq, pReadDel) may or may not be | |
| * initialized depending on @p loadRet. | |
| * @post On RS_RET_OK with qType changed to LINKEDLIST: queue is now in-memory. | |
| * On RS_RET_FILE_NOT_FOUND: corrupt files moved; caller should construct | |
| * fresh disk streams. | |
| * On RS_RET_OK with qType unchanged: queue state validated successfully. | |
| * | |
| * @note Threading: called only during single-threaded queue construction; | |
| * no additional locking required. | |
| * | |
| * @param pThis Queue object being constructed. | |
| * @param loadRet Return value from qqueueTryLoadPersistedInfo(). | |
| * @return RS_RET_OK on success or in-memory switch, RS_RET_FILE_NOT_FOUND | |
| * for fresh-start after safe-mode recovery, or error code on failure. | |
| */ | |
| static rsRetVal qqueueVerifyAndRecover(qqueue_t *pThis, rsRetVal loadRet) { |
57ac2a6 to
3242dbc
Compare
Why: Queue recovery must fail safe and remain compatible with existing startup flows, including disk queue tests that rely on deferred spool setup. Impact: Disk queue startup validates corruption more safely, and imjournal CI is more resilient to transient readiness timing under journal churn. Before/After: Before, some startup paths could incorrectly switch to emergency mode when spool scanning could not run, and imjournal readiness could fail flakily. After, startup keeps legacy behavior when scan is unavailable and imjournal retries readiness before failing. Technical Overview: Expose queue.oncorruption as a user parameter and document it. Set explicit default mode to safe for queue defaults/construct paths. Harden disk queue scan parsing with bounds, overflow, and alloc checks. If spool directory/prefix is unset, return original load status unchanged. If spool scan cannot run (ENOENT/ENOTDIR), skip corruption scan. On recovery-path failures, switch to emergency in-memory mode with alerts. Retry imjournal readiness probes with bounded attempts and timeout. If readiness is still not observed, skip test as environment-limited. Closes rsyslog#5771 With the help of AI-Agents: codex
3242dbc to
c62adb9
Compare
There was a problem hiding this comment.
2 issues found across 5 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="action.c">
<violation number="1" location="action.c:114">
P2: Custom agent: **Code Quality Guard**
Duplicated `static inline wtiShutdownImmediate()` in `action.c` and `runtime/ruleset.c` — this is a copy-paste pattern that should be consolidated into `runtime/wti.h` where the `wti_t` struct members it accesses (`pbShutdownImmediate`, `pmutShutdownImmediate`) are declared. Duplicating atomic-safety wrappers across files creates a maintenance hazard: if the access pattern needs to change, both copies must be updated in lockstep.
SLOP_SCORE is elevated: this is a lazy patch pattern (copy-paste instead of proper shared definition).</violation>
</file>
<file name="runtime/queue.c">
<violation number="1" location="runtime/queue.c:240">
P2: Custom agent: **Code Quality Guard**
The `const` qualifier on `pThis` in `qqueueShutdownImmediateGet` creates a const-correctness violation on platforms without atomic builtins. `QQUEUE_SHUTDOWN_HELPER_MUT(pThis)` would yield `const pthread_mutex_t *` (via `&pThis->mutQueueSize`), but `ATOMIC_FETCH_32BIT` expects `pthread_mutex_t *`. This will produce a compiler warning, violating the project's "compile cleanly with no warnings on both gcc and clang" policy (CONTRIBUTING.md). The cast `(int *)&pThis->bShutdownImmediate` also discards const. Remove the `const` from the `pThis` parameter — atomic reads are logically non-const operations since they may acquire a mutex.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| static rsRetVal doSubmitToActionQNotAllMark(action_t *const pAction, wti_t *const pWti, smsg_t *); | ||
| static void ATTR_NONNULL() actionSuspend(action_t *const pThis, wti_t *const pWti); | ||
| static void ATTR_NONNULL() actionRetry(action_t *const pThis, wti_t *const pWti); | ||
| static inline int wtiShutdownImmediate(const wti_t *const pWti) { |
There was a problem hiding this comment.
P2: Custom agent: Code Quality Guard
Duplicated static inline wtiShutdownImmediate() in action.c and runtime/ruleset.c — this is a copy-paste pattern that should be consolidated into runtime/wti.h where the wti_t struct members it accesses (pbShutdownImmediate, pmutShutdownImmediate) are declared. Duplicating atomic-safety wrappers across files creates a maintenance hazard: if the access pattern needs to change, both copies must be updated in lockstep.
SLOP_SCORE is elevated: this is a lazy patch pattern (copy-paste instead of proper shared definition).
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At action.c, line 114:
<comment>Duplicated `static inline wtiShutdownImmediate()` in `action.c` and `runtime/ruleset.c` — this is a copy-paste pattern that should be consolidated into `runtime/wti.h` where the `wti_t` struct members it accesses (`pbShutdownImmediate`, `pmutShutdownImmediate`) are declared. Duplicating atomic-safety wrappers across files creates a maintenance hazard: if the access pattern needs to change, both copies must be updated in lockstep.
SLOP_SCORE is elevated: this is a lazy patch pattern (copy-paste instead of proper shared definition).</comment>
<file context>
@@ -111,6 +111,9 @@ static rsRetVal doSubmitToActionQComplex(action_t *const pAction, wti_t *const p
static rsRetVal doSubmitToActionQNotAllMark(action_t *const pAction, wti_t *const pWti, smsg_t *);
static void ATTR_NONNULL() actionSuspend(action_t *const pThis, wti_t *const pWti);
static void ATTR_NONNULL() actionRetry(action_t *const pThis, wti_t *const pWti);
+static inline int wtiShutdownImmediate(const wti_t *const pWti) {
+ return ATOMIC_FETCH_32BIT(pWti->pbShutdownImmediate, pWti->pmutShutdownImmediate);
+}
</file context>
| #define QQUEUE_SHUTDOWN_HELPER_MUT(pThis) (&(pThis)->mutQueueSize) | ||
| #endif | ||
|
|
||
| static inline int qqueueShutdownImmediateGet(const qqueue_t *const pThis) { |
There was a problem hiding this comment.
P2: Custom agent: Code Quality Guard
The const qualifier on pThis in qqueueShutdownImmediateGet creates a const-correctness violation on platforms without atomic builtins. QQUEUE_SHUTDOWN_HELPER_MUT(pThis) would yield const pthread_mutex_t * (via &pThis->mutQueueSize), but ATOMIC_FETCH_32BIT expects pthread_mutex_t *. This will produce a compiler warning, violating the project's "compile cleanly with no warnings on both gcc and clang" policy (CONTRIBUTING.md). The cast (int *)&pThis->bShutdownImmediate also discards const. Remove the const from the pThis parameter — atomic reads are logically non-const operations since they may acquire a mutex.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At runtime/queue.c, line 240:
<comment>The `const` qualifier on `pThis` in `qqueueShutdownImmediateGet` creates a const-correctness violation on platforms without atomic builtins. `QQUEUE_SHUTDOWN_HELPER_MUT(pThis)` would yield `const pthread_mutex_t *` (via `&pThis->mutQueueSize`), but `ATOMIC_FETCH_32BIT` expects `pthread_mutex_t *`. This will produce a compiler warning, violating the project's "compile cleanly with no warnings on both gcc and clang" policy (CONTRIBUTING.md). The cast `(int *)&pThis->bShutdownImmediate` also discards const. Remove the `const` from the `pThis` parameter — atomic reads are logically non-const operations since they may acquire a mutex.</comment>
<file context>
@@ -231,6 +231,20 @@ DEFobjCurrIf(glbl) DEFobjCurrIf(strm) DEFobjCurrIf(datetime) DEFobjCurrIf(statso
+ #define QQUEUE_SHUTDOWN_HELPER_MUT(pThis) (&(pThis)->mutQueueSize)
+#endif
+
+static inline int qqueueShutdownImmediateGet(const qqueue_t *const pThis) {
+ return ATOMIC_FETCH_32BIT((int *)&pThis->bShutdownImmediate, QQUEUE_SHUTDOWN_HELPER_MUT(pThis));
+}
</file context>
| static inline int qqueueShutdownImmediateGet(const qqueue_t *const pThis) { | |
| static inline int qqueueShutdownImmediateGet(qqueue_t *const pThis) { |
aa61fb9 to
12464b3
Compare
Queue shutdown sets bShutdownImmediate from the control thread while worker/action/ruleset loops read it concurrently. Those accesses were plain loads/stores, which is a data race under the C memory model. On x86 this is often masked by stronger ordering, but on ARM64 weak ordering can leave workers observing stale values long enough to miss shutdown and hang in timeout paths (as seen in arm64 CI during disk queue corruption tests). Fix by routing bShutdownImmediate through atomic helpers and using atomic loads in all hot loops that poll the flag (queue dequeue/DA loops, action retry loops, ruleset execution, and rsyslogd batch preprocessing). Also propagate the helper mutex pointer via wti so non-atomic fallback builds remain synchronized.
12464b3 to
d4d5a95
Compare
Run the debug testcase in a way that always preserves artifacts on failure and avoids secondary noise from auto-reruns.
The previous version could fail to upload logs with EACCES because files created in the container were not readable by the runner, and USE_AUTO_DEBUG caused an extra rerun path ("RSYSLOG_DYNNAME already set") after aborts.
Update the job to normalize file ownership/permissions after the testcase, even on failure, and disable automatic debug reruns for this focused diagnostic workflow.
With the help of AI-Agents: Codex
Debug tracing in strmReadBuf logged raw read buffers with "%*s", where the width argument does not limit input read length for %s.
When global debug logging is enabled, this can make vsnprintf scan past the read buffer until a NUL byte and trigger ASAN heap-buffer-overflow in queue disk dequeue paths.
Use precision-limited formatting ("%.*s") with iLenRead to cap bytes read from the non-NUL-terminated buffer.
With the help of AI-Agents: Codex
Keep the targeted arm64 diskqueue debug workflow focused on rich failure forensics with stable evidence capture. Changes: - preserve and upload tests/log (full processing flow) - print tail -n 2000 of tests/log to stdout on failure - keep file ownership/permission normalization so artifact upload can read generated debug files - keep debug defaults in tests/diag.sh for this investigation run This consolidates the recent debug-workflow-only adjustments into one commit while leaving the runtime bugfix commit separate. With the help of AI-Agents: Codex
Ensure arm64 CI jobs produce local core files usable by tests/diag.sh and keep gdb available for stack capture. Changes: - add gdb/procps to devtools/ci/Dockerfile.arm - run ARM containers with --privileged - set ulimit -c unlimited and kernel.core_pattern=/rsyslog/tests/core.%e.%p - upload tests/core* artifacts in arm jobs With the help of AI-Agents: Codex
Increase the custom ignore-mode shutdown timeout used by diskqueue-oncorruption-missing-segment.sh from the generic 90s testbench default to a dedicated 240s ceiling. The old value could time out in full-suite ARM64 ASAN runs even though the standalone testcase succeeds. Also set ASAN_OPTIONS disable_coredump=0 in both ARM64 ASAN jobs so crashes produce core files in tests/ (with existing core_pattern), which allows diag.sh post-failure core handling.
d321e6c to
32fa884
Compare
Implements robust disk queue corruption detection and recovery logic.
queue.onCorruptionparameter to control behavior (ignore, safe, inMemory).qqueueVerifyAndRecoverto scan spool directory and validate state.realloc,strdup,es_str2cstr) withCHKmallocmacro.renameoperations during recovery.reallocfailure path.badqi.shanddaqueue-invld-qi.sh.PR created automatically by Jules for task 9861661753839617161 started by @rgerhards