Tom/fix memory leak (#84)

gabsow · galcohen-redislabs · cursoragent · web-flow · commit c21b080837ec · 2026-02-26T23:06:50.000+02:00
* MOD-13438 Add internal commands to the inner communication protocol Initial additions of the needed types and functions. (cherry picked from commit 95fc9cc) * Adjust the includes, typedefs and function declarations (cherry picked from commit 37ea653) * Make MR_ClusterRegisterMsgReceiver() idempotent (cherry picked from commit 5f0f0a9) * Spelling and other cleanups (cherry picked from commit c66873c) * Send internal-command messages to current node as well (cherry picked from commit da3eeff) * Added support for parsing responses of internal commands and storing in steps (cherry picked from commit b252924) * WIP (doesn't compile now) - handling step-done and done notifications under internal commands (cherry picked from commit 597d0ac) * Some cleanups. Now compiles but the done-step and done-execution parts still have bugs (cherry picked from commit 58a15ab) * WTF? How this trivial bug survived for so many years?? (cherry picked from commit f635c48) * Now we can use the flag properly (cherry picked from commit 7c5792a) * Mostly cleanups of funcs and structs (cherry picked from commit d2cd1df) * minor cleanups (cherry picked from commit e64de66) * Aadded MR_ExecutionCtxSetDone() (cherry picked from commit f2e283d) * typo in the rust apis * Minor log cleanup * Pin setuptools below 81 in CI Avoid newer setuptools behavior in Linux and macOS workflows. * Added a comment for internal command callbacks registrations * Retry AUTH in cases of a race that cause previous AUTH to fail * Refined the condition for when to retry AUTH * Ride on the done.callback to avoid race between timeout and done; Also no need for the MR_ExecutionCtxSetDone() anymore * Don't bail out before sending NOTIFY_DONE when I'm not the initiator * Testing something * Revert "Testing something" This reverts commit 5bbbef5. * Fix cluster cleanup leaks in TLS and async connect paths. This prevents leaking TLS config buffers and failed async contexts in error flows reported by Valgrind. * Free async context from event loop thread to prevent reply leak. redisAsyncContext is not thread-safe. Calling redisAsyncFree from the main thread while the event loop thread processes callbacks can race and leak a parsed redisReply object. Dispatch the free to the event loop via MR_EventLoopAddTask, consistent with the existing SSL error disconnect paths. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix comment: free is deferred within the event loop, not across threads. Co-authored-by: Cursor <cursoragent@cursor.com> * Drain pending messages on node free to prevent execution leaks When MR_ClusterFree tears down nodes during a topology change, in-flight executions with pending messages were left waiting for responses that would never arrive. Under Valgrind the 5s idle timeout often could not fire before process exit, leaking the execution and its parsed results (SeriesListReplyParser allocations). - MR_NodeFree: drain pendingMessages and notify each execution via MR_SetInternalCommandResults(node, NULL, execution) - MR_SetInternalCommandResults: handle NULL reply by marking all steps done and reporting a disconnect error - MR_FreeAsyncContext: free partially parsed reply tree from hiredis reader stack before calling redisAsyncFree (works around hiredis not cleaning up mid-parse state in redisReaderFree) All three changes verified clean by Valgrind on the test_asm_with_data_and_queries_during_migrations test. Co-authored-by: Cursor <cursoragent@cursor.com> * Only drain internal-command pending messages in MR_NodeFree Draining all pending messages and calling MR_SetInternalCommandResults for each caused crashes: regular (status) messages have executions with Reader/Mapper steps, and MR_PerformStepDoneOp asserts on non-InternalCommand step types. Only drain messages with FUNCTION_ID_INTERNAL set. Co-authored-by: Cursor <cursoragent@cursor.com> * Revert leak fixes per Gal's feedback - Remove MR_FreeAsyncContext and deferral: redisReaderFree is called during teardown; no need to defer redisAsyncFree at this stage - Remove MR_NodeFree pending-message drain: find root cause (e.g. MR_ClusterFree not called) rather than custom cleanup - Remove NULL reply handling in MR_SetInternalCommandResults: redundant Co-authored-by: Cursor <cursoragent@cursor.com> * Remove extra blank line in MR_SetInternalCommandResults Co-authored-by: Cursor <cursoragent@cursor.com> * Add MR_ClusterFini to free cluster state on module teardown MR_ClusterFree was never called at process exit — only during cluster topology changes — causing Valgrind-reported leaks of nodes, connections, and related allocations. Co-authored-by: Cursor <cursoragent@cursor.com> * Add MR_Fini API and call it from test module deinit Expose MR_Fini (calls MR_ClusterFini) as the public teardown API, add mr_fini to the Rust bindings, and register a deinit callback in the test module so cluster state is freed on shutdown. Co-authored-by: Cursor <cursoragent@cursor.com> * Skip testUnevenWork under Valgrind The mapper sleeps 30s on non-initiator shards but the execution times out after 2s. The sleeping thread pool thread prevents a clean shutdown within Valgrind's timeout. Co-authored-by: Cursor <cursoragent@cursor.com> * Destroy thread pool in MR_Fini, reduce testUnevenWork sleep to 5s MR_Fini now calls mr_thpool_destroy before cluster cleanup so threads are joined on shutdown. The 30s sleep in UnevenWorkMapper was far longer than needed (max_idle is 2s) and prevented clean exit under Valgrind; 5s is sufficient to test the uneven-work path. Co-authored-by: Cursor <cursoragent@cursor.com> * Revert thpool_destroy and sleep change, skip testUnevenWork under Valgrind Remove mr_thpool_destroy from MR_Fini — it blocks shutdown waiting for the 30s sleeping thread. Restore original 30s sleep. Skip the test under Valgrind since the sleeping thread prevents clean exit (pre-existing issue, unrelated to cluster leak fix). Co-authored-by: Cursor <cursoragent@cursor.com> * Fix cluster cleanup leaks in TLS and async connect paths. - Fix ca_cert double-free: was freeing *client_cert instead of *ca_cert - Fix async context leak: add redisAsyncFree(c) on connect error path Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Gal Cohen <gal@redis.com> Co-authored-by: Gal Cohen <38293267+galcohen-redislabs@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>
diff --git a/src/cluster.c b/src/cluster.c
@@ -532,7 +532,7 @@ static int checkTLS(char** client_key, char** client_cert, char** ca_cert, char*
             MR_FREE(*client_cert);
         }
         if (*ca_cert) {
-            MR_FREE(*client_cert);
+            MR_FREE(*ca_cert);
         }
     }
 
@@ -684,8 +684,8 @@ static void MR_ConnectToShard(Node* n){
         return;
     }
     if (c->err) {
-        /* Let *c leak for now... */
         RedisModule_Log(mr_staticCtx, "warning", "Error: %s\n", c->errstr);
+        redisAsyncFree(c);
         return;
     }
     c->data = n;

Original file line number	Diff line number	Diff line change
`@@ -532,7 +532,7 @@ static int checkTLS(char client_key, char client_cert, char** ca_cert, char*`
`532`	`532`	`MR_FREE(*client_cert);`
`533`	`533`	`}`
`534`	`534`	`if (*ca_cert) {`
`535`		`- MR_FREE(*client_cert);`
	`535`	`+ MR_FREE(*ca_cert);`
`536`	`536`	`}`
`537`	`537`	`}`
`538`	`538`
`@@ -684,8 +684,8 @@ static void MR_ConnectToShard(Node* n){`
`684`	`684`	`return;`
`685`	`685`	`}`
`686`	`686`	`if (c->err) {`
`687`		`- /* Let c leak for now... /`
`688`	`687`	`RedisModule_Log(mr_staticCtx, "warning", "Error: %s\n", c->errstr);`
	`688`	`+ redisAsyncFree(c);`
`689`	`689`	`return;`
`690`	`690`	`}`
`691`	`691`	`c->data = n;`