Skip to content

Automatic Slot Migrator Failure (System.NullReferenceException in CreateAndRunMigrateTasks and MigrateSession.RecoverFromFailure failed to make slots STABLE) #1560

@muzammilar

Description

@muzammilar

Describe the bug

When running garnet, the Automatic Slot Migration fails.

Cluster: A two-node (2 shards, no replicas) cluster. The cluster is initialized with all slots being on one node with 5-10 thousand keys, and then we try to migrate some hash slots from one primary/shard to another while continuosly writing data and deleting keys (both manually and using TTLs). There are some short TTL keys that are being deleted and modified consistently.

Garnet Version: 1.0.94

Network: IPv6 with TLS

Steps to reproduce the bug

Command:

/usr/bin/valkey-cli -h primary-01.foo.bar -p 6379 --user myuser -a mypassword --tls --cacert /path/to/ca.crt MIGRATE ipv6-address-of-primary-02 6379 "" 0 -1 REPLACE AUTH2 myuser 'mypassword' SLOTSRANGE 8192 16383

When the above command is executed, the sender says that's migrating slots to the target but, it throws an error in the logs and the cluster_slots_ok and cluster_slots_assigned are both decreased (by the number of slots being migrated).

When you look at CLUSTER MTASKS, it says that it's 0. It should say 1 during the migration.

Sender's Logs:

2026-02-13T06:24:04.987917+00:00 primary-02 GarnetServer[216135]: 06::24::04 fail: MigrateSession - 14333193[0] CreateAndRunMigrateTasks: Object 24 240 4096 System.NullReferenceException: Object reference not set to an instance of an object.    at Garnet.cluster.ClusterSession.Expired(IGarnetObject& value) in /_/libs/cluster/Session/MigrateCommand.cs:line 18    at Garnet.cluster.MigrateSession.ObjectStoreScan.SingleReader(Byte[]& key, IGarnetObject& value, RecordMetadata recordMetadata, Int64 numberOfRecords, CursorRecordResult& cursorRecordResult) in /_/libs/cluster/Server/Migration/MigrateScanFunctions.cs:line 78    at Tsavorite.core.AllocatorBase`4.ScanLookup[TInput,TOutput,TScanFunctions,TScanIterator](TsavoriteKV`4 store, ScanCursorState`2 scanCursorState, Int64& cursor, Int64 count, TScanFunctions scanFunctions, TScanIterator iter, Boolean validateCursor, Int64 maxAddress, Boolean resetCursor, Boolean includeTombstones) in /_/libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorScan.cs:line 197    at Tsavorite.core.GenericAllocatorImpl`3.ScanCursor[TScanFunctions](TsavoriteKV`4 store, ScanCursorState`2 scanCursorState, Int64& cursor, Int64 count, TScanFunctions scanFunctions, Int64 endAddress, Boolean validateCursor, Int64 maxAddress, Boolean resetCursor, Boolean includeTombstones) in /_/libs/storage/Tsavorite/cs/src/core/Allocator/GenericAllocatorImpl.cs:line 1034    at Tsavorite.core.ClientSession`8.ScanCursor[TScanFunctions](Int64& cursor, Int64 count, TScanFunctions scanFunctions, Int64 endAddress, Boolean validateCursor, Int64 maxAddress, Boolean resetCursor, Boolean includeTombstones) in /_/libs/storage/Tsavorite/cs/src/core/ClientSession/ClientSession.cs:line 503    at Tsavorite.core.ClientSession`8.IterateLookup[TScanFunctions](TScanFunctions& scanFunctions, Int64& cursor, Int64 untilAddress, Boolean validateCursor, Int64 maxAddress, Boolean resetCursor, Boolean includeTombstones) in /_/libs/storage/Tsavorite/cs/src/core/ClientSession/ClientSession.cs:line 477    at Garnet.server.StorageSession.IterateObjectStore[TScanFunctions](TScanFunctions& scanFunctions, Int64& cursor, Int64 untilAddress, Int64 maxAddress, Boolean validateCursor, Boolean includeTombstones) in /_/libs/server/Storage/Session/Common/ArrayKeyIterationFunctions.cs:line 172    at Garnet.server.GarnetApi`2.IterateObjectStore[TScanFunctions](TScanFunctions& scanFunctions, Int64& cursor, Int64 untilAddress, Int64 maxAddress, Boolean includeTombstones) in /_/libs/server/API/GarnetApi.cs:line 463    at Garnet.cluster.MigrateSession.MigrateOperation.Scan(StoreType storeType, Int64& currentAddress, Int64 endAddress) in /_/libs/cluster/Server/Migration/MigrateOperation.cs:line 67    at Garnet.cluster.MigrateSession.<>c__DisplayClass61_0.<MigrateSlotsDriverInline>g__ScanStoreTask|1(Int32 taskId, StoreType storeType, Int64 beginAddress, Int64 tailAddress, Int32 pageSize) in /_/libs/cluster/Server/Migration/MigrateSessionSlots.cs:line 93    at Garnet.cluster.MigrateSession.<>c__DisplayClass61_2.<MigrateSlotsDriverInline>b__2() in /_/libs/cluster/Server/Migration/MigrateSessionSlots.cs:line 57    at System.Threading.Tasks.Task`1.InnerInvoke()    at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state) --- End of stack trace from previous location ---    at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)    at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread) --- End of stack trace from previous location ---    at Garnet.cluster.MigrateSession.<>c__DisplayClass61_0.<<MigrateSlotsDriverInline>g__CreateAndRunMigrateTasks|0>d.MoveNext() in /_/libs/cluster/Server/Migration/MigrateSessionSlots.cs:line 63
2026-02-13T06:24:04.988258+00:00 primary-02 GarnetServer[216135]: 06::24::04 fail: MigrateSession - 14333193[0] MigrateSlotsDriver failed
2026-02-13T06:24:04.989179+00:00 primary-02 GarnetServer[216135]: 06::24::04 fail: MigrateSession - 14333193[0] An error occurred System.AggregateException: One or more errors occurred. (A task was canceled.)  ---> System.Threading.Tasks.TaskCanceledException: A task was canceled.    at System.Threading.Tasks.Task.GetExceptions(Boolean includeTaskCanceledExceptions)    at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)    at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)    at Garnet.cluster.MigrateSession.TrySetSlotRanges(String nodeid, MigrateState state) in /_/libs/cluster/Server/Migration/MigrateSession.cs:line 270    at Garnet.cluster.MigrateSession.TryRecoverFromFailure() in /_/libs/cluster/Server/Migration/MigrateSession.cs:line 331    at Garnet.cluster.MigrateSession.BeginAsyncMigrationTask() in /_/libs/cluster/Server/Migration/MigrationDriver.cs:line 86    at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecutionContextCallback(Object s)    at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)    at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)    at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext()    at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)    at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)    at System.Threading.Tasks.Task`1.TrySetResult(TResult result)    at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetExistingTaskResult(Task`1 task, TResult result)    at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetResult(TResult result)    at Garnet.cluster.MigrateSession.MigrateSlotsDriverInline() in /_/libs/cluster/Server/Migration/MigrateSessionSlots.cs:line 122    at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecutionContextCallback(Object s)    at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)    at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)    at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext()    at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)    at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)    at System.Threading.Tasks.Task`1.TrySetResult(TResult result)    at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetExistingTaskResult(Task`1 task, TResult result)    at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetResult(TResult result)    at Garnet.cluster.MigrateSession.<>c__DisplayClass61_0.<<MigrateSlotsDriverInline>g__CreateAndRunMigrateTasks|0>d.MoveNext() in /_/libs/cluster/Server/Migration/MigrateSessionSlots.cs:line 72    at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecutionContextCallback(Object s)    at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)    at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)    at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext()    at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)    at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)    at System.Threading.Tasks.Task.FinishSlow(Boolean userDelegateExecute)    at System.Threading.Tasks.Task.TrySetException(Object exceptionObject)    at System.Threading.Tasks.Task.RunOrQueueCompletionAction(ITaskCompletionAction completionAction, Boolean allowInlining)    at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)    at System.Threading.Tasks.Task.FinishSlow(Boolean userDelegateExecute)    at System.Threading.Tasks.Task.TrySetException(Object exceptionObject)    at System.Threading.Tasks.Task.WhenAllPromise.Invoke(Task completedTask)    at System.Threading.Tasks.Task.RunOrQueueCompletionAction(ITaskCompletionAction completionAction, Boolean allowInlining)    at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)    at System.Threading.Tasks.Task.FinishSlow(Boolean userDelegateExecute)    at System.Threading.Tasks.Task.TrySetException(Object exceptionObject)    at System.Threading.Tasks.UnwrapPromise`1.TrySetFromTask(Task task, Boolean lookForOce)    at System.Threading.Tasks.UnwrapPromise`1.ProcessCompletedOuterTask(Task task)    at System.Threading.Tasks.UnwrapPromise`1.InvokeCore(Task completingTask)    at System.Threading.Tasks.UnwrapPromise`1.Invoke(Task completingTask)    at System.Threading.Tasks.Task.RunOrQueueCompletionAction(ITaskCompletionAction completionAction, Boolean allowInlining)    at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)    at System.Threading.Tasks.Task.FinishSlow(Boolean userDelegateExecute)    at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)    at System.Threading.ThreadPoolWorkQueue.Dispatch()    at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart() --- End of stack trace from previous location ---     --- End of inner exception stack trace ---    at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)    at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)    at Garnet.cluster.MigrateSession.TrySetSlotRanges(String nodeid, MigrateState state) in /_/libs/cluster/Server/Migration/MigrateSession.cs:line 270
2026-02-13T06:24:04.989350+00:00 primary-02 GarnetServer[216135]: 06::24::04 fail: MigrateSession - 14333193[0] MigrateSession.RecoverFromFailure failed to make slots STABLE

Claude's Response:

  What this means practically

  - The migration of slot(s) from this node (session <removed>) failed mid-scan
  - The slot(s) may be stuck in a MIGRATING state on this node since recovery also failed
  - You likely need to manually reset the slot state (e.g., CLUSTER SETSLOT <slot> STABLE) on the affected node
  - The root cause is a Garnet bug — the Expired() check in MigrateCommand.cs doesn't handle null objects during object store iteration

  This is a known class of issue with Garnet's migration path for the object store. If you're hitting this consistently, it may be related to keys with TTLs expiring during migration. You could check if there's a newer Garnet version (you're on 1.0.94) that fixes the
  null check in MigrateSession.Expired()

Expected behavior

Migration should complete within a few seconds.

Screenshots

No response

Release version

1.0.94

IDE

No response

OS version

Distributor ID:	Debian
Description:	Debian GNU/Linux 12 (bookworm)
Release:	12
Codename:	bookworm

Additional context

No response

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions