Skip to content
This repository was archived by the owner on Oct 29, 2025. It is now read-only.

Replication fails with TimeoutException on large dataset using RIOT 4.3.0 #179

@billy522

Description

@billy522

I’m using RIOT 4.3.0 to replicate data from Redis Enterprise to Google Cloud Memorystore.
My command looks like this:

SOURCE_HOST="redis-XXXXX.com"
SOURCE_PORT=XXXX
SOURCE_USER=default
SOURCE_PASS="XXXXXX"
TARGET_HOST="10.1XX.XX.XX"
TARGET_PORT=XXXX

source_uri="redis://${SOURCE_USER}:${SOURCE_PASS}@${SOURCE_HOST}:${SOURCE_PORT}"
target_uri="${TARGET_HOST}:${TARGET_PORT}"

/root/riot-4.3.0/bin/riot replicate "$source_uri" "$target_uri" --source-cluster --target-cluster

With a small database, the replication works fine.
With a large database (~70 million keys), it always fails around 35 million keys.

Error log:

org.springframework.retry.ExhaustedRetryException: Retry exhausted after last attempt in recovery path, but exception is not skippable.
        at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor.lambda$write$4(FaultTolerantChunkProcessor.java:401)
        at org.springframework.retry.support.RetryTemplate.handleRetryExhausted(RetryTemplate.java:573)
        at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:418)
        at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:276)
        at org.springframework.batch.core.step.item.BatchRetryTemplate.execute(BatchRetryTemplate.java:216)
        at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor.write(FaultTolerantChunkProcessor.java:414)
        at org.springframework.batch.core.step.item.SimpleChunkProcessor.process(SimpleChunkProcessor.java:227)
        at org.springframework.batch.core.step.item.ChunkOrientedTasklet.execute(ChunkOrientedTasklet.java:75)
        at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:383)
        at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:307)
        at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
        at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:250)
        at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:82)
        at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:369)
        at org.springframework.batch.repeat.support.RepeatTemplate.executeInternal(RepeatTemplate.java:206)
        at org.springframework.batch.repeat.support.RepeatTemplate.iterate(RepeatTemplate.java:140)
        at org.springframework.batch.core.step.tasklet.TaskletStep.doExecute(TaskletStep.java:235)
        at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:230)
        at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:153)
        at org.springframework.batch.core.job.AbstractJob.handleStep(AbstractJob.java:408)
        at org.springframework.batch.core.job.SimpleJob.doExecute(SimpleJob.java:127)
        at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:307)
        at org.springframework.batch.core.launch.support.TaskExecutorJobLauncher$1.run(TaskExecutorJobLauncher.java:155)
        at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.util.concurrent.TimeoutException
        at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960)
        at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095)
        at com.redis.lettucemod.RedisModulesUtils.getAll(RedisModulesUtils.java:355)
        at com.redis.spring.batch.item.redis.common.OperationExecutor.execute(OperationExecutor.java:123)
        at com.redis.spring.batch.item.redis.common.OperationExecutor.process(OperationExecutor.java:102)
        at com.redis.spring.batch.item.redis.common.OperationExecutor.process(OperationExecutor.java:34)
        at com.redis.spring.batch.item.ChunkProcessingItemWriter.write(ChunkProcessingItemWriter.java:57)
        at org.springframework.batch.core.step.item.SimpleChunkProcessor.writeItems(SimpleChunkProcessor.java:203)
        at org.springframework.batch.core.step.item.SimpleChunkProcessor.doWrite(SimpleChunkProcessor.java:170)
        at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor.lambda$write$2(FaultTolerantChunkProcessor.java:331)
        at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:357)
        ... 21 more
Scanning 100% [===============] 70764175/70764175 (0:23:52 / 0:00:00) 49416.3/s

What I tried:
• Reduced --batch size (down to 1).
• Reduced --threads (down to 1).
• Excluded big keys with --key-exclude.
• Added --mem-limit.
• Reduced --scan-count (down to 10).
• --source-timeout 10m --target-timeout 10m

But I still get the same TimeoutException at about the same point.

Big key summary from redis-cli --bigkeys:

Sampled 23,594,835 keys in the keyspace!
Total key length in bytes is 882,939,469 (avg len 37.42)

Biggest list:  "event-server:job-system:waiting-queue" → 1,377 items  
Biggest hash:  "li:livejanus" → 435 fields  
Biggest string:"ExactDirtyWords" → 25,366 bytes  
Biggest set:   "li:tagging:TW:" → 701,312 members  
Biggest zset:  "leaderboard:GLOBAL:ARCADE_SHIRLEYBIRD_JP:platform:ALL" → 91,716 members  

4216 lists with 8052 items (0.02% of keys, avg size 1.91)  
2,661,935 hashes with 2,895,308 fields (11.28% of keys, avg size 1.09)  
16,343,036 strings with 243,616,545 bytes (69.27% of keys, avg size 14.91)  
6443 sets with 2,087,605 members (0.03% of keys, avg size 324.01)  
4,579,205 zsets with 80,699,877 members (19.41% of keys, avg size 17.62)  

Even after excluding the largest keys (--key-exclude) the error still happens at ~35M keys.

Environment:
• RIOT 4.3.0
• Source: Redis Enterprise cluster, version: 6.2.13, used_memory_human:24.24G
• Target: Google Cloud Memorystore cluster

Could this be a timeout limit in RIOT for large datasets?
Is there a recommended way to increase the timeout?

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions