-
Notifications
You must be signed in to change notification settings - Fork 53
Replication fails with TimeoutException on large dataset using RIOT 4.3.0 #179
Description
I’m using RIOT 4.3.0 to replicate data from Redis Enterprise to Google Cloud Memorystore.
My command looks like this:
SOURCE_HOST="redis-XXXXX.com"
SOURCE_PORT=XXXX
SOURCE_USER=default
SOURCE_PASS="XXXXXX"
TARGET_HOST="10.1XX.XX.XX"
TARGET_PORT=XXXX
source_uri="redis://${SOURCE_USER}:${SOURCE_PASS}@${SOURCE_HOST}:${SOURCE_PORT}"
target_uri="${TARGET_HOST}:${TARGET_PORT}"
/root/riot-4.3.0/bin/riot replicate "$source_uri" "$target_uri" --source-cluster --target-cluster
With a small database, the replication works fine.
With a large database (~70 million keys), it always fails around 35 million keys.
Error log:
org.springframework.retry.ExhaustedRetryException: Retry exhausted after last attempt in recovery path, but exception is not skippable.
at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor.lambda$write$4(FaultTolerantChunkProcessor.java:401)
at org.springframework.retry.support.RetryTemplate.handleRetryExhausted(RetryTemplate.java:573)
at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:418)
at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:276)
at org.springframework.batch.core.step.item.BatchRetryTemplate.execute(BatchRetryTemplate.java:216)
at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor.write(FaultTolerantChunkProcessor.java:414)
at org.springframework.batch.core.step.item.SimpleChunkProcessor.process(SimpleChunkProcessor.java:227)
at org.springframework.batch.core.step.item.ChunkOrientedTasklet.execute(ChunkOrientedTasklet.java:75)
at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:383)
at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:307)
at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:250)
at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:82)
at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:369)
at org.springframework.batch.repeat.support.RepeatTemplate.executeInternal(RepeatTemplate.java:206)
at org.springframework.batch.repeat.support.RepeatTemplate.iterate(RepeatTemplate.java:140)
at org.springframework.batch.core.step.tasklet.TaskletStep.doExecute(TaskletStep.java:235)
at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:230)
at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:153)
at org.springframework.batch.core.job.AbstractJob.handleStep(AbstractJob.java:408)
at org.springframework.batch.core.job.SimpleJob.doExecute(SimpleJob.java:127)
at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:307)
at org.springframework.batch.core.launch.support.TaskExecutorJobLauncher$1.run(TaskExecutorJobLauncher.java:155)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.util.concurrent.TimeoutException
at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095)
at com.redis.lettucemod.RedisModulesUtils.getAll(RedisModulesUtils.java:355)
at com.redis.spring.batch.item.redis.common.OperationExecutor.execute(OperationExecutor.java:123)
at com.redis.spring.batch.item.redis.common.OperationExecutor.process(OperationExecutor.java:102)
at com.redis.spring.batch.item.redis.common.OperationExecutor.process(OperationExecutor.java:34)
at com.redis.spring.batch.item.ChunkProcessingItemWriter.write(ChunkProcessingItemWriter.java:57)
at org.springframework.batch.core.step.item.SimpleChunkProcessor.writeItems(SimpleChunkProcessor.java:203)
at org.springframework.batch.core.step.item.SimpleChunkProcessor.doWrite(SimpleChunkProcessor.java:170)
at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor.lambda$write$2(FaultTolerantChunkProcessor.java:331)
at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:357)
... 21 more
Scanning 100% [===============] 70764175/70764175 (0:23:52 / 0:00:00) 49416.3/s
What I tried:
• Reduced --batch size (down to 1).
• Reduced --threads (down to 1).
• Excluded big keys with --key-exclude.
• Added --mem-limit.
• Reduced --scan-count (down to 10).
• --source-timeout 10m --target-timeout 10m
But I still get the same TimeoutException at about the same point.
Big key summary from redis-cli --bigkeys:
Sampled 23,594,835 keys in the keyspace!
Total key length in bytes is 882,939,469 (avg len 37.42)
Biggest list: "event-server:job-system:waiting-queue" → 1,377 items
Biggest hash: "li:livejanus" → 435 fields
Biggest string:"ExactDirtyWords" → 25,366 bytes
Biggest set: "li:tagging:TW:" → 701,312 members
Biggest zset: "leaderboard:GLOBAL:ARCADE_SHIRLEYBIRD_JP:platform:ALL" → 91,716 members
4216 lists with 8052 items (0.02% of keys, avg size 1.91)
2,661,935 hashes with 2,895,308 fields (11.28% of keys, avg size 1.09)
16,343,036 strings with 243,616,545 bytes (69.27% of keys, avg size 14.91)
6443 sets with 2,087,605 members (0.03% of keys, avg size 324.01)
4,579,205 zsets with 80,699,877 members (19.41% of keys, avg size 17.62)
Even after excluding the largest keys (--key-exclude) the error still happens at ~35M keys.
Environment:
• RIOT 4.3.0
• Source: Redis Enterprise cluster, version: 6.2.13, used_memory_human:24.24G
• Target: Google Cloud Memorystore cluster
Could this be a timeout limit in RIOT for large datasets?
Is there a recommended way to increase the timeout?
Thank you.