Skip to content

fix(s3wal): 503 SlowDown during WAL recovery from S3#3198

Merged
superhx merged 2 commits intoAutoMQ:mainfrom
manmohak07:manmohak07/fix-503-slowdown-wal-recovery-s3
Feb 6, 2026
Merged

fix(s3wal): 503 SlowDown during WAL recovery from S3#3198
superhx merged 2 commits intoAutoMQ:mainfrom
manmohak07:manmohak07/fix-503-slowdown-wal-recovery-s3

Conversation

@manmohak07
Copy link
Contributor

Fixes #3177

Problem

AutoMQ brokers crash during startup/recovery when AWS S3 returns 503 SlowDown error due to rate limiting.

ERROR Exiting Kafka due to fatal exception during startup.
java.lang.RuntimeException: software.amazon.awssdk.services.s3.model.S3Exception: 
  Please reduce your request rate. (Service: S3, Status Code: 503)
  Suppressed: Request attempt 1 failure
  Suppressed: Request attempt 2 failure  
  Suppressed: Request attempt 3 failure

Solution

Added RetryMode.STANDARD to the AWS S3 client configuration which adds automatic retry with exponential backoff for S3 throttling errors.

File: s3stream/src/main/java/com/automq/stream/s3/operator/AwsObjectStorage.java
Line: 470

protected ClientOverrideConfiguration clientOverrideConfiguration(
    long apiCallTimeoutMs, long apiCallAttemptTimeoutMs) {
    return ClientOverrideConfiguration.builder()
        .apiCallTimeout(Duration.ofMillis(apiCallTimeoutMs))
        .apiCallAttemptTimeout(Duration.ofMillis(apiCallAttemptTimeoutMs))
        .retryStrategy(RetryMode.STANDARD)  // ← Added
        .build();
}

Also, retry logic for delete operations has been added.

Why STANDARD?

  • Automatic Retry: AWS SDK retries failed S3 requests transparently
  • Exponential Backoff: 100ms → 200ms → 400ms → 800ms delays
  • Detection: Retries on 503, 500, timeouts, throttling errors
  • No Application Changes: Works for all S3 operations automatically

Testing

Results

$ kubectl get pods -n automq -w
NAME                                         READY   STATUS      RESTARTS   AGE
minio-7f9765cb74-wgpvz                       1/1     Running     0          2m38s
minio-setup-s5sjh                            0/1     Completed   0          2m38s
my-cluster-controller-0                      1/1     Running     0          2m32s
my-cluster-entity-operator-7c9b785b8-jmnxc   2/2     Running     0          27s
strimzi-cluster-operator-cd45c6c5-jmsqh      1/1     Running     0          4m57s
  • Broker started without crashes
  • S3 operations completed without fatal errors
  • Topics created and managed successfully
  • No CrashLoopBackOff observed

Verification Commands

./gradlew clean build -x test
# BUILD SUCCESSFUL

docker build -t automq-throttle-test:latest .
# Image created successfully

kubectl apply -f automq-demo-local.yml
# All pods running

kubectl exec my-cluster-controller-0 -n automq -- bash -c \
  "/opt/kafka/bin/kafka-topics.sh --create --topic test --partitions 3 --bootstrap-server localhost:9092"
# Created topic test.

Broker Logs

kubectl logs my-cluster-controller-0 -n automq | tail -50


2026-01-30 14:47:56,969 INFO [BrokerLifecycleManager id=0] 
The broker has been unfenced. Transitioning from RECOVERY to RUNNING.

2026-01-30 14:47:57,030 INFO [BrokerServer id=0] 
Transition from STARTING to STARTED

2026-01-30 14:47:57,034 INFO [KafkaRaftServer nodeId=0] 
Kafka Server started

AWS RetryStrategy documentation

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request addresses broker crashes during startup/recovery when AWS S3 returns 503 SlowDown errors due to rate limiting. The fix implements two key changes: adding AWS SDK's built-in retry mechanism (RetryMode.STANDARD) for all S3 operations, and implementing custom retry logic specifically for delete operations.

Changes:

  • Added RetryMode.STANDARD to AWS S3 client configuration for automatic retry with exponential backoff
  • Implemented custom retry logic for S3 delete operations with throttling detection
  • Added TimeUnit import to support retry scheduling

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Gezi-lzq
Copy link
Contributor

Gezi-lzq commented Feb 6, 2026

LGTM

@superhx superhx merged commit 730aa9f into AutoMQ:main Feb 6, 2026
6 checks passed
@manmohak07 manmohak07 deleted the manmohak07/fix-503-slowdown-wal-recovery-s3 branch February 6, 2026 08:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] 503 SlowDown when recovering WAL from S3 crashes broker

3 participants