-
Notifications
You must be signed in to change notification settings - Fork 485
vivekchavan14/fix autobalancer metrics reporter outofordersequence #2700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
vivekchavan14
wants to merge
2
commits into
AutoMQ:main
Choose a base branch
from
vivekchavan14:vivekchavan14/fix-autobalancer-metrics-reporter-outofordersequence
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,89 @@ | ||
# Fix for AutoBalancer Metrics Reporter Issues (#2697) | ||
|
||
## Problem Description | ||
|
||
The AutoBalancer metrics reporter was experiencing several critical issues after upgrading from version 1.1.2 to 1.4.1: | ||
|
||
1. **OutOfOrderSequenceException**: "The broker received an out of order sequence number" | ||
2. **Missing consumption data curves**: Metrics data gaps causing incomplete monitoring dashboards | ||
3. **InterruptException during shutdown**: Improper handling of producer flush during shutdown | ||
4. **Failed to send metrics warnings**: High failure rates in metric transmission | ||
|
||
## Root Cause Analysis | ||
|
||
### Primary Issue: Missing Producer Idempotence | ||
The root cause was that the AutoBalancer metrics reporter was configured with: | ||
- `retries = 5` | ||
- `acks = all` | ||
- **BUT missing `enable.idempotence = true`** | ||
|
||
Without idempotence, when the producer retries failed requests, it can send records with sequence numbers that appear out of order to the broker, causing `OutOfOrderSequenceException`. | ||
|
||
### Secondary Issues | ||
1. **Poor error handling**: Generic error logging without differentiating error types | ||
2. **Improper shutdown**: Not gracefully handling producer flush during shutdown | ||
3. **No shutdown checks**: Attempting to send metrics even during shutdown | ||
|
||
## Solution Implemented | ||
|
||
### 1. Enable Producer Idempotence | ||
```java | ||
// Enable idempotence to prevent OutOfOrderSequenceException during retries | ||
setIfAbsent(producerProps, ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true"); | ||
// Increase delivery timeout to handle network issues and retries better | ||
setIfAbsent(producerProps, ProducerConfig.DELIVERY_TIMEOUT_MS_CONFIG, "120000"); | ||
// Set reasonable request timeout | ||
setIfAbsent(producerProps, ProducerConfig.REQUEST_TIMEOUT_MS_CONFIG, "30000"); | ||
``` | ||
|
||
### 2. Improved Error Handling | ||
```java | ||
// Log different error types with appropriate levels | ||
if (e instanceof org.apache.kafka.common.errors.OutOfOrderSequenceException) { | ||
LOGGER.warn("OutOfOrderSequenceException when sending auto balancer metric (this should be resolved with idempotence enabled): {}", e.getMessage()); | ||
} else if (e instanceof org.apache.kafka.common.errors.NotLeaderOrFollowerException) { | ||
LOGGER.warn("NotLeaderOrFollowerException when sending auto balancer metric (transient error): {}", e.getMessage()); | ||
} else if (e instanceof InterruptException) { | ||
LOGGER.info("InterruptException when sending auto balancer metric (likely due to shutdown): {}", e.getMessage()); | ||
} | ||
``` | ||
|
||
### 3. Graceful Shutdown Process | ||
```java | ||
@Override | ||
public void close() { | ||
LOGGER.info("Closing Auto Balancer metrics reporter, id={}.", brokerId); | ||
shutdown = true; | ||
if (metricsReporterRunner != null) { | ||
metricsReporterRunner.interrupt(); | ||
try { | ||
metricsReporterRunner.join(PRODUCER_CLOSE_TIMEOUT.toMillis()); | ||
} catch (InterruptedException e) { | ||
LOGGER.warn("Interrupted while waiting for metrics reporter thread to finish"); | ||
Thread.currentThread().interrupt(); | ||
} | ||
} | ||
if (producer != null) { | ||
try { | ||
// Try to flush remaining metrics before closing | ||
producer.flush(); | ||
} catch (Exception e) { | ||
LOGGER.warn("Failed to flush producer during shutdown: {}", e.getMessage()); | ||
} finally { | ||
producer.close(PRODUCER_CLOSE_TIMEOUT); | ||
} | ||
} | ||
} | ||
``` | ||
|
||
### 4. Shutdown Checks in Send Method | ||
```java | ||
public void sendAutoBalancerMetric(AutoBalancerMetrics ccm) { | ||
if (shutdown) { | ||
return; // Don't send metrics if shutting down | ||
} | ||
// ... rest of method | ||
} | ||
``` | ||
|
||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,82 @@ | ||
# Fix for Issue #2615: Failed to init cert metrics | ||
|
||
## Problem | ||
AutoMQ was failing to initialize certificate metrics with the error: | ||
``` | ||
java.lang.IllegalArgumentException: Illegal base64 character 20 | ||
``` | ||
|
||
The error occurred in the `S3StreamKafkaMetricsManager.parseCertificates()` method when trying to decode PEM certificate content that contained whitespace characters (spaces, tabs, newlines, carriage returns). | ||
|
||
## Root Cause | ||
The original implementation only removed newlines (`\n`) from the PEM certificate content before Base64 decoding: | ||
```java | ||
pemPart.replace("-----BEGIN CERTIFICATE-----", "").replaceAll("\\n", "") | ||
``` | ||
|
||
However, PEM certificates can contain various whitespace characters including: | ||
- Spaces (ASCII 32) | ||
- Tabs (ASCII 9) | ||
- Carriage returns (ASCII 13) | ||
- Newlines (ASCII 10) | ||
|
||
Character 20 (space) in the error message was causing the Base64 decoder to fail. | ||
|
||
## Solution | ||
### 1. Fixed the PEM parsing logic | ||
Updated the `parseCertificates()` method to: | ||
- Remove ALL whitespace characters using `replaceAll("\\s", "")` instead of just newlines | ||
- Added graceful error handling for both `IllegalArgumentException` (Base64 errors) and `CertificateException` (certificate parsing errors) | ||
- Changed from fixed array to dynamic List to handle cases where some certificates fail to parse | ||
- Added proper logging for failed certificate parsing attempts | ||
|
||
### 2. Key Changes Made | ||
**File**: `server-common/src/main/java/org/apache/kafka/server/metrics/s3stream/S3StreamKafkaMetricsManager.java` | ||
|
||
**Before**: | ||
```java | ||
private static X509Certificate[] parseCertificates(String pemContent) throws CertificateException { | ||
// ... | ||
byte[] certBytes = Base64.getDecoder().decode(pemPart.replace("-----BEGIN CERTIFICATE-----", "").replaceAll("\\n", "")); | ||
// ... | ||
} | ||
``` | ||
|
||
**After**: | ||
```java | ||
private static X509Certificate[] parseCertificates(String pemContent) throws CertificateException { | ||
// ... | ||
String cleanedPemPart = pemPart.replace("-----BEGIN CERTIFICATE-----", "") | ||
.replaceAll("\\s", ""); // Remove all whitespace characters | ||
|
||
try { | ||
byte[] certBytes = Base64.getDecoder().decode(cleanedPemPart); | ||
// ... | ||
} catch (IllegalArgumentException e) { | ||
LOGGER.warn("Failed to decode certificate part due to invalid Base64, skipping: {}", e.getMessage()); | ||
} catch (CertificateException e) { | ||
LOGGER.warn("Failed to parse certificate, skipping: {}", e.getMessage()); | ||
} | ||
// ... | ||
} | ||
``` | ||
|
||
### 3. Added Comprehensive Tests | ||
Created `S3StreamKafkaMetricsManagerTest.java` with tests for: | ||
- Empty certificate content | ||
- Certificates with various whitespace issues (spaces, tabs, carriage returns) | ||
- Invalid Base64 content | ||
- Graceful error handling | ||
|
||
## Impact | ||
- **Fixes the crash**: AutoMQ will no longer crash when initializing certificate metrics with whitespace-containing PEM content | ||
- **Robust error handling**: Invalid certificates are now skipped with appropriate logging instead of causing complete failure | ||
- **Backward compatible**: The fix doesn't break existing functionality | ||
- **Better logging**: Administrators can now see which certificates failed to parse and why | ||
|
||
## Testing | ||
- All existing tests continue to pass | ||
- New tests verify the fix handles whitespace correctly | ||
- Tested with various certificate formats and edge cases | ||
|
||
This fix resolves the "Illegal base64 character 20" error reported in issue #2615 and makes the certificate parsing more robust overall. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default here is true, and there is no conflicting configuration. Idempotence should be effective and no additional modification is required.