-
Notifications
You must be signed in to change notification settings - Fork 28
[INFRA-9517] handle auth failures with immediate reconnection #105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jsadn
wants to merge
4
commits into
main
Choose a base branch
from
auth-failure-handling
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| package com.sproutsocial.nsq; | ||
|
|
||
| /** | ||
| * Error codes for NSQ exceptions | ||
| */ | ||
| public enum NSQErrorCode { | ||
| /** | ||
| * Authentication failure - auth session expired or invalid credentials | ||
| */ | ||
| AUTH_FAILED, | ||
|
|
||
| /** | ||
| * General NSQ error (protocol errors, invalid operations, etc.) | ||
| */ | ||
| GENERAL | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
246 changes: 246 additions & 0 deletions
246
src/test/java/com/sproutsocial/nsq/AuthFailureRecoveryDockerTestIT.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,246 @@ | ||
| package com.sproutsocial.nsq; | ||
|
|
||
| import org.junit.Assert; | ||
| import org.junit.Test; | ||
| import org.slf4j.Logger; | ||
| import org.slf4j.LoggerFactory; | ||
|
|
||
| import java.util.List; | ||
| import java.util.concurrent.atomic.AtomicInteger; | ||
|
|
||
| /** | ||
| * Integration test for immediate reconnection behavior. | ||
| * | ||
| * These tests verify that the immediateCheckConnections() method (which is called | ||
| * when auth failures occur) works correctly to trigger immediate reconnection | ||
| * rather than waiting up to 60 seconds for the periodic checkConnections() call. | ||
| * | ||
| * Note: Actual E_AUTH_FAILED/E_UNAUTHORIZED error handling is tested in | ||
| * ConnectionAuthFailureTest, which verifies those errors throw AuthFailedException | ||
| * and trigger the handleAuthFailure() -> immediateCheckConnections() code path. | ||
| * These integration tests verify the reconnection mechanism itself works correctly. | ||
| */ | ||
| public class AuthFailureRecoveryDockerTestIT extends BaseDockerTestIT { | ||
| private static final Logger logger = LoggerFactory.getLogger(AuthFailureRecoveryDockerTestIT.class); | ||
| private Publisher publisher; | ||
| private Subscriber subscriber; | ||
|
|
||
| @Override | ||
| public void setup() { | ||
| super.setup(); | ||
| publisher = this.backupPublisher(); | ||
| } | ||
|
|
||
| /** | ||
| * Tests that the subscriber can successfully establish connections and receive messages. | ||
| * This baseline test ensures the immediate reconnection mechanism doesn't break | ||
| * normal operation. | ||
| */ | ||
| @Test | ||
| public void testNormalSubscriberOperation() { | ||
| TestMessageHandler handler = new TestMessageHandler(); | ||
| subscriber = new Subscriber(client, 1, 10, cluster.getLookupNode().getHttpHostAndPort().toString()); | ||
| subscriber.setDefaultMaxInFlight(10); | ||
| subscriber.subscribe(topic, "channel", handler); | ||
|
|
||
| // Send messages and verify they're received | ||
| List<String> messages = messages(20, 40); | ||
| send(topic, messages, 0, 0, publisher); | ||
|
|
||
| // Verify all messages received | ||
| List<NSQMessage> receivedMessages = handler.drainMessagesOrTimeOut(20); | ||
| validateReceivedAllMessages(messages, receivedMessages, false); | ||
| } | ||
|
|
||
| /** | ||
| * Tests that subscriber can recover from connection drops. | ||
| * This simulates a scenario similar to auth failure where connection is lost | ||
| * and needs to be re-established. | ||
| */ | ||
| @Test | ||
| public void testSubscriberReconnectionAfterNetworkDisruption() { | ||
| TestMessageHandler handler = new TestMessageHandler(); | ||
| subscriber = new Subscriber(client, 1, 10, cluster.getLookupNode().getHttpHostAndPort().toString()); | ||
| subscriber.setDefaultMaxInFlight(10); | ||
| subscriber.subscribe(topic, "channel", handler); | ||
|
|
||
| // Send first batch of messages | ||
| List<String> batch1 = messages(10, 40); | ||
| send(topic, batch1, 0, 0, publisher); | ||
| List<NSQMessage> received1 = handler.drainMessagesOrTimeOut(10); | ||
| validateReceivedAllMessages(batch1, received1, false); | ||
|
|
||
| // Simulate connection issue by disconnecting and reconnecting network | ||
| logger.info("Disconnecting network for nsqd node to simulate connection drop"); | ||
| cluster.disconnectNetworkFor(cluster.getNsqdNodes().get(0)); | ||
| Util.sleepQuietly(2000); | ||
|
|
||
| cluster.reconnectNetworkFor(cluster.getNsqdNodes().get(0)); | ||
| Util.sleepQuietly(2000); | ||
|
|
||
| // Send second batch and verify recovery | ||
| List<String> batch2 = messages(10, 40); | ||
| send(topic, batch2, 0, 0, publisher); | ||
| List<NSQMessage> received2 = handler.drainMessagesOrTimeOut(10, 20000); | ||
| validateReceivedAllMessages(batch2, received2, false); | ||
|
|
||
| logger.info("Successfully received messages after connection recovery"); | ||
| } | ||
|
|
||
| /** | ||
| * Tests that the immediateCheckConnections method is accessible and functional. | ||
| * This verifies the immediate reconnection path exists. | ||
| */ | ||
| @Test | ||
| public void testImmediateCheckConnectionsMethod() { | ||
| TestMessageHandler handler = new TestMessageHandler(); | ||
| subscriber = new Subscriber(client, 1, 10, cluster.getLookupNode().getHttpHostAndPort().toString()); | ||
| subscriber.setDefaultMaxInFlight(10); | ||
| subscriber.subscribe(topic, "channel", handler); | ||
|
|
||
| // Send initial messages to establish connection | ||
| List<String> messages = messages(5, 40); | ||
| send(topic, messages, 0, 0, publisher); | ||
| handler.drainMessagesOrTimeOut(5); | ||
|
|
||
| // Call immediateCheckConnections (this is what gets called on auth failure) | ||
| subscriber.immediateCheckConnections(topic); | ||
|
|
||
| // Verify subscriber still works after immediate check | ||
| List<String> moreMessages = messages(5, 40); | ||
| send(topic, moreMessages, 0, 0, publisher); | ||
| List<NSQMessage> received = handler.drainMessagesOrTimeOut(5); | ||
| validateReceivedAllMessages(moreMessages, received, false); | ||
|
|
||
| logger.info("immediateCheckConnections method works correctly"); | ||
| } | ||
|
|
||
| /** | ||
| * Tests that multiple rapid connection checks don't cause issues. | ||
| * This simulates what might happen if multiple connections experience auth failures. | ||
| */ | ||
| @Test | ||
| public void testMultipleRapidConnectionChecks() { | ||
| TestMessageHandler handler = new TestMessageHandler(); | ||
| subscriber = new Subscriber(client, 1, 10, cluster.getLookupNode().getHttpHostAndPort().toString()); | ||
| subscriber.setDefaultMaxInFlight(10); | ||
| subscriber.subscribe(topic, "channel", handler); | ||
|
|
||
| // Send initial messages to establish connections | ||
| List<String> messages = messages(10, 40); | ||
| send(topic, messages, 0, 0, publisher); | ||
| handler.drainMessagesOrTimeOut(10); | ||
|
|
||
| // Verify connections are established | ||
| Assert.assertTrue("Should have connections", subscriber.getConnectionCount() > 0); | ||
|
|
||
| // Call immediateCheckConnections multiple times rapidly | ||
| // (simulating multiple auth failures in quick succession) | ||
| for (int i = 0; i < 5; i++) { | ||
| subscriber.immediateCheckConnections(topic); | ||
| Util.sleepQuietly(100); | ||
| } | ||
|
|
||
| // Verify subscriber still works correctly | ||
| List<String> moreMessages = messages(10, 40); | ||
| send(topic, moreMessages, 0, 0, publisher); | ||
| List<NSQMessage> received = handler.drainMessagesOrTimeOut(10); | ||
| validateReceivedAllMessages(moreMessages, received, false); | ||
|
|
||
| logger.info("Multiple rapid connection checks handled correctly"); | ||
| } | ||
|
|
||
| /** | ||
| * Verifies that the subscriber's connection count updates correctly after | ||
| * immediate connection checks (which would happen during auth failure recovery). | ||
| */ | ||
| @Test | ||
| public void testConnectionCountAfterImmediateReconnect() { | ||
| TestMessageHandler handler = new TestMessageHandler(); | ||
| subscriber = new Subscriber(client, 1, 10, cluster.getLookupNode().getHttpHostAndPort().toString()); | ||
| subscriber.setDefaultMaxInFlight(10); | ||
| subscriber.subscribe(topic, "channel", handler); | ||
|
|
||
| // Send messages to create topic and establish initial connections | ||
| List<String> initialMessages = messages(5, 40); | ||
| send(topic, initialMessages, 0, 0, publisher); | ||
| handler.drainMessagesOrTimeOut(5); | ||
|
|
||
| int initialConnectionCount = subscriber.getConnectionCount(); | ||
| Assert.assertTrue("Should have at least one connection", initialConnectionCount > 0); | ||
| logger.info("Initial connection count: {}", initialConnectionCount); | ||
|
|
||
| // Trigger immediate check (simulating auth failure recovery) | ||
| subscriber.immediateCheckConnections(topic); | ||
| Util.sleepQuietly(2000); | ||
|
|
||
| int afterCheckCount = subscriber.getConnectionCount(); | ||
| Assert.assertTrue("Should still have connections after immediate check", afterCheckCount > 0); | ||
| logger.info("Connection count after immediate check: {}", afterCheckCount); | ||
|
|
||
| // Verify messages can still be processed | ||
| List<String> messages = messages(10, 40); | ||
| send(topic, messages, 0, 0, publisher); | ||
| List<NSQMessage> received = handler.drainMessagesOrTimeOut(10); | ||
| Assert.assertEquals("Should receive all messages after reconnect", 10, received.size()); | ||
| } | ||
|
|
||
| /** | ||
| * Tests that calling immediateCheckConnections with bad/non-existent hosts doesn't hang. | ||
| * This simulates what would happen with genuinely bad credentials where | ||
| * connection attempts fail but should complete quickly without infinite loops. | ||
| * | ||
| * Key behavior being tested: | ||
| * - immediateCheckConnections() completes quickly even if connections fail | ||
| * - No infinite loops - test completes in reasonable time | ||
| * - Failed connections are not added to connectionMap | ||
| */ | ||
| @Test | ||
| public void testBadConnectionsDoNotCauseInfiniteLoop() { | ||
| logger.info("Testing that failed connections don't cause infinite retry loops"); | ||
|
|
||
| // Create subscriber pointing to non-existent lookup server | ||
| // This simulates a scenario where auth credentials are bad | ||
| subscriber = new Subscriber(client, 300, 10, "127.0.0.1:54321"); // Unreachable port | ||
| subscriber.setDefaultMaxInFlight(10); | ||
|
|
||
| TestMessageHandler handler = new TestMessageHandler(); | ||
| subscriber.subscribe(topic, "channel", handler); | ||
|
|
||
| // Verify no connections established (lookup fails) | ||
| Util.sleepQuietly(1000); | ||
| int initialConnectionCount = subscriber.getConnectionCount(); | ||
| Assert.assertEquals("Should have no connections with bad lookup", 0, initialConnectionCount); | ||
|
|
||
| // Call immediateCheckConnections multiple times | ||
| // If this caused an infinite loop, the test would hang here | ||
| long startTime = System.currentTimeMillis(); | ||
| for (int i = 0; i < 5; i++) { | ||
| subscriber.immediateCheckConnections(topic); | ||
| Util.sleepQuietly(100); | ||
| } | ||
| long elapsedTime = System.currentTimeMillis() - startTime; | ||
|
|
||
| // Verify still no connections (failed attempts not added) | ||
| int finalConnectionCount = subscriber.getConnectionCount(); | ||
| Assert.assertEquals("Failed connections should not be in connectionMap", 0, finalConnectionCount); | ||
|
|
||
| // Verify test completed quickly (no infinite loop) | ||
| Assert.assertTrue("Should complete quickly without infinite loop", elapsedTime < 10000); | ||
|
|
||
| logger.info("Successfully verified: Failed connections don't cause infinite retry loops"); | ||
| logger.info("- {} calls to immediateCheckConnections completed in {}ms", 5, elapsedTime); | ||
| logger.info("- Connection count remained at 0 (failed connections not added)"); | ||
| } | ||
|
|
||
| @Override | ||
| public void teardown() throws InterruptedException { | ||
| if (publisher != null) { | ||
| publisher.stop(); | ||
| } | ||
| if (subscriber != null) { | ||
| subscriber.stop(); | ||
| } | ||
| super.teardown(); | ||
| } | ||
| } |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to make reconnect behavior pluggable. A few different ways a caller might want to handle authentication failures:
For the case of actual bad credentials, won't an "immediate reconnect" bombard the auth server with infinite retries as fast as it can? At a minimum, we should probably give up after a certain number of authentication failures, since this would happen in the case of legitimate bad credentials.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i considered this early on but is not an issue because
checkConnections():read()will encounter the exceptionSubscription.connectionMapwhile (isReading)is not running so there is not infinite loop.instead, we will see
logger.error("error connecting to:{}", activeHost, e);every minute - this is current behavior.