CachedResourceFactory doesn't recover from BindException / No resiliency to BindException

Synapsis

Reproduces with the two attached fixture files.  SimpleDatabaseTlsProviderInteractiveExample.java and bind.py.  This occurs when using the OCI_INTERACTIVE auth mode but the exception latching part of this bug could affect other modes if an error is thrown when the auth task is being performed.  Once the condition occurs, stale values get cached potentially forever in driver static memory, meaning the entire driver may need reloading.

Steps

1. Ensure that tcp port 8181 is not currently bound to a process by running lsof -I TCP:8181 on a Mac or equivalent.
2. Run bind.py.  This will bind port 8181 to a python process and block.
3. Update SimpleDatabaseTlsProviderInteractiveExample on the first TODO with your own valid OCI sonnect URL string (I was using a TNS_ADMIN path to a file).
4. Run SimpleDatabaseTlsProviderInteractiveExample using the current main branch.

Expected: The program will fail right away with a BindException error.  It will then continue to loop indefinitely constantly throwing not just the same exception but the same exact exception object.  Even if you kill (Ctrl-C) out of the bind.py program and double-check with lsof that the port is no longer bound, the test program will not recover and will continue to loop on the same BindException failure.

Fix Verification Steps:

1. Change to the fix branch.  Uncomment out the second TODO in SimpleDatabaseTlsProviderInteractiveExample.java.
2. Verify that tcp port 8181 is not bound.
3. Run bind.py.  Verify that a python process now holds 8181.
4. Run SimpleDatabaseTlsProviderInteractiveExample.  Note this time it will block for a long time (over 2 minutes because the fix contains a backoff-and-retry).
5. After some time, the retries will stop and an exception will be thrown and printed.
6. Note that when the main program loop comes around, the connection will try again and again and fail after over 2 minutes of retry; i.e. even though the auth is still failing because bind.py still holds 8181, the resource provider is able to recover and try again.
7. Now kill bind.py and continue to wait.  After 2 or 3 minutes (max), the provider will fail.  This time on the main loop retry though, it will succeed if bind.py is killed and no other process holds 8181.  Now the connection will work (after you respond to the login prompt in the browser launch) and loop over and over with the SELECT statement properly executed.

Detailed Analysis:

First, port 8181 must be available to bind on the system where we are running (OCI requirement).  The provider supports no form of retry in the case it can't immediately bind the port to provide the token response back to the caller.  Also, the cache system holds on to the exception state in its executor Future objects. When an error occurs, such as a BindException, the factory will "latch" on that error state and never recover.  The cache mechanism for this is stored in a static Map, meaning you can't clear it without reloading the Driver.  That's why we observe this latched error state when it occurs.  I have created a Github PR that provides two change: one provides an inverse exponential back-off on BindException failures such that we can improve our chances of not failing in the first place.  The other modifies the Future object so that it can track when its result exception has been thrown.  This allows us to flush stale resource futures out of the cache on a retry from the client.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CachedResourceFactory doesn't recover from BindException / No resiliency to BindException #199

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CachedResourceFactory doesn't recover from BindException / No resiliency to BindException #199

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions