Skip to content

[grid] Improve readTimeout and cache expiration of HttpClient between Router and Nodes #16154

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: trunk
Choose a base branch
from

Conversation

VietND96
Copy link
Member

@VietND96 VietND96 commented Aug 10, 2025

User description

🔗 Related Issues

💥 What does this PR do?

  • Add back Cache Builder for HttpClient in Router (use Caffeine Cache Builder, which was recently done via [grid] Migrate from Guava's CacheBuilder to Caffeine #15547)
  • Remove the hardcoded value of 2 minutes for the timeout of stale connections.
  • Implement dynamic expiry based on HttpClient read timeout (utilize caffeine.cache.Expiry).
  • Implement a private method to GET /status NodeStatus and get the node session timeout of the corresponding Node and use it for cache expiration (silent fallback default readTime of ClientConfig if node session timeout N/A)
  • Use read timeout to set expireAfterCreate (the first time HttpClient is added to cache), expireAfterRead (when HttpClient is accessed from cache).
  • HttpClient read timeout and cache expiration are more reliable now.

Here is few logs when enable log-level FINE

14:50:16.912 DEBUG [HandleSession.lambda$loadSessionId$2] - Creating new HTTP client for http://172.18.0.13:5555
14:50:16.927 DEBUG [JdkHttpClient.execute0] - Executing request: (GET) /status
14:50:16.987 DEBUG [HandleSession.lambda$loadSessionId$2] - Created connection for http://172.18.0.13:5555 (read timeout: 300 seconds)
14:50:16.988 DEBUG [HandleSession$1.expireAfterCreate] - Set (read timeout: 300 seconds) as initial expiration for http://172.18.0.13:5555 in cache
14:50:16.989 DEBUG [HandleSession$1.expireAfterRead] - Set (read timeout: 300 seconds) as expiration after read for http://172.18.0.13:5555 in cache

🔧 Implementation Notes

💡 Additional Considerations

🔄 Types of changes

  • Cleanup (formatting, renaming)
  • Bug fix (backwards compatible)
  • New feature (non-breaking change which adds functionality and tests!)
  • Breaking change (fix or feature that would cause existing functionality to change)

PR Type

Enhancement


Description

  • Replace manual HTTP client cache with Caffeine cache library

  • Simplify resource management with automatic cleanup

  • Add proper executor service shutdown in GridStatusHandler

  • Remove complex manual cache eviction logic


Diagram Walkthrough

flowchart LR
  A["Manual Cache Management"] --> B["Caffeine Cache"]
  C["Custom CacheEntry Class"] --> D["Direct HttpClient Storage"]
  E["Manual Cleanup Thread"] --> F["Automatic Expiry & Removal"]
  G["Complex Usage Counting"] --> H["Simple Cache Access"]
Loading

File Walkthrough

Relevant files
Enhancement
HandleSession.java
Replace manual cache with Caffeine implementation               

java/src/org/openqa/selenium/grid/router/HandleSession.java

  • Replace ConcurrentHashMap with Caffeine cache for HTTP clients
  • Remove manual cleanup thread and usage counting logic
  • Simplify cache entry management with automatic expiry
  • Add removal listener for proper HTTP client cleanup
+42/-102
GridStatusHandler.java
Add proper resource cleanup                                                           

java/src/org/openqa/selenium/grid/router/GridStatusHandler.java

  • Implement AutoCloseable interface
  • Add graceful executor service shutdown
  • Import ExecutorServices utility for cleanup
+10/-1   
Dependencies
BUILD.bazel
Add Caffeine dependency                                                                   

java/src/org/openqa/selenium/grid/router/BUILD.bazel

  • Add Caffeine cache library dependency
+1/-0     

@VietND96 VietND96 requested review from diemol and joerg1985 August 10, 2025 04:52
@selenium-ci selenium-ci added B-grid Everything grid and server related C-java Java Bindings B-build Includes scripting, bazel and CI integrations labels Aug 10, 2025
Copy link
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis 🔶

1234 - Partially compliant

Compliant requirements:

  • None

Non-compliant requirements:

  • Investigate regression where clicking a link with JavaScript in href does not trigger JS in Selenium 2.48.x, while it works in 2.47.1.
  • Reproduce with Firefox 42.0 (32-bit) on 64-bit machine.
  • Provide fix or explanation specific to Router/Grid if applicable.
  • Add tests confirming JS href is executed on click.

Requires further human verification:

  • Browser-level behavior verification that JS in href is executed on click across versions.
  • End-to-end test execution against Firefox 42 environment.

5678 - Partially compliant

Compliant requirements:

  • None

Non-compliant requirements:

  • Diagnose "Error: ConnectFailure (Connection refused)" when instantiating multiple ChromeDriver instances.
  • Ensure stable ChromeDriver initialization without connection failures after the first instance.
  • Provide documentation or fix.
  • Include tests for multiple driver instantiation scenarios.

Requires further human verification:

  • Reproduction with specified Chrome/ChromeDriver versions on Ubuntu 16.04.4.
  • System-level/network diagnostics for connection refusals.
⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Resource Lifecycle

Introducing AutoCloseable with a static ExecutorService means close() will affect a shared executor across instances. Validate that shutdown is intended and not called multiple times or prematurely by lifecycle management.

class GridStatusHandler implements HttpHandler, AutoCloseable {

  private static final Logger LOG = Logger.getLogger(GridStatusHandler.class.getName());
  private static final ExecutorService EXECUTOR_SERVICE =
      Executors.newCachedThreadPool(
          r -> {
            Thread thread = new Thread(r, "Grid status executor");
            thread.setDaemon(true);
            return thread;
          });

  private final Tracer tracer;
  private final Distributor distributor;

  GridStatusHandler(Tracer tracer, Distributor distributor) {
    this.tracer = Require.nonNull("Tracer", tracer);
    this.distributor = Require.nonNull("Distributor", distributor);
  }

  @Override
  public HttpResponse execute(HttpRequest req) {

    try (Span span = newSpanAsChildOf(tracer, req, "grid.status")) {
      AttributeMap attributeMap = tracer.createAttributeMap();
      attributeMap.put(AttributeKey.LOGGER_CLASS.getKey(), getClass().getName());
Cache Expiry Semantics

Switching to expireAfterAccess(2m) may evict active-but-idle session clients during long server-side waits or streaming responses. Validate that 2 minutes is safe and won’t drop connections mid-request; consider using eviction only when not in use or tuning duration.

  // Create Cache with 2 minute expiry after last access
  // and a removal listener to close HTTP clients
  this.httpClientCache =
      Caffeine.newBuilder()
          .expireAfterAccess(Duration.ofMinutes(2))
          .removalListener(
              (RemovalListener<URI, HttpClient>)
                  (uri, httpClient, cause) -> {
                    if (httpClient != null) {
                      try {
                        LOG.fine("Closing HTTP client for " + uri + ", removal cause: " + cause);
                        httpClient.close();
                      } catch (Exception ex) {
                        LOG.log(Level.WARNING, "Failed to close HTTP client for " + uri, ex);
                      }
                    }
                  })
          .build();
}
Close Timing

Removal listener closes HttpClient on eviction; ensure HttpClient is not shared across concurrent requests to the same URI that may still be in-flight when eviction triggers. Confirm HttpClient thread-safety and that eviction cannot race with active usage.

      Caffeine.newBuilder()
          .expireAfterAccess(Duration.ofMinutes(2))
          .removalListener(
              (RemovalListener<URI, HttpClient>)
                  (uri, httpClient, cause) -> {
                    if (httpClient != null) {
                      try {
                        LOG.fine("Closing HTTP client for " + uri + ", removal cause: " + cause);
                        httpClient.close();
                      } catch (Exception ex) {
                        LOG.log(Level.WARNING, "Failed to close HTTP client for " + uri, ex);
                      }
                    }
                  })
          .build();
}

Copy link
Contributor

qodo-merge-pro bot commented Aug 10, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
High-level
Ensure cache thread-safety semantics

Replacing explicit in-use counting with Caffeine’s expireAfterAccess risks
evicting and closing an HttpClient while a long-running request is still
executing, since access occurs before execution and no pinning occurs during
use. Consider using a per-request lifecycle (new client per call), a
reference/pinning mechanism (e.g., cache values as a lightweight wrapper with
leasing), or switching to eviction based on explicit signals rather than time
alone to avoid clients being closed mid-flight.

Examples:

java/src/org/openqa/selenium/grid/router/HandleSession.java [85-100]
this.httpClientCache =
    Caffeine.newBuilder()
        .expireAfterAccess(Duration.ofMinutes(2))
        .removalListener(
            (RemovalListener<URI, HttpClient>)
                (uri, httpClient, cause) -> {
                  if (httpClient != null) {
                    try {
                      LOG.fine("Closing HTTP client for " + uri + ", removal cause: " + cause);
                      httpClient.close();

 ... (clipped 6 lines)
java/src/org/openqa/selenium/grid/router/HandleSession.java [178-185]
HttpClient httpClient =
    httpClientCache.get(
        sessionUri,
        uri -> {
          LOG.fine("Creating new HTTP client for " + uri);
          ClientConfig config = ClientConfig.defaultConfig().baseUri(uri).withRetries();
          return httpClientFactory.createClient(config);
         });

Solution Walkthrough:

Before:

// Get client from cache; access time is updated now
HttpClient client = cache.get(uri, this::createClient);

// Use the client for a potentially long operation.
// The cache might evict and close the client during this time
// if the operation takes longer than the expiry time.
handler.execute(longRunningRequest);

After:

// Re-introduce a wrapper with reference counting
class ClientWrapper {
  final HttpClient client;
  final AtomicInteger usageCount = new AtomicInteger(0);
  // ... constructor
}

// In request handler
ClientWrapper wrapper = cache.get(uri, ...);
wrapper.usageCount.incrementAndGet();
try {
  // use wrapper.client
} finally {
  wrapper.usageCount.decrementAndGet();
}

// Eviction policy must check usageCount before removing.
Suggestion importance[1-10]: 9

__

Why: This suggestion correctly identifies a critical race condition where an active HttpClient could be closed mid-request due to the removal of the in-use tracking mechanism, which is a significant design flaw.

High
Possible issue
Use instance-scoped executor

Avoid a static ExecutorService tied to instance lifecycle; otherwise, calling
close() on one instance affects all. Make the executor an instance field to
ensure proper ownership and shutdown.

java/src/org/openqa/selenium/grid/router/GridStatusHandler.java [55-61]

-private static final ExecutorService EXECUTOR_SERVICE =
+private final ExecutorService executorService =
     Executors.newCachedThreadPool(
         r -> {
           Thread thread = new Thread(r, "Grid status executor");
           thread.setDaemon(true);
           return thread;
         });
  • Apply / Chat
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a critical issue where a static ExecutorService is tied to an instance's close() method, which could lead to premature shutdown if multiple GridStatusHandler instances exist. Making it an instance field is the correct fix.

Medium
  • Update

Copy link
Member

@joerg1985 joerg1985 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VietND96 Switching back to a standard cache will bring back the issue descriped in #12978. Therefore this PR shouldn't be merged in my mind.

@VietND96
Copy link
Member Author

@joerg1985 Ok I see. Since I just tried this after moving on to Caffeine cache builder recently.
Cache builder only care access timeout, while the connection uses read timeout. I missed this session handle includes the connection from client also.

@VietND96 VietND96 changed the title [grid] Simplify httpClientCache manages in handle session of Router [grid] Improve readTimeout and cache expiration of HttpClient between Router and Nodes Aug 10, 2025
@VietND96
Copy link
Member Author

@joerg1985, can you review again? I spent hours reading Caffeine support, now the cache expiration is aligned with read timeout and dynamically based on the destination node session timeout (I also handle resilient fallback to get default ClientConfig if without /status endpoint).

@VietND96 VietND96 requested a review from joerg1985 August 10, 2025 15:24
@joerg1985
Copy link
Member

@VietND96 i don't think it is enought to use a different timeout, as the connection timeout and delays in processing (e.g. gc pauses) must be respected too, to ensure the client is not closed before the response is processed. Additionally the timeout handling inside the JDK is not perfect (there is an open issue regarding this in the JDK Jira), so the client might get a worse response as the client is closed than from the real timeout.

The current timeout handling will calculate the inactivity from the last time the client did return a result and not from the last time the client was used. So a client might get closed to fast and the next request to the session has to start a new client first.

@VietND96
Copy link
Member Author

Hmm, so for example, read timeout is 300s, at the second 100th, HttpClient is loaded from cache to execute a request, expiration immediately reset back to 300s, also not a safe enough protocol?

@joerg1985
Copy link
Member

I think usage counting is the correct pattern to ensure the client is not in use.

Perhaps @ben-manes can tell if there is a matching pattern inside Coffein or it might be interresting to have such a mechanism integrated to Caffeine. The basic problem is a long running operation on the cached object and it should not be disposed while this operation is running. This does not sound to exotic to handle it inside a library to me.

* @param lastAccessTime the last time this entry was accessed
* @return true if the entry should be evicted
*/
boolean isExpired(long lastAccessTime) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not find any call to this method, is guess this can be removed or called 😀

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am reworking this based on your feedback. Actually, when using Expiry (by create, read, update Cache), we could not overwrite the expired logic, so this isExpired is redundant and useless.
You're correct, client is not in use is still a safe protocol. I am thinking how can implement cache with this pattern.

// and a removal listener to close HTTP clients
this.httpClientCache =
Caffeine.newBuilder()
.expireAfter(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't need to log based on the operation type, you could use

Expiry.accessing((key, value) -> {
  LOG.fine("Set (read timeout: {} seconds) for {} in cache",
      cacheEntry.getConfig().readTimeout().toSeconds(), uri);
  return cacheEntry.getConfig().readTimeout()
});

@ben-manes
Copy link

The basic problem is a long running operation on the cached object and it should not be disposed while this operation is running. This does not sound to exotic to handle it inside a library to me.

You can use pinning if you inform the cache of the long-running operation to exclude the entry from consideration. That requires calling into the cache to modify the entry's metadata. That can be difficult to retrofit and there isn't an entry callback for in-use since that would degrade eviction to O(n) and unbounded usage. You might consider using a two-level cache with the primary as weak reference for reference counting that loads and evicts from an expiring cache.

Comment on lines +195 to +196
* <p>Weight Strategy: - inUse == 0: Weight = 1 (normal, can be evicted) - inUse > 0: Weight =
* Integer.MAX_VALUE (effectively pinned, won't be evicted)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pinning has a weight of zero. It represents how much space the entry takes, so MAX_VALUE is well everything.

@VietND96
Copy link
Member Author

@joerg1985 I added back inUse and lastUse were introduced by you before, and implemented cache around it
I also tried to use pinning entries. Can you also review the usage is correct @ben-manes ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-build Includes scripting, bazel and CI integrations B-grid Everything grid and server related C-java Java Bindings Review effort 3/5
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants