Skip to content

[grid] Reduce redundant logs of find slots and retry queue requests by the Distributor #16155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 12, 2025

Conversation

VietND96
Copy link
Member

@VietND96 VietND96 commented Aug 10, 2025

User description

🔗 Related Issues

💥 What does this PR do?

  • When all Nodes with status UP are busy (no available slots to distribute new requests), and the SessionQueue has many pending sessions.
  • In a situation, all those busy Nodes will still be returned to iterate to check the capability match with the request queue for creating a session. It leads to lots of logs and unnecessary retries could be seen as below.

07:00:47.623 INFO [LocalDistributor.newSession] - Session request received by the Distributor:
[Capabilities {acceptInsecureCerts: true, browserName: firefox, moz:debuggerAddress: true, ...
07:00:47.623 INFO [LocalDistributor.newSession] - Unable to find a free slot for request 17f228d9-364b-4d66-9450-6e64be3fef40.
Capabilities {acceptInsecureCerts: true, browserName: firefox, moz:debuggerAddress: true, ...
07:00:47.623 INFO [LocalDistributor.newSession] - Unable to find a free slot for request aab1b19d-a567-4b0b-83ba-85b72ec8807e.
Capabilities {acceptInsecureCerts: true, browserName: firefox, moz:debuggerAddress: true, ...
07:00:47.624 WARN [SeleniumSpanExporter$1.lambda$export$1] - Will retry session 17f228d9-364b-4d66-9450-6e64be3fef40

  • To reduce those logs and unnecessary checks when the bandwidth is full, update the list getAvailableNodes() to include a condition that has at least 1 available Slot. Once the return Nodes are narrowed down, the check hasCapability() can be optimized.

🔧 Implementation Notes

💡 Additional Considerations

🔄 Types of changes

  • Cleanup (formatting, renaming)
  • Bug fix (backwards compatible)
  • New feature (non-breaking change which adds functionality and tests!)
  • Breaking change (fix or feature that would cause existing functionality to change)

PR Type

Enhancement


Description

  • Optimize node selection to only return nodes with free slots

  • Reduce redundant logging and capability checks for busy nodes

  • Add comprehensive test coverage for node filtering logic

  • Fix JMX bean cleanup in LocalNewSessionQueue


Diagram Walkthrough

flowchart LR
  A["Session Request"] --> B["getAvailableNodes()"]
  B --> C["Filter UP nodes only"]
  C --> D["Check for free slots"]
  D --> E["Return filtered nodes"]
  E --> F["Reduced redundant checks"]
Loading

File Walkthrough

Relevant files
Enhancement
LocalDistributor.java
Enhanced node filtering for available slots                           

java/src/org/openqa/selenium/grid/distributor/local/LocalDistributor.java

  • Modified getAvailableNodes() to filter nodes with at least one free
    slot
  • Simplified isNotSupported() method by removing redundant UP status
    check
  • Updated session polling logic to remove duplicate availability check
+10/-5   
Bug fix
LocalNewSessionQueue.java
Added JMX bean cleanup functionality                                         

java/src/org/openqa/selenium/grid/sessionqueue/local/LocalNewSessionQueue.java

  • Added JMX bean management with proper cleanup
  • Store MBean reference for unregistration on close
  • Import MBean class for proper resource management
+8/-1     
Tests
LocalDistributorTest.java
Comprehensive test coverage for node filtering                     

java/test/org/openqa/selenium/grid/distributor/local/LocalDistributorTest.java

  • Added test for nodes with free slots filtering
  • Added test for excluding draining nodes
  • Added test for excluding down nodes
  • Added test for reducing redundant slot checks
  • Added test for handling fully occupied nodes
+345/-0 

@VietND96 VietND96 requested a review from diemol August 10, 2025 19:00
@selenium-ci selenium-ci added B-grid Everything grid and server related C-java Java Bindings labels Aug 10, 2025
Copy link
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis 🔶

5678 - Partially compliant

Compliant requirements:

  • None

Non-compliant requirements:

  • Investigate and resolve repeated "ConnectFailure" when multiple drivers are instantiated.
  • Determine discrepancy between first and subsequent instances.
  • Provide a fix/mitigation for repeated connection failures.

Requires further human verification:

  • Add reproducible tests or environment verification specific to ChromeDriver/Ubuntu networking to validate the issue.

1234 - Partially compliant

Compliant requirements:

  • None

Non-compliant requirements:

  • Ensure click() triggers JavaScript in link href for Firefox 2.48.x regression.

Requires further human verification:

  • Browser-level behavior validation with Firefox versions mentioned in the ticket.
⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Behavioral Change

getAvailableNodes() now returns only UP nodes with at least one free slot, which changes semantics from "eligible nodes" to "currently free nodes". This may skip capability checks for busy-but-matching nodes and affect scheduling fairness or metrics. Validate that callers expect this stricter filtering and that starvation does not worsen under load.

return model.getSnapshot().stream()
    .filter(
        node -> {
          // Only consider UP nodes (not DOWN, DRAINING, etc.)
          if (!UP.equals(node.getAvailability())) {
            return false;
          }
          // Consider node has at least one free slot
          return node.getSlots().stream().anyMatch(slot -> slot.getSession() == null);
        })
    .collect(toImmutableSet());
Redundant Filtering

After getAvailableNodes() already ensures free slots, an additional .filter(NodeStatus::hasCapacity) remains in the polling path. Confirm NodeStatus::hasCapacity uses the same definition of "free slot"; otherwise, reconcile potential mismatch or remove redundancy.

getAvailableNodes().stream()
    .filter(NodeStatus::hasCapacity)
    .flatMap(node -> node.getSlots().stream().map(Slot::getStereotype))
JMX Cleanup Robustness

New MBean is stored and unregistered on close(); ensure multiple register/unregister calls and failure scenarios are handled (e.g., queue created via factory, close called twice, or register throws). Consider try/catch around unregister to avoid masking shutdown errors.

if (jmxBean != null) {
  new JMXHelper().unregister(jmxBean.getObjectName());
}

Copy link
Contributor

qodo-merge-pro bot commented Aug 10, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
High-level
Risk of starvation/backlog growth

getAvailableNodes now excludes all UP nodes without a free slot, which means
capability support checks and stereotype accounting will ignore temporarily busy
nodes. This can cause the queue to mispredict supply, skip nodes about to free
up, and increase session starvation under bursty load. Consider separating
"supports capability" from "has free slot" to retain visibility of capacity
while still avoiding redundant per-slot checks and logs.

Examples:

java/src/org/openqa/selenium/grid/distributor/local/LocalDistributor.java [513-523]
return model.getSnapshot().stream()
    .filter(
        node -> {
          // Only consider UP nodes (not DOWN, DRAINING, etc.)
          if (!UP.equals(node.getAvailability())) {
            return false;
          }
          // Consider node has at least one free slot
          return node.getSlots().stream().anyMatch(slot -> slot.getSession() == null);
        })

 ... (clipped 1 lines)
java/src/org/openqa/selenium/grid/distributor/local/LocalDistributor.java [713]
return getAvailableNodes().stream().noneMatch(node -> node.hasCapability(caps, slotMatcher));

Solution Walkthrough:

Before:

// in LocalDistributor.java

protected Set<NodeStatus> getAvailableNodes() {
  // Returns only UP nodes that have at least one free slot
  return model.getSnapshot().stream()
      .filter(node -> UP.equals(node.getAvailability()) && node.hasFreeSlot())
      .collect(toImmutableSet());
}

private boolean isNotSupported(Capabilities caps) {
  // This will be TRUE if all supporting nodes are temporarily busy
  return getAvailableNodes().stream().noneMatch(node -> node.hasCapability(caps));
}

After:

// in LocalDistributor.java

protected Set<NodeStatus> getUpNodes() {
  // Returns all UP nodes, regardless of load
  return model.getSnapshot().stream()
      .filter(node -> UP.equals(node.getAvailability()))
      .collect(toImmutableSet());
}

private boolean isNotSupported(Capabilities caps) {
  // Correctly checks for capability across all UP nodes
  return getUpNodes().stream().noneMatch(node -> node.hasCapability(caps));
}

// Session creation logic would then use getUpNodes() and filter for free slots,
// e.g. getUpNodes().stream().filter(node -> node.hasFreeSlot())...
Suggestion importance[1-10]: 9

__

Why: This suggestion correctly identifies a critical design flaw where filtering out busy nodes in getAvailableNodes leads to an incorrect assessment of the grid's total capacity, potentially causing session starvation or premature rejection of valid requests.

High
Possible issue
Guard JMX registration failures

Ensure registration failures don't leave jmxBean uninitialized and cause NPEs on
close. Guard registration with try-catch and set the field to null on failure,
optionally logging the error.

java/src/org/openqa/selenium/grid/sessionqueue/local/LocalNewSessionQueue.java [144-145]

 // Manage JMX and unregister on close()
-this.jmxBean = new JMXHelper().register(this);
+MBean registered = null;
+try {
+  registered = new JMXHelper().register(this);
+} catch (RuntimeException e) {
+  // Registration failed; leave jmxBean as null to avoid NPE on close
+  // Optionally log at debug level
+}
+this.jmxBean = registered;
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: This suggestion correctly identifies a potential RuntimeException from JMX registration that could be unhandled, and proposes a robust try-catch block to prevent it from crashing the constructor.

Medium
General
Use capacity check instead of slot scan
Suggestion Impact:The commit replaced the local getAvailableNodes implementation (which filtered UP nodes and scanned slots for a free one) with a delegated call to nodeRegistry.getAvailableNodes(), indicating the suggested intent (avoiding slot scan in favor of encapsulated capacity logic) was adopted via refactoring.

code diff:

   protected Set<NodeStatus> getAvailableNodes() {
-    Lock readLock = this.lock.readLock();
-    readLock.lock();
-    try {
-      return model.getSnapshot().stream()
-          .filter(
-              node -> {
-                // Only consider UP nodes (not DOWN, DRAINING, etc.)
-                if (!UP.equals(node.getAvailability())) {
-                  return false;
-                }
-                // Consider node has at least one free slot
-                return node.getSlots().stream().anyMatch(slot -> slot.getSession() == null);
-              })
-          .collect(toImmutableSet());
-    } finally {
-      readLock.unlock();
-    }
+    return nodeRegistry.getAvailableNodes();
   }

Avoid iterating over all slots just to detect a free one by leveraging
NodeStatus.hasCapacity() if it already encapsulates this logic. This reduces
per-call overhead and keeps the intent consistent with other uses. Fall back to
a short-circuiting check if hasCapacity() is insufficient.

java/src/org/openqa/selenium/grid/distributor/local/LocalDistributor.java [509-527]

 protected Set<NodeStatus> getAvailableNodes() {
   Lock readLock = this.lock.readLock();
   readLock.lock();
   try {
     return model.getSnapshot().stream()
-        .filter(
-            node -> {
-              // Only consider UP nodes (not DOWN, DRAINING, etc.)
-              if (!UP.equals(node.getAvailability())) {
-                return false;
-              }
-              // Consider node has at least one free slot
-              return node.getSlots().stream().anyMatch(slot -> slot.getSession() == null);
-            })
+        .filter(node -> UP.equals(node.getAvailability()) && node.hasCapacity())
         .collect(toImmutableSet());
   } finally {
     readLock.unlock();
   }
 }

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 6

__

Why: The suggestion correctly proposes using node.hasCapacity() to simplify the filter logic, which improves readability and aligns with its usage elsewhere in the PR.

Low
Safely unregister JMX before shutdown

Unregister the JMX MBean before shutting down the executor to prevent potential
deadlocks or race conditions with active JMX exposures. Also, wrap
unregistration in a try-catch to avoid masking shutdown on failures.

java/src/org/openqa/selenium/grid/sessionqueue/local/LocalNewSessionQueue.java [505-512]

 @Override
 public void close() {
+  if (jmxBean != null) {
+    try {
+      new JMXHelper().unregister(jmxBean.getObjectName());
+    } catch (RuntimeException e) {
+      // Swallow or log; avoid preventing shutdown
+    }
+  }
   shutdownGracefully(NAME, service);
-
-  if (jmxBean != null) {
-    new JMXHelper().unregister(jmxBean.getObjectName());
-  }
 }
  • Apply / Chat
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly points out that an exception during JMX unregistration could prevent the service shutdown and proposes a valid try-catch block to ensure shutdownGracefully is always called.

Low
  • Update

diemol
diemol previously requested changes Aug 11, 2025
Signed-off-by: Viet Nguyen Duc <[email protected]>

# Conflicts:
#	java/src/org/openqa/selenium/grid/distributor/local/LocalDistributor.java
@VietND96 VietND96 requested a review from diemol August 11, 2025 16:44
@diemol diemol added this to the Selenium 4.35 milestone Aug 11, 2025
@diemol diemol dismissed their stale review August 11, 2025 21:32

Already reviewed

@diemol diemol enabled auto-merge (squash) August 11, 2025 22:18
@diemol diemol merged commit 4ecdae3 into trunk Aug 12, 2025
33 of 35 checks passed
@diemol diemol deleted the distributor-retry branch August 12, 2025 00:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-grid Everything grid and server related C-java Java Bindings Review effort 3/5
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants