-
Notifications
You must be signed in to change notification settings - Fork 31
Description
We're using oncrpc4j version 3.3.2 (which, from what I can see from diffing the releases, is functionally identical to 3.3.3 - the latest).
here's some log snippet from our product for context:
2025-06-16 19:18:07.641 +0000,1837242509882 {} DEBUG c.p.d.m.MobilityControl:318 [msm-message-consumer-2-0] msmId=2-1750099925069: objId=2-2562: COW: SET_INSTANCES_COW_SPLIT: calling SICS (async) with pdfsSessionId=2370338415347, ...
2025-06-16 19:18:07.641 +0000,1837242675698 {} WARN c.p.d.m.MobilityControl:318 [msm-message-consumer-2-0] msmId=2-1750099925069: objId=2-2562: COW: SET_INSTANCES_COW_SPLIT: SICS (async) call failed: did not get a response within 5 MINUTES
The important part to note here is that we're calling an async RPC endpoint with a timeout of 5 minutes in the first line at 19:18:07.641. Literally within the same millisecond, we're getting a TimeoutException on the failed handler from the oncrpc4j library indicating "did not get a response within 5 MINUTES". This exception message attached to the TimeoutException comes from rpc/ReplyQueue here:
public void registerKey(int xid, SocketAddress addr, CompletionHandler<RpcReply, RpcTransport> callback, final long timeout, final TimeUnit timeoutUnits) throws EOFException {
ScheduledFuture<?> scheduledTimeout = null;
if (timeout > 0 && timeoutUnits != null) {
scheduledTimeout = executorService.schedule(() -> {
CompletionHandler<RpcReply, RpcTransport> handler = get(xid);
if (handler != null) { //means we're 1st, no response yet
handler.failed(new TimeoutException("did not get a response within " + timeout + " " + timeoutUnits), null);
}
}, timeout, timeoutUnits);
}
_queue.put(xid, new PendingRequest(addr, callback, scheduledTimeout));
}
I don't really know how this can be happening - it would imply a bug in the scheduler I think, which I find hard to imagine, but the proof is clear. The message "did not get a response within" is clearly coming from oncrpc4j ReplyQueue code, and the timeout is clearly 5 minutes, based on the content of the same message. Yet, we're failing immediately. It's like schedule is running the event on the same thread inline (though that is not proven here).
[update] We're building our code against the 3.3.2 version of oncrpc4j library on maven central - we compile with Java 21.
[update 2] I've analyzed our logs and found that there were hundreds of legitimate 5 MINUTE timeouts in this session. Only this anomalous zero-delay "5 MINUTE" timeout occurred during the session. Submissions and time outs that occurred on either side if this event seemed to work perfectly.