Skip to content

RPC call times out immediately #110

@jcalcote

Description

@jcalcote

We're using oncrpc4j version 3.3.2 (which, from what I can see from diffing the releases, is functionally identical to 3.3.3 - the latest).

here's some log snippet from our product for context:

2025-06-16 19:18:07.641 +0000,1837242509882 {} DEBUG c.p.d.m.MobilityControl:318 [msm-message-consumer-2-0] msmId=2-1750099925069: objId=2-2562: COW: SET_INSTANCES_COW_SPLIT: calling SICS (async) with pdfsSessionId=2370338415347, ...

2025-06-16 19:18:07.641 +0000,1837242675698 {} WARN c.p.d.m.MobilityControl:318 [msm-message-consumer-2-0] msmId=2-1750099925069: objId=2-2562: COW: SET_INSTANCES_COW_SPLIT: SICS (async) call failed: did not get a response within 5 MINUTES

The important part to note here is that we're calling an async RPC endpoint with a timeout of 5 minutes in the first line at 19:18:07.641. Literally within the same millisecond, we're getting a TimeoutException on the failed handler from the oncrpc4j library indicating "did not get a response within 5 MINUTES". This exception message attached to the TimeoutException comes from rpc/ReplyQueue here:

    public void registerKey(int xid, SocketAddress addr, CompletionHandler<RpcReply, RpcTransport> callback, final long timeout, final TimeUnit timeoutUnits) throws EOFException {
        ScheduledFuture<?> scheduledTimeout = null;
        if (timeout > 0 && timeoutUnits != null) {
            scheduledTimeout = executorService.schedule(() -> {
                CompletionHandler<RpcReply, RpcTransport> handler = get(xid);
                if (handler != null) { //means we're 1st, no response yet
                    handler.failed(new TimeoutException("did not get a response within " + timeout + " " + timeoutUnits), null);
                }
            }, timeout, timeoutUnits);
        }
        _queue.put(xid, new PendingRequest(addr, callback, scheduledTimeout));
    }

I don't really know how this can be happening - it would imply a bug in the scheduler I think, which I find hard to imagine, but the proof is clear. The message "did not get a response within" is clearly coming from oncrpc4j ReplyQueue code, and the timeout is clearly 5 minutes, based on the content of the same message. Yet, we're failing immediately. It's like schedule is running the event on the same thread inline (though that is not proven here).

[update] We're building our code against the 3.3.2 version of oncrpc4j library on maven central - we compile with Java 21.

[update 2] I've analyzed our logs and found that there were hundreds of legitimate 5 MINUTE timeouts in this session. Only this anomalous zero-delay "5 MINUTE" timeout occurred during the session. Submissions and time outs that occurred on either side if this event seemed to work perfectly.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions