RemoteDisconnect: Changes to HTTPAdapter and Retry logic #1253

nicolasbisurgi · 2025-07-25T19:16:05Z

nicolasbisurgi
Jul 25, 2025
Maintainer

Problem statement

When running TM1py (wrapped around RushTI) against PAoC instances, multiple clients experience frequent RemoteDisconnect errors. This happens randomly, with no clear pattern and it's impossible to reproduce at will. One thing to note is that this never happens when we run RushTI locally (either in an OnPrem installation or with an executable version of RushTI inside the PAoC).

Posible explanation

When trying to find explanations for this I've focused on the HTTPAdapter as it's the one that manages the requests against the TM1 server and it's here where I think the fundamental problem is:

DEFAULT_CONNECTION_POOL_SIZE = 10

...

def _manage_http_adapter(self):
        adapter = HTTPAdapterWithSocketOptions(
            pool_connections=int(self._connection_pool_size or self.DEFAULT_CONNECTION_POOL_SIZE),
            pool_maxsize=int(self._connection_pool_size),
            ssl_context=self._ssl_context)
        self._s.mount(self._base_url, adapter)

The pool is defined against a specific target (which in our case is a single TM1 instance), so I don't see why we would need the pool_connections to be more than 1. The pool_maxsize however is how many connections that pool could have. This is key when dealing with high concurrency as RushTI does. If we set RushTI to run up to 60 processes in parallel, then pool_maxsize should be 60. The endpoint web server will be monitoring 60 connections.

BUT since we have the same value for both pool_connections as for pool_maxsize, when we instruct RushTi to execute up to 60 threads, we are actually generating a 3,600 connections which is quite a lot for the web server to monitor. High-latency networks struggle with these amount of connections due to TCP handshake overhead, NAT table exhaustion, and cloud edge rate limiting.

This would also explain why running RushTI on low latency networks (i.e.: onprem or local to PAoC) would not prompt this RemoteDisconnect with the same frequency as it does when called from a remote location.

Proposed solution/mitigation

Use this HTTPAdapter instead:

    def _manage_http_adapter(self):
        adapter = HTTPAdapterWithSocketOptions(
            pool_connections=self._pool_connections,
            pool_maxsize=self._connection_pool_size,
        )

        self._s.mount(self._base_url, adapter)

This would have independant pool_connections (Default=1) and pool_maxsize (Default=10) values, reducing normal/default case from 100 to 10 connections per TM1Service instance.
Also the default socket settings of the HTTPAdapter are quite conservative and I don't think they are optimized for the TM1Py use cases, as such I propose to add these options to the adapter:

socket_options=[
                (socket.TCP_NODELAY, 1),   # Disable Nagle's algorithm for lower latency
                (socket.SO_KEEPALIVE, 1),  # Enables TCP keepalive probes to detect dead connections
                (socket.SO_REUSEADDR, 1),  # Allow socket reuse
            ]

From these items, it seems TCP_NODELAY would be the biggest performance improvement as it will reduce latency as queries are sent immediately without possible buffering.

Proposed extra measures

The polling case

The above should mitigate the problem as it should reduce the amount of RemoteDisconnects we get; but it does not ensure they would be gone.
When checking the http requests logs at the moment the RemoteDisconnect happens, in the vast majority of the cases it happens in a GET poll request. As many of you know, when RushTI runs it sends multiple async requests to the TM1 Server to start different processes. The TM1 server acknowledges by responding back with an async_id which then RushTI uses to poll its status.
Imagine we make an async call to a process that take 200 seconds to run. This would mean:

1: POST request
200: GET requests

If we get a RemoteDisconnect there's a 99.5% chance it happens during a poll request. Since poll requests are idempotent, we should have no problem retrying it rather than failing the entire task.

The other ones

Even though GET, HEAD and OPTIONS methods could be considered idempotent, we can't ensure that. At the same time PUT, POST and DELETE could be idempotent as well depending on context (i.e.: POST /api/v1/Processes('%7Dbedrock.server.wait')/tm1.ExecuteWithReturn could be run mutiple times and it will always procude the same result). Since this nuance is only known by the developer that invokes these requests I would suggest we add an 'idempotent' optinal paramter in the request function:

def request(
            self,
            method: str,
            url: str,
            data: str = '',
            encoding='utf-8',
            async_requests_mode: Optional[bool] = None,
            return_async_id=False,
            timeout: float = None,
            cancel_at_timeout: bool = False,
            idempotent: bool = False,
            **kwargs):

If the requests fails we could implement a retry logic if idempotent=True, else we fail the request. We could explicitely extend this parameter to multiple functions withiin TM1Py (thinking execute_with_return and execute_process_with_return for sure) so tm1py developers can add a safe_retry behavior.

All of this put together in a decision tree would look:

graph TD
    A[Request Sent] --> B{Connection Drop?}
    B -->|No| C[Success]
    B -->|Yes| D{Is Polling Request?}
    D -->|Yes| E[Always Retry]
    D -->|No| F{idempotent=True?}
    F -->|Yes| G[Retry Request]
    F -->|No| H[Fail Fast]

    E --> I[Reconnect & Retry]
    G --> I
    I --> J{Retry Success?}
    J -->|Yes| C
    J -->|No| H

All proposed changes would maintain backward compatibility. Existing code will continue to work with improved connection handling.

This analysis is quite preliminary and I'm not a network expert by any means so I'd love to get your opinion on this matter.

Cheers!

nico

MariusWirtz · 2025-07-28T17:43:16Z

MariusWirtz
Jul 28, 2025
Maintainer

I support this suggestion

0 replies

nicolasbisurgi · 2025-07-28T19:51:36Z

nicolasbisurgi
Jul 28, 2025
Maintainer Author

Thanks @MariusWirtz !

@macsir - I think you developed the HTTPAdapter a while ago, if you have any inputs on this matter it would be great to hear them :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RemoteDisconnect: Changes to HTTPAdapter and Retry logic #1253

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

RemoteDisconnect: Changes to HTTPAdapter and Retry logic #1253

Uh oh!

nicolasbisurgi Jul 25, 2025 Maintainer

Problem statement

Posible explanation

Proposed solution/mitigation

Proposed extra measures

The polling case

The other ones

Replies: 2 comments

Uh oh!

MariusWirtz Jul 28, 2025 Maintainer

Uh oh!

nicolasbisurgi Jul 28, 2025 Maintainer Author

nicolasbisurgi
Jul 25, 2025
Maintainer

MariusWirtz
Jul 28, 2025
Maintainer

nicolasbisurgi
Jul 28, 2025
Maintainer Author