RemoteDisconnect: Changes to HTTPAdapter and Retry logic #1253
nicolasbisurgi
started this conversation in
General
Replies: 2 comments
-
I support this suggestion |
Beta Was this translation helpful? Give feedback.
0 replies
-
Thanks @MariusWirtz ! @macsir - I think you developed the HTTPAdapter a while ago, if you have any inputs on this matter it would be great to hear them :) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Problem statement
When running TM1py (wrapped around RushTI) against PAoC instances, multiple clients experience frequent RemoteDisconnect errors. This happens randomly, with no clear pattern and it's impossible to reproduce at will. One thing to note is that this never happens when we run RushTI locally (either in an OnPrem installation or with an executable version of RushTI inside the PAoC).
Posible explanation
When trying to find explanations for this I've focused on the HTTPAdapter as it's the one that manages the requests against the TM1 server and it's here where I think the fundamental problem is:
The pool is defined against a specific target (which in our case is a single TM1 instance), so I don't see why we would need the pool_connections to be more than 1. The pool_maxsize however is how many connections that pool could have. This is key when dealing with high concurrency as RushTI does. If we set RushTI to run up to 60 processes in parallel, then pool_maxsize should be 60. The endpoint web server will be monitoring 60 connections.
BUT since we have the same value for both pool_connections as for pool_maxsize, when we instruct RushTi to execute up to 60 threads, we are actually generating a 3,600 connections which is quite a lot for the web server to monitor. High-latency networks struggle with these amount of connections due to TCP handshake overhead, NAT table exhaustion, and cloud edge rate limiting.
This would also explain why running RushTI on low latency networks (i.e.: onprem or local to PAoC) would not prompt this RemoteDisconnect with the same frequency as it does when called from a remote location.
Proposed solution/mitigation
Use this HTTPAdapter instead:
This would have independant
pool_connections
(Default=1) andpool_maxsize
(Default=10) values, reducing normal/default case from 100 to 10 connections per TM1Service instance.Also the default socket settings of the HTTPAdapter are quite conservative and I don't think they are optimized for the TM1Py use cases, as such I propose to add these options to the adapter:
From these items, it seems TCP_NODELAY would be the biggest performance improvement as it will reduce latency as queries are sent immediately without possible buffering.
Proposed extra measures
The polling case
The above should mitigate the problem as it should reduce the amount of RemoteDisconnects we get; but it does not ensure they would be gone.
When checking the http requests logs at the moment the RemoteDisconnect happens, in the vast majority of the cases it happens in a GET poll request. As many of you know, when RushTI runs it sends multiple async requests to the TM1 Server to start different processes. The TM1 server acknowledges by responding back with an async_id which then RushTI uses to poll its status.
Imagine we make an async call to a process that take 200 seconds to run. This would mean:
If we get a RemoteDisconnect there's a 99.5% chance it happens during a poll request. Since poll requests are idempotent, we should have no problem retrying it rather than failing the entire task.
The other ones
Even though GET, HEAD and OPTIONS methods could be considered idempotent, we can't ensure that. At the same time PUT, POST and DELETE could be idempotent as well depending on context (i.e.: POST /api/v1/Processes('%7Dbedrock.server.wait')/tm1.ExecuteWithReturn could be run mutiple times and it will always procude the same result). Since this nuance is only known by the developer that invokes these requests I would suggest we add an 'idempotent' optinal paramter in the request function:
If the requests fails we could implement a retry logic if
idempotent=True
, else we fail the request. We could explicitely extend this parameter to multiple functions withiin TM1Py (thinkingexecute_with_return
andexecute_process_with_return
for sure) so tm1py developers can add a safe_retry behavior.All of this put together in a decision tree would look:
All proposed changes would maintain backward compatibility. Existing code will continue to work with improved connection handling.
This analysis is quite preliminary and I'm not a network expert by any means so I'd love to get your opinion on this matter.
Cheers!
nico
Beta Was this translation helpful? Give feedback.
All reactions