Make `safe_interval` more dynamic for quick transport tasks

As realized together with @giovannipizzi while debugging things for our new cluster at PSI: When submitting a simple calculation (execution takes about 10s) for testing purposes, with the default `safe_interval=30` in the `Computer` configuration, one has to wait an additional 90s until the job is done (30s for the `upload`, `submit`, and `retrieve` tasks, each). This is to be expected, of course, and one could just reduce the `safe_interval` (albeit increasing the risk of SSH overloads).

However, the `upload` task in that case is truly the first `Transport` task that is being executed by the daemon worker, so it could, in principle, enter immediately (the same if jobs were run previously, but longer ago than the `safe_interval`). I locally implemented a first version (thanks to @giovannipizzi's input) that does this, by adding a `last_close_time` attribute (currently added to the `authinfo` metadata for a first PoC). In the `request_transport` method of the `TransportQueue`, the time difference between the current time and the `last_close_time` is then checked, and if it is larger than `safe_interval`, the `Transport` is opened immediately via:
```python
open_callback_handle = self._loop.call_later(0, do_open, context=contextvars.Context())  # or use 1 for safety?
```
bypassing the `safe_interval` (or `safe_open_interval` as it is called in `transports.py`).

In addition, the waiting times for the `submit` and `retrieve` tasks could also be reduced. It seems like currently, the `safe_interval` is imposed on all of them, even if they finish very quickly (I assume as all open a transport connection via SSH). So we were thinking if it's possible to make this a bit more sophisticated, e.g. by adding *special* transport requests, that could make use of the open transport, and keep a transport of which the task has finished open for a short time longer (also quickly discussed with @mbercx). Of course, one would still need to make sure SSH doesn't get overloaded, the implementation works with heavy loads (not just individual testing calculations), and one would also have to consider how this all works with multiple daemon workers. Again with @giovannipizzi, I had a quick look, but it seems like the implementation would be a bit more involved. So wondering what the others think, if this is feasible and worth investigating more time into. Pinging @khsrali who has looked a bit more into transports.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make `safe_interval` more dynamic for quick transport tasks #6544

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Make safe_interval more dynamic for quick transport tasks #6544

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Make `safe_interval` more dynamic for quick transport tasks #6544