- Task cancellation. When API http request gets cancelled, it should cancel corresponding task.
- I'd like to see profiled network latency / bandwidth.
- I'd like to see how much bandwidth each link is using.
- Solve the problem of in continuous batching when a new prompt comes in, it will block decode of the current batch until the prefill is complete.
- We want people to be able to copy models over to a new device without ever connecting EXO to the internet. Right now EXO require internet connection once to cache some files to check if a download is complete. Instead, we should simply check if there is a non-empty model folder locally with no .partial files. This indicates it's a fully downloaded model that can be loaded.
- Memory pressure instead of memory used.
- Show the type of each connection (TB5, Ethernet, etc.) in the UI. Refer to old exo: https://github.com/exo-explore/exo/blob/56f783b38dc6b08ce606b07a5386dc40dae00330/exo/helpers.py#L251
- Prioritise certain connection types (or by latency). TB5 > Ethernet > WiFi. Refer to old exo: https://github.com/exo-explore/exo/blob/56f783b38dc6b08ce606b07a5386dc40dae00330/exo/helpers.py#L251
- Dynamically switch to higher priority connection when it becomes available. Probably bring back InstanceReplacedAtomically.
- Faster model loads by streaming model from other devices in cluster.
- Add support for specifying the type of network connection to use in a test. Depends on 15/16.
- Rethink retry logic
- Log cleanup - per-module log filters and default to DEBUG log levels
- Validate RDMA connections with ibv_devinfo in the info gatherer