Prerequisites
Feature Description
Hello,
I was very excited about this pull request: #13529
URL Link: #13529
Even if it is only for CUDA, if it is possible to maintain this without major issues, it would be a massive quality of life improvement for those that do not have beefy hardware.
Thank you for your time and consideration.
Motivation
This will allow those without great hardware to experience greater context lengths and better performance from MOE models by being able to offload more layers onto GPU due to the decrease of memory requirement from KV Cache.
Possible Implementation
No response