Speaches seems to keep 2gb+ of VRAM when I use the CUDA version unless it is told to clear the model from VRAM. I went from docker tag 0.8.3-cuda-12.6.3 to 0.9.0-rc.3-cuda-12.6.3 and it seems it is ignoring WHISPER__TTL=0, which should unload all models from VRAM after processing is complete. This worked wonders, especially when models are small enough and that reloading them is near instant onto my GPU
Is there a new Env to do something similar? Otherwise I will go back to 0.8.3-cuda-12.6.3