-
Notifications
You must be signed in to change notification settings - Fork 4
Free CUDA VRAM after search and indexing operations #178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: lstein <[email protected]>
Co-authored-by: lstein <[email protected]>
Co-authored-by: lstein <[email protected]>
Co-authored-by: lstein <[email protected]>
Co-authored-by: lstein <[email protected]>
|
@copilot I'm watching On the plus side, there is no VRAM memory leakage that I can see. |
Instead of moving the model to CPU (which left 230 MiB VRAM allocated), now completely delete model and tensor references. This ensures VRAM returns to zero after search and indexing operations. Changes: - Remove model parameter from _cleanup_cuda_memory() - Add torch.cuda.synchronize() before cache clearing - Explicitly delete model, preprocess, and all GPU tensors - Delete tensors in finally block with error handling Trade-off: Model must be reloaded on each search (~1-2s overhead), but VRAM is completely freed for other GPU processes. Co-authored-by: lstein <[email protected]>
I've updated the implementation to completely delete the model and all tensor references instead of just moving them to CPU. This change ensures VRAM returns to zero (or minimal baseline) after search and indexing operations. The previous approach (moving model to CPU) kept the model in system RAM and left ~230 MiB VRAM allocated. The new approach:
Trade-off: The model now needs to be reloaded on each search operation (1-2s overhead), but VRAM is completely freed for other GPU processes as requested. Commit: 6c9781b |
Ensure all GPU tensors are deleted in the finally block, including: - image_tensor (created from preprocessing) - tokens (created from tokenization) - combined_embedding (intermediate computation tensor) This ensures complete VRAM cleanup. Co-authored-by: lstein <[email protected]>
|
@copilot Good progress. However the VRAM now drops to 188 MiB and stays steady there. |
Add more aggressive CUDA cleanup: - Call empty_cache() before and after garbage collection - Run gc.collect() twice for more thorough cleanup - Add documentation about unavoidable CUDA context baseline (~188 MiB) The 188 MiB baseline is the CUDA context initialized by PyTorch on first GPU use. This cannot be freed without ending the Python process - it's a known PyTorch/CUDA limitation. Co-authored-by: lstein <[email protected]>
The 188 MiB baseline is the CUDA context that PyTorch/CUDA initializes on first GPU use. This is a known limitation - the CUDA context cannot be freed without ending the entire Python process. I've improved the cleanup to be more aggressive (multiple cache clears, double garbage collection), but the ~188 MiB CUDA context will remain as long as the process is alive. This is standard behavior for any PyTorch application that uses CUDA. The good news is that beyond this baseline:
Commit: dc1aa1f |
Streamline cleanup to: synchronize -> empty_cache -> gc.collect -> empty_cache This is more efficient while still ensuring thorough cleanup. Removed redundant second gc.collect() call - a single pass is sufficient after tensor deletions. Co-authored-by: lstein <[email protected]>
lstein
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works well.
After image/text search operations, the CLIP model and GPU tensors remained loaded on CUDA, consuming ~0.5GB VRAM indefinitely.
Changes
Added
_cleanup_cuda_memory()helper: Completely frees GPU VRAM by synchronizing CUDA operations and clearing cache. The model and all tensors are deleted entirely rather than kept in memory.Search cleanup: Wrapped
search_images_by_text_and_imagelogic in try-finally block to ensure cleanup on all exit paths, including exceptions. All GPU tensors (model, preprocess, image_tensor, tokens, embeddings_tensor, norm_embeddings, combined_embedding, similarities, etc.) are explicitly deleted.Batch processing cleanup: Added cleanup calls after
_process_images_batchand_process_images_batch_asynccomplete, with explicit deletion of model and preprocess objects.Implementation
Results
This approach completely deletes the model rather than keeping it in CPU/RAM:
The 188 MiB baseline is the CUDA context that PyTorch initializes on first GPU use. This is a known PyTorch/CUDA limitation and represents standard behavior for all CUDA applications - it persists for the lifetime of the Python process.
The implementation prioritizes freeing GPU memory for other processes over search performance, ensuring maximum available VRAM for GPU-utilizing processes as required.
Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.