fix(polyphemus): enable BetterTransformer optimization and add GPU diagnostics#1276
Draft
harry-rhesis wants to merge 3 commits intomainfrom
Draft
fix(polyphemus): enable BetterTransformer optimization and add GPU diagnostics#1276harry-rhesis wants to merge 3 commits intomainfrom
harry-rhesis wants to merge 3 commits intomainfrom
Conversation
… CUDA base image - Fix BetterTransformer implementation to use correct optimum API (BetterTransformer.transform) - Switch Dockerfile from python:slim to nvidia/cuda:12.1.1-cudnn8-runtime for better GPU stability - Add Python 3.10 installation in CUDA base image - Add health check endpoint to Dockerfile - Add GPU computation test and enhanced GPU debug logging This enables 1.5-2x inference speedup that was previously not being applied due to incorrect API usage.
Incorporate changes from main including: - Docker image migrations to mirror.gcr.io for GCP container registry - Infrastructure updates for k8s and local deployment - LazyModelLoader generate_batch implementation
Ensure optimum package (v2.1.0) and its dependencies are properly locked for BetterTransformer optimization support.
Contributor
Author
|
@asadaaron we can close then this PR, right? If BetterTransformer is not available anymore. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Optimize Polyphemus inference performance and add GPU utilization diagnostics to verify GPU is being used correctly during model generation.
What Changed
Performance Optimizations
GPU Diagnostics
Infrastructure Updates (from merge with main)
mirror.gcr.iofor GCP Container Registrygenerate_batch()implementation in LazyModelLoaderAdditional Context
This PR addresses concerns about low GPU utilization (1%) observed in Cloud Run. The new diagnostic logging will help determine if:
The BetterTransformer optimization (when available) provides significant speedup without code changes by optimizing transformer models for inference.
Testing
After deployment, verify GPU usage by checking logs for:
✅ GPU Computation Test: PASSEDduring model loading🔍 GPU Debug - Input tensors on device: cuda:0before generation🔍 GPU Debug - Pre/Post-generation GPU memoryduring inferenceExpected behavior: