Skip to content

fix(polyphemus): enable BetterTransformer optimization and add GPU diagnostics#1276

Draft
harry-rhesis wants to merge 3 commits intomainfrom
fix/better-transformer
Draft

fix(polyphemus): enable BetterTransformer optimization and add GPU diagnostics#1276
harry-rhesis wants to merge 3 commits intomainfrom
fix/better-transformer

Conversation

@harry-rhesis
Copy link
Contributor

Purpose

Optimize Polyphemus inference performance and add GPU utilization diagnostics to verify GPU is being used correctly during model generation.

What Changed

Performance Optimizations

  • Enabled BetterTransformer optimization in model loader for 1.5-2x inference speedup
  • Upgraded Polyphemus Dockerfile to use CUDA base image for better GPU support
  • Added GPU computation test during model loading to verify GPU availability

GPU Diagnostics

  • Added detailed GPU debug logging in HuggingFace model generation:
    • Logs input tensor device placement before generation
    • Tracks GPU memory usage before/after generation
    • Reports generation time and memory deltas
  • Added GPU computation verification test that runs during model load
  • Enhanced logging to help diagnose low GPU utilization issues

Infrastructure Updates (from merge with main)

  • Migrated all Docker images to mirror.gcr.io for GCP Container Registry
  • Updated Kubernetes and Docker Compose configurations
  • Added generate_batch() implementation in LazyModelLoader

Additional Context

This PR addresses concerns about low GPU utilization (1%) observed in Cloud Run. The new diagnostic logging will help determine if:

  • Model and inputs are correctly placed on GPU
  • GPU is being used during actual inference
  • Memory allocation vs actual GPU compute

The BetterTransformer optimization (when available) provides significant speedup without code changes by optimizing transformer models for inference.

Testing

After deployment, verify GPU usage by checking logs for:

  • ✅ GPU Computation Test: PASSED during model loading
  • 🔍 GPU Debug - Input tensors on device: cuda:0 before generation
  • 🔍 GPU Debug - Pre/Post-generation GPU memory during inference
  • Generation time improvements from BetterTransformer

Expected behavior:

  • Model loads on GPU (cuda:0)
  • Inputs are moved to GPU before generation
  • GPU memory increases during generation
  • Faster inference times with BetterTransformer

… CUDA base image

- Fix BetterTransformer implementation to use correct optimum API (BetterTransformer.transform)
- Switch Dockerfile from python:slim to nvidia/cuda:12.1.1-cudnn8-runtime for better GPU stability
- Add Python 3.10 installation in CUDA base image
- Add health check endpoint to Dockerfile
- Add GPU computation test and enhanced GPU debug logging

This enables 1.5-2x inference speedup that was previously not being applied due to incorrect API usage.
Incorporate changes from main including:
- Docker image migrations to mirror.gcr.io for GCP container registry
- Infrastructure updates for k8s and local deployment
- LazyModelLoader generate_batch implementation
Ensure optimum package (v2.1.0) and its dependencies are properly locked
for BetterTransformer optimization support.
@harry-rhesis
Copy link
Contributor Author

@asadaaron we can close then this PR, right? If BetterTransformer is not available anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant