fix(polyphemus): enable BetterTransformer optimization and add GPU diagnostics by harry-rhesis · Pull Request #1276 · rhesis-ai/rhesis

harry-rhesis · 2026-02-06T13:13:52Z

Purpose

Optimize Polyphemus inference performance and add GPU utilization diagnostics to verify GPU is being used correctly during model generation.

What Changed

Performance Optimizations

Enabled BetterTransformer optimization in model loader for 1.5-2x inference speedup
Upgraded Polyphemus Dockerfile to use CUDA base image for better GPU support
Added GPU computation test during model loading to verify GPU availability

GPU Diagnostics

Added detailed GPU debug logging in HuggingFace model generation:
- Logs input tensor device placement before generation
- Tracks GPU memory usage before/after generation
- Reports generation time and memory deltas
Added GPU computation verification test that runs during model load
Enhanced logging to help diagnose low GPU utilization issues

Infrastructure Updates (from merge with main)

Migrated all Docker images to mirror.gcr.io for GCP Container Registry
Updated Kubernetes and Docker Compose configurations
Added generate_batch() implementation in LazyModelLoader

Additional Context

This PR addresses concerns about low GPU utilization (1%) observed in Cloud Run. The new diagnostic logging will help determine if:

Model and inputs are correctly placed on GPU
GPU is being used during actual inference
Memory allocation vs actual GPU compute

The BetterTransformer optimization (when available) provides significant speedup without code changes by optimizing transformer models for inference.

Testing

After deployment, verify GPU usage by checking logs for:

✅ GPU Computation Test: PASSED during model loading
🔍 GPU Debug - Input tensors on device: cuda:0 before generation
🔍 GPU Debug - Pre/Post-generation GPU memory during inference
Generation time improvements from BetterTransformer

Expected behavior:

Model loads on GPU (cuda:0)
Inputs are moved to GPU before generation
GPU memory increases during generation
Faster inference times with BetterTransformer

… CUDA base image - Fix BetterTransformer implementation to use correct optimum API (BetterTransformer.transform) - Switch Dockerfile from python:slim to nvidia/cuda:12.1.1-cudnn8-runtime for better GPU stability - Add Python 3.10 installation in CUDA base image - Add health check endpoint to Dockerfile - Add GPU computation test and enhanced GPU debug logging This enables 1.5-2x inference speedup that was previously not being applied due to incorrect API usage.

Incorporate changes from main including: - Docker image migrations to mirror.gcr.io for GCP container registry - Infrastructure updates for k8s and local deployment - LazyModelLoader generate_batch implementation

Ensure optimum package (v2.1.0) and its dependencies are properly locked for BetterTransformer optimization support.

harry-rhesis · 2026-02-10T10:09:43Z

@asadaaron we can close then this PR, right? If BetterTransformer is not available anymore.

harry-rhesis added 2 commits February 6, 2026 14:07

Merge branch 'main' into fix/better-transformer

8b9034d

Incorporate changes from main including: - Docker image migrations to mirror.gcr.io for GCP container registry - Infrastructure updates for k8s and local deployment - LazyModelLoader generate_batch implementation

harry-rhesis temporarily deployed to test February 6, 2026 13:13 — with GitHub Actions Inactive

harry-rhesis temporarily deployed to stg February 6, 2026 13:14 — with GitHub Actions Inactive

harry-rhesis temporarily deployed to stg February 6, 2026 13:24 — with GitHub Actions Inactive

fix(polyphemus): update uv.lock to include optimum package dependencies

1db95a4

Ensure optimum package (v2.1.0) and its dependencies are properly locked for BetterTransformer optimization support.

harry-rhesis temporarily deployed to test February 6, 2026 13:48 — with GitHub Actions Inactive

harry-rhesis temporarily deployed to stg February 6, 2026 13:49 — with GitHub Actions Inactive

harry-rhesis temporarily deployed to stg February 6, 2026 13:59 — with GitHub Actions Inactive

harry-rhesis marked this pull request as draft February 9, 2026 09:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(polyphemus): enable BetterTransformer optimization and add GPU diagnostics#1276

fix(polyphemus): enable BetterTransformer optimization and add GPU diagnostics#1276
harry-rhesis wants to merge 3 commits intomainfrom
fix/better-transformer

harry-rhesis commented Feb 6, 2026

Uh oh!

harry-rhesis commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

harry-rhesis commented Feb 6, 2026

Purpose

What Changed

Performance Optimizations

GPU Diagnostics

Infrastructure Updates (from merge with main)

Additional Context

Testing

Uh oh!

harry-rhesis commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant