Skip to content

Conversation

@zack041
Copy link

@zack041 zack041 commented Jan 30, 2026

📌 Description

Add graceful OOM handling during autotuning. When torch.cuda.OutOfMemoryError occurs, the autotuner now clears CUDA cache and falls back to the default tactic (runners[0], -1) instead of crashing. The try-except block wraps the entire profiling loop, covering methods like _prepare_input_tensors() that could also cause OOM. OOM from the inner profiling loop is raised to be caught by the outer exception handler.

🔍 Related Issues

Fixes #2357

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

No tests added because OOM during autotuning is difficult to reliably reproduce in a test environment.

Summary by CodeRabbit

  • Bug Fixes
    • Improved profiling error handling so individual tactic failures are caught, logged, recorded, and do not abort tuning.
    • Added robust out-of-memory handling that clears GPU resources and falls back to safe/previous configurations instead of crashing.
    • Ensured tuning continues after non‑OOM errors, preserves cache/metrics consistency, and still selects the best measured configuration when available.

✏️ Tip: You can customize this high-level summary in your review settings.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @zack041, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness of the autotuner by introducing comprehensive Out-Of-Memory (OOM) error handling. It prevents the autotuner from crashing when GPU memory is exhausted during profiling, ensuring a more stable and resilient tuning process. Instead of failing, the system now clears the CUDA cache and reverts to a safe default configuration, allowing the application to continue operating without interruption.

Highlights

  • Graceful OOM Handling: Implemented try-except blocks to catch torch.cuda.OutOfMemoryError during the autotuning process, preventing crashes due to GPU memory exhaustion.
  • CUDA Cache Clearing: Upon detecting an OOM error, torch.cuda.empty_cache() is called to free up GPU memory, improving recovery chances.
  • Default Tactic Fallback: The autotuner now gracefully falls back to a default tactic (runners[0], -1) instead of crashing when an OOM occurs, ensuring continued operation.
  • Comprehensive Error Coverage: The OOM handling wraps the entire profiling loop, including methods like _prepare_input_tensors(), ensuring broad coverage for potential memory issues.
  • Inner Loop OOM Propagation: OOM errors originating from the inner _profile_single_kernel profiling loop are specifically re-raised to be caught by the outer, more comprehensive exception handler.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 30, 2026

📝 Walkthrough

Walkthrough

Added robust exception handling in choose_one profiling: per-tactic try/except to continue on errors, special handling for torch.cuda.OutOfMemoryError that clears CUDA cache and falls back to a safe default tactic, and consistent cache/stat updates for failed profiling attempts.

Changes

Cohort / File(s) Summary
Profiling Exception Handling
flashinfer/autotuner.py
Wrapped profiling loop in per-tactic try/except. Catches torch.cuda.OutOfMemoryError, clears CUDA cache, and returns a fallback runner/tactic (-1). On other exceptions logs a warning, records failure (sets measured time to ∞), updates failed-profiling counters/cache entries, and continues evaluating remaining tactics. Ensures cache keys and chosen (runner, tactic) selection logic remain consistent.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant Autotuner as Autotuner.choose_one
    participant Runner as Runner.profile
    participant CUDA
    participant Cache

    Caller->>Autotuner: request best (runner,tactic)
    Autotuner->>Cache: lookup cached (runner,tactic)
    alt cache miss / needs profiling
        Autotuner->>Runner: profile(tactic_i)
        Runner->>CUDA: allocate / run kernel
        alt torch.cuda.OutOfMemoryError
            CUDA-->>Runner: OOM error
            Runner-->>Autotuner: raise OOM
            Autotuner->>CUDA: torch.cuda.empty_cache()
            Autotuner-->>Caller: return fallback (runner, tactic=-1)
        else other Exception
            Runner-->>Autotuner: exception
            Autotuner->>Cache: record failed profiling (time=∞)
            Autotuner->>Runner: continue with next tactic
        else success
            Runner-->>Autotuner: time_measured
            Autotuner->>Cache: update best (runner,tactic)
            Autotuner-->>Caller: return chosen (runner,tactic)
        end
    else cached
        Cache-->>Autotuner: cached (runner,tactic)
        Autotuner-->>Caller: return cached (runner,tactic)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 When kernels heap and memory's thin and tactics trip and stray,
I nibble bugs and empty cache, then gently hop away.
If OOM clouds the tuning sky, I pick the safest lane—
a rabbit's hop, a tidy fix, and profiling's calm again. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Fix autotuner oom' is vague and uses an abbreviation without clarity; 'oom' lacks context for readers unfamiliar with the issue. Expand the title to be more descriptive, such as 'Fix autotuner graceful handling of out-of-memory errors' to provide clearer context.
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed The PR description follows the template structure with complete Description, Related Issues, and Checklist sections; pre-commit checks are marked complete and testing notes are provided.
Linked Issues check ✅ Passed The PR successfully implements the objectives from issue #2357: graceful OOM handling with CUDA cache clearing and fallback to the default tactic (runners[0], -1) instead of crashing.
Out of Scope Changes check ✅ Passed All changes in autotuner.py are focused on OOM exception handling during profiling, directly addressing the requirements in issue #2357; no extraneous modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces important graceful Out-Of-Memory (OOM) handling during the autotuning process, which enhances the robustness of the system. The implementation correctly wraps the profiling loop with a try-except block for torch.cuda.OutOfMemoryError, clears the CUDA cache, and falls back to a default tactic. However, the cache update and statistics increment logic is currently positioned such that it executes even for cache hits, leading to inaccurate statistics and redundant operations. This should be adjusted to only apply when a new optimal configuration is successfully profiled.

Comment on lines 528 to 540
if runner_id is not None:
# At least one valid (runner, tactic) pair is found
cache_key = AutoTuner._get_cache_key(
custom_op, runners[runner_id], p.get_opt_shapes(), tuning_config
)
# inspect call stack
self.profiling_cache[cache_key] = (runner_id, tactic, p)
self.stats.tuned_op_successful_configs[custom_op] = (
self.stats.tuned_op_successful_configs.get(custom_op, 0) + 1
)
logger.debug(
f"[Autotuner]: profiling chosen runner: {runners[runner_id]} {tactic} for {cache_key}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This block of code, which updates the profiling cache and statistics, is currently placed outside the if not is_cache_hit: condition. This means it will execute even when a configuration is retrieved from the cache, leading to incorrect tuned_op_successful_configs counts and redundant cache updates. It should only execute when a new best runner/tactic is found after profiling. Please move this block inside the if not is_cache_hit: block, at the same indentation level as min_time = float("inf") (line 475).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Autotuning failsafe fallback to top1 tactic

1 participant