Skip to content

Conversation

ChrisRackauckas-Claude
Copy link
Contributor

Reverts the recent timeout handling to use a simpler maxtime-based blocking mechanism

- Remove async task creation and timedwait mechanism
- Simply run solve() and measure elapsed time
- If algorithm exceeds maxtime, block it for larger matrices
- Store max allowed size per algorithm per eltype
- Skip blocked algorithms for larger sizes with informative message
- Keep NaN handling for timed out/blocked runs
- Blocked list automatically resets when switching eltypes

This approach is more stable than trying to kill running tasks, which
can cause Julia to hang. Instead, we let slow algorithms complete but
prevent them from running on larger matrices.
@ChrisRackauckas-Claude
Copy link
Contributor Author

Detailed Explanation

This PR addresses the issues with the current timeout implementation by reverting to a simpler approach that doesn't try to interrupt running tasks.

What Changed

Before (current implementation):

  • Creates async tasks with channels for communication
  • Uses timedwait() to detect when tasks exceed maxtime
  • Attempts to abandon timed-out tasks (but can't actually kill them)
  • Tasks continue running in background, potentially causing issues

After (this PR):

  • Simply runs solve() and measures elapsed time
  • If time > maxtime, blocks algorithm for larger matrices
  • Maintains a dictionary: blocked_algorithms[eltype][algorithm_name] = max_allowed_size
  • Skips blocked algorithms with clear error messages

Key Implementation Details

  1. Blocking Logic: When an algorithm exceeds maxtime at size N, it's blocked for all sizes > N
  2. Per-eltype Tracking: Each element type has its own blocked list, naturally resetting when switching types
  3. NaN Handling: Preserved - blocked/skipped runs return NaN
  4. Progress Reporting: Updated to show "Exceeded maxtime" vs "Skipped" for clarity

Why This Approach is Better

  • Stability: No risk of Julia hanging from task interruption attempts
  • Simplicity: Removed ~40 lines of async/channel code
  • Predictability: Algorithms either run to completion or are skipped entirely
  • Efficiency: Prevents repeated testing of slow algorithms on larger sizes

The user experience remains the same - slow algorithms are prevented from wasting time on larger matrices - but the implementation is much more robust.

@ChrisRackauckas-Claude
Copy link
Contributor Author

Fixed the blocking logic issue!

The problem was that the comparison was using > instead of >= when checking if an algorithm should be skipped. This meant that when an algorithm exceeded maxtime at size N, it would still run again at size N in future runs.

The fix:

  • Changed line 150 from if n > blocked_at_size to if n >= blocked_at_size
  • Updated variable name from max_allowed_size to blocked_at_size for clarity
  • Fixed comments to accurately describe the behavior

Testing:
Created unit tests that verify the blocking logic works correctly for:

  • Sizes smaller than the blocked size (should NOT skip) ✓
  • Size equal to the blocked size (should skip - this was the bug) ✓
  • Sizes larger than the blocked size (should skip) ✓

The fix ensures that once an algorithm exceeds maxtime at a given matrix size, it will be properly skipped for that size and all larger sizes in subsequent benchmarking runs.

@ChrisRackauckas-Claude
Copy link
Contributor Author

Added informative skip message as requested!

Now when an algorithm is skipped due to exceeding maxtime on a previous size, it will display:

[ Info: ⏭️  Skipping GenericLUFactorization for 9000×9000 Float64 matrix (exceeded maxtime at size 9000)

This makes it much clearer during benchmarking when and why algorithms are being skipped.

@ChrisRackauckas ChrisRackauckas merged commit 3fd10c0 into SciML:main Aug 12, 2025
114 of 118 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants