Skip to content

Commit a0f36af

Browse files
Add maxtime parameter to LinearSolveAutotune for timeout handling (#716)
* Add maxtime parameter to LinearSolveAutotune - Added maxtime parameter with 100s default to autotune_setup() and benchmark_algorithms() - Implements timeout handling during accuracy checks and benchmarking - Records timed out runs as NaN in results - Updated docstrings and documentation to explain the new parameter - Prevents hanging on slow algorithms or large matrices * Improve timeout handling: properly kill timed-out tasks - Use Channel-based communication between warmup and timer tasks - Properly interrupt timed-out tasks with Base.throwto() - Clean up timer task when warmup completes successfully - Handle exceptions from warmup task properly - Prevents resource leaks from hanging tasks * Update lib/LinearSolveAutotune/src/benchmarking.jl * Make analysis tools robust to NaN values from timeouts - Filter out NaN values when computing mean, max, and std statistics - Exclude NaN values from plots to avoid visualization errors - Report number of timed-out tests in summary output - Ensure categorize_results excludes NaN values when selecting best algorithms - All aggregation functions now properly handle NaN values that indicate timeouts This ensures the autotuning system works correctly even when some tests timeout, which is expected behavior for large matrix sizes or slow algorithms. --------- Co-authored-by: ChrisRackauckas <[email protected]>
1 parent 06ce5b1 commit a0f36af

File tree

5 files changed

+171
-48
lines changed

5 files changed

+171
-48
lines changed

docs/src/tutorials/autotune.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,37 @@ results = autotune_setup(
132132
)
133133
```
134134

135+
### Time Limits for Algorithm Tests
136+
137+
Control the maximum time allowed for each algorithm test (including accuracy check):
138+
139+
```julia
140+
# Default: 100 seconds maximum per algorithm test
141+
results = autotune_setup() # maxtime = 100.0
142+
143+
# Quick timeout for fast exploration
144+
results = autotune_setup(maxtime = 10.0)
145+
146+
# Extended timeout for slow algorithms or large matrices
147+
results = autotune_setup(
148+
maxtime = 300.0, # 5 minutes per test
149+
sizes = [:large, :big]
150+
)
151+
152+
# Conservative timeout for production benchmarking
153+
results = autotune_setup(
154+
maxtime = 200.0,
155+
samples = 10,
156+
seconds = 2.0
157+
)
158+
```
159+
160+
When an algorithm exceeds the `maxtime` limit:
161+
- The test is skipped to prevent hanging
162+
- The result is recorded as `NaN` in the benchmark data
163+
- A warning is displayed indicating the timeout
164+
- The benchmark continues with the next algorithm
165+
135166
### Missing Algorithm Handling
136167

137168
By default, autotune expects all algorithms to be available to ensure complete benchmarking. You can relax this requirement:

lib/LinearSolveAutotune/src/LinearSolveAutotune.jl

Lines changed: 29 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -78,13 +78,13 @@ function Base.show(io::IO, results::AutotuneResults)
7878
println(io, " • Julia: ", get(results.sysinfo, "julia_version", "Unknown"))
7979
println(io, " • Threads: ", get(results.sysinfo, "num_threads", "Unknown"), " (BLAS: ", get(results.sysinfo, "blas_num_threads", "Unknown"), ")")
8080

81-
# Results summary
82-
successful_results = filter(row -> row.success, results.results_df)
81+
# Results summary - filter out NaN values
82+
successful_results = filter(row -> row.success && !isnan(row.gflops), results.results_df)
8383
if nrow(successful_results) > 0
8484
println(io, "\n🏆 Top Performing Algorithms:")
8585
summary = combine(groupby(successful_results, :algorithm),
86-
:gflops => mean => :avg_gflops,
87-
:gflops => maximum => :max_gflops,
86+
:gflops => (x -> mean(filter(!isnan, x))) => :avg_gflops,
87+
:gflops => (x -> maximum(filter(!isnan, x))) => :max_gflops,
8888
nrow => :num_tests)
8989
sort!(summary, :avg_gflops, rev = true)
9090

@@ -104,6 +104,12 @@ function Base.show(io::IO, results::AutotuneResults)
104104
println(io, "📏 Matrix Sizes: ", minimum(sizes), "×", minimum(sizes),
105105
" to ", maximum(sizes), "×", maximum(sizes))
106106

107+
# Report timeouts if any
108+
timeout_results = filter(row -> isnan(row.gflops), results.results_df)
109+
if nrow(timeout_results) > 0
110+
println(io, "⏱️ Timed Out: ", nrow(timeout_results), " tests exceeded time limit")
111+
end
112+
107113
# Call to action - reordered
108114
println(io, "\n" * "="^60)
109115
println(io, "🚀 For comprehensive results, consider running:")
@@ -158,7 +164,8 @@ end
158164
seconds::Float64 = 0.5,
159165
eltypes = (Float32, Float64, ComplexF32, ComplexF64),
160166
skip_missing_algs::Bool = false,
161-
include_fastlapack::Bool = false)
167+
include_fastlapack::Bool = false,
168+
maxtime::Float64 = 100.0)
162169
163170
Run a comprehensive benchmark of all available LU factorization methods and optionally:
164171
@@ -182,6 +189,8 @@ Run a comprehensive benchmark of all available LU factorization methods and opti
182189
- `eltypes = (Float32, Float64, ComplexF32, ComplexF64)`: Element types to benchmark
183190
- `skip_missing_algs::Bool = false`: If false, error when expected algorithms are missing; if true, warn instead
184191
- `include_fastlapack::Bool = false`: If true, includes FastLUFactorization in benchmarks
192+
- `maxtime::Float64 = 100.0`: Maximum time in seconds for each algorithm test (including accuracy check).
193+
If exceeded, the run is skipped and recorded as NaN
185194
186195
# Returns
187196
@@ -216,7 +225,8 @@ function autotune_setup(;
216225
seconds::Float64 = 0.5,
217226
eltypes = (Float64,),
218227
skip_missing_algs::Bool = false,
219-
include_fastlapack::Bool = false)
228+
include_fastlapack::Bool = false,
229+
maxtime::Float64 = 100.0)
220230
@info "Starting LinearSolve.jl autotune setup..."
221231
@info "Configuration: sizes=$sizes, set_preferences=$set_preferences"
222232
@info "Element types to benchmark: $(join(eltypes, ", "))"
@@ -249,18 +259,25 @@ function autotune_setup(;
249259

250260
# Run benchmarks
251261
@info "Running benchmarks (this may take several minutes)..."
262+
@info "Maximum time per algorithm test: $(maxtime)s"
252263
results_df = benchmark_algorithms(matrix_sizes, all_algs, all_names, eltypes;
253-
samples = samples, seconds = seconds, sizes = sizes)
264+
samples = samples, seconds = seconds, sizes = sizes, maxtime = maxtime)
254265

255-
# Display results table
256-
successful_results = filter(row -> row.success, results_df)
266+
# Display results table - filter out NaN values
267+
successful_results = filter(row -> row.success && !isnan(row.gflops), results_df)
268+
timeout_results = filter(row -> isnan(row.gflops), results_df)
269+
270+
if nrow(timeout_results) > 0
271+
@info "$(nrow(timeout_results)) tests timed out (exceeded $(maxtime)s limit)"
272+
end
273+
257274
if nrow(successful_results) > 0
258275
@info "Benchmark completed successfully!"
259276

260-
# Create summary table for display
277+
# Create summary table for display - handle NaN values
261278
summary = combine(groupby(successful_results, :algorithm),
262-
:gflops => mean => :avg_gflops,
263-
:gflops => maximum => :max_gflops,
279+
:gflops => (x -> mean(filter(!isnan, x))) => :avg_gflops,
280+
:gflops => (x -> maximum(filter(!isnan, x))) => :max_gflops,
264281
nrow => :num_tests)
265282
sort!(summary, :avg_gflops, rev = true)
266283

lib/LinearSolveAutotune/src/benchmarking.jl

Lines changed: 107 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -73,14 +73,19 @@ end
7373

7474
"""
7575
benchmark_algorithms(matrix_sizes, algorithms, alg_names, eltypes;
76-
samples=5, seconds=0.5, sizes=[:small, :medium])
76+
samples=5, seconds=0.5, sizes=[:small, :medium],
77+
maxtime=100.0)
7778
7879
Benchmark the given algorithms across different matrix sizes and element types.
7980
Returns a DataFrame with results including element type information.
81+
82+
# Arguments
83+
- `maxtime::Float64 = 100.0`: Maximum time in seconds for each algorithm test (including accuracy check).
84+
If the accuracy check exceeds this time, the run is skipped and recorded as NaN.
8085
"""
8186
function benchmark_algorithms(matrix_sizes, algorithms, alg_names, eltypes;
8287
samples = 5, seconds = 0.5, sizes = [:tiny, :small, :medium, :large],
83-
check_correctness = true, correctness_tol = 1e0)
88+
check_correctness = true, correctness_tol = 1e0, maxtime = 100.0)
8489

8590
# Set benchmark parameters
8691
old_params = BenchmarkTools.DEFAULT_PARAMETERS
@@ -136,52 +141,120 @@ function benchmark_algorithms(matrix_sizes, algorithms, alg_names, eltypes;
136141
ProgressMeter.update!(progress,
137142
desc="Benchmarking $name on $(n)×$(n) $eltype matrix: ")
138143

139-
gflops = 0.0
144+
gflops = NaN # Use NaN for timed out runs
140145
success = true
141146
error_msg = ""
142147
passed_correctness = true
148+
timed_out = false
143149

144150
try
145151
# Create the linear problem for this test
146152
prob = LinearProblem(copy(A), copy(b);
147153
u0 = copy(u0),
148154
alias = LinearAliasSpecifier(alias_A = true, alias_b = true))
149155

150-
# Warmup run and correctness check
151-
warmup_sol = solve(prob, alg)
156+
# Time the warmup run and correctness check
157+
start_time = time()
152158

153-
# Check correctness if reference solution is available
154-
if check_correctness && reference_solution !== nothing
155-
# Compute relative error
156-
rel_error = norm(warmup_sol.u - reference_solution.u) / norm(reference_solution.u)
157-
158-
if rel_error > correctness_tol
159-
passed_correctness = false
160-
@warn "Algorithm $name failed correctness check for size $n, eltype $eltype. " *
161-
"Relative error: $(round(rel_error, sigdigits=3)) > tolerance: $correctness_tol. " *
162-
"Algorithm will be excluded from results."
163-
success = false
164-
error_msg = "Failed correctness check (rel_error = $(round(rel_error, sigdigits=3)))"
159+
# Create a channel for communication between tasks
160+
result_channel = Channel(1)
161+
162+
# Warmup run and correctness check with timeout
163+
warmup_task = @async begin
164+
try
165+
result = solve(prob, alg)
166+
put!(result_channel, result)
167+
catch e
168+
put!(result_channel, e)
169+
end
170+
end
171+
172+
# Timer task to enforce timeout
173+
timer_task = @async begin
174+
sleep(maxtime)
175+
if !istaskdone(warmup_task)
176+
try
177+
Base.throwto(warmup_task, InterruptException())
178+
catch
179+
# Task might be in non-interruptible state
180+
end
181+
put!(result_channel, :timeout)
182+
end
183+
end
184+
185+
# Wait for result or timeout
186+
warmup_sol = nothing
187+
result = take!(result_channel)
188+
189+
# Clean up timer task if still running
190+
if !istaskdone(timer_task)
191+
try
192+
Base.throwto(timer_task, InterruptException())
193+
catch
194+
# Timer task might have already finished
165195
end
166196
end
167197

168-
# Only benchmark if correctness check passed
169-
if passed_correctness
170-
# Actual benchmark
171-
bench = @benchmark solve($prob, $alg) setup=(prob = LinearProblem(
172-
copy($A), copy($b);
173-
u0 = copy($u0),
174-
alias = LinearAliasSpecifier(alias_A = true, alias_b = true)))
175-
176-
# Calculate GFLOPs
177-
min_time_sec = minimum(bench.times) / 1e9
178-
flops = luflop(n, n)
179-
gflops = flops / min_time_sec / 1e9
198+
if result === :timeout
199+
# Task timed out
200+
timed_out = true
201+
@warn "Algorithm $name timed out (exceeded $(maxtime)s) for size $n, eltype $eltype. Recording as NaN."
202+
success = false
203+
error_msg = "Timed out (exceeded $(maxtime)s)"
204+
gflops = NaN
205+
elseif result isa Exception
206+
# Task threw an error
207+
throw(result)
208+
else
209+
# Successful completion
210+
warmup_sol = result
211+
elapsed_time = time() - start_time
212+
213+
# Check correctness if reference solution is available
214+
if check_correctness && reference_solution !== nothing
215+
# Compute relative error
216+
rel_error = norm(warmup_sol.u - reference_solution.u) / norm(reference_solution.u)
217+
218+
if rel_error > correctness_tol
219+
passed_correctness = false
220+
@warn "Algorithm $name failed correctness check for size $n, eltype $eltype. " *
221+
"Relative error: $(round(rel_error, sigdigits=3)) > tolerance: $correctness_tol. " *
222+
"Algorithm will be excluded from results."
223+
success = false
224+
error_msg = "Failed correctness check (rel_error = $(round(rel_error, sigdigits=3)))"
225+
gflops = 0.0
226+
end
227+
end
228+
229+
# Only benchmark if correctness check passed and we have time remaining
230+
if passed_correctness && !timed_out
231+
# Check if we have enough time remaining for benchmarking
232+
# Allow at least 2x the warmup time for benchmarking
233+
remaining_time = maxtime - elapsed_time
234+
if remaining_time < 2 * elapsed_time
235+
@warn "Algorithm $name: insufficient time remaining for benchmarking (warmup took $(round(elapsed_time, digits=2))s). Recording as NaN."
236+
gflops = NaN
237+
success = false
238+
error_msg = "Insufficient time for benchmarking"
239+
else
240+
# Actual benchmark
241+
bench = @benchmark solve($prob, $alg) setup=(prob = LinearProblem(
242+
copy($A), copy($b);
243+
u0 = copy($u0),
244+
alias = LinearAliasSpecifier(alias_A = true, alias_b = true)))
245+
246+
# Calculate GFLOPs
247+
min_time_sec = minimum(bench.times) / 1e9
248+
flops = luflop(n, n)
249+
gflops = flops / min_time_sec / 1e9
250+
end
251+
end
180252
end
181253

182254
catch e
183255
success = false
184256
error_msg = string(e)
257+
gflops = NaN
185258
# Don't warn for each failure, just record it
186259
end
187260

@@ -252,8 +325,8 @@ Categorize the benchmark results into size ranges and find the best algorithm fo
252325
For complex types, avoids RFLUFactorization if possible due to known issues.
253326
"""
254327
function categorize_results(df::DataFrame)
255-
# Filter successful results
256-
successful_df = filter(row -> row.success, df)
328+
# Filter successful results and exclude NaN values
329+
successful_df = filter(row -> row.success && !isnan(row.gflops), df)
257330

258331
if nrow(successful_df) == 0
259332
@warn "No successful benchmark results found!"
@@ -293,8 +366,9 @@ function categorize_results(df::DataFrame)
293366
continue
294367
end
295368

296-
# Calculate average GFLOPs for each algorithm in this range
297-
avg_results = combine(groupby(range_df, :algorithm), :gflops => mean => :avg_gflops)
369+
# Calculate average GFLOPs for each algorithm in this range, excluding NaN values
370+
avg_results = combine(groupby(range_df, :algorithm),
371+
:gflops => (x -> mean(filter(!isnan, x))) => :avg_gflops)
298372

299373
# Sort by performance
300374
sort!(avg_results, :avg_gflops, rev=true)

lib/LinearSolveAutotune/src/plotting.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ function create_benchmark_plots(df::DataFrame; title_base = "LinearSolve.jl LU F
4444

4545
# Plot each algorithm for this element type
4646
for alg in algorithms
47-
alg_df = filter(row -> row.algorithm == alg, eltype_df)
47+
alg_df = filter(row -> row.algorithm == alg && !isnan(row.gflops), eltype_df)
4848
if nrow(alg_df) > 0
4949
# Sort by size for proper line plotting
5050
sort!(alg_df, :size)

lib/LinearSolveAutotune/src/telemetry.jl

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -365,9 +365,10 @@ function format_detailed_results_markdown(df::DataFrame)
365365
end
366366

367367
# Create a summary table with average performance per algorithm for this element type
368+
# Filter out NaN values when computing statistics
368369
summary = combine(groupby(eltype_df, :algorithm),
369-
:gflops => mean => :avg_gflops,
370-
:gflops => std => :std_gflops,
370+
:gflops => (x -> mean(filter(!isnan, x))) => :avg_gflops,
371+
:gflops => (x -> std(filter(!isnan, x))) => :std_gflops,
371372
nrow => :num_tests)
372373
sort!(summary, :avg_gflops, rev = true)
373374

0 commit comments

Comments
 (0)