Skip to content

Improve TTFX#3064

Open
maleadt wants to merge 7 commits intomasterfrom
tb/precompile
Open

Improve TTFX#3064
maleadt wants to merge 7 commits intomasterfrom
tb/precompile

Conversation

@maleadt
Copy link
Member

@maleadt maleadt commented Mar 27, 2026

Before:

Benchmark: julia -e "using CUDA; @cuda identity(nothing)"
  Time (mean ± σ):      7.458 s ±  0.039 s    [User: 10.002 s, System: 0.444 s]
  Range (min … max):    7.408 s …  7.552 s    10 runs

After:

Benchmark: julia --project -e "using CUDACore; @cuda identity(nothing)"
  Time (mean ± σ):      1.799 s ±  0.010 s    [User: 2.424 s, System: 0.294 s]
  Range (min … max):    1.781 s …  1.816 s    10 runs

Or, even more spectactular: the profiler

Before:

Benchmark: julia -e "using CUDA; show(devnull, CUDA.@profile @cuda identity(nothing))"
  Time (mean ± σ):     17.436 s ±  0.107 s    [User: 19.796 s, System: 0.521 s]
  Range (min … max):   17.248 s … 17.559 s    10 runs

After:

Benchmark: julia --project -e "using CUDACore, CUDATools; show(devnull, @profile @cuda identity(nothing))"
  Time (mean ± σ):      3.194 s ±  0.017 s    [User: 3.778 s, System: 0.331 s]
  Range (min … max):    3.178 s …  3.231 s    10 runs

Depends on JuliaGPU/GPUCompiler.jl#776

@kshyatt kshyatt added the performance How fast can we go? label Mar 27, 2026
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Details
Benchmark suite Current: 77d00a4 Previous: ecb27a7 Ratio
latency/precompile 4583534095.5 ns 4095102458 ns 1.12
latency/ttfp 4409300069 ns 14632855735 ns 0.30
latency/import 3825213452.5 ns 3384844378 ns 1.13
integration/volumerhs 9435275.5 ns 9443415 ns 1.00
integration/byval/slices=1 145955 ns 145838 ns 1.00
integration/byval/slices=3 423262 ns 422984 ns 1.00
integration/byval/reference 144037 ns 144026 ns 1.00
integration/byval/slices=2 284655 ns 284568.5 ns 1.00
integration/cudadevrt 102687 ns 102657 ns 1.00
kernel/indexing 13636 ns 13364 ns 1.02
kernel/indexing_checked 14263 ns 14057 ns 1.01
kernel/occupancy 730.0555555555557 ns 669.0063694267516 ns 1.09
kernel/launch 2228.5555555555557 ns 2129.7 ns 1.05
kernel/rand 15954 ns 14468 ns 1.10
array/reverse/1d 18844 ns 18380 ns 1.03
array/reverse/2dL_inplace 66117 ns 65953 ns 1.00
array/reverse/1dL 69408 ns 68908.5 ns 1.01
array/reverse/2d 21233 ns 20404 ns 1.04
array/reverse/1d_inplace 10320.333333333334 ns 8469.666666666666 ns 1.22
array/reverse/2d_inplace 11664 ns 10199 ns 1.14
array/reverse/2dL 73235 ns 72407 ns 1.01
array/reverse/1dL_inplace 66058 ns 65821 ns 1.00
array/copy 19083 ns 18385 ns 1.04
array/iteration/findall/int 150383.5 ns 148396.5 ns 1.01
array/iteration/findall/bool 133286 ns 131596 ns 1.01
array/iteration/findfirst/int 84538 ns 82845 ns 1.02
array/iteration/findfirst/bool 82499 ns 81148.5 ns 1.02
array/iteration/scalar 68951 ns 68544 ns 1.01
array/iteration/logical 203658 ns 197158 ns 1.03
array/iteration/findmin/1d 88631 ns 86597.5 ns 1.02
array/iteration/findmin/2d 117501 ns 116930 ns 1.00
array/reductions/reduce/Int64/1d 44048 ns 42549 ns 1.04
array/reductions/reduce/Int64/dims=1 43000.5 ns 43265 ns 0.99
array/reductions/reduce/Int64/dims=2 60230 ns 59433 ns 1.01
array/reductions/reduce/Int64/dims=1L 88058 ns 87460 ns 1.01
array/reductions/reduce/Int64/dims=2L 85594 ns 84755 ns 1.01
array/reductions/reduce/Float32/1d 35508 ns 34245 ns 1.04
array/reductions/reduce/Float32/dims=1 46980 ns 39504 ns 1.19
array/reductions/reduce/Float32/dims=2 57520 ns 56666.5 ns 1.02
array/reductions/reduce/Float32/dims=1L 52341 ns 51754 ns 1.01
array/reductions/reduce/Float32/dims=2L 70395 ns 69136 ns 1.02
array/reductions/mapreduce/Int64/1d 43667 ns 42523 ns 1.03
array/reductions/mapreduce/Int64/dims=1 42924.5 ns 41960.5 ns 1.02
array/reductions/mapreduce/Int64/dims=2 60047 ns 59584 ns 1.01
array/reductions/mapreduce/Int64/dims=1L 88007 ns 87504 ns 1.01
array/reductions/mapreduce/Int64/dims=2L 85593 ns 84897 ns 1.01
array/reductions/mapreduce/Float32/1d 35487 ns 34058 ns 1.04
array/reductions/mapreduce/Float32/dims=1 40739 ns 39925.5 ns 1.02
array/reductions/mapreduce/Float32/dims=2 57473 ns 56515 ns 1.02
array/reductions/mapreduce/Float32/dims=1L 52213 ns 51712 ns 1.01
array/reductions/mapreduce/Float32/dims=2L 70041 ns 68949 ns 1.02
array/broadcast 20699 ns 20526 ns 1.01
array/copyto!/gpu_to_gpu 11605 ns 11259 ns 1.03
array/copyto!/cpu_to_gpu 217253 ns 215190.5 ns 1.01
array/copyto!/gpu_to_cpu 286557 ns 280520 ns 1.02
array/accumulate/Int64/1d 119099 ns 118622 ns 1.00
array/accumulate/Int64/dims=1 80970 ns 79915 ns 1.01
array/accumulate/Int64/dims=2 157467 ns 156043 ns 1.01
array/accumulate/Int64/dims=1L 1695571 ns 1705195.5 ns 0.99
array/accumulate/Int64/dims=2L 962881 ns 961432 ns 1.00
array/accumulate/Float32/1d 102052 ns 101378 ns 1.01
array/accumulate/Float32/dims=1 77635 ns 76864 ns 1.01
array/accumulate/Float32/dims=2 144878 ns 143449 ns 1.01
array/accumulate/Float32/dims=1L 1586265 ns 1592307.5 ns 1.00
array/accumulate/Float32/dims=2L 658698 ns 660195 ns 1.00
array/construct 1330.8 ns 1314.9 ns 1.01
array/random/randn/Float32 38785 ns 37667 ns 1.03
array/random/randn!/Float32 31766.5 ns 31476 ns 1.01
array/random/rand!/Int64 34407 ns 34492 ns 1.00
array/random/rand!/Float32 8622.666666666666 ns 8454.333333333334 ns 1.02
array/random/rand/Int64 30001 ns 37106 ns 0.81
array/random/rand/Float32 13451 ns 12822 ns 1.05
array/permutedims/4d 51855 ns 52569 ns 0.99
array/permutedims/2d 52665 ns 52223 ns 1.01
array/permutedims/3d 53154 ns 52428.5 ns 1.01
array/sorting/1d 2737384 ns 2743369 ns 1.00
array/sorting/by 3305916 ns 3313654 ns 1.00
array/sorting/2d 1069615 ns 1074587 ns 1.00
cuda/synchronization/stream/auto 1041.9 ns 1044.4615384615386 ns 1.00
cuda/synchronization/stream/nonblocking 7511 ns 8092.2 ns 0.93
cuda/synchronization/stream/blocking 835.8018867924528 ns 845.875 ns 0.99
cuda/synchronization/context/auto 1187.8 ns 1177.7 ns 1.01
cuda/synchronization/context/nonblocking 8148.4 ns 7929.299999999999 ns 1.03
cuda/synchronization/context/blocking 948.2 ns 932.4857142857143 ns 1.02

This comment was automatically generated by workflow using github-action-benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance How fast can we go?

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants