Ever wondered what's making your CUDA binary big?
Cubloaty is a size profiler for CUDA binaries. It analyzes .so files and .cubin files to show you the size of each kernel, broken down by architecture (sm_70, sm_80, sm_90, etc.).
Think of it as bloaty, but for CUDA kernels.
$ cubloaty sampling.so
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ ๐ CUDA Kernel Size Analysis Report โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Architecture Summary
โญโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฎ
โ Architecture โ Kernels โ Total Size โ Percentage โ
โโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ SM_89 โ 361 โ 5.5MB โ 100.0% โ
โโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ TOTAL โ 361 โ 5.5MB โ 100.0% โ
โฐโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโฏ
Section Breakdown
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฎ
โ Section Type โ Total Size โ % of Total โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ Code Sections โ 4.3MB โ 78.9% โ
โ Metadata โ 567.5KB โ 10.1% โ
โ Data Sections โ 510.4KB โ 9.1% โ
โ Debug Info โ 39.8KB โ 0.7% โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ TOTAL โ 5.5MB โ 100.0% โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโฏ
Top CUDA Kernels (All Architectures) - 361 Total
โญโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโฎ
โ Rank โ Kernel Name โ Code Size โ % of Code โ
โโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโค
โ 1 โ void flashinfer::sampling::TopKTopPSamplingFromProbKernel<1024u, (c... โ 55.8KB โ 1.2% โ
โ 2 โ void flashinfer::sampling::TopKSamplingFromProbKernel<1024u, (cub::... โ 55.5KB โ 1.2% โ
โ 3 โ void flashinfer::sampling::TopKTopPSamplingFromProbKernel<1024u, (c... โ 52.9KB โ 1.2% โ
โ 4 โ void flashinfer::sampling::TopKSamplingFromProbKernel<1024u, (cub::... โ 52.6KB โ 1.2% โ
โ 5 โ void flashinfer::sampling::TopKTopPSamplingFromProbKernel<512u, (cu... โ 51.5KB โ 1.1% โ
โ 6 โ void flashinfer::sampling::TopKSamplingFromProbKernel<512u, (cub::C... โ 51.2KB โ 1.1% โ
โ 7 โ void flashinfer::sampling::TopKTopPSamplingFromProbKernel<512u, (cu... โ 46.4KB โ 1.0% โ
โ 8 โ void flashinfer::sampling::TopKSamplingFromProbKernel<512u, (cub::C... โ 46.2KB โ 1.0% โ
โ 9 โ void flashinfer::sampling::TopPSamplingFromProbKernel<1024u, (cub::... โ 46.0KB โ 1.0% โ
โ 10 โ void flashinfer::sampling::ChainSpeculativeSampling<1024u, (cub::CU... โ 45.5KB โ 1.0% โ
โ 11 โ void flashinfer::sampling::ChainSpeculativeSampling<512u, (cub::CUB... โ 43.0KB โ 1.0% โ
โ 12 โ void flashinfer::sampling::TopPSamplingFromProbKernel<1024u, (cub::... โ 43.0KB โ 1.0% โ
โ 13 โ void flashinfer::sampling::TopPSamplingFromProbKernel<512u, (cub::C... โ 42.9KB โ 1.0% โ
โ 14 โ void flashinfer::sampling::ChainSpeculativeSampling<1024u, (cub::CU... โ 42.4KB โ 0.9% โ
โ 15 โ void flashinfer::sampling::MinPSamplingFromProbKernel<1024u, (cub::... โ 39.4KB โ 0.9% โ
โ 16 โ void flashinfer::sampling::ChainSpeculativeSampling<512u, (cub::CUB... โ 38.8KB โ 0.9% โ
โ 17 โ void flashinfer::sampling::TopPRenormProbKernel<1024u, (cub::CUB_30... โ 38.4KB โ 0.9% โ
โ 18 โ void flashinfer::sampling::TopKTopPSamplingFromProbKernel<1024u, (c... โ 38.1KB โ 0.8% โ
โ 19 โ void flashinfer::sampling::TopPSamplingFromProbKernel<512u, (cub::C... โ 38.0KB โ 0.8% โ
โ 20 โ void flashinfer::sampling::TopKSamplingFromProbKernel<1024u, (cub::... โ 37.9KB โ 0.8% โ
โ 21 โ void flashinfer::sampling::MinPSamplingFromProbKernel<512u, (cub::C... โ 36.9KB โ 0.8% โ
โ 22 โ void flashinfer::sampling::TopKTopPSamplingFromProbKernel<1024u, (c... โ 36.4KB โ 0.8% โ
โ 23 โ void flashinfer::sampling::TopKSamplingFromProbKernel<1024u, (cub::... โ 36.2KB โ 0.8% โ
โ 24 โ void flashinfer::sampling::MinPSamplingFromProbKernel<1024u, (cub::... โ 36.1KB โ 0.8% โ
โ 25 โ void flashinfer::sampling::TopPRenormProbKernel<512u, (cub::CUB_300... โ 34.5KB โ 0.8% โ
โ 26 โ void flashinfer::sampling::TopKMaskLogitsKernel<1024u, (cub::CUB_30... โ 34.2KB โ 0.8% โ
โ 27 โ void flashinfer::sampling::TopKTopPSamplingFromProbKernel<512u, (cu... โ 33.9KB โ 0.8% โ
โ 28 โ void flashinfer::sampling::TopKSamplingFromProbKernel<512u, (cub::C... โ 33.8KB โ 0.7% โ
โ 29 โ void flashinfer::sampling::MinPSamplingFromProbKernel<512u, (cub::C... โ 31.9KB โ 0.7% โ
โ 30 โ void flashinfer::sampling::ChainSpeculativeSampling<1024u, (cub::CU... โ 31.8KB โ 0.7% โ
โ ... โ (331 more kernels) โ โ โ
โโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโค
โ โ TOTAL KERNEL CODE โ 4.4MB โ 80.1% of โ
โ โ โ โ file โ
โฐโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโฏ
โ Analysis complete!- ๐ Multi-architecture analysis - See kernel sizes across sm_70, sm_80, sm_90, etc.
- ๐ Regex filtering - Filter kernels by name pattern
- ๐ฆ Multiple formats -
.solibraries and standalone.cubinfiles - ๐จ Rich output - Beautiful tables or JSON for scripting
- โก Fast - Analyzes binaries in seconds
Cubloaty requires the following tools to be installed and available in your PATH:
- CUDA Toolkit - for
cuobjdump(part of the CUDA installation) - binutils - for
objdump,objcopy, andreadelf - gcc/g++ - for
c++filt(symbol demangling)
On Ubuntu/Debian:
sudo apt-get install binutils gccCUDA Toolkit can be downloaded from NVIDIA's website.
Install the package from pypi:
pip install cubloaty
Or git clone the repo and install from source:
git clone https://github.com/flashinfer-ai/cubloaty.git
pip install -e . -v # editable modecubloaty libmykernel.socubloaty kernel.sm_90.cubincubloaty libmykernel.so --top 50cubloaty libmykernel.so --arch sm_90# Find all GEMM kernels
cubloaty libmykernel.so --filter "gemm"
# Find attention-related kernels
cubloaty libmykernel.so --filter "attention|flash"cubloaty libmykernel.so --format json > analysis.jsoncubloaty libmykernel.so --full-names# Show top 20 GEMM kernels for sm_90 in JSON format
cubloaty lib.so --arch sm_90 --filter "gemm" --top 20 --format json# Show per-architecture breakdown
cubloaty libmykernel.so --verbose# Show just the top 10
cubloaty libmykernel.so --top 10# Get JSON output and process with jq
cubloaty lib.so --format json | jq '.kernels[] | select(.size > 100000)' file Path to .so or .cubin file to analyze
--top N, -n N Show top N kernels (default: 30)
--arch ARCH, -a ARCH Filter by architecture (e.g., sm_90, sm_80)
--filter REGEX, -r Filter kernel names by regex (case-insensitive)
--format {table,json} Output format (default: table)
--full-names Show full kernel names without truncation
--no-color Disable colored output
--verbose, -v Show detailed processing information
--version Show version number
Cubloaty extracts CUDA fatbinary sections from shared libraries using objdump and objcopy, then uses cuobjdump to extract individual cubins for each architecture. It analyzes each cubin with readelf to extract kernel symbols and their sizes, and uses c++filt to demangle C++ symbol names.
Issues and pull requests are welcome!