README » Algorithm Implementations
This directory contains various algorithm implementations for distributed computing and matrix operations.
00_load: Load operations across multiple GPUs01_store: Store operations across multiple GPUs02_all_load: Load operations where all GPUs load simultaneously03_all_store: Store operations where all GPUs store simultaneously04_atomic_add: Atomic add operations across multiple GPUs05_atomic_xchg: Atomic exchange operations across multiple GPUs
- 06_message_passing: Point-to-point message passing (load/store and put/get operations)
07_gemm_all_scatter: Matrix multiplication with all-scatter communication08_gemm_atomics_all_reduce: Matrix multiplication with all-reduce using atomics09_gemm_one_shot_all_reduce: Matrix multiplication with one-shot all-reduce10_gemm_all_scatter_wg_specialization: Matrix multiplication with all-scatter using workgroup specialization11_gemm_all_scatter_producer_consumer: Matrix multiplication with all-scatter using producer-consumer concurrent kernels12_gemm_all_scatter_bulk_synchronous: Matrix multiplication with all-scatter using the bulk synchronous parallel approach13_flash_decode: Fused Flash Decode Attention for accelerating LLM inference14_all_gather_gemm: Fused All-Gather + GEMM with Pull and Push models15_gemm_all_reduce_ring_based: Matrix multiplication with ring-based all-reduce16_all_reduce_ring_based: Ring-based all-reduce operation20_gemm_all_scatter_independent: Independent GEMM and all-scatter operations with support for CSV input configurations21_gemm_one_shot_all_reduce_independent: Independent GEMM and all-reduce operations with support for CSV input configurations and selective execution
ccl: iris-ccl collective communication operations (all-to-all, etc.)
benchmark: Benchmarking utilities and performance testing toolscommon: Common utilities and shared code for examples
# Example command to run distributed load operations
python examples/00_load/load_bench.py --num_ranks 8 # Load across GPUs
python examples/02_all_load/all_load_bench.py --num_ranks 8 # Simultaneous load on all GPUs
# Example command to run distributed store operations
python examples/01_store/store_bench.py --num_ranks 8 # Store across GPUs
python examples/03_all_store/all_store_bench.py --num_ranks 8 # Simultaneous store on all GPUs
# Example command to run atomic operations
python examples/04_atomic_add/atomic_add_bench.py --num_ranks 8 # Atomic add across GPUs
python examples/05_atomic_xchg/atomic_xchg_bench.py --num_ranks 8 # Atomic exchange across GPUs
# Example command to run message passing
python examples/06_message_passing/message_passing_put.py --num_ranks 8
python examples/06_message_passing/message_passing_load_store.py --num_ranks 8
# Example command to run benchmark with all-scatter algorithm
python examples/07_gemm_all_scatter/benchmark.py --benchmark --validate --num_ranks 8
# Example command to run benchmark with all-reduce algorithm
python examples/08_gemm_atomics_all_reduce/benchmark.py --benchmark --validate --num_ranks 8
# Example command to run benchmark with one-shot all-reduce algorithm
python examples/09_gemm_one_shot_all_reduce/benchmark.py --benchmark --validate --num_ranks 8
# Example command to run benchmark with all-scatter and workgroup specialization
python examples/10_gemm_all_scatter_wg_specialization/benchmark.py --benchmark --validate --num_ranks 8
# Example command to run benchmark with all-scatter producer-consumer pattern
python examples/11_gemm_all_scatter_producer_consumer/benchmark.py --benchmark --validate --num_ranks 8
# Example command to run benchmark with all-scatter bulk synchronous approach
python examples/12_gemm_all_scatter_bulk_synchronous/benchmark.py --benchmark --validate --num_ranks 8
# Flash Decode Attention - simple example run
python examples/13_flash_decode/example_run.py --num_ranks 8
# All-Gather + GEMM - Pull model
python examples/14_all_gather_gemm/example_run_pull.py --num_ranks 8
# All-Gather + GEMM - Push model
python examples/14_all_gather_gemm/example_run_push.py --num_ranks 8
# Example command to run benchmark with ring-based all-reduce for GEMM
python examples/15_gemm_all_reduce_ring_based/benchmark.py --benchmark --validate --num_ranks 8
# Example command to run benchmark with ring-based all-reduce
python examples/16_all_reduce_ring_based/benchmark.py --benchmark --validate --num_ranks 8
# Independent GEMM and all-scatter - single configuration
python examples/20_gemm_all_scatter_independent/benchmark.py --benchmark --validate --num_ranks 8
# Independent GEMM and all-scatter - sweep with CSV configurations
python examples/20_gemm_all_scatter_independent/benchmark.py --benchmark --validate --num_ranks 8 --csv dataset/gemm_config.csv
# Independent GEMM and all-reduce - run both operations
python examples/21_gemm_one_shot_all_reduce_independent/benchmark.py --benchmark --validate --num_ranks 8
# Independent GEMM and all-reduce - run only GEMM
python examples/21_gemm_one_shot_all_reduce_independent/benchmark.py --only_gemm --validate --num_ranks 8
# Independent GEMM and all-reduce - run only all-reduce
python examples/21_gemm_one_shot_all_reduce_independent/benchmark.py --only_comm --validate --num_ranks 8
# Independent GEMM and all-reduce - sweep with CSV configurations
python examples/21_gemm_one_shot_all_reduce_independent/benchmark.py --benchmark --num_ranks 8 --csv examples/21_gemm_one_shot_all_reduce_independent/example_config.csv
# All-to-all collective communication
python examples/ccl/benchmark.py --benchmark --validate -m 1024 -n 512 -r 8 --datatype fp32
Note: Only examples 20 and 21 support loading multiple configurations from a CSV file using the --csv argument.
Example 20 CSV format:
m,n,k,datatype,blk_m,blk_n,blk_k,gemm_sms,comm_sms
8192,4608,36864,fp16,256,64,64,256,48
8192,4096,12288,fp32,256,128,64,256,48
4096,4096,8192,bf16,128,128,64,240,56Example 21 CSV format:
m,n,k,datatype,blk_m,blk_n,blk_k,gemm_sms,comm_sms
8192,4608,36864,fp16,256,64,64,256,48
4096,4096,12288,fp32,128,128,64,240,56