File tree Expand file tree Collapse file tree 5 files changed +277
-191
lines changed Expand file tree Collapse file tree 5 files changed +277
-191
lines changed Original file line number Diff line number Diff line change @@ -7,6 +7,8 @@ build_release/
77
88# Temporary binaries
99/tmp /
10- less_slow_from_ptx.cubin
1110less_slow_from_cu.cubin
1211less_slow_from_cu.ptx
12+ less_slow_sm70_from_ptx.cubin
13+ less_slow_sm80_from_ptx.cubin
14+ less_slow_sm90a_from_ptx.cubin
Original file line number Diff line number Diff line change 2323 * $ nvcc -arch=sm_90a -Xptxas -v -lineinfo -cubin -o less_slow_from_cu.cubin less_slow.cu
2424 * $ cuobjdump -sass less_slow_from_cu.cubin | grep -i mma
2525 *
26+ * Assuming how aggressively NVCC unrolls loops and the number of kernels in
27+ * this file, you may want to deduplicate them:
28+ *
29+ * $ cuobjdump -sass less_slow_from_cu.cubin | grep -i mma | \
30+ * $ sed -r 's/\/\*[^*]+\*\///g' | \
31+ * $ sed -r 's/^[[:space:]]+//; s/[[:space:]]+$//' | \
32+ * $ sort -u
33+ *
2634 * Keep in mind the following TC generations:
2735 *
2836 * - Volta SM70: 1st generation of TCs, server V100 cards.
Original file line number Diff line number Diff line change 1818 * You can validate this file by asking the Nvidia PTX Assembler to compile it
1919 * to `.cubin` for some target architecture:
2020 *
21- * $ ptxas -o less_slow_from_ptx.cubin -arch=sm_70 less_slow_sm70.ptx
22- * $ cuobjdump -sass less_slow_from_ptx.cubin | grep -i mma
21+ * $ ptxas -o less_slow_sm70_from_ptx.cubin -arch=sm_70 less_slow_sm70.ptx
22+ * $ cuobjdump -sass less_slow_sm70_from_ptx.cubin | grep -i mma
23+ *
24+ * Assuming how aggressively NVCC unrolls loops and the number of kernels in
25+ * this file, you may want to deduplicate them:
26+ *
27+ * $ cuobjdump -sass less_slow_sm70_from_ptx.cubin | grep -i mma | \
28+ * $ sed -r 's/\/\*[^*]+\*\///g' | \
29+ * $ sed -r 's/^[[:space:]]+//; s/[[:space:]]+$//' | \
30+ * $ sort -u
2331 *
2432 * @section Register File
2533 *
Original file line number Diff line number Diff line change 1313 * You can validate this file by asking the Nvidia PTX Assembler to compile it
1414 * to `.cubin` for some target architecture:
1515 *
16- * $ ptxas -o less_slow_from_ptx .cubin -arch=sm_80 less_slow_sm80.ptx
17- * $ cuobjdump -sass less_slow_from_ptx .cubin | grep -i mma
16+ * $ ptxas -o less_slow_sm80_from_ptx .cubin -arch=sm_80 less_slow_sm80.ptx
17+ * $ cuobjdump -sass less_slow_sm80_from_ptx .cubin | grep -i mma
1818 *
1919 * Assuming how aggressively NVCC unrolls loops and the number of kernels in
2020 * this file, you may want to deduplicate them:
2121 *
22- * $ cuobjdump -sass less_slow_from_ptx .cubin | grep -i mma | \
22+ * $ cuobjdump -sass less_slow_sm80_from_ptx .cubin | grep -i mma | \
2323 * $ sed -r 's/\/\*[^*]+\*\///g' | \
2424 * $ sed -r 's/^[[:space:]]+//; s/[[:space:]]+$//' | \
2525 * $ sort -u
You can’t perform that action at this time.
0 commit comments