[7_GPU_and_Specialized_Hardware]一个thread block里有多少个thread? #102

hbsun2113 · 2022-08-08T10:26:40Z

hbsun2113
Aug 8, 2022

在7_GPU_and_Specialized_Hardware发现对于不同模型设置的thread num是不同的。我的疑问：为了提供并行度，不是应该尽可能把一个thread block(e.g. stream multi-processors)里所有的thread都用上吗? 为什么thread num会不同？

对于MyModuleWindowSum设定的thread num是128：

对于blocking设定的thread num是8*8=64：

如果使用meta_schedule进行调优，对于这两个模型使用的thread num也是不同的：
MyModuleWindowSum:

from tvm import meta_schedule as ms

sch_tuned_MyModuleWindowSum = ms.tune_tir(
    mod=MyModuleWindowSum,
    target="nvidia/tesla-p100",
    config=ms.TuneConfig(
      max_trials_global=1000,
      num_trials_per_iter=1000,
    ),
    work_dir="./tune_tmp_MyModuleWindowSum",
    task_name="main"
)
sch_tuned_MyModuleWindowSum.mod.show()
print(sch_tuned_MyModuleWindowSum.trace)

MyModuleMatmul:

from tvm import meta_schedule as ms

sch_tuned = ms.tune_tir(
    mod=MyModuleMatmul,
    target="nvidia/tesla-p100",
    config=ms.TuneConfig(
      max_trials_global=64,
      num_trials_per_iter=64,
    ),
    work_dir="./tune_tmp",
    task_name="main"
)

sch_tuned.mod.show()

Answered by Hzfengsy

Aug 8, 2022

对于GPU而言，一个warp的线程数是32，参考 (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#differences-between-host-and-device) ，超过32个thread会被串行执行

View full answer

Hzfengsy · 2022-08-08T11:18:45Z

Hzfengsy
Aug 8, 2022
Maintainer

对于GPU而言，一个warp的线程数是32，参考 (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#differences-between-host-and-device) ，超过32个thread会被串行执行

2 replies

hbsun2113 Aug 8, 2022
Author

所以32个thread应该是最好的？那为什么自动调优给出的结果是64呢，是考虑了其他因素吗？

Hzfengsy Aug 8, 2022
Maintainer

thread数量保持32的倍数都是可以接受的。如果是64近似于

for i in range(2):
    for j in T.thread_binding(32, thread="threadIdx.x")

当然硬件会有更加复杂的scheduler，所以可能有时候32比较好，也有可能64会更好。
比较明确的是，如果小于32，肯定没有充分利用计算资源；如果不是32的整数倍（如48），那么最后一次线性执行（48 % 32 = 16线程）也没有充分利用计算资源

yzh119 · 2022-08-08T22:10:19Z

yzh119
Aug 8, 2022
Maintainer

一个thread block内可以有多个warp，gpu的warp scheduler可以调度这些warp的执行（可以参考https://www.cs.ucr.edu/~nael/217-f19/lectures/WarpScheduling.pptx）。
使用多个warp的好处：

如果某一个warp发射了一个在全局内存里读取数据的指令（可能需要比较长的时钟周期）停顿住了，warp scheduler可以转向执行下一个无需等待可以立即执行的warp，以此减小等待的时间，覆盖掉部分数据读写的开销。

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[7_GPU_and_Specialized_Hardware]一个thread block里有多少个thread? #102

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[7_GPU_and_Specialized_Hardware]一个thread block里有多少个thread? #102

Uh oh!

Uh oh!

hbsun2113 Aug 8, 2022

Replies: 2 comments · 2 replies

Uh oh!

Hzfengsy Aug 8, 2022 Maintainer

Uh oh!

hbsun2113 Aug 8, 2022 Author

Uh oh!

Hzfengsy Aug 8, 2022 Maintainer

Uh oh!

yzh119 Aug 8, 2022 Maintainer

hbsun2113
Aug 8, 2022

Replies: 2 comments 2 replies

Hzfengsy
Aug 8, 2022
Maintainer

hbsun2113 Aug 8, 2022
Author

Hzfengsy Aug 8, 2022
Maintainer

yzh119
Aug 8, 2022
Maintainer