Fix the MultiStages for intranode combine and the offset_int4 for int… #527
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
internode_ll.cu:
Wouldn't the lane_id * kNumSendUnrolls part cause a discrepancy?
elect_one_sync()always elects lane 0, so it's equivalent to:const auto& offset_int4 = (iter_idx + 1) * 32 * kNumSendUnrolls#525
#359
Each invocation of
tma_store_1dissues a TMA bulk_group, and multiple TMA bulk_groups issued by the same thread execute serially (see NVIDIA PTX documentation: 9.7.9.27.2.2. Data Movement and Conversion Instructions: cp.async.bulk.wait_group). Becauseelect_one_sync()always selects lane 0, thetma_store_1dinsideif (elect_one_sync())will be issued serially by the same lane, preventing a true MultiStage pipeline. Assigning the first kNumStages lanes to distinct stages enables genuine MultiStage concurrency.Furthermore, the inline assembly in
tma_store_wait—asm volatile("cp.async.bulk.wait_group.read %0;" ::"n"(N) : "memory")— waits for the number of remaining TMA bulk_group read-to-shared-memory transactions initiated by the current executing thread to drop to N. If the first kNumStages lanes each own one stage, then a lane that issuestma_store_1donly needs to wait for the single TMA bulk_group it issued in the previous iteration that writes its corresponding stage in shared memory.Finally, the
__syncwarp()immediately aftertma_store_wait()is unnecessary, becausetma_store_fence()followed by__syncwarp()is sufficient for visibility.