Matrix Transpose Tutorial Cleanup

I found a couple things while looking at the transpose tutorial.

First, the launch and kernel solutions could use block_unchecked policies. This will also allow the kernel implementation to skip the second sync threads call.

Second, it doesn't look like the launch solution actually uses shared memory as intended. It looks like the same thread that reads a value writes that value. The intention of shared memory is to let different threads read and write so memory accesses to both matrices are coalesced. This will require the launch solution to have a teamSync call, which it is currently lacking.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matrix Transpose Tutorial Cleanup #1916

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Matrix Transpose Tutorial Cleanup #1916

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions