Skip to content

Conversation

gonzalobg
Copy link
Contributor

  • Refactor all kernels into a generic "parallel for" algorithm
    • Supports grid-stride and block-stride loops, configurable with model flag
    • Handles devices of different sizes via occupancy APIs
  • Refactor memory allocation APIs
  • Prints more GPU details, in particular, the theoretical peak BW in GB/s of the current device, using the NVML library (which is part of the CUDA Toolkit and always available)
  • Fixes 2 bugs:
    • Prints the "order" used to run the benchmarks (e.g. classic vs isolated)
    • Fixes a division by zero bug in the solution checking

@gonzalobg
Copy link
Contributor Author

This was passing. Seems like this and other PRs are spuriously failing due to some cache issue @tom91136 @tomdeakin

@gonzalobg
Copy link
Contributor Author

Closing for #202

@gonzalobg gonzalobg closed this Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant