Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
WIP. A place to test and discuss ideas for implementing decision trees for tuning (starting with batch GEMM and GEMV).
The current decision tree structure requires 4 arrays, like the ones used by the trees in scikit-learn. I modified one of the arrays slightly, as described in the comment documentation for
evaluate_gemm_tree, but the others can just be output directly from the scikit-learn tree.To begin discussion, I added an example for a new
*gemm_batched_coresetup for Z and C (I realized I'll probably have to keep changing things as we develop this, including maybe the number of configurations we want to instantiate for each precision+transpose+transpose combo, and want to limit the amount of times I have to change everything for all 4 precisions).One problem with this is if we want to instantiate different sets of kernels for different architectures: we might end up with a lot of instantiations unless, e.g., we can have compile-time guards (with
GPU_TARGETmaybe?, which we don't currently do anything with when building for SYCL). Even just for PVC, we may want to cut down from what I have here, since there are 40 configurations for each precision, plus the various conjugate options for the complex types...