|
| 1 | +.. Copyright Contributors to the oneDAL project |
| 2 | +.. |
| 3 | +.. Licensed under the Apache License, Version 2.0 (the "License"); |
| 4 | +.. you may not use this file except in compliance with the License. |
| 5 | +.. You may obtain a copy of the License at |
| 6 | +.. |
| 7 | +.. http://www.apache.org/licenses/LICENSE-2.0 |
| 8 | +.. |
| 9 | +.. Unless required by applicable law or agreed to in writing, software |
| 10 | +.. distributed under the License is distributed on an "AS IS" BASIS, |
| 11 | +.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 12 | +.. See the License for the specific language governing permissions and |
| 13 | +.. limitations under the License. |
| 14 | +
|
| 15 | +##### |
| 16 | +Ideas |
| 17 | +##### |
| 18 | + |
| 19 | +As an open-source project, we welcome community contributions to Intel(R) Extension for Scikit-learn. |
| 20 | +This document suggests contribution directions which we consider good introductory projects with meaningful |
| 21 | +impact. You can directly contribute to next-generation supercomputing, or just learn in depth about key |
| 22 | +aspects of performant machine learning code for a range of architectures. This list is expected to evolve |
| 23 | +with current available projects described in the latest version of the documentation. |
| 24 | + |
| 25 | +Every project is labeled in one of three tiers based on the time commitment: 'small' (90 hours), 'medium' |
| 26 | +(175 hours) or 'large' (350 hours). Related topics can be combined into larger packages, though not |
| 27 | +completely additive due to similarity in scope (e.g. 3 'smalls' may make a 'medium' given a learning |
| 28 | +curve). Others may increase in difficulty as the scope increases (some 'smalls' may become large with |
| 29 | +in-depth C++ coding). Each idea has a linked GitHub issue, a description, a difficulty, and possibly an |
| 30 | +extended goal. They are grouped into relative similarity to allow for easy combinations. |
| 31 | + |
| 32 | +Implement Covariance Estimators for Supercomputers |
| 33 | +-------------------------------------------------- |
| 34 | + |
| 35 | +The Intel(R) Extension for Scikit-learn contains an MPI-enabled covariance algorithm, showing high performance |
| 36 | +from SBCs to multi-node clusters. It directly matches the capabilities of Scikit-Learn's EmpiricalCovariance |
| 37 | +estimator. There exist a number of closely related algorithms which modify the outputs of EmpiricalCovariance |
| 38 | +which can be created using our implementation. This includes Oracles Approximated Shrinkage (OAS) and Shrunk |
| 39 | +Covariance (ShrunkCovariance) algorithms. Adding these algorithms to our codebase will assist the community |
| 40 | +in their analyses. The total combined work of the two sub-projects is an easy difficulty with a medium time |
| 41 | +requirement. With the extended goals, it becomes a hard difficulty with large time requirement. |
| 42 | + |
| 43 | +Oracle Approximating Shrinkage Estimator (small) |
| 44 | +************************************************ |
| 45 | + |
| 46 | +The output of EmpiricalCovariance is regularized by a shrinkage value impacted by the overall mean of the data. |
| 47 | +The goal would be to implement this estimator with post-processing changes to the fitted empirical covariance. |
| 48 | +This project is very similar to the ShrunkCovariance project and would combine into a medium project. |
| 49 | +When implemented in python re-using our EmpiricalCovariance estimator, this would be an easy project with a |
| 50 | +small time commitment. Implementing the super-computing distributed version using python would only work for |
| 51 | +distributed-aware frameworks. Extended goals would make this a hard difficulty, medium commitment project. This |
| 52 | +would require implementing the regularization in C++ in oneDAL both for CPU and GPU. Then this must be made |
| 53 | +available in Scikit-learn-intelex for making a new estimator. This would hopefully follow the design strategy |
| 54 | +used for our Ridge Regression estimator. |
| 55 | + |
| 56 | +Questions, status, and additional information can be tracked on `GitHub <https://github.com/uxlfoundation/scikit-learn-intelex/issues/2305>`__. |
| 57 | + |
| 58 | + |
| 59 | +ShrunkCovariance Estimator (small) |
| 60 | +********************************** |
| 61 | + |
| 62 | +The output of EmpiricalCovariance is regularized by a shrinkage value impacted by the overall mean of the data. |
| 63 | +The goal would be to implement this estimator with post-processing changes to the fitted empirical covariance. |
| 64 | +This is very similar to the OAS project and would combine into a medium project. |
| 65 | +When implemented in python re-using our EmpiricalCovariance estimator, this would be an easy project with a |
| 66 | +small time commitment. Implementing the super-computing distributed version using python would only work for |
| 67 | +distributed-aware frameworks. Extended goals would make this a hard difficulty, medium commitment project. This |
| 68 | +would require implementing the regularization in C++ in oneDAL both for CPU and GPU. Then this must be made |
| 69 | +available in Scikit-learn-intelex for making a new estimator. This would hopefully follow the design strategy |
| 70 | +used for our Ridge Regression estimator. |
| 71 | + |
| 72 | +Questions, status, and additional information can be tracked on `GitHub <https://github.com/uxlfoundation/scikit-learn-intelex/issues/2306>`__. |
| 73 | + |
| 74 | + |
| 75 | +Implement the Preprocessing Estimators for Supercomputers |
| 76 | +--------------------------------------------------------- |
| 77 | + |
| 78 | +The Intel(R) Extension for Scikit-learn contains two unique estimators used to get vital metrics from large datasets, |
| 79 | +known as BasicStatistics and IncrementalBasicStatistics. They generate relevant values like 'min', 'max', 'mean' |
| 80 | +and 'variance' with special focus on multithreaded performance. It is also MPI-enabled working on SBCs to multi-node |
| 81 | +clusters, and can prove very useful for important big data pre-processing steps which may be otherwise unwieldly. |
| 82 | +Several pre-processsing algorithms in Scikit-learn use these basic metrics where BasicStatistics could be used instead. |
| 83 | +The overall goal would be to use the online version, IncrementalBasicStatistics, to create advanced pre-processing |
| 84 | +scikit-learn-intelex estimators which can be used on supercomputing clusters. The difficulty of this project is easy, |
| 85 | +with a combined time commitment of a large project. It does not have any extended goals. |
| 86 | + |
| 87 | + |
| 88 | +StandardScaler Estimator (small) |
| 89 | +******************************** |
| 90 | + |
| 91 | +The StandardScaler estimator scales the data to zero mean and unit variance. Use the IncrementalBasicStatistics estimator |
| 92 | +to generate the mean and variance to scale the data. Investigate where the new implementation may be low performance and |
| 93 | +include guards in the code to use Scikit-learn as necessary. The final deliverable would be to add this estimator to the 'spmd' |
| 94 | +interfaces which are effective on MPI-enabled supercomputers, this will use the underlying MPI-enabled mean and variance |
| 95 | +calculators in IncrementalBasicStatistics. This is an easy difficulty project, and would be a medium time commitment |
| 96 | +when combined with other pre-processing projects. |
| 97 | + |
| 98 | +Questions, status, and additional information can be tracked on `GitHub <https://github.com/uxlfoundation/scikit-learn-intelex/issues/2307>`__. |
| 99 | + |
| 100 | + |
| 101 | +MaxAbsScaler Estimator (small) |
| 102 | +****************************** |
| 103 | + |
| 104 | +The MaxAbScaler estimator scales the data by its maximum absolute value. Use the IncrementalBasicStatistics estimator |
| 105 | +to generate the min and max to scale the data. Investigate where the new implementation may be low performance and |
| 106 | +include guards in the code to use Scikit-learn as necessary. The final deliverable would be to add this estimator to the 'spmd' |
| 107 | +interfaces which are effective on MPI-enabled supercomputers, this will use the underlying MPI-enabled minimum and maximum |
| 108 | +calculators in IncrementalBasicStatistics. This is similar to the MinMaxScaler and can be combined into a small project. |
| 109 | +This is an easy difficulty project. |
| 110 | + |
| 111 | +Questions, status, and additional information can be tracked on `GitHub <https://github.com/uxlfoundation/scikit-learn-intelex/issues/2308>`__. |
| 112 | + |
| 113 | +MinMaxScaler Estimator (small) |
| 114 | +****************************** |
| 115 | + |
| 116 | +The MinMaxScaler estimator scales the data to a range set by the minimum and maximum. Use the IncrementalBasicStatistics |
| 117 | +estimator to generate the min and max to scale the data. Investigate where the new implementation may be low performance and |
| 118 | +include guards in the code to use Scikit-learn as necessary. The final deliverable would be to add this estimator to the 'spmd' |
| 119 | +interfaces which are effective on MPI-enabled supercomputers, this will use the underlying MPI-enabled minimum and maximum |
| 120 | +calculators in IncrementalBasicStatistics. This is similar to the MaxAbsScaler and can be combined into a small project. |
| 121 | +This is an easy difficulty project. |
| 122 | + |
| 123 | +Questions, status, and additional information can be tracked on `GitHub <https://github.com/uxlfoundation/scikit-learn-intelex/issues/2309>`__. |
| 124 | + |
| 125 | +Normalizer Estimator (small) |
| 126 | +**************************** |
| 127 | + |
| 128 | +The normalizer estimator scales the samples independently by the sample's norm (l1, l2). Use the IncrementalBasicStatistics |
| 129 | +estimator to generate the sum squared data and use it for generating only the l2 version of the normalizer. Investigate where |
| 130 | +the new implementation may be low performance and include guards in the code to use Scikit-learn as necessary. The final |
| 131 | +deliverable would be to add this estimator to the 'spmd' interfaces which are effective on MPI-enabled supercomputers, this |
| 132 | +will use the underlying MPI-enabled mean and variance calculators in IncrementalBasicStatistics. This is an easy difficulty project, |
| 133 | +and would be a medium time commitment when combined with other pre-processing projects. |
| 134 | + |
| 135 | +Questions, status, and additional information can be tracked on `GitHub <https://github.com/uxlfoundation/scikit-learn-intelex/issues/2310>`__. |
| 136 | + |
| 137 | + |
| 138 | +Expose Accelerated Kernel Distance Functions |
| 139 | +-------------------------------------------- |
| 140 | + |
| 141 | +The Intel(R) Extension for Scikit-learn contains several kernel functions which have not been made available in our public API but |
| 142 | +are available in our onedal package. Making these available to the users is an easy, python-only project good for learning about |
| 143 | +Scikit-learn, testing and the underlying math of kernels. The goal would be to make them available in a similar fashion as in Scikit-Learn. |
| 144 | +Their general nature makes them have high utility for both scikit-learn and scikit-learn-intelex as they can be used as plugins for a |
| 145 | +number of other estimators (see the Kernel trick). |
| 146 | + |
| 147 | + |
| 148 | +sigmoid_kernel Function (small) |
| 149 | +******************************* |
| 150 | + |
| 151 | +The sigmoid kernel converts data via tanh into a new space. This is easy difficulty, but requires significant benchmarking to find when |
| 152 | +the scikit-learn-intelex implementation provides better performance. This project will focus on the public API and including the benchmarking |
| 153 | +results for a seamless, high-performance user experience. Combines with the other kernel projects to a medium time commitment. |
| 154 | + |
| 155 | +Questions, status, and additional information can be tracked on `GitHub <https://github.com/uxlfoundation/scikit-learn-intelex/issues/2311>`__. |
| 156 | + |
| 157 | + |
| 158 | +polynomial_kernel Function (small) |
| 159 | +********************************** |
| 160 | + |
| 161 | +The polynomial kernel converts data via a polynomial into a new space. This is easy difficulty, but requires significant benchmarking to find when |
| 162 | +the scikit-learn-intelex implementation provides better performance. This project will focus on the public API and including the benchmarking |
| 163 | +results for a seamless, high-performance user experience. Combines with the other kernel projects to a medium time commitment. |
| 164 | + |
| 165 | +Questions, status, and additional information can be tracked on `GitHub <https://github.com/uxlfoundation/scikit-learn-intelex/issues/2312>`__. |
| 166 | + |
| 167 | + |
| 168 | +rbf_kernel Function (small) |
| 169 | +*************************** |
| 170 | + |
| 171 | +The rbf kernel converts data via a radial basis function into a new space. This is easy difficulty, but requires significant benchmarking to find when |
| 172 | +the scikit-learn-intelex implementation provides better performance. This project will focus on the public API and including the benchmarking |
| 173 | +results for a seamless, high-performance user experience. Combines with the other kernel projects to a medium time commitment. |
| 174 | + |
| 175 | +Questions, status, and additional information can be tracked on `GitHub <https://github.com/uxlfoundation/scikit-learn-intelex/issues/2313>`__. |
0 commit comments