Skip to content

Commit eff8583

Browse files
authored
[docs] add Ideas to documentation (#2297)
* add ideas doc * add links * change copyright * revert change * fix mistakes * fix failures * add issue links and fix a spelling mistake: * make anonymous * remove space
1 parent 96cd7af commit eff8583

File tree

3 files changed

+180
-0
lines changed

3 files changed

+180
-0
lines changed

CONTRIBUTING.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,10 @@ A GitHub* Action verifies if your changes comply with the output of the auto-for
5353

5454
Optionally, you can install pre-commit hooks that do the formatting for you. For this, run from the top level of the repository:
5555

56+
## Ideas
57+
58+
If you want to contribute but do not know where to start we maintain a [public list](https://uxlfoundation.github.io/scikit-learn-intelex/latest/ideas.html) of projects which include difficulty and effort in our documentation. These ideas have linked issues on GitHub where you can message us for next steps.
59+
5660
```bash
5761
pip install pre-commit
5862
pre-commit install

doc/sources/ideas.rst

Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
.. Copyright Contributors to the oneDAL project
2+
..
3+
.. Licensed under the Apache License, Version 2.0 (the "License");
4+
.. you may not use this file except in compliance with the License.
5+
.. You may obtain a copy of the License at
6+
..
7+
.. http://www.apache.org/licenses/LICENSE-2.0
8+
..
9+
.. Unless required by applicable law or agreed to in writing, software
10+
.. distributed under the License is distributed on an "AS IS" BASIS,
11+
.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
.. See the License for the specific language governing permissions and
13+
.. limitations under the License.
14+
15+
#####
16+
Ideas
17+
#####
18+
19+
As an open-source project, we welcome community contributions to Intel(R) Extension for Scikit-learn.
20+
This document suggests contribution directions which we consider good introductory projects with meaningful
21+
impact. You can directly contribute to next-generation supercomputing, or just learn in depth about key
22+
aspects of performant machine learning code for a range of architectures. This list is expected to evolve
23+
with current available projects described in the latest version of the documentation.
24+
25+
Every project is labeled in one of three tiers based on the time commitment: 'small' (90 hours), 'medium'
26+
(175 hours) or 'large' (350 hours). Related topics can be combined into larger packages, though not
27+
completely additive due to similarity in scope (e.g. 3 'smalls' may make a 'medium' given a learning
28+
curve). Others may increase in difficulty as the scope increases (some 'smalls' may become large with
29+
in-depth C++ coding). Each idea has a linked GitHub issue, a description, a difficulty, and possibly an
30+
extended goal. They are grouped into relative similarity to allow for easy combinations.
31+
32+
Implement Covariance Estimators for Supercomputers
33+
--------------------------------------------------
34+
35+
The Intel(R) Extension for Scikit-learn contains an MPI-enabled covariance algorithm, showing high performance
36+
from SBCs to multi-node clusters. It directly matches the capabilities of Scikit-Learn's EmpiricalCovariance
37+
estimator. There exist a number of closely related algorithms which modify the outputs of EmpiricalCovariance
38+
which can be created using our implementation. This includes Oracles Approximated Shrinkage (OAS) and Shrunk
39+
Covariance (ShrunkCovariance) algorithms. Adding these algorithms to our codebase will assist the community
40+
in their analyses. The total combined work of the two sub-projects is an easy difficulty with a medium time
41+
requirement. With the extended goals, it becomes a hard difficulty with large time requirement.
42+
43+
Oracle Approximating Shrinkage Estimator (small)
44+
************************************************
45+
46+
The output of EmpiricalCovariance is regularized by a shrinkage value impacted by the overall mean of the data.
47+
The goal would be to implement this estimator with post-processing changes to the fitted empirical covariance.
48+
This project is very similar to the ShrunkCovariance project and would combine into a medium project.
49+
When implemented in python re-using our EmpiricalCovariance estimator, this would be an easy project with a
50+
small time commitment. Implementing the super-computing distributed version using python would only work for
51+
distributed-aware frameworks. Extended goals would make this a hard difficulty, medium commitment project. This
52+
would require implementing the regularization in C++ in oneDAL both for CPU and GPU. Then this must be made
53+
available in Scikit-learn-intelex for making a new estimator. This would hopefully follow the design strategy
54+
used for our Ridge Regression estimator.
55+
56+
Questions, status, and additional information can be tracked on `GitHub <https://github.com/uxlfoundation/scikit-learn-intelex/issues/2305>`__.
57+
58+
59+
ShrunkCovariance Estimator (small)
60+
**********************************
61+
62+
The output of EmpiricalCovariance is regularized by a shrinkage value impacted by the overall mean of the data.
63+
The goal would be to implement this estimator with post-processing changes to the fitted empirical covariance.
64+
This is very similar to the OAS project and would combine into a medium project.
65+
When implemented in python re-using our EmpiricalCovariance estimator, this would be an easy project with a
66+
small time commitment. Implementing the super-computing distributed version using python would only work for
67+
distributed-aware frameworks. Extended goals would make this a hard difficulty, medium commitment project. This
68+
would require implementing the regularization in C++ in oneDAL both for CPU and GPU. Then this must be made
69+
available in Scikit-learn-intelex for making a new estimator. This would hopefully follow the design strategy
70+
used for our Ridge Regression estimator.
71+
72+
Questions, status, and additional information can be tracked on `GitHub <https://github.com/uxlfoundation/scikit-learn-intelex/issues/2306>`__.
73+
74+
75+
Implement the Preprocessing Estimators for Supercomputers
76+
---------------------------------------------------------
77+
78+
The Intel(R) Extension for Scikit-learn contains two unique estimators used to get vital metrics from large datasets,
79+
known as BasicStatistics and IncrementalBasicStatistics. They generate relevant values like 'min', 'max', 'mean'
80+
and 'variance' with special focus on multithreaded performance. It is also MPI-enabled working on SBCs to multi-node
81+
clusters, and can prove very useful for important big data pre-processing steps which may be otherwise unwieldly.
82+
Several pre-processsing algorithms in Scikit-learn use these basic metrics where BasicStatistics could be used instead.
83+
The overall goal would be to use the online version, IncrementalBasicStatistics, to create advanced pre-processing
84+
scikit-learn-intelex estimators which can be used on supercomputing clusters. The difficulty of this project is easy,
85+
with a combined time commitment of a large project. It does not have any extended goals.
86+
87+
88+
StandardScaler Estimator (small)
89+
********************************
90+
91+
The StandardScaler estimator scales the data to zero mean and unit variance. Use the IncrementalBasicStatistics estimator
92+
to generate the mean and variance to scale the data. Investigate where the new implementation may be low performance and
93+
include guards in the code to use Scikit-learn as necessary. The final deliverable would be to add this estimator to the 'spmd'
94+
interfaces which are effective on MPI-enabled supercomputers, this will use the underlying MPI-enabled mean and variance
95+
calculators in IncrementalBasicStatistics. This is an easy difficulty project, and would be a medium time commitment
96+
when combined with other pre-processing projects.
97+
98+
Questions, status, and additional information can be tracked on `GitHub <https://github.com/uxlfoundation/scikit-learn-intelex/issues/2307>`__.
99+
100+
101+
MaxAbsScaler Estimator (small)
102+
******************************
103+
104+
The MaxAbScaler estimator scales the data by its maximum absolute value. Use the IncrementalBasicStatistics estimator
105+
to generate the min and max to scale the data. Investigate where the new implementation may be low performance and
106+
include guards in the code to use Scikit-learn as necessary. The final deliverable would be to add this estimator to the 'spmd'
107+
interfaces which are effective on MPI-enabled supercomputers, this will use the underlying MPI-enabled minimum and maximum
108+
calculators in IncrementalBasicStatistics. This is similar to the MinMaxScaler and can be combined into a small project.
109+
This is an easy difficulty project.
110+
111+
Questions, status, and additional information can be tracked on `GitHub <https://github.com/uxlfoundation/scikit-learn-intelex/issues/2308>`__.
112+
113+
MinMaxScaler Estimator (small)
114+
******************************
115+
116+
The MinMaxScaler estimator scales the data to a range set by the minimum and maximum. Use the IncrementalBasicStatistics
117+
estimator to generate the min and max to scale the data. Investigate where the new implementation may be low performance and
118+
include guards in the code to use Scikit-learn as necessary. The final deliverable would be to add this estimator to the 'spmd'
119+
interfaces which are effective on MPI-enabled supercomputers, this will use the underlying MPI-enabled minimum and maximum
120+
calculators in IncrementalBasicStatistics. This is similar to the MaxAbsScaler and can be combined into a small project.
121+
This is an easy difficulty project.
122+
123+
Questions, status, and additional information can be tracked on `GitHub <https://github.com/uxlfoundation/scikit-learn-intelex/issues/2309>`__.
124+
125+
Normalizer Estimator (small)
126+
****************************
127+
128+
The normalizer estimator scales the samples independently by the sample's norm (l1, l2). Use the IncrementalBasicStatistics
129+
estimator to generate the sum squared data and use it for generating only the l2 version of the normalizer. Investigate where
130+
the new implementation may be low performance and include guards in the code to use Scikit-learn as necessary. The final
131+
deliverable would be to add this estimator to the 'spmd' interfaces which are effective on MPI-enabled supercomputers, this
132+
will use the underlying MPI-enabled mean and variance calculators in IncrementalBasicStatistics. This is an easy difficulty project,
133+
and would be a medium time commitment when combined with other pre-processing projects.
134+
135+
Questions, status, and additional information can be tracked on `GitHub <https://github.com/uxlfoundation/scikit-learn-intelex/issues/2310>`__.
136+
137+
138+
Expose Accelerated Kernel Distance Functions
139+
--------------------------------------------
140+
141+
The Intel(R) Extension for Scikit-learn contains several kernel functions which have not been made available in our public API but
142+
are available in our onedal package. Making these available to the users is an easy, python-only project good for learning about
143+
Scikit-learn, testing and the underlying math of kernels. The goal would be to make them available in a similar fashion as in Scikit-Learn.
144+
Their general nature makes them have high utility for both scikit-learn and scikit-learn-intelex as they can be used as plugins for a
145+
number of other estimators (see the Kernel trick).
146+
147+
148+
sigmoid_kernel Function (small)
149+
*******************************
150+
151+
The sigmoid kernel converts data via tanh into a new space. This is easy difficulty, but requires significant benchmarking to find when
152+
the scikit-learn-intelex implementation provides better performance. This project will focus on the public API and including the benchmarking
153+
results for a seamless, high-performance user experience. Combines with the other kernel projects to a medium time commitment.
154+
155+
Questions, status, and additional information can be tracked on `GitHub <https://github.com/uxlfoundation/scikit-learn-intelex/issues/2311>`__.
156+
157+
158+
polynomial_kernel Function (small)
159+
**********************************
160+
161+
The polynomial kernel converts data via a polynomial into a new space. This is easy difficulty, but requires significant benchmarking to find when
162+
the scikit-learn-intelex implementation provides better performance. This project will focus on the public API and including the benchmarking
163+
results for a seamless, high-performance user experience. Combines with the other kernel projects to a medium time commitment.
164+
165+
Questions, status, and additional information can be tracked on `GitHub <https://github.com/uxlfoundation/scikit-learn-intelex/issues/2312>`__.
166+
167+
168+
rbf_kernel Function (small)
169+
***************************
170+
171+
The rbf kernel converts data via a radial basis function into a new space. This is easy difficulty, but requires significant benchmarking to find when
172+
the scikit-learn-intelex implementation provides better performance. This project will focus on the public API and including the benchmarking
173+
results for a seamless, high-performance user experience. Combines with the other kernel projects to a medium time commitment.
174+
175+
Questions, status, and additional information can be tracked on `GitHub <https://github.com/uxlfoundation/scikit-learn-intelex/issues/2313>`__.

doc/sources/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,4 +133,5 @@ Enable Intel(R) GPU optimizations
133133

134134
Support <support.rst>
135135
contribute.rst
136+
ideas.rst
136137
license.rst

0 commit comments

Comments
 (0)