Skip to content

Commit 656e6a9

Browse files
glemaitrereshamasbetatimlucyleeow
authored
Add announcement CZI EOSS 6 (#190)
Co-authored-by: Reshama Shaikh <[email protected]> Co-authored-by: Tim Head <[email protected]> Co-authored-by: Lucy Liu <[email protected]>
1 parent 251e034 commit 656e6a9

File tree

3 files changed

+110
-0
lines changed

3 files changed

+110
-0
lines changed
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
---
2+
title: "Chan Zuckerberg Initiative considers scikit-learn an Essential Open Source Software"
3+
date: August 6, 2024
4+
categories:
5+
- Funding
6+
tags:
7+
- Open Source
8+
- Funding
9+
- Internship
10+
- Diversity
11+
featured-image: sklearn_czi.png
12+
13+
postauthors:
14+
- name: Guillaume Lemaitre
15+
website: https://github.com/glemaitre
16+
image: guillaume-lemaitre.jpg
17+
- name: Lucy Liu
18+
website: https://github.com/lucyleeow
19+
image: lucyliu.jpeg
20+
---
21+
<div>
22+
<img src="/assets/images/posts_images/{{ page.featured-image }}" alt="">
23+
{% include postauthor.html %}
24+
</div>
25+
26+
We are delighted to announce that `scikit-learn` has been awarded a grant from
27+
the [Chan Zuckerberg Initiative (CZI)](https://chanzuckerberg.com/)'s [Essential Open
28+
Source Software for Science
29+
(EOSS)](https://chanzuckerberg.com/rfa/essential-open-source-software-for-science/)
30+
program. This grant is funded by [Wellcome Trust](https://wellcome.org/).
31+
As in previous rounds, this cycle supports open-source software projects that are
32+
essential to biomedical research. This is the third time that CZI EOSS supports
33+
`scikit-learn`.
34+
35+
In this new grant, we will focus on improving the [evaluation and inspection of
36+
predictive
37+
models](https://chanzuckerberg.com/eoss/proposals/predictive-models-evaluation-inspection-in-scikit-learn/).
38+
39+
## Predictive models evaluation & inspection
40+
41+
When building a machine learning pipeline for a specific research problem, two key
42+
aspects are closely connected: (i) design of the pipeline and (ii) assessment, analysis, and
43+
inspection of it. Researchers strive to identify the optimal pipeline, maximizing specific
44+
evaluation metrics, while also seeking at explaining the validity and rationale behind
45+
the pipeline's predictions. This is the cornerstone of answering research
46+
questions. With this proposal we aim to improve and extend the available `scikit-learn`
47+
tools.
48+
49+
`scikit-learn` provides building blocks for model evaluation and statistical analysis of
50+
results. Originally, this information was presented in a raw format and required
51+
expertise from scientists to create intuitive reports for outreach to peers and
52+
outsiders. Recently, the `scikit-learn` community developed displays to easily generate
53+
visual figures for communicating such results. However, these displays are still in
54+
their early development stages and do not leverage all available statistical analysis
55+
tools (i.e., cross-validation) from `scikit-learn`. Thus, we aim to expand these
56+
displays, using the right statistical tools and thus promote the adoption of best
57+
practices when reporting results. Additionally, we also intend to create new displays
58+
to support common analysis tasks that are not yet covered in `scikit-learn`.
59+
60+
In the domain of model inspection, we aim to address several areas: (i) model inspection
61+
during training, (ii) enhancing user experience through interactive inspection, and
62+
(iii) model explainability. First, during the training of a pipeline, researchers are
63+
interested in monitoring the internal characteristics of the model, which is a not yet
64+
addressed long-standing issue in `scikit-learn`. We want to build upon some initial work
65+
by implementing a "callback" framework that allows users to track these internal
66+
parameters. Next, researchers commonly use interactive tools such as Jupyter Notebook to
67+
develop pipelines. `scikit-learn` started some efforts to visually and interactively
68+
display pipelines in these environments. However, there is room for improvement in terms
69+
of user interaction and accessibility. Finally, as `scikit-learn` is widely used as a
70+
reference package, it is crucial to improve the section of the library dedicated to
71+
model explainability. We aim to improve the documentation and user experience with the
72+
existing explainability tools, making sure that they use the appropriate tool for their
73+
use cases. In addition, we propose to work on a scikit-learn enhancement proposal (SLEP)
74+
to define a common API for model explainability within scikit-learn. Ultimately, the
75+
goal is to come to a consensus to provide scikit-learn end-users with a consistent
76+
experience when using model explainability tools.
77+
78+
On top of all these items, we intend to continue working on the general maintenance of
79+
the project, addressing bug reports and performance regressions. As a community-driven
80+
project, we also want to dedicate time reviewing external contributions.
81+
82+
## Involved people
83+
84+
To execute this project, we plan the following hires:
85+
86+
- [Lucy Liu](https://github.com/lucyleeow) (Quansight Labs) will work about half-time on
87+
the project, on topic related to displays and feature importance.
88+
- We will hire full-time internships to work on the other part of the project. The
89+
initial plan is to hire two interns for a period of 6 months each and repeat this
90+
process for the next 2 years. We want to provide opportunities to underrepresented
91+
groups in the field of machine learning and data science, similarly to previous
92+
initiatives (cf. [NumFOCUS Small Development
93+
Grant](https://blog.scikit-learn.org/diversity/mentoring/)).
94+
95+
## Past CZI EOSS grants
96+
97+
In the past `scikit-learn` has been awarded two grants from the CZI EOSS program:
98+
99+
- [CZI EOSS Cycle 1](https://chanzuckerberg.com/eoss/proposals/scikit-learn-maintenance-and-enhancement-for-gradient-boosting/)
100+
helped at creating to the
101+
[`HistGradientBoostingClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html) and
102+
[`HistGradientBoostingRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html) estimators.
103+
These estimators are the equivalent of gradient boosting models implemented in
104+
`LightGBM` and `XGBoost`.
105+
- [CZI EOSS Cycle 4](https://chanzuckerberg.com/eoss/proposals/maintenance-extension-of-scikit-learn-machine-learning-in-python/)
106+
extended `scikit-learn` to work better with missing values and categorical data in
107+
several estimators.
108+
109+
Both grants allowed us to maintain and enhance `scikit-learn` to better serve the
110+
community.
5.89 KB
Loading
30.3 KB
Loading

0 commit comments

Comments
 (0)