Linux Foundation Critical Projects

A methodology for identifying and ranking open source software projects in terms of criticality.

Contents

Objective
- What does it mean for a project to be "critical"?
Methodology
Limitations
Future iterations
References

Objective

Using the OpenSSF Criticality Scores (hereafter, OpenSSF CS) as a foundation, develop a scoring system that ranks open source software projects in terms of "criticality" or "importance". The system ought to

Identify a set of "top projects" and provide relative criticality ranks for this set
Address some limitations identified by the OpenSSF CS
- Reduce the prevalence of "false positives"
- Use a set of signals that be consistently measured across projects
- Fine-tune to the scoring function and signal weights
Be easily deployed
Be easily iterated upon to improve performance

What does it mean for a project to be "critical"?

Definition: An open source project is "critical" if it is either directly relied on by many users or indirectly through other widely-used projects that depend on it.

Methodology

Our approach consists of three main steps:

Project discovery and filtering
Measure signals of criticality
Score the set of gathered projects

Project discovery and filtering

The universe of open source projects is too broad, so we need to pare this down a bit before even beginning to rank. Let's define the unit of analysis for an open source project as the project's canonical source code repository. Our approach to project discovery and filtering is as follows:

Use git-based projects with public source repositories on GitHub
Use GitHub GraphQL to return a list of the top 10,000 most-starred public repositories older than 6 months. We additionally include the OpenSSF Securing Critical Projects list of "most critical projects" in this set.
Filter this set to exclude forks, mirrors, templates, and archived repositories.

Measure signals of criticality

For each project in the set of projects, we next gather a number of quantitative signals that can be derived from project characteristics. Signals should measure something about the project that reasonably correlates with a notion of criticality.

Examples:

Project age
Development and release frequency
Number of distinct individual contributors
Number of distinct contribution organizations
Level of project discourse (contributor communication, responsiveness to issues)
Merge/pull requests, forks, artifact downloads
Downstream dependents
Scope of use (who and how is the project being used?)

Some of these signals are easier to measure than others. For example, project age can be easily measured by looking at the date of the first commit in the project's version control system (VCS). On the other hand, scope of use is much harder to measure, as it requires knowledge of who is using the project and how they are using it.

If signals are measured inconsistently across projects, including the signal in a criticality score can bias rankings. For example, if we were to use "number of stars" as a signal, this would be problematic because not all projects are hosted on GitHub. Even among those that are, the number of stars can vary widely depending on the project's popularity and the community's engagement. A concern with the OpenSSF CS data is that some signals like dependency information and issue activity are not consistently measured across projects.

We focus on signals that can be consistently measured across all projects in our set. The signals we have chosen so far are based purely on the version control log of the open source project. Our hope is to expand beyond this constaint in future iterations.

Score the set of gathered projects

Project criticality rankings can be determined by a composite score derived from the project signals. The ideal score should be a near perfect correlate of criticality: the higher the score, the more "critical" the project is. For each project $i$, the project's criticality score is a function $f$ of the $K$ different criticality signals $s_{ik}, \ldots, s_{iK}$ and parameters $\theta$:

$$ score_{i} = f(s_{i1}, \ldots, s_{iK}; \theta) $$

Note that the function $f$ could take many forms. Following the OpenSSF CS, we opt for a scoring function that is a linear combination of signals and parameters:

$$ score_{i} = \sum_{k=1}^{K} \alpha_{k} g(s_{ik}) $$

where

Term

Description

$g(s_{ik})$

$g$ is a function of the numerical signal $s_{ik}$. Our preferred approach is to define $g$ as the percentile rank of project's signal relative to all other projects. For example, if project has more contributors than 75% of other projects in the set of projects, then = 0.75. The original OpenSSF CS uses a clamping function for $g$ dependent parameterized thresholds to map the signals to a common scale. The percentile rank transformation is attractive choice for $g$ since it scales all signals to the $[0, 1]$ interval, is directly interpretable, and requires fewer parametric assumptions than the OpenSSF CS implementation. The drawback is that it can only be calculated when the signal has been measured for all projects under consideration.

$\alpha_{k}$

Relative weights for signal $k$. Larger values for $\alpha_{k}$ mean signal $k$ matters more for determining the score. Assume $0 \leq \alpha_{k} \leq 1$ and $\sum_{k} \alpha_{k} = 1$

Under these assumptions, the score is a weighted composite of signal ranks.

Signal and parameter choices

Weight $\alpha_{k}$	Signal $s_{ik}$	Description
40%	Distinct contributors	Numer of distinct committers
20%	Distinct organizations	Number of distinct organizations contributing to the project, as determined by email domain
10%	Project size	Count of source lines of code in the project
10%	Last updated	Months since the last commit to the project
10%	Project age	Months since the first commit to the project
10%	Commit frequency	Count of commits within the last year

Some remarks on the choice of signals and parameters:

By placing heavy weight on distinct contributors, we are prioritizing projects that have a large number of individual contributors, which is a strong indicator of community engagement and project health.
We also weight distinct organizations, which helps to identify projects that have backing from multiple entities, indicating a broader support base.
Source lines of code is a measure of project size, which can be an indicator of complexity and legacy of the codebase. Smaller projects that can be easily replaced should not be considered as critical.
We intentionally avoid signals like issue activity as we find that they can be quite noisy, inconsistently measured across projects, and not necessarily indicative of criticality.

Limitations

VCS and DVCS platform-specific (git and GitHub) - Coming up with a set of signals that can be consistently measured across different platforms is a challenge. For example, using platform-specific signals like "number of forks" or "number of stars" is not possible for projects that are not hosted on GitHub. To accommodate this, we can use a set of signals that are more general and can be measured across different platforms. The signals we've chosen so far are based purely on the version control log of the open source project.
Dependency information is not included - A related issue of observability is that dependency information (which projects rely on others) can be hard to consistently gather. While SBOMs or packaging manifests do a decent job of recording runtime dependencies and other external artifacts a project relies upon, nuanced and potentially more important dependency relationships are imperfectly observed. For example, container technologies like Docker and Kubernetes rely heavily on the Linux kernel, but neither directly "declare" the kernel as an import in packaging manifest.

Future iterations

Improved filtering (repository classification, NLP, etc.)
Use a wider set of signals: move beyond VCS logs in some scalable way
Extension: implement a user feedback loop
Extension: incorporate dependency information and recursive scoring
Extension: define alternative parameter sets for different use cases (e.g. security, widespread use, social impact, criticality for a specific user or organization, etc.)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Linux Foundation Critical Projects

Objective

What does it mean for a project to be "critical"?

Methodology

Project discovery and filtering

Measure signals of criticality

Score the set of gathered projects

Signal and parameter choices

Limitations

Future iterations

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

linuxfoundation/critical-projects

Folders and files

Latest commit

History

Repository files navigation

Linux Foundation Critical Projects

Objective

What does it mean for a project to be "critical"?

Methodology

Project discovery and filtering

Measure signals of criticality

Score the set of gathered projects

Signal and parameter choices

Limitations

Future iterations

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Packages