Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .github/workflows/pypi-publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,17 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout repo
uses: actions/checkout@v4
uses: actions/checkout@v6
with:
fetch-depth: 0

- name: Set up Python
uses: actions/setup-python@v5
uses: actions/setup-python@v6
with:
python-version: "3.10"

- name: Install uv
uses: astral-sh/setup-uv@v5
uses: astral-sh/setup-uv@v7

# set version (e.g. 1.2.3) from the latest Git tag on the master branch
- name: Set package release version
Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/test-pypi-publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,17 +22,17 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout repo
uses: actions/checkout@v4
uses: actions/checkout@v6
with:
fetch-depth: 0

- name: Set up Python
uses: actions/setup-python@v5
uses: actions/setup-python@v6
with:
python-version: "3.10"

- name: Install uv
uses: astral-sh/setup-uv@v5
uses: astral-sh/setup-uv@v7

# set version (e.g. 1.2.3) from the latest Git tag on master branch
- name: Set package release version
Expand Down
13 changes: 6 additions & 7 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,29 +14,28 @@ jobs:
strategy:
fail-fast: false
matrix:
# specific version in 3.13 due to bug https://github.com/python/cpython/issues/138031
python-version: ["3.10", "3.11", "3.12", "3.13.6"]
python-version: ["3.10", "3.11", "3.12", "3.13"]
os: [macos-latest, ubuntu-latest, windows-latest]
runs-on: ${{ matrix.os }}
steps:
- name: Checkout repo
uses: actions/checkout@v4
uses: actions/checkout@v6

- name: Set up Python
uses: actions/setup-python@v5
uses: actions/setup-python@v6
with:
python-version: ${{ matrix.python-version }}

- name: Install uv
uses: astral-sh/setup-uv@v5
uses: astral-sh/setup-uv@v7

- uses: actions/cache@v4
- uses: actions/cache@v5
name: Cache venv
with:
path: ./.venv
key: ${{ matrix.os }}-venv-${{ matrix.python-version }}-${{ hashFiles('**/uv.lock') }}

- uses: actions/cache@v4
- uses: actions/cache@v5
name: Cache datasets
with:
path: ~/scikit_learn_data
Expand Down
15 changes: 1 addition & 14 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,22 +9,9 @@ repos:
language: system
pass_filenames: false

- repo: https://github.com/pypa/pip-audit
rev: v2.9.0
hooks:
- id: pip-audit
args: [
--vulnerability-service, "pypi",
--cache-dir, ".pip_audit_cache",
# false alert for setuptools, we have a much newer version
--ignore-vuln, "GHSA-5rjg-fvgr-3xxf",
# false alert for pip
--ignore-vuln, "GHSA-4xh5-x5gv-qwph"
]

- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.13.1
hooks:
- id: ruff-check # linter
args: [ --fix ]
args: [ --fix, --exit-zero ]
- id: ruff-format # formatter
19 changes: 12 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,12 +53,14 @@ Main features:

| | `python3.10` | `python3.11` | `python3.12` | `python3.13` |
|:-----------:|:------------:|:------------:|:------------:|:------------:|
| **Linux** | ✅ | ✅ | ✅ | |
| **Windows** | ✅ | ✅ | ✅ | |
| **macOS** | ✅ | ✅ | ✅ | |
| **Linux** | ✅ | ✅ | ✅ | |
| **Windows** | ✅ | ✅ | ✅ | |
| **macOS** | ✅ | ✅ | ✅ | |

Python 3.9 was supported up to scikit-fingerprints 1.13.0.

Python 3.13 is officially supported, but underlying libraries may not be fully compatible yet.

## Installation

You can install the library using pip:
Expand Down Expand Up @@ -159,7 +161,7 @@ Examples and tutorials:

## Project overview

`scikit-fingerprint` brings molecular fingerprints and related functionalities into
`scikit-fingerprints` brings molecular fingerprints and related functionalities into
the scikit-learn ecosystem. With familiar class-based design and `.transform()` method,
fingerprints can be computed from SMILES strings or RDKit `Mol` objects. Resulting NumPy
arrays or SciPy sparse arrays can be directly used in ML pipelines.
Expand Down Expand Up @@ -216,13 +218,16 @@ Publications using scikit-fingerprints:
1. [J. Adamczyk, W. Czech "Molecular Topological Profile (MOLTOP) - Simple and Strong Baseline for Molecular Graph Classification" ECAI 2024](https://ebooks.iospress.nl/doi/10.3233/FAIA240663)
2. [J. Adamczyk, P. Ludynia "Scikit-fingerprints: easy and efficient computation of molecular fingerprints in Python" SoftwareX](https://www.sciencedirect.com/science/article/pii/S2352711024003145)
3. [J. Adamczyk, P. Ludynia, W. Czech "Molecular Fingerprints Are Strong Models for Peptide Function Prediction" ArXiv preprint](https://arxiv.org/abs/2501.17901)
4. [M. Fitzner et al. "BayBE: a Bayesian Back End for experimental planning in the low-to-no-data regime" RSC Digital Discovery](https://pubs.rsc.org/en/content/articlehtml/2025/dd/d5dd00050e)
5. [J. Xiong "Bridging 3D Molecular Structures and Artificial Intelligence by a Conformation Description Language"](https://www.biorxiv.org/content/10.1101/2025.05.07.652440v1.abstract)
4. [J. Adamczyk "Towards Rational Pesticide Design with Graph Machine Learning Models for Ecotoxicology" CIKM 2025](https://dl.acm.org/doi/abs/10.1145/3746252.3761660)
5. [J. Adamczyk, J. Poziemski, F. Job, M. Król, M. Makowski "MolPILE - large-scale, diverse dataset for molecular representation learning" ArXiv preprint](https://arxiv.org/abs/2509.18353)
6. [M. Fitzner et al. "BayBE: a Bayesian Back End for experimental planning in the low-to-no-data regime" RSC Digital Discovery](https://pubs.rsc.org/en/content/articlehtml/2025/dd/d5dd00050e)
7. [J. Xiong et al. "Bridging 3D Molecular Structures and Artificial Intelligence by a Conformation Description Language"](https://www.biorxiv.org/content/10.1101/2025.05.07.652440v1.abstract)
8. [S. Mavlonazarova et al. "Untargeted Metabolomics Reveals Organ-Specific and Extraction-Dependent Metabolite Profiles in Endemic Tajik Species Ferula violacea Korovin" bioRxiv preprint](https://www.biorxiv.org/content/10.1101/2025.08.24.671964v1)

## Contributing

Please read [CONTRIBUTING.md](CONTRIBUTING.md) and [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md) for details on our code of
conduct, and the process for submitting pull requests to us.
conduct and the process for submitting pull requests.

## License

Expand Down
15 changes: 0 additions & 15 deletions mypy.ini

This file was deleted.

18 changes: 12 additions & 6 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ dependencies = [
"numba<1",
"numpy>=1.20.0,<3",
"pandas<3",
"rdkit<=2025.3.6",
"rdkit<=2025.9.3",
"scikit-learn>=1.0.0,<2",
"scipy>=1.0.0,<2",
"tqdm>=4.0.0,<5"
Expand All @@ -49,21 +49,18 @@ dev = [
"coverage",
"jupyter",
"mypy",
"pip-audit",
"pre-commit",
"pytest",
"pytest-cov",
"pytest-rerunfailures",
"ruff",
"setuptools>=80",
"xenon"
"scipy-stubs",
]

test = [
"mypy",
"ruff",
"xenon",
"pip-audit",
"pre-commit",
"pytest",
"pytest-rerunfailures"
Expand All @@ -73,7 +70,7 @@ docs = [
"ipython",
"nbsphinx",
"pydata-sphinx-theme",
"scikit-learn!=1.7.1", # due to scikit-learn docs issue: https://github.com/microsoft/lightgbm/issues/6978
"scikit-learn!=1.7.1", # due to scikit-learn docs issue: https://github.com/microsoft/lightgbm/issues/6978
"sphinx",
"sphinx-copybutton"
]
Expand All @@ -95,6 +92,15 @@ filterwarnings = [
"ignore:Function auroc_score.*:FutureWarning"
]

[tool.mypy]
python_version = "3.10"
check_untyped_defs = true # check all functions, this fixes some tests
allow_redefinition = true # we redefine variables a lot for efficiency
# most libraries used are not properly typed in Python, particularly RDKit
ignore_missing_imports = true
disable_error_code = ["import-untyped"]
no_site_packages = true

[tool.uv.build-backend]
module-name = "skfp"
module-root = ""
Expand Down
14 changes: 7 additions & 7 deletions skfp/applicability_domain/bounding_box.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,18 +15,18 @@ class BoundingBoxADChecker(BaseADChecker):
This creates a "bounding box" using their extreme values, and new molecules
should lie in this distribution, i.e. have properties in the same ranges [1]_.

Typically, physicochemical properties (continous features) are used as inputs.
Typically, physicochemical properties (continuous features) are used as inputs.
Consider scaling, normalizing, or transforming them before computing AD to lessen
effects of outliers, e.g. with ``PowerTransformer`` or ``RobustScaler``. This is
particularly important if ``"three_sigma"`` is used as percentile bound, as it
particularly important if ``"three_sigma"`` is used as the percentile bound, as it
assumes normal distribution.

By default, the full range of training descriptors are allowed as AD. For stricter
check, use ``percentile_lower`` and ``percentile_upper`` arguments to disallow
extremely low or large values, respectively. For looser check, use ``num_allowed_violations``
to allow a number of desrciptors to lie outside the given ranges.

This method scales very well with both number of samples and features.
This method scales very well with both the number of samples and features.

Parameters
----------
Expand All @@ -42,7 +42,7 @@ class BoundingBoxADChecker(BaseADChecker):
uses 3 standard deviations from the mean, a common rule-of-thumb for outliers
assuming the normal distribution.

num_allowed_violations : bool, default=0
num_allowed_violations : int, default=0
Number of allowed violations of feature ranges. By default, all descriptors
must lie inside the bounding box.

Expand Down Expand Up @@ -85,16 +85,16 @@ class BoundingBoxADChecker(BaseADChecker):

_parameter_constraints: dict = {
**BaseADChecker._parameter_constraints,
"percentile_lower": [Interval(Real, 0, 100, closed="both")],
"percentile_upper": [Interval(Real, 0, 100, closed="both")],
"percentile_lower": [Interval(Real, 0, 100, closed="both"), "three_sigma"],
"percentile_upper": [Interval(Real, 0, 100, closed="both"), "three_sigma"],
"num_allowed_violations": [Interval(Integral, 0, None, closed="left")],
}

def __init__(
self,
percentile_lower: float | str = 0,
percentile_upper: float | str = 100,
num_allowed_violations: int | None = 0,
num_allowed_violations: int = 0,
n_jobs: int | None = None,
verbose: int | dict = 0,
):
Expand Down
4 changes: 2 additions & 2 deletions skfp/applicability_domain/convex_hull.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,11 +35,11 @@ class ConvexHullADChecker(BaseADChecker):
& 1^T \lambda = 1,\\
& \lambda_i \geq 0 \text{ for all } i=1,...,n

Typically, physicochemical properties (continous features) are used as inputs.
Typically, physicochemical properties (continuous features) are used as inputs.
Consider scaling, normalizing, or transforming them before computing AD to lessen
effects of outliers, e.g. with ``PowerTransformer`` or ``RobustScaler``.

This method scales very badly with both number of samples and features. It has
This method scales very badly with both the number of samples and features. It has
quadratic scaling :math:`O(n^2)` in number of samples, and can be realistically run
on at most 1000-3000 molecules. Its geometry also breaks down above ~10 features,
marking everything as outside AD.
Expand Down
4 changes: 2 additions & 2 deletions skfp/applicability_domain/distance_to_centroid.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ class DistanceToCentroidADChecker(BaseADChecker):
data centroid, i.e. the average (middle) point [1]_. New molecules should lie
inside the hypersphere of a given radius (distance) from that centroid.

Typically, physicochemical properties (continous features) are used as inputs.
Typically, physicochemical properties (continuous features) are used as inputs.
Consider scaling, normalizing, or transforming them before computing AD to lessen
effects of outliers, e.g. with ``PowerTransformer`` or ``RobustScaler``.

Expand Down Expand Up @@ -129,7 +129,7 @@ class DistanceToCentroidADChecker(BaseADChecker):
_parameter_constraints: dict = {
**BaseADChecker._parameter_constraints,
"threshold": [Interval(Real, 0, None, closed="neither"), StrOptions({"auto"})],
"distance": [
"metric": [
callable,
StrOptions(SCIPY_METRIC_NAMES | SKFP_METRIC_NAMES | SKFP_BULK_METRIC_NAMES),
],
Expand Down
2 changes: 1 addition & 1 deletion skfp/applicability_domain/hotelling_t2_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ class HotellingT2TestADChecker(BaseADChecker):
Mahalanobis distance of a new sample from the mean of the training data, scaled
by the covariance structure of the training data.

Typically, physicochemical properties (continous features) are used as inputs.
Typically, physicochemical properties (continuous features) are used as inputs.
Consider scaling, normalizing, or transforming them before computing AD to lessen
effects of outliers, e.g. with ``PowerTransformer`` or ``RobustScaler``. In case
of Hotelling's T^2 test, using PCA beforehand to obtain orthogonal features is
Expand Down
Loading