Add greedy signature selection

pchlenski · pchlenski · commit 83556123cf4b · 2025-07-12T18:40:58.000-07:00
diff --git a/README.md b/README.md
@@ -1,12 +1,13 @@
 # Manify 🪐
-> A Python Library for Learning Non-Euclidean Representations
 
 [![Python Version](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://www.python.org/downloads/)
-[![License](https://img.shields.io/github/license/pchlenski/manify)](https://github.com/pchlenski/manify/blob/main/LICENSE)
 [![PyPI version](https://badge.fury.io/py/manify.svg)](https://badge.fury.io/py/manify)
 [![Tests](https://github.com/pchlenski/manify/actions/workflows/test.yml/badge.svg)](https://github.com/pchlenski/manify/actions/workflows/test.yml)
 [![codecov](https://codecov.io/gh/pchlenski/manify/branch/main/graph/badge.svg)](https://codecov.io/gh/pchlenski/manify)
 [![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
+[![Documentation](https://img.shields.io/badge/docs-manify.readthedocs.io-blue)](https://manify.readthedocs.io)
+[![arXiv](https://img.shields.io/badge/arXiv-2503.09576-b31b1b.svg)](https://arxiv.org/abs/2503.09576)
+[![License](https://img.shields.io/github/license/pchlenski/manify)](https://github.com/pchlenski/manify/blob/main/LICENSE)
 
 Manify is a Python library for non-Euclidean representation learning. 
 It is built on top of `geoopt` and follows `scikit-learn` API conventions.
@@ -18,12 +19,6 @@ The library supports a variety of workflows involving (products of) Riemannian m
 perceptrons, and neural networks.
 - Clustering manifold-valued data using Riemannian fuzzy K-Means
 
-📖 **Documentation**: [manify.readthedocs.io](https://manify.readthedocs.io)
-
-📝 **Manuscript**: [Manify: A Python Library for Learning Non-Euclidean Representations](https://arxiv.org/abs/2503.09576)
-
-🐛 **Issue Tracker**: [Github](https://github.com/pchlenski/manify/issues)
-
 ## Installation
 
 There are two ways to install `manify`:
@@ -41,29 +36,25 @@ There are two ways to install `manify`:
 ## Quick Example
 
 ```python
-import torch
-from manify.manifolds import ProductManifold
-from manify.embedders import CoordinateLearning
-from manify.predictors.decision_tree import ProductSpaceDT
+import manify
 from manify.utils.dataloaders import load_hf
 from sklearn.model_selection import train_test_split
 
-# Load graph data
+# Load Polblogs graph from HuggingFace
 features, dists, adj, labels = load_hf("polblogs")
 
-# Create product manifold
-pm = ProductManifold(signature=[(1, 4)])  # S^4_1
+# Create an S^4 x H^4 product manifold
+pm = manify.ProductManifold(signature=[(1.0, 4), (-1.0, 4)])
 
 # Learn embeddings (Gu et al (2018) method)
-embedder = CoordinateLearning(pm=pm)
-embedder.fit(X=None, D=dists)
-X_embedded = embedder.transform()
+embedder = manify.CoordinateLearning(pm=pm)
+X_embedded = embedder.fit_transform(X=None, D=dists, burn_in_iterations=200, training_iterations=800)
 
 # Train and evaluate classifier (Chlenski et al (2025) method)
 X_train, X_test, y_train, y_test = train_test_split(X_embedded, labels)
-tree = ProductSpaceDT(pm=pm, max_depth=3, task="classification")
-tree.fit(X_train, y_train)
-print(tree.score(X_test, y_test))
+model = manify.ProductSpaceDT(pm=pm, max_depth=3, task="classification")
+model.fit(X_train, y_train)
+print(model.score(X_test, y_test))
 ```
 
 ## Modules
@@ -113,7 +104,7 @@ Decision Trees and Random Forests paper.
 Please read our [contributing guide](https://github.com/pchlenski/manify/blob/main/CONTRIBUTING.md) for details on how
 to contribute to the project.
 
-## Citation
+## References
 If you use our work, please cite the `Manify` paper:
 ```bibtex
 @misc{chlenski2025manifypythonlibrarylearning,
@@ -126,3 +117,17 @@ If you use our work, please cite the `Manify` paper:
       url={https://arxiv.org/abs/2503.09576}, 
 }
 ```
+
+Additionally, if you use one of the methods implemented in `manify`, please cite the original papers:
+- `CoordinateLearning`: Gu et al. "Learning Mixed-Curvature Representations in Product Spaces." ICLR 2019. 
+[https://openreview.net/forum?id=HJxeWnCcF7](https://openreview.net/forum?id=HJxeWnCcF7)
+- `ProductSpaceVAE`: Skopek et al. "Mixed-Curvature Variational Autoencoders." ICLR 2020. 
+[https://openreview.net/forum?id=S1g6xeSKDS](https://openreview.net/forum?id=S1g6xeSKDS)
+- `SiameseNetwork`: Based on Siamese networks: Chopra et al. "Learning a Similarity Metric Discriminatively, with Application to Face Verification." CVPR 2005. [https://ieeexplore.ieee.org/document/1467314](https://ieeexplore.ieee.org/document/1467314)
+- `ProductSpaceDT` and `ProductSpaceRF`: Chlenski et al. "Mixed Curvature Decision Trees and Random Forests." ICML 2025. [https://arxiv.org/abs/2410.13879](https://arxiv.org/abs/2410.13879)
+- `KappaGCN`: Bachmann et al. "Constant Curvature Graph Convolutional Networks." ICML 2020. [https://proceedings.mlr.press/v119/bachmann20a.html](https://proceedings.mlr.press/v119/bachmann20a.html)
+- `ProductSpacePerceptron` and `ProductSpaceSVM`: Tabaghi et al. "Linear Classifiers in Product Space Forms." ArXiv 2021. [https://arxiv.org/abs/2102.10204](https://arxiv.org/abs/2102.10204)
+- `RiemannianFuzzyKMeans` and `RiemannianAdan`: Yuan et al. "Riemannian Fuzzy K-Means." OpenReview 2025. [https://openreview.net/forum?id=9VmOgMN4Ie](https://openreview.net/forum?id=9VmOgMN4Ie)
+- Delta-hyperbolicity computation: Based on Gromov's δ-hyperbolicity metric for tree-likeness of metric spaces. Gromov, M. "Hyperbolic Groups." Essays in Group Theory, 1987. [https://link.springer.com/chapter/10.1007/978-1-4613-9586-7_3](https://link.springer.com/chapter/10.1007/978-1-4613-9586-7_3)
+- Sectional curvature estimation: Gu et al. "Learning Mixed-Curvature Representations in Product Spaces." ICLR 2019. [https://openreview.net/forum?id=HJxeWnCcF7](https://openreview.net/forum?id=HJxeWnCcF7)
+- Greedy signature selection: Tabaghi et al. "Linear Classifiers in Product Space Forms." ArXiv 2021. [https://arxiv.org/abs/2102.10204](https://arxiv.org/abs/2102.10204)
diff --git a/manify/curvature_estimation/_pipelines.py b/manify/curvature_estimation/_pipelines.py
@@ -0,0 +1,112 @@
+from __future__ import annotations
+
+from typing import TYPE_CHECKING
+
+if TYPE_CHECKING:
+    from typing import Any, Literal
+
+    from jaxtyping import Float
+
+import torch
+from sklearn.model_selection import train_test_split
+
+from ..embedders._losses import distortion_loss
+from ..embedders.coordinate_learning import CoordinateLearning
+from ..manifolds import ProductManifold
+from ..predictors._base import BasePredictor
+from ..predictors.decision_tree import ProductSpaceDT
+
+
+def distortion_pipeline(
+    pm: ProductManifold,
+    dists: Float[torch.Tensor, "n_nodes n_nodes"],
+    embedder_init_kwargs: dict[str, Any] | None = None,
+    embedder_fit_kwargs: dict[str, Any] | None = None,
+) -> float:
+    """Builds a distortion‐based pipeline function for greedy signature selection.
+
+    Args:
+        pm: Product manifold to use for the pipeline.
+        dists: Pairwise distances to approximate.
+        embedder_init_kwargs: Additional keyword arguments for initializing the embedder model.
+        embedder_fit_kwargs: Additional keyword arguments for fitting the embedder model.
+
+    Returns:
+        A function f(signature) → loss, where signature is a list
+        of (curvature, dim) tuples.
+    """
+    if embedder_init_kwargs is None:
+        embedder_init_kwargs = {}
+    if embedder_fit_kwargs is None:
+        embedder_fit_kwargs = {}
+
+    dists = dists.to(pm.device)
+    dists_rescaled = dists / dists.max()
+
+    # Initialize embedder model
+    model = CoordinateLearning(pm=pm, device=pm.device, **embedder_init_kwargs)
+
+    # Fit the model
+    model.fit(X=None, D=dists_rescaled, **embedder_fit_kwargs)
+
+    # Loss is the distortion loss of the new embeddings
+    embeddings = model.embeddings_
+    new_dists = pm.pdist(X=embeddings)
+    return float(distortion_loss(new_dists, dists_rescaled).item())
+
+
+def classifier_pipeline(
+    pm: ProductManifold,
+    dists: Float[torch.Tensor, "n_nodes n_nodes"],
+    labels: Float[torch.Tensor, "n_nodes"],
+    classifier: type[BasePredictor] = ProductSpaceDT,
+    task: Literal["classification", "regression"] = "classification",
+    embedder_init_kwargs: dict[str, Any] | None = None,
+    embedder_fit_kwargs: dict[str, Any] | None = None,
+    model_init_kwargs: dict[str, Any] | None = None,
+    model_fit_kwargs: dict[str, Any] | None = None,
+) -> float:
+    """Builds a classifier‐based pipeline function for greedy signature selection.
+
+    Args:
+        pm: Product manifold to use for the pipeline.
+        dists: Pairwise distances to approximate.
+        labels: Labels for the nodes, used for training the classifier.
+        classifier: Classifier to use for evaluating the signature.
+        task: Task type, either "classification" or "regression".
+        embedder_init_kwargs: Additional keyword arguments for initializing the coordinate learning model.
+        embedder_fit_kwargs: Additional keyword arguments for fitting the coordinate learning model.
+        model_init_kwargs: Additional keyword arguments for initializing the classifier.
+        model_fit_kwargs: Additional keyword arguments for fitting the classifier.
+
+    Returns:
+        The loss of the classifier on the test set after embedding the distances using the product manifold.
+    """
+    if embedder_init_kwargs is None:
+        embedder_init_kwargs = {}
+    if embedder_fit_kwargs is None:
+        embedder_fit_kwargs = {}
+    if model_init_kwargs is None:
+        model_init_kwargs = {}
+    if model_fit_kwargs is None:
+        model_fit_kwargs = {}
+
+    dists = dists.to(pm.device)
+    dists_rescaled = dists / dists.max()
+
+    # Embedding steps
+    embedder = CoordinateLearning(pm=pm, device=pm.device, **embedder_init_kwargs)
+    embedder.fit(X=None, D=dists_rescaled, **embedder_fit_kwargs)
+    X = embedder.embeddings_
+
+    # Train-test split
+    X_train, X_test, y_train, y_test = train_test_split(X, labels)
+
+    # Train classifier
+    model_init_kwargs["task"] = task
+    model = classifier(pm=pm, **model_init_kwargs)
+    model.fit(X=X_train, y=y_train, **model_fit_kwargs)
+    loss = model.score(X=X_test, y=y_test)
+
+    # For classification, we want to maximize accuracy; for regression, we minimize MSE.
+    return -loss if task == "classification" else loss
diff --git a/manify/curvature_estimation/greedy_method.py b/manify/curvature_estimation/greedy_method.py
@@ -8,36 +8,61 @@
 
 from typing import TYPE_CHECKING
 
-import torch
-
 if TYPE_CHECKING:
-    from jaxtyping import Float
+    from collections.abc import Callable, Iterable
+    from typing import Any
 
 from ..manifolds import ProductManifold
+from ._pipelines import distortion_pipeline
 
 
 def greedy_signature_selection(
-    pm: ProductManifold,
-    dists: Float[torch.Tensor, "n_points n_points"],
-    candidate_components: tuple[tuple[float, int], ...] = ((-1.0, 2), (0.0, 2), (1.0, 2)),
+    candidate_components: Iterable[tuple[float, int]] = ((-1.0, 2), (0.0, 2), (1.0, 2)),
     max_components: int = 3,
-) -> ProductManifold:
+    pipeline: Callable[..., float] = distortion_pipeline,
+    **kwargs: dict[str, Any],
+) -> tuple[ProductManifold, list[float]]:
     r"""Greedily estimates an optimal product manifold signature.
 
     This implements the greedy signature selection algorithm that incrementally builds a product manifold
     by selecting components that best preserve distances. At each step, it chooses the manifold component
     that maximizes distortion reduction.
 
     Args:
-        pm: Initial product manifold to use as starting point.
-        dists: Pairwise distance matrix to approximate.
         candidate_components: Candidate (curvature, dimension) pairs to consider.
         max_components: Maximum number of components to include.
+        pipeline: Function that takes a ProductManifold, plus additional arguments, and returns a loss value.
+        **kwargs: Additional keyword arguments to pass to the pipeline function.
 
     Returns:
         optimal_pm: Optimized product manifold with the selected signature.
-
-    Note:
-        This function is not yet implemented.
     """
-    raise NotImplementedError
+    # Initialize variables
+    signature: list[tuple[float, int]] = []
+    loss_history: list[float] = []
+    current_loss = float("inf")
+    candidate_components_list = list(candidate_components)  # For type safe iteration
+
+    # Greedy loop
+    for _ in range(max_components):
+        best_loss, best_idx = current_loss, -1
+
+        # Try each candidate
+        for idx, comp in enumerate(candidate_components_list):
+            pm = ProductManifold(signature=signature + [comp])
+            loss = pipeline(pm, **kwargs)
+            if loss < best_loss:
+                best_loss, best_idx = loss, idx
+
+        # If no improvement, stop
+        if best_idx < 0:
+            break
+
+        # Otherwise accept that component
+        signature.append(candidate_components_list[best_idx])
+        current_loss = best_loss
+        loss_history.append(current_loss)
+
+    # Return final manifold
+    optimal_pm = ProductManifold(signature=signature)
+    return optimal_pm, loss_history
diff --git a/tests/test_curvature_estimation.py b/tests/test_curvature_estimation.py
@@ -1,7 +1,10 @@
 import torch
 
+from manify.curvature_estimation._pipelines import classifier_pipeline, distortion_pipeline
 from manify.curvature_estimation.delta_hyperbolicity import sampled_delta_hyperbolicity, vectorized_delta_hyperbolicity
+from manify.curvature_estimation.greedy_method import greedy_signature_selection
 from manify.manifolds import ProductManifold
+from manify.utils.dataloaders import load_hf
 
 
 def iterative_delta_hyperbolicity(D, reference_idx=0, relative=True):
@@ -67,3 +70,46 @@ def test_delta_hyperbolicity():
     assert torch.allclose(sampled_deltas, vectorized_deltas[indices[:, 0], indices[:, 1], indices[:, 2]], atol=1e-5), (
         "Sampled deltas should be close to vectorized deltas."
     )
+
+
+def test_greedy_method():
+    # Get a very small subset of the polblogs dataset
+    _, D, _, y = load_hf("polblogs")
+    D = D[:128, :128] / D.max()
+    y = y[:128]
+    D = D / D.max()
+
+    max_components = 3
+    embedder_init_kwargs = {"random_state": 42}
+    embedder_fit_kwargs = {"burn_in_iterations": 10, "training_iterations": 90, "lr": 1e-2}
+
+    # Try distortion pipeline
+    optimal_pm, loss_history = greedy_signature_selection(
+        pipeline=distortion_pipeline,
+        dists=D,
+        embedder_init_kwargs=embedder_init_kwargs,
+        embedder_fit_kwargs=embedder_fit_kwargs,
+    )
+    # assert set(optimal_pm.signature) == set(pm.signature), "Optimal signature should match the initial signature"
+    assert len(optimal_pm.signature) == len(loss_history)
+    assert len(optimal_pm.signature) <= max_components
+    assert len(optimal_pm.signature) > 0, "Optimal signature should not be empty"
+    assert len(loss_history) > 0, "Loss history should not be empty"
+    if len(loss_history) > 1:
+        assert loss_history[-1] < loss_history[0], "Loss should decrease over iterations"
+
+    # Try classifier pipeline
+    optimal_pm, loss_history = greedy_signature_selection(
+        pipeline=classifier_pipeline,
+        labels=y,
+        dists=D,
+        embedder_init_kwargs=embedder_init_kwargs,
+        embedder_fit_kwargs=embedder_fit_kwargs,
+    )
+    # assert set(optimal_pm.signature) == set(pm.signature), "Optimal signature should match the initial signature"
+    assert len(optimal_pm.signature) == len(loss_history)
+    assert len(optimal_pm.signature) <= max_components
+    assert len(optimal_pm.signature) > 0, "Optimal signature should not be empty"
+    assert len(loss_history) > 0, "Loss history should not be empty"
+    if len(loss_history) > 1:
+        assert loss_history[-1] < loss_history[0], "Loss should decrease over iterations"