Skip to content

Add option for numba acceleration. Improve tests#73

Merged
bjoaofelipe merged 13 commits intoKrishnaswamyLab:masterfrom
MattScicluna:use_numba
Aug 31, 2025
Merged

Add option for numba acceleration. Improve tests#73
bjoaofelipe merged 13 commits intoKrishnaswamyLab:masterfrom
MattScicluna:use_numba

Conversation

@MattScicluna
Copy link
Contributor

No description provided.

@MattScicluna MattScicluna mentioned this pull request Aug 19, 2025
@bjoaofelipe bjoaofelipe marked this pull request as ready for review August 27, 2025 00:05
@bjoaofelipe
Copy link
Contributor

Ran into two errors:

======================================================================
ERROR: test_random_landmarking.test_random_landmarking_with_non_euclidean_distances
Test random landmarking with non-euclidean distance metrics

Traceback (most recent call last):
File "/home/runner/work/graphtools/graphtools/test/test_random_landmarking.py", line 383, in test_random_landmarking_with_non_euclidean_distances
warnings.warn(
UserWarning: Random landmarking may not be respecting the distance parameter. All distance metrics produced identical cluster assignments.

======================================================================
FAIL: test_landmark.test_set_params

Traceback (most recent call last):
File "/home/runner/work/graphtools/graphtools/test/test_landmark.py", line 205, in test_set_params
assert G.get_params() == {
^^^^^^^^^^^^^^^^^^^
AssertionError


@bjoaofelipe
Copy link
Contributor

The distance parameter is not being used. Need to start passing it downstream. Every algorithm uses the default euclidean distance

@bjoaofelipe
Copy link
Contributor

Found the error on:

    def build_landmark_op(self):
        """Build the landmark operator

            Calculates spectral clusters on the kernel, and calculates transition
            probabilities between cluster centers by using transition probabilities
            between samples assigned to each cluster.
            
            random_landmarking:
            This method randomly selects n_landmark points and assigns each sample to its nearest landmark
            using Euclidean distance .

        """
        if self.random_landmarking :
            with _logger.log_task("landmark operator"):
                is_sparse = sparse.issparse(self.kernel)
                n_samples = self.data.shape[0]
                rng = np.random.default_rng(self.random_state)
                landmark_indices = rng.choice(n_samples, self.n_landmark, replace=False)
                data = self.data if not hasattr(self, 'data_nu') else self.data_nu # because of the scaling to review
                if n_samples > 5000:   # sklearn.euclidean_distances is faster than cdist for big dataset 
                    distances = euclidean_distances(data, data[landmark_indices])
                else:
                    distances = cdist(data, data[landmark_indices], metric="euclidean")
                self._clusters = np.argmin(distances, axis=1)

We are forcing euclidean distances here, which is not the expected behavior

@bjoaofelipe bjoaofelipe reopened this Aug 27, 2025
@MattScicluna MattScicluna changed the title Add option for numba acceleration Add option for numba acceleration. Improve tests Aug 31, 2025
@MattScicluna
Copy link
Contributor Author

MattScicluna commented Aug 31, 2025

Found the error on:

    def build_landmark_op(self):
        """Build the landmark operator

            Calculates spectral clusters on the kernel, and calculates transition
            probabilities between cluster centers by using transition probabilities
            between samples assigned to each cluster.
            
            random_landmarking:
            This method randomly selects n_landmark points and assigns each sample to its nearest landmark
            using Euclidean distance .

        """
        if self.random_landmarking :
            with _logger.log_task("landmark operator"):
                is_sparse = sparse.issparse(self.kernel)
                n_samples = self.data.shape[0]
                rng = np.random.default_rng(self.random_state)
                landmark_indices = rng.choice(n_samples, self.n_landmark, replace=False)
                data = self.data if not hasattr(self, 'data_nu') else self.data_nu # because of the scaling to review
                if n_samples > 5000:   # sklearn.euclidean_distances is faster than cdist for big dataset 
                    distances = euclidean_distances(data, data[landmark_indices])
                else:
                    distances = cdist(data, data[landmark_indices], metric="euclidean")
                self._clusters = np.argmin(distances, axis=1)

We are forcing euclidean distances here, which is not the expected behavior

I added a test to test_landmark 04256c3 to confirm that this unexpected behaviour happens for both random landmarking and the default.

@bjoaofelipe
Copy link
Contributor

I will create a new isue mentioning that distances are being ignored by the build_graph class.

@bjoaofelipe bjoaofelipe merged commit c59357b into KrishnaswamyLab:master Aug 31, 2025
1 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants