- 
                Notifications
    
You must be signed in to change notification settings  - Fork 518
 
Open
Description
Hi, I encountered numerical overflow issues with the current implementation of the all_points_core_distance function, specifically when raising 1 / d_ij to a large power d. For large d values or very small distances, this can easily lead to extremely large numbers and subsequent overflow. In #197 a comment already mentioned this problem.
I used logs to keep the values at a manageable scale and then exponentiate the final result. Here’s the original code snippet:
def all_points_core_distance(distance_matrix, d=2.0):
    """
    Compute the all-points-core-distance for all the points of a cluster.
    Parameters
    ----------
    distance_matrix : array (cluster_size, cluster_size)
        The pairwise distance matrix between points in the cluster.
    d : integer
        The dimension of the data set, which is used in the computation
        of the all-point-core-distance as per the paper.
    Returns
    -------
    core_distances : array (cluster_size,)
        The all-points-core-distance of each point in the cluster
    References
    ----------
    Moulavi, D., Jaskowiak, P.A., Campello, R.J., Zimek, A. and Sander, J.,
    2014. Density-Based Clustering Validation. In SDM (pp. 839-847).
     #"""
    distance_matrix[distance_matrix != 0] = (1.0 / distance_matrix[
        distance_matrix != 0]) ** d
    result = distance_matrix.sum(axis=1)
    result /= distance_matrix.shape[0] - 1
    if result.sum() == 0:
        result = np.zeros(len(distance_matrix))
    else:
        result **= (-1.0 / d)
    return resultBelow is a modified version that uses logarithms to avoid numerical overflow. I tested this on several examples and it produces results identical to the original method:
import numpy as np
from scipy.special import logsumexp
def all_points_core_distance(distance_matrix, d=2.0):
    """
    Compute the all-points-core-distance for all the points of a cluster.
    Parameters
    ----------
    distance_matrix : array (cluster_size, cluster_size)
        The pairwise distance matrix between points in the cluster.
    d : integer
        The dimension of the data set, which is used in the computation
        of the all-point-core-distance as per the paper.
    Returns
    -------
    core_distances : array (cluster_size,)
        The all-points-core-distance of each point in the cluster
    References
    ----------
    Moulavi, D., Jaskowiak, P.A., Campello, R.J., Zimek, A. and Sander, J.,
    2014. Density-Based Clustering Validation. In SDM (pp. 839-847).
    """
    N = distance_matrix.shape[0]
    dists = distance_matrix.copy()
    dists[dists == 0] = np.nan
    s_ij = -d * np.log(dists)
    np.fill_diagonal(s_ij, -np.inf)
    log_S_i = logsumexp(s_ij, axis=1)
    log_m_i = log_S_i - np.log(N - 1)
    log_apcd_i = - (1.0 / d) * log_m_i
    apcd_i = np.exp(log_apcd_i)
    apcd_i[np.isinf(apcd_i)] = 0
    apcd_i[np.isnan(apcd_i)] = 0
    return apcd_iMetadata
Metadata
Assignees
Labels
No labels