Skip to content
This repository was archived by the owner on Jan 8, 2026. It is now read-only.

Commit a520288

Browse files
committed
Correct typing
1 parent c94b369 commit a520288

File tree

2 files changed

+20
-19
lines changed

2 files changed

+20
-19
lines changed

clusx/clustering/models.py

Lines changed: 13 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -374,9 +374,9 @@ def _calculate_cluster_probabilities(
374374
scores.append(prior_new + new_cluster_likelihood)
375375

376376
# Convert log scores to probabilities
377-
scores = np.array(scores)
378-
scores -= logsumexp(scores) # type: ignore
379-
probabilities = np.exp(scores) # type: np.ndarray
377+
scores_array = np.array(scores)
378+
scores_array -= logsumexp(scores_array) # type: ignore
379+
probabilities = np.exp(scores_array)
380380

381381
# Add placeholder for new cluster ID
382382
extended_cluster_ids = cluster_ids + [None] # None represents new cluster
@@ -424,15 +424,14 @@ def _create_or_update_cluster(
424424

425425
# Update existing cluster
426426
assert existing_cluster_id is not None
427-
cid = existing_cluster_id
428-
params = self.cluster_params[cid]
427+
params = self.cluster_params[existing_cluster_id]
429428
params["count"] += 1
430-
params["mean"] = self._normalize(
431-
params["mean"] * (params["count"] - 1) + embedding
432-
)
433-
self.clusters.append(cid)
434429

435-
return cid
430+
result = (params["mean"] * (params["count"] - 1) + embedding).astype(np.float32)
431+
params["mean"] = self._normalize(result)
432+
self.clusters.append(existing_cluster_id)
433+
434+
return existing_cluster_id
436435

437436
def assign_cluster(self, embedding: NDArray[np.float32]) -> tuple[int, np.ndarray]:
438437
"""
@@ -777,7 +776,7 @@ def log_pyp_prior(self, cluster_id: Optional[int] = None) -> float:
777776

778777
# Prior for an existing cluster: (n_k - sigma) / (n + alpha)
779778
assert "count" in self.cluster_params[cluster_id]
780-
count = self.cluster_params[cluster_id]["count"]
779+
count = int(self.cluster_params[cluster_id]["count"])
781780
numerator = count - self.sigma
782781

783782
# If numerator is negative or zero, use a small positive value
@@ -837,9 +836,9 @@ def _calculate_cluster_probabilities(
837836
scores.append(prior_new + new_cluster_likelihood)
838837

839838
# Convert log scores to probabilities
840-
scores = np.array(scores)
841-
scores -= logsumexp(scores) # type: ignore
842-
probabilities = np.exp(scores)
839+
scores_array = np.array(scores)
840+
scores_array -= logsumexp(scores_array) # type: ignore
841+
probabilities = np.exp(scores_array)
843842

844843
# Add placeholder for new cluster ID
845844
extended_cluster_ids = cluster_ids + [None] # None represents new cluster

docs/source/methodological_framework.rst

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,11 @@ This section documents the design and implementation of the nonparametric Bayesi
1212
Dirichlet Process Clustering
1313
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1414

15-
Clusterium implements text clustering using the Dirichlet Process (DP), a fundamental nonparametric Bayesian model that allows for a flexible, potentially infinite number of clusters. Unlike traditional clustering algorithms that require pre-specifying the number of clusters (e.g., K-means), the Dirichlet Process automatically determines the appropriate number of clusters based on the data. The theoretical foundations for this approach were established by Ferguson [1]_.
15+
Clusterium implements text clustering using the Dirichlet Process (DP), a fundamental nonparametric Bayesian model that allows for a flexible, potentially infinite number of clusters. Unlike traditional clustering algorithms that require pre-specifying the number of clusters (e.g., K-means), the DP automatically determines the appropriate number of clusters based on the data. The theoretical foundations for this approach were established by Ferguson [1]_.
1616

1717
**Mathematical Foundation:**
1818

19-
In Clusterium's implementation, the Dirichlet Process is realized through the Chinese Restaurant Process (CRP) formulation. The prior probability of a document joining an existing cluster or creating a new one follows:
19+
In Clusterium's implementation, the DP is realized through the Chinese Restaurant Process (CRP) formulation. The prior probability of a document joining an existing cluster or creating a new one follows:
2020

2121
.. math::
2222
@@ -61,7 +61,7 @@ These properties make vMF particularly suitable for clustering in high-dimension
6161

6262
**Algorithm Overview:**
6363

64-
The Dirichlet Process clustering algorithm in Clusterium follows these key steps:
64+
The DP clustering algorithm in Clusterium follows these key steps:
6565

6666
1. **Embedding Generation**: Transform documents into normalized vector representations using a pretrained language model.
6767

@@ -91,7 +91,7 @@ Clusterium's implementation includes several important design decisions that aff
9191

9292
**Stochastic Properties and Document Order Sensitivity:**
9393

94-
A critical aspect of the Dirichlet Process implementation is its sequential, stochastic nature. Since documents are processed one at a time following the Chinese Restaurant Process, several important properties emerge:
94+
A critical aspect of the DP implementation is its sequential, stochastic nature. Since documents are processed one at a time following the Chinese Restaurant Process, several important properties emerge:
9595

9696
1. **Order Dependency**: The final clustering outcome is sensitive to the order in which documents are processed. This sensitivity arises because:
9797

@@ -116,7 +116,7 @@ To mitigate order dependency in production applications, randomly shuffling docu
116116

117117
**Parameter Tuning:**
118118

119-
The Dirichlet Process clustering model is governed by two key parameters that significantly influence clustering behavior from an academic perspective:
119+
The DP clustering model is governed by two key parameters that significantly influence clustering behavior from an academic perspective:
120120

121121
1. **Alpha (α)**: The concentration parameter that controls cluster proliferation.
122122

@@ -135,6 +135,8 @@ The interaction between these parameters creates distinct clustering profiles. F
135135
Pitman-Yor Process Clustering
136136
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
137137

138+
Clustering using the Pitman-Yor Process (PYP) is generally better suited for text data as it can model the power-law distributions common in natural language.
139+
138140
.. note::
139141

140142
This section is currently under development and will be added in a future update.

0 commit comments

Comments
 (0)