You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jan 8, 2026. It is now read-only.
Copy file name to clipboardExpand all lines: docs/source/methodological_framework.rst
+7-5Lines changed: 7 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,11 +12,11 @@ This section documents the design and implementation of the nonparametric Bayesi
12
12
Dirichlet Process Clustering
13
13
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
14
14
15
-
Clusterium implements text clustering using the Dirichlet Process (DP), a fundamental nonparametric Bayesian model that allows for a flexible, potentially infinite number of clusters. Unlike traditional clustering algorithms that require pre-specifying the number of clusters (e.g., K-means), the Dirichlet Process automatically determines the appropriate number of clusters based on the data. The theoretical foundations for this approach were established by Ferguson [1]_.
15
+
Clusterium implements text clustering using the Dirichlet Process (DP), a fundamental nonparametric Bayesian model that allows for a flexible, potentially infinite number of clusters. Unlike traditional clustering algorithms that require pre-specifying the number of clusters (e.g., K-means), the DP automatically determines the appropriate number of clusters based on the data. The theoretical foundations for this approach were established by Ferguson [1]_.
16
16
17
17
**Mathematical Foundation:**
18
18
19
-
In Clusterium's implementation, the Dirichlet Process is realized through the Chinese Restaurant Process (CRP) formulation. The prior probability of a document joining an existing cluster or creating a new one follows:
19
+
In Clusterium's implementation, the DP is realized through the Chinese Restaurant Process (CRP) formulation. The prior probability of a document joining an existing cluster or creating a new one follows:
20
20
21
21
.. math::
22
22
@@ -61,7 +61,7 @@ These properties make vMF particularly suitable for clustering in high-dimension
61
61
62
62
**Algorithm Overview:**
63
63
64
-
The Dirichlet Process clustering algorithm in Clusterium follows these key steps:
64
+
The DP clustering algorithm in Clusterium follows these key steps:
65
65
66
66
1. **Embedding Generation**: Transform documents into normalized vector representations using a pretrained language model.
67
67
@@ -91,7 +91,7 @@ Clusterium's implementation includes several important design decisions that aff
91
91
92
92
**Stochastic Properties and Document Order Sensitivity:**
93
93
94
-
A critical aspect of the Dirichlet Process implementation is its sequential, stochastic nature. Since documents are processed one at a time following the Chinese Restaurant Process, several important properties emerge:
94
+
A critical aspect of the DP implementation is its sequential, stochastic nature. Since documents are processed one at a time following the Chinese Restaurant Process, several important properties emerge:
95
95
96
96
1. **Order Dependency**: The final clustering outcome is sensitive to the order in which documents are processed. This sensitivity arises because:
97
97
@@ -116,7 +116,7 @@ To mitigate order dependency in production applications, randomly shuffling docu
116
116
117
117
**Parameter Tuning:**
118
118
119
-
The Dirichlet Process clustering model is governed by two key parameters that significantly influence clustering behavior from an academic perspective:
119
+
The DP clustering model is governed by two key parameters that significantly influence clustering behavior from an academic perspective:
120
120
121
121
1. **Alpha (α)**: The concentration parameter that controls cluster proliferation.
122
122
@@ -135,6 +135,8 @@ The interaction between these parameters creates distinct clustering profiles. F
135
135
Pitman-Yor Process Clustering
136
136
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
137
137
138
+
Clustering using the Pitman-Yor Process (PYP) is generally better suited for text data as it can model the power-law distributions common in natural language.
139
+
138
140
.. note::
139
141
140
142
This section is currently under development and will be added in a future update.
0 commit comments