Skip to content
This repository was archived by the owner on Jan 8, 2026. It is now read-only.

Commit b85a21a

Browse files
committed
Update docs
1 parent fc0fcd5 commit b85a21a

File tree

3 files changed

+52
-42
lines changed

3 files changed

+52
-42
lines changed

CONTRIBUTING.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,7 @@ This project uses:
103103
* `Black <https://black.readthedocs.io/>`_ for code formatting
104104
* `isort <https://pycqa.github.io/isort/>`_ for import sorting
105105
* `flake8 <https://flake8.pycqa.org/>`_ for linting
106+
* `pylint <https://pylint.org/>`_ for static type checking
106107

107108
These tools are automatically run when you use pre-commit hooks.
108109

README.rst

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -53,8 +53,8 @@ For interactive visualization during evaluation, add the ``--show-plot`` option:
5353

5454
The default parameters are optimized based on extensive testing:
5555

56-
* Dirichlet Process: α=0.5, variance=0.3
57-
* Pitman-Yor Process: α=0.3, σ=0.3, variance=0.3
56+
* Dirichlet Process: α=0.5, kappa=0.3
57+
* Pitman-Yor Process: α=0.3, σ=0.3, kappa=0.3
5858

5959
For advanced usage and parameter tuning, see the `Usage Guide <https://clusterium.readthedocs.io/en/latest/usage.html>`_.
6060

@@ -70,11 +70,11 @@ Python API Example
7070
texts = load_data("your_data.txt")
7171
7272
# Perform clustering with default parameters
73-
dp = DirichletProcess(alpha=0.5) # Dirichlet Process
74-
clusters_dp, _ = dp.fit(texts)
73+
dp = DirichletProcess(alpha=0.5, kappa=0.3) # Dirichlet Process
74+
clusters_dp = dp.fit_predict(texts)
7575
76-
pyp = PitmanYorProcess(alpha=0.3, sigma=0.3) # Pitman-Yor Process
77-
clusters_pyp, _ = pyp.fit(texts)
76+
pyp = PitmanYorProcess(alpha=0.3, sigma=0.3, kappa=0.3) # Pitman-Yor Process
77+
clusters_pyp = pyp.fit_predict(texts)
7878
7979
# Print number of clusters found
8080
print(f"DP found {len(set(clusters_dp))} clusters")

docs/source/usage.rst

Lines changed: 45 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -55,15 +55,18 @@ Command Line Options for ``cluster``
5555
* - ``--dp-alpha``
5656
- Concentration parameter for Dirichlet Process
5757
- 0.5
58+
* - ``--dp-kappa``
59+
- Precision parameter for Dirichlet Process likelihood model
60+
- 0.3
5861
* - ``--pyp-alpha``
5962
- Concentration parameter for Pitman-Yor Process
6063
- 0.3
64+
* - ``--pyp-kappa``
65+
- Precision parameter for Pitman-Yor Process likelihood model
66+
- 0.3
6167
* - ``--pyp-sigma``
6268
- Discount parameter for Pitman-Yor Process (0.0 ≤ σ < 1.0)
6369
- 0.3
64-
* - ``--variance``
65-
- Sensitivity parameter for the clustering model
66-
- 0.3
6770
* - ``--random-seed``
6871
- Random seed for reproducible clustering
6972
- None
@@ -101,6 +104,9 @@ Command Line Options for ``evaluate``
101104
* - ``--output-dir``
102105
- Directory to save output files
103106
- ``output``
107+
* - ``--random-seed``
108+
- Random seed for reproducible evaluation
109+
- None
104110

105111
Examples
106112
========
@@ -136,17 +142,18 @@ Fine-tune the clustering by adjusting the model-specific parameters:
136142
clusx cluster \
137143
--input your_data.txt \
138144
--dp-alpha 0.5 \
145+
--dp-kappa 0.3 \
139146
--pyp-alpha 0.3 \
147+
--pyp-kappa 0.3 \
140148
--pyp-sigma 0.3 \
141-
--variance 0.3 \
142149
--random-seed 42
143150
144151
The choice of parameters significantly affects clustering results. For example:
145152

146153
* Lower alpha values (0.1-0.5) create fewer, larger clusters
147154
* Higher alpha values (1.0-5.0) create more, smaller clusters
148155
* For Pitman-Yor Process, sigma values between 0.1-0.7 typically work well
149-
* Lower variance values (0.1-0.3) make the model more sensitive to small differences between texts
156+
* Lower kappa values (0.1-0.3) make the model more sensitive to small differences between texts
150157
* Using the same value for both DP and PYP alpha parameters will result in dramatically different clustering behaviors
151158

152159
For detailed guidance on parameter selection for each model, see the `Understanding Clustering Parameters`_ section below.
@@ -195,7 +202,7 @@ The JSON output follows this structure:
195202
"model_name": "DP",
196203
"alpha": 1.0,
197204
"sigma": 0.0,
198-
"variance": 0.1
205+
"kappa": 0.3
199206
}
200207
}
201208
@@ -212,11 +219,11 @@ The CSV output format provides a simple tabular view of cluster assignments:
212219

213220
.. code-block:: text
214221
215-
Text,Cluster_DP,Alpha,Sigma,Variance
216-
"What is the capital of France?",0,1.0,0.0,0.1
217-
"What city is the capital of France?",0,1.0,0.0,0.1
218-
"How tall is the Eiffel Tower?",1,1.0,0.0,0.1
219-
"What is the height of the Eiffel Tower?",1,1.0,0.0,0.1
222+
Text,Cluster_DP,Alpha,Sigma,Kappa
223+
"What is the capital of France?",0,1.0,0.0,0.3
224+
"What city is the capital of France?",0,1.0,0.0,0.3
225+
"How tall is the Eiffel Tower?",1,1.0,0.0,0.3
226+
"What is the height of the Eiffel Tower?",1,1.0,0.0,0.3
220227
221228
Evaluating Clustering Results
222229
-----------------------------
@@ -310,7 +317,7 @@ for further analysis or integration with other tools. Example evaluation report
310317
"parameters": {
311318
"alpha": 1.0,
312319
"sigma": 0.0,
313-
"variance": 0.1,
320+
"kappa": 0.3,
314321
"random_state": 42
315322
},
316323
"cluster_stats": {
@@ -340,7 +347,7 @@ for further analysis or integration with other tools. Example evaluation report
340347
"parameters": {
341348
"alpha": 1.0,
342349
"sigma": 0.5,
343-
"variance": 0.1,
350+
"kappa": 0.3,
344351
"random_state": 42
345352
},
346353
"cluster_stats": {
@@ -384,17 +391,17 @@ To interpret evaluation results and improve clustering performance, it's importa
384391
* Controls how likely the algorithm is to create new clusters
385392
* **Recommended range**: 0.1 to 5.0
386393
* **Effect**: Higher values create more clusters, lower values create fewer, larger clusters
387-
* **Typical good starting value**: α=0.5 with variance=0.3
394+
* **Typical good starting value**: α=0.5 with kappa=0.3
388395
* **Default**: 0.5
389396
* **Constraint**: Must be positive (α > 0)
390397

391-
* **variance**:
398+
* **dp-kappa (precision parameter)**:
392399

393400
* Controls the sensitivity of the clustering process
394-
* **Effect**: Lower values make the model more sensitive to small differences between texts
401+
* **Effect**: Higher values make the model more sensitive to small differences between texts
395402
* **Typical good value**: 0.3
396403
* **Default**: 0.3
397-
* Part of the base measure for the clustering model
404+
* Part of the likelihood model for the clustering process
398405

399406
2. **Pitman-Yor Process Parameters**:
400407

@@ -403,7 +410,7 @@ To interpret evaluation results and improve clustering performance, it's importa
403410
* Similar role as in Dirichlet Process, but with different optimal ranges
404411
* **Recommended range**: 0.1 to 2.0
405412
* **Effect**: Higher values create more clusters, lower values create fewer, larger clusters
406-
* **Typical good starting value**: α=0.3 with variance=0.5
413+
* **Typical good starting value**: α=0.3 with kappa=0.3
407414
* **Default**: 0.3
408415
* **Constraint**: Must satisfy α > -σ (typically not an issue since σ is positive)
409416
* **Important**: Using the same alpha value as DP leads to dramatically different clustering behaviors
@@ -420,13 +427,13 @@ To interpret evaluation results and improve clustering performance, it's importa
420427
* As sigma approaches 1.0, the distribution exhibits heavier tails (more power-law-like)
421428
* Higher sigma values tend to produce more small clusters and fewer large clusters
422429

423-
* **variance**:
430+
* **pyp-kappa (precision parameter)**:
424431

425432
* Controls the sensitivity of the clustering process
426-
* **Effect**: Lower values make the model more sensitive to small differences between texts
427-
* **Typical good value**: 0.5 (slightly higher than for Dirichlet Process)
428-
* **Default**: 0.3 (same as for Dirichlet Process)
429-
* Part of the base measure for the clustering model
433+
* **Effect**: Higher values make the model more sensitive to small differences between texts
434+
* **Typical good value**: 0.3 (same as for Dirichlet Process)
435+
* **Default**: 0.3
436+
* Part of the likelihood model for the clustering process
430437

431438
3. **Power Law Parameters** (detected in the evaluation results, not passed as a parameter):
432439

@@ -450,8 +457,8 @@ Based on evaluation results, you can adjust parameters to improve clustering qua
450457

451458
1. Start with the recommended values:
452459

453-
* For Dirichlet Process: alpha=0.5, variance=0.3
454-
* For Pitman-Yor Process: alpha=0.3, sigma=0.3, variance=0.5
460+
* For Dirichlet Process: alpha=0.5, kappa=0.3
461+
* For Pitman-Yor Process: alpha=0.3, sigma=0.3
455462

456463
2. If you want more clusters, increase alpha
457464
3. If you want fewer clusters, decrease alpha
@@ -462,6 +469,10 @@ The evaluation dashboard helps you compare different parameter settings and choo
462469
configuration for your dataset. Higher silhouette scores indicate better-defined clusters, while
463470
power-law characteristics often suggest natural language patterns in your data.
464471

472+
.. note::
473+
474+
Given that clustering is stochastic, you should run multiple trials with the same parameters to get reliable and reproducible results. This helps identify stable clusters that consistently appear across runs and reduces the impact of random initialization. Using the ``--random-seed`` parameter ensures reproducibility for a specific run, but comparing results across multiple seeds provides more robust insights into the true underlying cluster structure.
475+
465476
Python API
466477
==========
467478

@@ -482,9 +493,8 @@ Basic Usage
482493
# texts = load_data("your_data.csv", column="text_column")
483494
484495
# Perform Dirichlet Process clustering with recommended parameters
485-
base_measure = {"variance": 0.3} # Controls sensitivity to text differences
486-
dp = DirichletProcess(alpha=0.5, base_measure=base_measure, random_state=42)
487-
clusters, _ = dp.fit(texts)
496+
dp = DirichletProcess(alpha=0.5, kappa=0.3, random_state=42)
497+
clusters = dp.fit_predict(texts)
488498
489499
# Save results
490500
save_clusters_to_json("clusters.json", texts, clusters, "DP")
@@ -497,9 +507,8 @@ The Pitman-Yor Process often produces better clustering results for text data:
497507
.. code-block:: python
498508
499509
# Perform Pitman-Yor Process clustering with recommended parameters
500-
base_measure = {"variance": 0.5} # Typically higher for PYP
501-
pyp = PitmanYorProcess(alpha=0.3, sigma=0.3, base_measure=base_measure, random_state=42)
502-
clusters_pyp, _ = pyp.fit(texts)
510+
pyp = PitmanYorProcess(alpha=0.3, kappa=0.3, sigma=0.3, random_state=42)
511+
clusters_pyp = pyp.fit_predict(texts)
503512
504513
# Save results
505514
save_clusters_to_json("pyp_clusters.json", texts, clusters_pyp, "PYP")
@@ -569,22 +578,22 @@ You can customize various aspects of the clustering process:
569578
# For fewer, larger clusters (good for broad categorization)
570579
dp_fewer_clusters = DirichletProcess(
571580
alpha=0.1, # Low alpha = fewer clusters
572-
base_measure={"variance": 0.5}, # Higher variance = less sensitive to differences
581+
kappa=0.5, # Higher kappa = less sensitive to differences
573582
random_state=42
574583
)
575584
576585
# For more, smaller clusters (good for fine-grained categorization)
577586
dp_more_clusters = DirichletProcess(
578587
alpha=5.0, # High alpha = more clusters
579-
base_measure={"variance": 0.1}, # Lower variance = more sensitive to differences
588+
kappa=0.1, # Lower kappa = more sensitive to differences
580589
random_state=42
581590
)
582591
583592
# For power-law distributed cluster sizes (often matches natural language patterns)
584593
pyp_power_law = PitmanYorProcess(
585594
alpha=0.3,
586595
sigma=0.7, # Higher sigma = stronger power-law behavior
587-
base_measure={"variance": 0.5},
596+
kappa=0.3,
588597
random_state=42
589598
)
590599
@@ -593,8 +602,8 @@ You can customize various aspects of the clustering process:
593602
custom_model = SentenceTransformer("all-mpnet-base-v2") # Different model
594603
595604
# To use a custom model with DirichletProcess:
596-
dp_custom = DirichletProcess(alpha=0.5)
597-
dp_custom.embedding_model = custom_model
605+
dp_custom = DirichletProcess(alpha=0.5, kappa=0.3)
606+
dp_custom.model = custom_model
598607
599608
# Custom similarity function (advanced)
600609
def custom_similarity(text, cluster_param):

0 commit comments

Comments
 (0)