You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jan 8, 2026. It is now read-only.
Copy file name to clipboardExpand all lines: docs/source/usage.rst
+89-43Lines changed: 89 additions & 43 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -135,26 +135,25 @@ When using a CSV file, you must specify the column name to use for clustering:
135
135
Adjusting Clustering Parameters
136
136
-------------------------------
137
137
138
-
Fine-tune the clustering by adjusting the model-specific parameters:
138
+
Fine-tune the clustering by adjusting the model-specific parameters. Here's an example using parameters that worked well for a dataset of ~7,000 sentences:
139
139
140
140
.. code-block:: bash
141
141
142
142
clusx cluster \
143
143
--input your_data.txt \
144
-
--dp-alpha 0.5 \
145
-
--dp-kappa 0.3 \
146
-
--pyp-alpha 0.3 \
147
-
--pyp-kappa 0.3 \
148
-
--pyp-sigma 0.3 \
144
+
--dp-alpha 15.0 \
145
+
--dp-kappa 25.0 \
146
+
--pyp-alpha 12.0 \
147
+
--pyp-kappa 25.0 \
148
+
--pyp-sigma 0.5 \
149
149
--random-seed 42
150
150
151
-
The choice of parameters significantly affects clustering results. For example:
151
+
The choice of parameters significantly affects clustering results. General guidelines include:
* **Typical good starting value**: α=0.3 with kappa=0.3
414
409
* **Default**: 0.3
415
410
* **Constraint**: Must satisfy α > -σ (typically not an issue since σ is positive)
416
411
* **Important**: Using the same alpha value as DP leads to dramatically different clustering behaviors
417
412
418
413
* **pyp-sigma (discount parameter)**:
419
414
420
415
* Unique to Pitman-Yor Process
421
-
* **Recommended range**: 0.1 to 0.7
422
416
* **Valid range**: 0.0 to 0.99 (must be less than 1.0)
423
417
* **Effect**: Controls the power-law behavior of cluster sizes
424
-
* **Typical good starting value**: σ=0.3
425
418
* **Default**: 0.3
426
419
* When sigma=0, Pitman-Yor behaves exactly like Dirichlet Process
427
420
* As sigma approaches 1.0, the distribution exhibits heavier tails (more power-law-like)
@@ -431,7 +424,6 @@ To interpret evaluation results and improve clustering performance, it's importa
431
424
432
425
* Controls the sensitivity of the clustering process
433
426
* **Effect**: Higher values make the model more sensitive to small differences between texts
434
-
* **Typical good value**: 0.3 (same as for Dirichlet Process)
435
427
* **Default**: 0.3
436
428
* Part of the likelihood model for the clustering process
437
429
@@ -450,20 +442,76 @@ To interpret evaluation results and improve clustering performance, it's importa
450
442
* Smaller values indicate more confidence in the power law alpha estimate
451
443
* Helps determine the reliability of the power law fit
452
444
445
+
Example Parameter Combinations
446
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
447
+
448
+
The following parameter combinations were found to work well for a dataset of approximately 7,000 sentences. These are provided as examples to illustrate how different parameter settings affect clustering outcomes, but optimal values will vary based on your specific dataset size, domain, and clustering goals:
* PYP produced ~25–35 clusters, better capturing niche topics (e.g., splitting "technology" into "AI," "blockchain," etc.)
494
+
495
+
**Key Observations from These Examples**:
496
+
497
+
* **Alpha Scaling**: Alpha values should be proportional to the dataset size. For the 7,000 sentence dataset, values between 10-30 worked well.
498
+
* **Kappa Range**: Values between 15-35 balanced cluster tightness and avoided overfitting. Higher kappa created more distinct but fewer clusters.
499
+
* **Discount (Sigma)**: Values between 0.4-0.6 for PYP ensured it outperformed DP in capturing power-law distributions without fragmenting clusters excessively.
500
+
453
501
Optimizing Clustering Parameters
454
502
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
455
503
456
-
Based on evaluation results, you can adjust parameters to improve clustering quality:
504
+
When tuning parameters for your own dataset, consider these guidelines:
457
505
458
-
1. Start with the recommended values:
506
+
1. Start with the default values and gradually adjust based on your dataset characteristics:
4. To get a more power-law-like distribution, increase sigma (for PYP only)
466
-
5. Evaluate the results using the evaluation metrics, especially silhouette score
511
+
2. Scale alpha values proportionally to your dataset size
512
+
3. Adjust kappa to balance cluster tightness (higher values) vs. number of clusters (lower values)
513
+
4. For PYP, experiment with different sigma values to find the right balance between capturing power-law distributions and avoiding excessive fragmentation
514
+
5. Evaluate the results using silhouette scores, cluster size distributions, and topic coherence
467
515
468
516
The evaluation dashboard helps you compare different parameter settings and choose the optimal
469
517
configuration for your dataset. Higher silhouette scores indicate better-defined clusters, while
0 commit comments