Skip to content
This repository was archived by the owner on Jan 8, 2026. It is now read-only.

Commit 5a0b5d1

Browse files
committed
Amend guidance on parameter tuning
1 parent b85a21a commit 5a0b5d1

File tree

2 files changed

+97
-49
lines changed

2 files changed

+97
-49
lines changed

README.rst

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -51,12 +51,14 @@ For interactive visualization during evaluation, add the ``--show-plot`` option:
5151
5252
.. note::
5353

54-
The default parameters are optimized based on extensive testing:
54+
The package comes with sensible defaults, but optimal parameters depend on your dataset:
5555

56-
* Dirichlet Process: α=0.5, kappa=0.3
57-
* Pitman-Yor Process: α=0.3, σ=0.3, kappa=0.3
56+
* Default values: DP (α=0.5, κ=0.3), PYP (α=0.3, κ=0.3, σ=0.3)
57+
* For a dataset of ~7,000 sentences, these values worked well:
58+
* Dirichlet Process: α=15.0, κ=25.0 (formed ~10-20 clusters)
59+
* Pitman-Yor Process: α=12.0, κ=25.0, σ=0.5 (formed ~20-30 clusters)
5860

59-
For advanced usage and parameter tuning, see the `Usage Guide <https://clusterium.readthedocs.io/en/latest/usage.html>`_.
61+
For guidance on parameter tuning for your specific dataset, see the `Usage Guide <https://clusterium.readthedocs.io/en/latest/usage.html>`_.
6062

6163
Python API Example
6264
------------------
@@ -70,10 +72,10 @@ Python API Example
7072
texts = load_data("your_data.txt")
7173
7274
# Perform clustering with default parameters
73-
dp = DirichletProcess(alpha=0.5, kappa=0.3) # Dirichlet Process
75+
dp = DirichletProcess(alpha=0.5, kappa=0.3) # Default parameters
7476
clusters_dp = dp.fit_predict(texts)
7577
76-
pyp = PitmanYorProcess(alpha=0.3, sigma=0.3, kappa=0.3) # Pitman-Yor Process
78+
pyp = PitmanYorProcess(alpha=0.3, kappa=0.3, sigma=0.3) # Default parameters
7779
clusters_pyp = pyp.fit_predict(texts)
7880
7981
# Print number of clusters found

docs/source/usage.rst

Lines changed: 89 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -135,26 +135,25 @@ When using a CSV file, you must specify the column name to use for clustering:
135135
Adjusting Clustering Parameters
136136
-------------------------------
137137

138-
Fine-tune the clustering by adjusting the model-specific parameters:
138+
Fine-tune the clustering by adjusting the model-specific parameters. Here's an example using parameters that worked well for a dataset of ~7,000 sentences:
139139

140140
.. code-block:: bash
141141
142142
clusx cluster \
143143
--input your_data.txt \
144-
--dp-alpha 0.5 \
145-
--dp-kappa 0.3 \
146-
--pyp-alpha 0.3 \
147-
--pyp-kappa 0.3 \
148-
--pyp-sigma 0.3 \
144+
--dp-alpha 15.0 \
145+
--dp-kappa 25.0 \
146+
--pyp-alpha 12.0 \
147+
--pyp-kappa 25.0 \
148+
--pyp-sigma 0.5 \
149149
--random-seed 42
150150
151-
The choice of parameters significantly affects clustering results. For example:
151+
The choice of parameters significantly affects clustering results. General guidelines include:
152152

153-
* Lower alpha values (0.1-0.5) create fewer, larger clusters
154-
* Higher alpha values (1.0-5.0) create more, smaller clusters
155-
* For Pitman-Yor Process, sigma values between 0.1-0.7 typically work well
156-
* Lower kappa values (0.1-0.3) make the model more sensitive to small differences between texts
157-
* Using the same value for both DP and PYP alpha parameters will result in dramatically different clustering behaviors
153+
* Alpha values should be scaled based on your dataset size (higher for larger datasets)
154+
* Higher kappa values make the model more sensitive to small differences between texts
155+
* For Pitman-Yor Process, sigma controls the power-law behavior of cluster sizes
156+
* Using the same alpha value for both DP and PYP leads to dramatically different clustering behaviors
158157

159158
For detailed guidance on parameter selection for each model, see the `Understanding Clustering Parameters`_ section below.
160159

@@ -389,17 +388,15 @@ To interpret evaluation results and improve clustering performance, it's importa
389388
* **dp-alpha (concentration parameter)**:
390389

391390
* Controls how likely the algorithm is to create new clusters
392-
* **Recommended range**: 0.1 to 5.0
391+
* **Typical range**: Values should be scaled based on dataset size and characteristics
393392
* **Effect**: Higher values create more clusters, lower values create fewer, larger clusters
394-
* **Typical good starting value**: α=0.5 with kappa=0.3
395393
* **Default**: 0.5
396394
* **Constraint**: Must be positive (α > 0)
397395

398396
* **dp-kappa (precision parameter)**:
399397

400398
* Controls the sensitivity of the clustering process
401-
* **Effect**: Higher values make the model more sensitive to small differences between texts
402-
* **Typical good value**: 0.3
399+
* **Effect**: Higher values make the model more sensitive to small differences between texts, creating more distinct but fewer clusters
403400
* **Default**: 0.3
404401
* Part of the likelihood model for the clustering process
405402

@@ -408,20 +405,16 @@ To interpret evaluation results and improve clustering performance, it's importa
408405
* **pyp-alpha (concentration parameter)**:
409406

410407
* Similar role as in Dirichlet Process, but with different optimal ranges
411-
* **Recommended range**: 0.1 to 2.0
412408
* **Effect**: Higher values create more clusters, lower values create fewer, larger clusters
413-
* **Typical good starting value**: α=0.3 with kappa=0.3
414409
* **Default**: 0.3
415410
* **Constraint**: Must satisfy α > -σ (typically not an issue since σ is positive)
416411
* **Important**: Using the same alpha value as DP leads to dramatically different clustering behaviors
417412

418413
* **pyp-sigma (discount parameter)**:
419414

420415
* Unique to Pitman-Yor Process
421-
* **Recommended range**: 0.1 to 0.7
422416
* **Valid range**: 0.0 to 0.99 (must be less than 1.0)
423417
* **Effect**: Controls the power-law behavior of cluster sizes
424-
* **Typical good starting value**: σ=0.3
425418
* **Default**: 0.3
426419
* When sigma=0, Pitman-Yor behaves exactly like Dirichlet Process
427420
* As sigma approaches 1.0, the distribution exhibits heavier tails (more power-law-like)
@@ -431,7 +424,6 @@ To interpret evaluation results and improve clustering performance, it's importa
431424

432425
* Controls the sensitivity of the clustering process
433426
* **Effect**: Higher values make the model more sensitive to small differences between texts
434-
* **Typical good value**: 0.3 (same as for Dirichlet Process)
435427
* **Default**: 0.3
436428
* Part of the likelihood model for the clustering process
437429

@@ -450,20 +442,76 @@ To interpret evaluation results and improve clustering performance, it's importa
450442
* Smaller values indicate more confidence in the power law alpha estimate
451443
* Helps determine the reliability of the power law fit
452444

445+
Example Parameter Combinations
446+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
447+
448+
The following parameter combinations were found to work well for a dataset of approximately 7,000 sentences. These are provided as examples to illustrate how different parameter settings affect clustering outcomes, but optimal values will vary based on your specific dataset size, domain, and clustering goals:
449+
450+
**Example 1: Balanced Clustering**
451+
452+
* Dirichlet Process: ``--dp-alpha 15.0 --dp-kappa 25.0``
453+
* Pitman-Yor Process: ``--pyp-alpha 12.0 --pyp-kappa 25.0 --pyp-sigma 0.5``
454+
* Observed Behavior:
455+
456+
* DP formed ~10–20 clusters
457+
* PYP formed ~20–30 clusters due to discount parameter encouraging more small clusters
458+
459+
**Example 2: More Granular Clustering**
460+
461+
* Dirichlet Process: ``--dp-alpha 25.0 --dp-kappa 20.0``
462+
* Pitman-Yor Process: ``--pyp-alpha 18.0 --pyp-kappa 20.0 --pyp-sigma 0.6``
463+
* Observed Behavior:
464+
465+
* DP created ~15–25 clusters
466+
* PYP produced ~30–40 clusters, with a long tail of smaller topic-specific groups
467+
468+
**Example 3: Tight, Cohesive Clusters**
469+
470+
* Dirichlet Process: ``--dp-alpha 10.0 --dp-kappa 30.0``
471+
* Pitman-Yor Process: ``--pyp-alpha 8.0 --pyp-kappa 30.0 --pyp-sigma 0.4``
472+
* Observed Behavior:
473+
474+
* DP yielded ~8–15 tight clusters (high kappa)
475+
* PYP created ~15–25 clusters, splitting some of DP's larger clusters into subtopics
476+
477+
**Example 4: Broad Coverage**
478+
479+
* Dirichlet Process: ``--dp-alpha 30.0 --dp-kappa 18.0``
480+
* Pitman-Yor Process: ``--pyp-alpha 25.0 --pyp-kappa 18.0 --pyp-sigma 0.55``
481+
* Observed Behavior:
482+
483+
* DP generated ~20–30 broad clusters
484+
* PYP formed ~35–50 clusters, reflecting power-law distribution (many small + few large clusters)
485+
486+
**Example 5: High Precision**
487+
488+
* Dirichlet Process: ``--dp-alpha 20.0 --dp-kappa 35.0``
489+
* Pitman-Yor Process: ``--pyp-alpha 15.0 --pyp-kappa 35.0 --pyp-sigma 0.45``
490+
* Observed Behavior:
491+
492+
* DP resulted in ~12–18 highly cohesive clusters
493+
* PYP produced ~25–35 clusters, better capturing niche topics (e.g., splitting "technology" into "AI," "blockchain," etc.)
494+
495+
**Key Observations from These Examples**:
496+
497+
* **Alpha Scaling**: Alpha values should be proportional to the dataset size. For the 7,000 sentence dataset, values between 10-30 worked well.
498+
* **Kappa Range**: Values between 15-35 balanced cluster tightness and avoided overfitting. Higher kappa created more distinct but fewer clusters.
499+
* **Discount (Sigma)**: Values between 0.4-0.6 for PYP ensured it outperformed DP in capturing power-law distributions without fragmenting clusters excessively.
500+
453501
Optimizing Clustering Parameters
454502
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
455503

456-
Based on evaluation results, you can adjust parameters to improve clustering quality:
504+
When tuning parameters for your own dataset, consider these guidelines:
457505

458-
1. Start with the recommended values:
506+
1. Start with the default values and gradually adjust based on your dataset characteristics:
459507

460-
* For Dirichlet Process: alpha=0.5, kappa=0.3
461-
* For Pitman-Yor Process: alpha=0.3, sigma=0.3
508+
* For smaller datasets (< 1,000 items), try lower alpha values
509+
* For larger datasets (> 10,000 items), try higher alpha values
462510

463-
2. If you want more clusters, increase alpha
464-
3. If you want fewer clusters, decrease alpha
465-
4. To get a more power-law-like distribution, increase sigma (for PYP only)
466-
5. Evaluate the results using the evaluation metrics, especially silhouette score
511+
2. Scale alpha values proportionally to your dataset size
512+
3. Adjust kappa to balance cluster tightness (higher values) vs. number of clusters (lower values)
513+
4. For PYP, experiment with different sigma values to find the right balance between capturing power-law distributions and avoiding excessive fragmentation
514+
5. Evaluate the results using silhouette scores, cluster size distributions, and topic coherence
467515

468516
The evaluation dashboard helps you compare different parameter settings and choose the optimal
469517
configuration for your dataset. Higher silhouette scores indicate better-defined clusters, while
@@ -492,7 +540,7 @@ Basic Usage
492540
# Or load data from a CSV file
493541
# texts = load_data("your_data.csv", column="text_column")
494542
495-
# Perform Dirichlet Process clustering with recommended parameters
543+
# Perform Dirichlet Process clustering with default parameters
496544
dp = DirichletProcess(alpha=0.5, kappa=0.3, random_state=42)
497545
clusters = dp.fit_predict(texts)
498546
@@ -506,17 +554,15 @@ The Pitman-Yor Process often produces better clustering results for text data:
506554

507555
.. code-block:: python
508556
509-
# Perform Pitman-Yor Process clustering with recommended parameters
557+
# Perform Pitman-Yor Process clustering with default parameters
510558
pyp = PitmanYorProcess(alpha=0.3, kappa=0.3, sigma=0.3, random_state=42)
511559
clusters_pyp = pyp.fit_predict(texts)
512560
513561
# Save results
514562
save_clusters_to_json("pyp_clusters.json", texts, clusters_pyp, "PYP")
515563
516-
For optimal results, consider using the recommended parameter values discussed in
517-
the `Understanding Clustering Parameters`_ section. The Pitman-Yor Process is
518-
particularly effective for text data that naturally follows power-law distributions.
519-
564+
For optimal results, you'll likely need to tune parameters based on your specific dataset characteristics.
565+
See the `Example Parameter Combinations`_ and `Optimizing Clustering Parameters`_ sections for guidance.
520566

521567
.. note::
522568

@@ -569,27 +615,27 @@ You can evaluate the quality of your clusters using the evaluation module:
569615
Customizing the Clustering Process
570616
----------------------------------
571617

572-
You can customize various aspects of the clustering process:
618+
You can customize various aspects of the clustering process based on your specific needs:
573619

574620
.. code-block:: python
575621
576622
# Custom parameters for different clustering behaviors
577623
578-
# For fewer, larger clusters (good for broad categorization)
624+
# For fewer, larger clusters
579625
dp_fewer_clusters = DirichletProcess(
580-
alpha=0.1, # Low alpha = fewer clusters
581-
kappa=0.5, # Higher kappa = less sensitive to differences
626+
alpha=0.1, # Lower alpha = fewer clusters
627+
kappa=0.5, # Adjust kappa based on desired sensitivity
582628
random_state=42
583629
)
584630
585-
# For more, smaller clusters (good for fine-grained categorization)
631+
# For more, smaller clusters
586632
dp_more_clusters = DirichletProcess(
587-
alpha=5.0, # High alpha = more clusters
588-
kappa=0.1, # Lower kappa = more sensitive to differences
633+
alpha=5.0, # Higher alpha = more clusters
634+
kappa=0.1, # Adjust kappa based on desired sensitivity
589635
random_state=42
590636
)
591637
592-
# For power-law distributed cluster sizes (often matches natural language patterns)
638+
# For power-law distributed cluster sizes
593639
pyp_power_law = PitmanYorProcess(
594640
alpha=0.3,
595641
sigma=0.7, # Higher sigma = stronger power-law behavior

0 commit comments

Comments
 (0)