You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jan 8, 2026. It is now read-only.
* **Typical good starting value**: α=0.3 with variance=0.5
413
+
* **Typical good starting value**: α=0.3 with kappa=0.3
407
414
* **Default**: 0.3
408
415
* **Constraint**: Must satisfy α > -σ (typically not an issue since σ is positive)
409
416
* **Important**: Using the same alpha value as DP leads to dramatically different clustering behaviors
@@ -420,13 +427,13 @@ To interpret evaluation results and improve clustering performance, it's importa
420
427
* As sigma approaches 1.0, the distribution exhibits heavier tails (more power-law-like)
421
428
* Higher sigma values tend to produce more small clusters and fewer large clusters
422
429
423
-
* **variance**:
430
+
* **pyp-kappa (precision parameter)**:
424
431
425
432
* Controls the sensitivity of the clustering process
426
-
* **Effect**: Lower values make the model more sensitive to small differences between texts
427
-
* **Typical good value**: 0.5 (slightly higher than for Dirichlet Process)
428
-
* **Default**: 0.3 (same as for Dirichlet Process)
429
-
* Part of the base measure for the clustering model
433
+
* **Effect**: Higher values make the model more sensitive to small differences between texts
434
+
* **Typical good value**: 0.3 (same as for Dirichlet Process)
435
+
* **Default**: 0.3
436
+
* Part of the likelihood model for the clustering process
430
437
431
438
3. **Power Law Parameters** (detected in the evaluation results, not passed as a parameter):
432
439
@@ -450,8 +457,8 @@ Based on evaluation results, you can adjust parameters to improve clustering qua
450
457
451
458
1. Start with the recommended values:
452
459
453
-
* For Dirichlet Process: alpha=0.5, variance=0.3
454
-
* For Pitman-Yor Process: alpha=0.3, sigma=0.3, variance=0.5
460
+
* For Dirichlet Process: alpha=0.5, kappa=0.3
461
+
* For Pitman-Yor Process: alpha=0.3, sigma=0.3
455
462
456
463
2. If you want more clusters, increase alpha
457
464
3. If you want fewer clusters, decrease alpha
@@ -462,6 +469,10 @@ The evaluation dashboard helps you compare different parameter settings and choo
462
469
configuration for your dataset. Higher silhouette scores indicate better-defined clusters, while
463
470
power-law characteristics often suggest natural language patterns in your data.
464
471
472
+
.. note::
473
+
474
+
Given that clustering is stochastic, you should run multiple trials with the same parameters to get reliable and reproducible results. This helps identify stable clusters that consistently appear across runs and reduces the impact of random initialization. Using the ``--random-seed`` parameter ensures reproducibility for a specific run, but comparing results across multiple seeds provides more robust insights into the true underlying cluster structure.
0 commit comments