implemented simplex constraint

kylegodbey · kylegodbey · commit 900d8b9ce102 · 2026-02-17T22:24:05.000-05:00
diff --git a/docs/theory.md b/docs/theory.md
@@ -53,4 +53,31 @@ The standard Gibbs sampler in `pybmc` assumes a Gaussian likelihood and conjugat
 
 ### Gibbs Sampler with Simplex Constraints
 
-`pybmc` also provides a Gibbs sampler that enforces simplex constraints on the model weights (i.e., \(\sum w_k = 1\) and \(w_k \ge 0\)). This is achieved by performing a random walk in the space of the transformed parameters and using a Metropolis-Hastings step to accept or reject proposals that fall outside the valid simplex region.
+`pybmc` also provides a Gibbs sampler that enforces simplex constraints on the model weights (i.e., \(\sum w_k = 1\) and \(w_k \ge 0\)). This is achieved by performing a random walk in the space of the transformed parameters and using a Metropolis-Hastings step to accept or reject proposals that fall outside the valid simplex region.
+
+#### When to Use Each Mode
+
+| Mode | Description | Use When |
+|------|-------------|----------|
+| **Unconstrained** (default) | Weights can take any real value | Maximum flexibility; some models may get negative weights to cancel out biases |
+| **Simplex** | Weights satisfy \(w_k \ge 0\) and \(\sum w_k = 1\) | You need interpretable weights that form a proper mixture; predictions should stay within the range of individual models |
+
+#### Simplex Constraint Implementation
+
+The simplex constraint is enforced through a Metropolis-within-Gibbs algorithm. In the SVD-reduced coefficient space, the relationship between the regression coefficients \(\boldsymbol{\beta}\) and the model weights \(\boldsymbol{\omega}\) is:
+
+\[
+\omega_k = \sum_{j=1}^m \beta_j \hat{V}_{jk} + \frac{1}{K}
+\]
+
+where \(\hat{V}\) contains the (normalized) right singular vectors and \(K\) is the number of models. The term \(\frac{1}{K}\) represents the equal-weight baseline.
+
+At each iteration, the algorithm:
+
+1. **Proposes** a new coefficient vector \(\boldsymbol{\beta}^*\) from a multivariate normal centered on the current value.
+2. **Projects** the proposal to weight space via \(\boldsymbol{\omega}^* = \boldsymbol{\beta}^* \hat{V} + \frac{1}{K}\).
+3. **Rejects** the proposal if any \(\omega_k^* < 0\) (the sum-to-one constraint is automatically satisfied by the SVD structure and the \(\frac{1}{K}\) offset).
+4. **Accepts** valid proposals with probability \(\min\!\bigl(1,\; \exp\!\bigl[\bigl(\ell(\boldsymbol{\beta}^*) - \ell(\boldsymbol{\beta})\bigr) / \sigma^2\bigr]\bigr)\), where \(\ell\) is the log-likelihood.
+5. **Samples** the error variance \(\sigma^2\) from its inverse-gamma full conditional.
+
+The `burn` parameter controls the number of burn-in iterations discarded before collecting samples, and the `stepsize` parameter scales the proposal covariance matrix to tune the acceptance rate.
diff --git a/docs/usage.md b/docs/usage.md
@@ -112,6 +112,83 @@ With the data prepared and the model orthogonalized, we can train the model comb
 bmc.train(training_options={"iterations": 50000, "sampler": "gibbs_sampling"})
 ```
 
+### Simplex Constraint Mode
+
+By default, `pybmc` uses an unconstrained Gibbs sampler where model weights can take
+any real value.  If you want to enforce that the weights lie on the **probability
+simplex** — meaning each weight is between 0 and 1 and the weights sum to 1 — you can
+enable the simplex constraint mode.
+
+!!! tip "When to Use Simplex Constraints"
+    Use simplex constraints when you want the model combination to behave as a
+    **proper weighted average** of the constituent models.  This is appropriate when:
+
+    - You want each model to contribute non-negatively to the prediction.
+    - The combined prediction should remain within the range spanned by the individual models.
+    - Physical interpretability of the weights matters for your application.
+
+    The unconstrained mode is more flexible and may yield better predictive performance
+    when some models systematically over- or under-predict, since negative weights can
+    partially cancel out biased models.
+
+There are two ways to enable simplex constraints:
+
+**Option 1: Set at initialization (recommended when you always want simplex)**
+
+```python
+bmc = BayesianModelCombination(
+    models_list=["FRDM12", "HFB24", "D1M", "UNEDF1", "BCPM"],
+    data_dict=data_dict,
+    truth_column_name="AME2020",
+    constraint="simplex",   # <-- weights constrained to [0, 1], sum to 1
+)
+
+bmc.orthogonalize("BE", train_df, components_kept=3)
+bmc.train(training_options={
+    "iterations": 50000,
+    "burn": 10000,        # burn-in iterations for the Metropolis step
+    "stepsize": 0.001,    # proposal step size
+})
+```
+
+**Option 2: Override per training call**
+
+```python
+# Initialize with default unconstrained mode
+bmc = BayesianModelCombination(
+    models_list=["FRDM12", "HFB24", "D1M", "UNEDF1", "BCPM"],
+    data_dict=data_dict,
+    truth_column_name="AME2020",
+)
+
+bmc.orthogonalize("BE", train_df, components_kept=3)
+
+# Override to simplex for this specific training run
+bmc.train(training_options={
+    "iterations": 50000,
+    "sampler": "simplex",
+    "burn": 10000,
+    "stepsize": 0.001,
+})
+```
+
+### Inspecting Model Weights
+
+After training, you can inspect the inferred model weights using `get_weights()`:
+
+```python
+# Get a summary (mean, std, median per model)
+summary = bmc.get_weights()
+for model, mean_w, std_w in zip(summary["models"], summary["mean"], summary["std"]):
+    print(f"  {model}: {mean_w:.4f} ± {std_w:.4f}")
+
+# Get the full weight matrix (n_samples × n_models) for custom analysis
+weight_matrix = bmc.get_weights(summary=False)
+```
+
+In simplex mode, every row of the weight matrix is guaranteed to satisfy
+\(w_k \ge 0\) and \(\sum_k w_k = 1\).
+
 ## 4. Make Predictions
 
 After training, we can use the `predict` method to generate predictions with uncertainty quantification. The method returns the full posterior draws, as well as DataFrames for the lower, median, and upper credible intervals.
diff --git a/pybmc/__init__.py b/pybmc/__init__.py
@@ -10,7 +10,7 @@
 
 from .data import Dataset
 from .bmc import BayesianModelCombination
-from .inference_utils import gibbs_sampler, USVt_hat_extraction
+from .inference_utils import gibbs_sampler, gibbs_sampler_simplex, USVt_hat_extraction
 from .sampling_utils import coverage
 
 
@@ -19,6 +19,7 @@
     "Dataset",
     "BayesianModelCombination",
     "gibbs_sampler",
+    "gibbs_sampler_simplex",
     "USVt_hat_extraction",
     "coverage",
 ]
diff --git a/pybmc/bmc.py b/pybmc/bmc.py
@@ -3,7 +3,7 @@
 import matplotlib.pyplot as plt
 from sklearn.model_selection import train_test_split
 import os
-from .inference_utils import gibbs_sampler,  USVt_hat_extraction
+from .inference_utils import gibbs_sampler, gibbs_sampler_simplex, USVt_hat_extraction
 from .sampling_utils import coverage, rndm_m_random_calculator
 
 
@@ -16,26 +16,41 @@ class BayesianModelCombination:
     + Predictions for certain isotopes.
     """
 
-    def __init__(self, models_list, data_dict, truth_column_name, weights=None):
+    VALID_CONSTRAINTS = ("unconstrained", "simplex")
+
+    def __init__(self, models_list, data_dict, truth_column_name, weights=None, constraint="unconstrained"):
         """ 
         Initialize the BayesianModelCombination class.
 
         :param models_list: List of model names
         :param data_dict: Dictionary from `load_data()` where each key is a model name and each value is a DataFrame of properties
         :param truth_column_name: Name of the column containing the truth values.
         :param weights: Optional initial weights for the models.
+        :param constraint: Weight constraint mode. Options:
+            - ``"unconstrained"`` (default): No constraints on model weights.
+            - ``"simplex"``: Forces weights to lie on the probability simplex
+              (each weight between 0 and 1, weights sum to 1). Uses a
+              Metropolis-within-Gibbs sampler to enforce the constraint.
         """
 
         if not isinstance(models_list, list) or not all(isinstance(model, str) for model in models_list):
             raise ValueError("The 'models' should be a list of model names (strings) for Bayesian Combination.")    
         if not isinstance(data_dict, dict) or not all(isinstance(df, pd.DataFrame) for df in data_dict.values()):
             raise ValueError("The 'data_dict' should be a dictionary of pandas DataFrames, one per property.")
+        if constraint not in self.VALID_CONSTRAINTS:
+            raise ValueError(
+                f"Invalid constraint '{constraint}'. "
+                f"Must be one of {self.VALID_CONSTRAINTS}."
+            )
 
         self.data_dict = data_dict 
         self.models_list = models_list 
         self.models = [m for m in models_list if m != 'truth']
         self.weights = weights if weights is not None else None 
         self.truth_column_name = truth_column_name
+        self.constraint = constraint
+        self.samples = None
+        self.Vt_hat = None
 
 
     def orthogonalize(self, property, train_df, components_kept):
@@ -85,27 +100,61 @@ def train(self, training_options=None):
         """
         Train the model combination using training data and optional training parameters.
 
-        :param training_data: Placeholder (not used).
         :param training_options: Dictionary of training options. Keys:
             - 'iterations': (int) Number of Gibbs iterations (default 50000)
+            - 'sampler': (str) Override the constraint mode for this training run.
+              ``"unconstrained"`` or ``"simplex"``. If not provided, uses the
+              instance-level ``self.constraint`` set at initialization.
             - 'b_mean_prior': (np.ndarray) Prior mean vector (default zeros)
+              *(unconstrained sampler only)*
             - 'b_mean_cov': (np.ndarray) Prior covariance matrix (default diag(S_hat²))
+              *(unconstrained sampler only)*
             - 'nu0_chosen': (float) Degrees of freedom for variance prior (default 1.0)
             - 'sigma20_chosen': (float) Prior variance (default 0.02)
+            - 'burn': (int) Burn-in iterations (default 10000)
+              *(simplex sampler only)*
+            - 'stepsize': (float) Proposal step size (default 0.001)
+              *(simplex sampler only)*
         """
         if training_options is None:
             training_options = {}
 
+        # Determine which sampler to use: training_options overrides instance default
+        sampler_mode = training_options.get('sampler', self.constraint)
+        if sampler_mode not in self.VALID_CONSTRAINTS:
+            raise ValueError(
+                f"Invalid sampler '{sampler_mode}'. "
+                f"Must be one of {self.VALID_CONSTRAINTS}."
+            )
+
         iterations = training_options.get('iterations', 50000)
         num_components = self.U_hat.shape[1]
         S_hat = self.S_hat
-
-        b_mean_prior = training_options.get('b_mean_prior', np.zeros(num_components))
-        b_mean_cov = training_options.get('b_mean_cov', np.diag(S_hat**2))
         nu0_chosen = training_options.get('nu0_chosen', 1.0)
         sigma20_chosen = training_options.get('sigma20_chosen', 0.02)
 
-        self.samples = gibbs_sampler(self.centered_experiment_train, self.U_hat, iterations, [b_mean_prior, b_mean_cov, nu0_chosen, sigma20_chosen])
+        if sampler_mode == "simplex":
+            burn = training_options.get('burn', 10000)
+            stepsize = training_options.get('stepsize', 0.001)
+            self.samples = gibbs_sampler_simplex(
+                self.centered_experiment_train,
+                self.U_hat,
+                self.Vt_hat,
+                self.S_hat,
+                iterations,
+                [nu0_chosen, sigma20_chosen],
+                burn=burn,
+                stepsize=stepsize,
+            )
+        else:
+            b_mean_prior = training_options.get('b_mean_prior', np.zeros(num_components))
+            b_mean_cov = training_options.get('b_mean_cov', np.diag(S_hat**2))
+            self.samples = gibbs_sampler(
+                self.centered_experiment_train,
+                self.U_hat,
+                iterations,
+                [b_mean_prior, b_mean_cov, nu0_chosen, sigma20_chosen],
+            )
 
 
 
@@ -185,7 +234,37 @@ def evaluate(self, domain_filter=None):
 
         return coverage(np.arange(0, 101, 5), rndm_m, df, truth_column=self.truth_column_name)
 
-
+    def get_weights(self, summary=True):
+        """
+        Compute model weights from posterior samples.
+
+        Converts the sampled coefficient vectors (beta) into model weights
+        using the transformation ``omega = beta @ Vt_hat + 1/M``, where M is
+        the number of models.  In simplex-constrained mode, all weights are
+        guaranteed to be non-negative and sum to 1.
+
+        :param summary: If True (default), return a dictionary with
+            ``'mean'``, ``'std'``, ``'median'`` arrays keyed by statistic.
+            If False, return the full ``(n_samples, n_models)`` weight matrix.
+        :return: Weight summary dict or full weight matrix.
+        :raises ValueError: If ``train()`` has not been called.
+        """
+        if self.samples is None or self.Vt_hat is None:
+            raise ValueError("Must call `orthogonalize()` and `train()` before getting weights.")
+
+        betas = self.samples[:, :-1]
+        n_models = self.Vt_hat.shape[1]
+        default_weights = np.full(n_models, 1.0 / n_models)
+        weight_matrix = betas @ self.Vt_hat + default_weights
+
+        if summary:
+            return {
+                "mean": np.mean(weight_matrix, axis=0),
+                "std": np.std(weight_matrix, axis=0),
+                "median": np.median(weight_matrix, axis=0),
+                "models": self.models,
+            }
+        return weight_matrix
 
 
 
diff --git a/pybmc/sampling_utils.py b/pybmc/sampling_utils.py
@@ -54,7 +54,9 @@ def rndm_m_random_calculator(filtered_model_predictions, samples, Vt_hat):
     np.random.seed(142858)
     rng = np.random.default_rng()
 
-    theta_rand_selected = rng.choice(samples, 10000, replace=False)
+    n_draws = min(10000, len(samples))
+    replace = len(samples) < 10000
+    theta_rand_selected = rng.choice(samples, n_draws, replace=replace)
 
     # Extract betas and noise std deviations
     betas = theta_rand_selected[:, :-1]  # shape: (10000, num_models - 1)
diff --git a/tests/test_simplex_constraint.py b/tests/test_simplex_constraint.py