Xtra-Computing
diff --git a/‎.github/workflows/python-publish.yml‎
Lines changed: 3 additions & 2 deletions b/‎.github/workflows/python-publish.yml‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎.gitignore‎
Lines changed: 3 additions & 1 deletion b/‎.gitignore‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎CLAUDE.md‎
Lines changed: 72 additions & 0 deletions b/‎CLAUDE.md‎
Lines changed: 72 additions & 0 deletions
diff --git a/‎HyperParameters.md‎
Lines changed: 0 additions & 14 deletions b/‎HyperParameters.md‎
Lines changed: 0 additions & 14 deletions
diff --git a/‎src/vertibench/Splitter.py‎
Lines changed: 25 additions & 13 deletions b/‎src/vertibench/Splitter.py‎
Lines changed: 25 additions & 13 deletions
@@ -9,8 +9,9 @@
 name: Upload Python Package
 
 on:
-  release:
-    types: [published]
+  push:
+    tags:
+      - 'v*'
 
 permissions:
   contents: read
 
@@ -166,4 +166,6 @@ cython_debug/
 .idea/
 
 # VSCode
-.vscode/
+.vscode/uv.lock
+test/tmp/
+.claude/
@@ -0,0 +1,72 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+VertiBench is a Python library for benchmarking vertical federated learning (VFL). It generates synthetic VFL datasets with tunable feature importance imbalance and inter-party correlation, then evaluates the quality of vertical data partitions along those two dimensions.
+
+## Build & Development Commands
+
+```bash
+# Install from source (editable)
+pip install -e .
+
+# Install with test dependencies (adds xgboost)
+pip install -e ".[test]"
+
+# Build distribution
+python -m build
+
+# Run all tests
+python -m unittest discover test/
+
+# Run individual test files
+python -m unittest test.test_splitter
+python -m unittest test.test_evaluator
+python -m unittest test.test_evaluate_alpha
+
+# Run a single test case
+python -m unittest test.test_splitter.TestImportanceSplitter.test_split_tabular
+```
+
+No linter or formatter is configured for this project.
+
+## Architecture
+
+The library lives in `src/vertibench/` and has two core modules:
+
+### Splitter.py — Vertical Data Partitioning
+
+Abstract base class `Splitter` defines the interface: `split_indices()` returns per-party feature index lists, and `split()` applies them to datasets.
+
+Three implementations:
+- **ImportanceSplitter** — Uses Dirichlet distribution to assign features to parties with controllable importance imbalance. The `weights` parameter controls expected importance per party (higher weight = more features).
+- **CorrelationSplitter** — Uses BRKGA (pymoo) genetic algorithm to find partitions that match a target inter/intra-party correlation ratio. Parameter `beta` ∈ [0,1] controls the balance. Requires `fit()` on data before splitting.
+- **SimpleSplitter** — Uniform contiguous split of features across parties.
+
+### Evaluator.py — Split Quality Assessment
+
+- **ImportanceEvaluator** — Computes per-party feature importance using SHAP Permutation explainer. `evaluate_alpha()` recovers the Dirichlet concentration parameter from importance scores.
+- **CorrelationEvaluator** — Computes correlation matrices and scores inner vs. inter-party correlation. `evaluate_beta()` recovers the correlation concentration metric. Supports GPU acceleration via PyTorch (`gpu_id` parameter). Uses multiple SVD strategies depending on feature count (exact for <100, randomized for larger).
+
+### Key Data Flow
+
+1. Generate data (e.g., `sklearn.datasets.make_classification`)
+2. `Splitter.split(X)` → list of per-party feature matrices `Xs`
+3. `Evaluator.evaluate(Xs, ...)` → quality scores
+4. `evaluate_alpha()` / `evaluate_beta()` → concentration metrics
+
+### Design Patterns
+
+- `Splitter` uses ABC + template method: concrete classes implement `split_indices()`, base class handles `split()` logic.
+- `CorrelationSplitter` composes a `CorrelationEvaluator` internally for optimization.
+- Correlation computation has multiple backends: Spearman (pandas), Pearson (numpy/torch), with CPU/GPU variants.
+
+## Testing
+
+Tests use `unittest` with `subTest()` for parameterized variants. Test data is generated synthetically via `generate_data()` and `split_data()` helpers in each test file. The evaluator tests train actual XGBoost models, so the `[test]` extras are required.
+
+## Dependencies
+
+Key: numpy, scipy, scikit-learn, torch, shap, pymoo, matplotlib. Python >= 3.9.
@@ -26,14 +26,18 @@ def split_indices(self, *args, **kwargs):
         """
         pass
 
-    def split(self, *Xs, indices=None, allow_empty_party=False, fill=None):
+    def split(self, *Xs, indices=None, allow_empty_party=False, fill=None, **kwargs):
         assert len(Xs) > 0, "At least one dataset should be given"
+        n_features = Xs[0].shape[1]
+        if n_features < self.num_parties:
+            raise ValueError(
+                f"Number of features ({n_features}) must be >= number of parties ({self.num_parties})")
         ans = []
 
         # calculate the indices for each party for all datasets
         if indices is None:
             allX = np.concatenate(Xs, axis=0)
-            party_to_feature = self.split_indices(allX, allow_empty_party=allow_empty_party)
+            party_to_feature = self.split_indices(allX, allow_empty_party=allow_empty_party, **kwargs)
         else:
             party_to_feature = indices
 
@@ -57,12 +61,12 @@ def split(self, *Xs, indices=None, allow_empty_party=False, fill=None):
 
 
 class ImportanceSplitter(Splitter):
-    def __init__(self, num_parties, weights=1, seed=None):
+    def __init__(self, num_parties, weights=1., seed=None):
         """
         Split a 2D dataset by feature importance under dirichlet distribution (assuming the features are independent).
         :param num_parties: [int] number of parties
-        :param weights: [int | list with size num_parties]
-                        If weights is an int, the weight of each party is the same.
+        :param weights: [float | list with size num_parties]
+                        If weights is a float, the weight of each party is the same. Equivalent to an array of [weights]*num_parties.
                         If weights is an array, the weight of each party is the corresponding element in the array.
                         The weights indicate the expected sum of feature importance of each party.
                         Meanwhile, larger weights mean less bias on the feature importance.
@@ -72,8 +76,8 @@ def __init__(self, num_parties, weights=1, seed=None):
         self.weights = weights
         self.seed = seed
         np.random.seed(seed)
-        if isinstance(self.weights, Real):
-            self.weights = [self.weights for _ in range(self.num_parties)]
+        if isinstance(self.weights, Real):  # both int & float values pass this 'if'
+            self.weights = [self.weights for _ in range(self.num_parties)]  # a uniform weights array is constructed
 
         self.check_params()
 
@@ -103,7 +107,7 @@ def dirichlet(alpha):
         xs.append(1 - sum(xs))
         return np.array(xs)
 
-    def split_indices(self, X, allow_empty_party=False):
+    def split_indices(self, X, allow_empty_party=False, **kwargs):
         """
         Split the indices of X by feature importance.
         :param allow_empty_party: [bool] whether to allow parties with zero features
@@ -168,7 +172,7 @@ def __init__(self, num_parties: int, evaluator: CorrelationEvaluator = None, see
         super().__init__(num_parties)
         self.evaluator = evaluator
         if evaluator is None:
-            self.evaluator = CorrelationEvaluator(gpu_id=gpu_id)
+            self.evaluator = CorrelationEvaluator(gpu_id=gpu_id, n_jobs=n_jobs)
         self.seed = seed
         self.gpu_id = gpu_id
         if self.gpu_id is not None:
@@ -320,7 +324,7 @@ def split_indices(self, X, n_elites=20, n_offsprings=70, n_mutants=10, n_gen=100
         self.best_icor = res_beta.opt.get('icor')[0]
         self.best_error = res_beta.F[0]
         # print(f"Best permutation order: {permute_order}")
-        # print(f"Beta {self.beta}, Best match icor: {best_match_icor}")
+        # print(f"Beta {beta}, Best match icor: {self.best_icor}")
 
         # summarize the feature ids on each party
         party_cut_points = np.cumsum(self.evaluator.n_features_on_party)
@@ -333,9 +337,17 @@ def split_indices(self, X, n_elites=20, n_offsprings=70, n_mutants=10, n_gen=100
         assert (np.sort(np.concatenate(self.best_feature_per_party)) == np.arange(X.shape[1])).all()
         return self.best_feature_per_party
 
-    def fit_split(self, X, **kwargs):
-        self.fit(X, **kwargs)
-        return self.split(X, **kwargs)
+    def fit_split(self, X, beta=0.5, **fit_kwargs):
+        """
+        Fit the splitter and split the data.
+        :param X: [np.ndarray] 2D dataset
+        :param beta: [float] the tightness of inner-party correlation (passed to split_indices, not fit)
+        :param fit_kwargs: additional keyword arguments passed to fit() (BRKGA parameters for fit_min_max)
+        """
+        if not (0 <= beta <= 1):
+            raise ValueError(f"beta should be in [0, 1], got {beta}")
+        self.fit(X, **fit_kwargs)
+        return self.split(X, beta=beta)
 
     def visualize(self, *args, **kwargs):
         return self.evaluator.visualize(*args, **kwargs)