You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4
+
5
+
## Project Overview
6
+
7
+
VertiBench is a Python library for benchmarking vertical federated learning (VFL). It generates synthetic VFL datasets with tunable feature importance imbalance and inter-party correlation, then evaluates the quality of vertical data partitions along those two dimensions.
No linter or formatter is configured for this project.
34
+
35
+
## Architecture
36
+
37
+
The library lives in `src/vertibench/` and has two core modules:
38
+
39
+
### Splitter.py — Vertical Data Partitioning
40
+
41
+
Abstract base class `Splitter` defines the interface: `split_indices()` returns per-party feature index lists, and `split()` applies them to datasets.
42
+
43
+
Three implementations:
44
+
-**ImportanceSplitter** — Uses Dirichlet distribution to assign features to parties with controllable importance imbalance. The `weights` parameter controls expected importance per party (higher weight = more features).
45
+
-**CorrelationSplitter** — Uses BRKGA (pymoo) genetic algorithm to find partitions that match a target inter/intra-party correlation ratio. Parameter `beta` ∈ [0,1] controls the balance. Requires `fit()` on data before splitting.
46
+
-**SimpleSplitter** — Uniform contiguous split of features across parties.
47
+
48
+
### Evaluator.py — Split Quality Assessment
49
+
50
+
-**ImportanceEvaluator** — Computes per-party feature importance using SHAP Permutation explainer. `evaluate_alpha()` recovers the Dirichlet concentration parameter from importance scores.
51
+
-**CorrelationEvaluator** — Computes correlation matrices and scores inner vs. inter-party correlation. `evaluate_beta()` recovers the correlation concentration metric. Supports GPU acceleration via PyTorch (`gpu_id` parameter). Uses multiple SVD strategies depending on feature count (exact for <100, randomized for larger).
52
+
53
+
### Key Data Flow
54
+
55
+
1. Generate data (e.g., `sklearn.datasets.make_classification`)
56
+
2.`Splitter.split(X)` → list of per-party feature matrices `Xs`
-`Splitter` uses ABC + template method: concrete classes implement `split_indices()`, base class handles `split()` logic.
63
+
-`CorrelationSplitter` composes a `CorrelationEvaluator` internally for optimization.
64
+
- Correlation computation has multiple backends: Spearman (pandas), Pearson (numpy/torch), with CPU/GPU variants.
65
+
66
+
## Testing
67
+
68
+
Tests use `unittest` with `subTest()` for parameterized variants. Test data is generated synthetically via `generate_data()` and `split_data()` helpers in each test file. The evaluator tests train actual XGBoost models, so the `[test]` extras are required.
0 commit comments