You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Convert JobAPI to Recipe for Kaplan-Meier example (#3894)
Fixes # .
### Description
Convert KM example from JobAPI to Recipe, also add production
instructions with provisioned HE context
### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
---------
Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
* How to perform Kaplan-Meier survival analysis in federated setting without and with secure features via time-binning and Homomorphic Encryption (HE).
5
-
* How to use the Flare ModelController API to contract a workflow to facilitate HE under simulator mode.
5
+
* How to use the Recipe API with Flare ModelController for job configuration and execution in both simulation and production environments.
6
6
7
7
## Basics of Kaplan-Meier Analysis
8
8
Kaplan-Meier survival analysis is a non-parametric statistic used to estimate the survival function from lifetime data. It is used to analyze the time it takes for an event of interest to occur. For example, during a clinical trial, the Kaplan-Meier estimator can be used to estimate the proportion of patients who survive a certain amount of time after treatment.
@@ -62,7 +62,7 @@ To run the baseline script, simply execute:
62
62
```commandline
63
63
python utils/baseline_kaplan_meier.py
64
64
```
65
-
By default, this will generate a KM curve image `km_curve_baseline.png` under `/tmp` directory. The resulting KM curve is shown below:
65
+
By default, this will generate a KM curve image `km_curve_baseline.png` under `/tmp/nvflare/baseline` directory. The resulting KM curve is shown below:
Here, we show the survival curve for both daily (without binning) and weekly binning. The two curves aligns well with each other, while the weekly-binned curve has lower resolution.
68
68
@@ -72,41 +72,232 @@ We make use of FLARE ModelController API to implement the federated Kaplan-Meier
72
72
73
73
The Flare ModelController API (`ModelController`) provides the functionality of flexible FLModel payloads for each round of federated analysis. This gives us the flexibility of transmitting various information needed by our scheme at different stages of federated learning.
74
74
75
-
Our [existing HE examples](../cifar10/cifar10-real-world) uses data filter mechanism for HE, provisioning the HE context information (specs and keys) for both client and server of the federated job under [CKKS](../../../nvflare/app_opt/he/model_encryptor.py) scheme. In this example, we would like to illustrate ModelController's capability in supporting customized needs beyond the existing HE functionalities (designed mainly for encrypting deep learning models).
76
-
- different HE schemes (BFV) rather than CKKS
77
-
-different content at different rounds of federated learning, and only specific payload needs to be encrypted
75
+
Our [existing HE examples](../cifar10/cifar10-real-world) uses data filter mechanism for HE, provisioning the HE context information (specs and keys) for both client and server of the federated job under [CKKS](../../../nvflare/app_opt/he/model_encryptor.py) scheme. In this example, we would like to illustrate ModelController's capability in supporting customized needs beyond the existing HE functionalities (designed mainly for encrypting deep learning models):
76
+
-Different content at different rounds of federated learning, where only specific payloads need to be encrypted
77
+
-Flexibility in choosing what to encrypt (histograms) versus what to send in plain text (metadata)
78
78
79
79
With the ModelController API, such "proof of concept" experiment becomes easy. In this example, the federated analysis pipeline includes 2 rounds without HE, or 3 rounds with HE.
80
80
81
81
For the federated analysis without HE, the detailed steps are as follows:
82
82
1. Server sends the simple start message without any payload.
83
83
2. Clients submit the local event histograms to server. Server aggregates the histograms with varying lengths by adding event counts of the same slot together, and sends the aggregated histograms back to clients.
84
84
85
-
For the federated analysis with HE, we need to ensure proper HE aggregation using BFV, and the detailed steps are as follows:
85
+
For the federated analysis with HE, we need to ensure proper HE aggregation using CKKS, and the detailed steps are as follows:
86
86
1. Server send the simple start message without any payload.
87
87
2. Clients collect the information of the local maximum bin number (for event time) and send to server, where server aggregates the information by selecting the maximum among all clients. The global maximum number is then distributed back to clients. This step is necessary because we would like to standardize the histograms generated by all clients, such that they will have the exact same length and can be encrypted as vectors of same size, which will be addable.
88
88
3. Clients condense their local raw event lists into two histograms with the global length received, encrypt the histrogram value vectors, and send to server. Server aggregated the received histograms by adding the encrypted vectors together, and sends the aggregated histograms back to clients.
89
89
90
90
After these rounds, the federated work is completed. Then at each client, the aggregated histograms will be decrypted and converted back to an event list, and Kaplan-Meier analysis can be performed on the global information.
91
91
92
+
### HE Context and Data Management
93
+
94
+
-**Simulation Mode**:
95
+
- Uses **CKKS scheme** (approximate arithmetic, compatible with production)
96
+
- HE context files are manually created via `prepare_he_context.py`:
- Server context: `/tmp/nvflare/he_context/he_context_server.txt`
99
+
- Data prepared at `/tmp/nvflare/dataset/km_data`
100
+
- Paths can be customized via `--he_context_path` (for client context) and `--data_root`
101
+
-**Production Mode**:
102
+
- Uses **CKKS scheme**
103
+
- HE context is automatically provisioned into startup kits via `nvflare provision`
104
+
- Context files are resolved by NVFlare's SecurityContentService:
105
+
- Clients automatically use: `client_context.tenseal` (from their startup kit)
106
+
- Server automatically uses: `server_context.tenseal` (from its startup kit)
107
+
- The `--he_context_path` parameter is ignored in production mode
108
+
-**Reuses the same data** from simulation mode at `/tmp/nvflare/dataset/km_data` by default
109
+
110
+
**Note:** CKKS scheme provides strong encryption with approximate arithmetic, which works well for this Kaplan-Meier analysis. The histogram counts are encrypted as floating-point numbers and rounded back to integers after decryption. Both simulation and production modes use the same CKKS scheme for consistency and compatibility. Production mode can reuse the data prepared during simulation mode, eliminating redundant data preparation.
111
+
92
112
## Run the job
93
-
First, we prepared data for a 5-client federated job. We split and generate the data files for each client with binning interval of 7 days.
113
+
114
+
This example supports both **Simulation Mode** (for local testing) and **Production Mode** (for real-world deployment).
115
+
116
+
| Feature | Simulation Mode | Production Mode |
117
+
|---------|----------------|-----------------|
118
+
|**Use Case**| Testing & Development | Real-world Deployment / Production Testing |
119
+
|**HE Context**| Manual preparation via script | Auto-provisioned via startup kits |
120
+
|**Security**| Single machine, no encryption between processes | Secure startup kits with certificates |
Then we prepare HE context for clients and server, note that this step is done by secure provisioning for real-life applications, but in this study experimenting with BFV scheme, we use this step to distribute the HE context.
137
+
**Step 2: Prepare HE Context (Simulation Only)**
138
+
139
+
For simulation mode, manually prepare the HE context with CKKS scheme:
Next, we run the federated training using NVFlare Simulator via [JobAPI](https://nvflare.readthedocs.io/en/main/programming_guide/fed_job_api.html), both without and with HE:
147
+
This generates the HE context with CKKS scheme (poly_modulus_degree=8192, global_scale=2^40) compatible with production mode.
148
+
149
+
**Step 3: Run the Job**
150
+
151
+
Run the job without and with HE:
152
+
```commandline
153
+
python job.py
154
+
python job.py --encryption
155
+
```
156
+
157
+
The script will execute the job in simulation mode and display the job status. Results (KM curves and analysis details) will be saved to each simulated client's workspace directory under `/tmp/nvflare/workspaces/`.
158
+
159
+
### Production Mode
160
+
161
+
For production deployments, the HE context is automatically provisioned through secure startup kits.
162
+
163
+
**Quick Start for Local Testing:**
164
+
If you want to quickly test production mode on a single machine:
165
+
1. Run provisioning: `nvflare provision -p project.yml -w /tmp/nvflare/prod_workspaces`
For detailed steps and distributed deployment, continue below:
173
+
174
+
**Step 1: Install NVFlare with HE Support**
175
+
176
+
```commandline
177
+
pip install nvflare[HE]
178
+
```
179
+
180
+
**Step 2: Provision Startup Kits with HE Context**
181
+
182
+
The `project.yml` file in this directory is pre-configured with `HEBuilder` using the CKKS scheme. Run provisioning to output to `/tmp/nvflare/prod_workspaces`:
- Look for `km_curve_fl_he.png` and `km_global.json` in each client's job directory
287
+
288
+
**Note:** In production mode with HE, the HE context paths are automatically configured to use the provisioned context files from each participant's startup kit:
289
+
- Clients use: `client_context.tenseal`
290
+
- Server uses: `server_context.tenseal`
291
+
292
+
The `--he_context_path` parameter is only used for simulation mode and is ignored in production mode. No manual HE context distribution is needed in production.
293
+
294
+
**Step 6: Shutdown All Parties**
295
+
296
+
After the job completes, shut down all parties gracefully via admin console:
0 commit comments