Skip to content

Commit 35b3a2b

Browse files
authored
Add phase 1 study plumbing (#4386)
## Summary - add phase 1 study session plumbing across Flare API, ProdEnv, server job handling, and the admin CLI launch path - scope `list_jobs` by session study, persist study on submit, preserve the source study on clone, and normalize legacy jobs without study metadata to `default` - split the design docs into phase 1 and phase 2 documents and add focused unit/integration coverage, including admin terminal and legacy-client compatibility paths
1 parent eaa2d9f commit 35b3a2b

File tree

26 files changed

+1674
-19
lines changed

26 files changed

+1674
-19
lines changed

docs/design/multistudy_phase1.md

Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
# Multi-Study Support — Phase 1: Study Plumbing
2+
3+
## Introduction
4+
5+
Flare currently operates as a single-tenant system. Every authorized admin can see and act on every job. There is no data segregation between different collaborations running on the same infrastructure.
6+
7+
Phase 1 introduces a **study** concept as lightweight metadata plumbing. Every job carries a `study` name (defaulting to `"default"`). The study flows from user-facing APIs into job metadata and runtime launchers so that K8s deployments can mount study-specific workspace volumes immediately — without any access-control, provisioning, or job-store changes.
8+
9+
See [multistudy_phase2.md](multistudy_phase2.md) for the full multi-tenancy design (access control, study registry, job-store partitioning, etc.).
10+
11+
> **Note:** Multi-study is an operational enhancement for shared-trust environments.
12+
> It does not provide the same isolation guarantees as separate deployments.
13+
> See [When to Use Multi-Study vs. Separate Deployments](multistudy_phase2.md#when-to-use-multi-study-vs-separate-deployments)
14+
> in [multistudy_phase2.md](multistudy_phase2.md).
15+
16+
### Design Principles
17+
18+
1. **Backward compatible** — a `default` study preserves current single-tenant behavior; legacy jobs missing a `study` field are treated as `default` on read
19+
2. **Phased rollout** — Phase 1 delivers plumbing only; access-control enforcement is deferred to Phase 2
20+
3. **Minimal footprint** — no authorization, no provisioning, no job-store layout changes
21+
22+
---
23+
24+
## Scope
25+
26+
1. `study: str = "default"` parameter on `ProdEnv`, `Session`, `new_session`, `new_secure_session`, `new_insecure_session`.
27+
2. `Session` carries the active study context; `list_jobs` inherits it and returns only jobs in that study.
28+
3. Study is passed through to job metadata at submission time, with syntax validation before persistence.
29+
4. Clone preserves the source job's study (not the session's study).
30+
5. `K8sJobLauncher` reads `study` from job metadata and selects the corresponding study workspace volume (TODO in code).
31+
6. `DockerJobLauncher` unchanged; TODO marker for future study-aware settings resolution.
32+
7. Admin console (`fl_admin.sh`) accepts `--study` at launch time; the study is established when the admin session logs in and then inherited by `submit_job` and `list_jobs`.
33+
8. No changes to authorization, job store paths, `project.yml` schema, scheduler, or provisioning.
34+
35+
---
36+
37+
## User Experience
38+
39+
### Data Scientist (Recipe API)
40+
41+
The recipe is unchanged. The study is specified via `ProdEnv`:
42+
43+
```python
44+
env = ProdEnv(
45+
startup_kit_location=args.startup_kit_location,
46+
study="cancer-research",
47+
)
48+
run = recipe.execute(env)
49+
```
50+
51+
If `study` is omitted, it defaults to `"default"`.
52+
53+
### Admin (FLARE API)
54+
55+
The `Session` gains a study context:
56+
57+
```python
58+
sess = new_secure_session(
59+
username="admin@org_a.com",
60+
startup_kit_location="./startup",
61+
study="cancer-research",
62+
)
63+
jobs = sess.list_jobs() # only jobs in cancer-research
64+
sess.submit_job("./my_job") # tagged to cancer-research
65+
```
66+
67+
The study is session-scoped, not a per-command filter. For HCI/admin sessions, the study is sent during login, stored in the authenticated server session, and preserved in the session token so later commands inherit the same active study.
68+
69+
### Admin Console
70+
71+
```
72+
$ ./startup/fl_admin.sh --study cancer-research
73+
74+
> list_jobs
75+
... only shows jobs in cancer-research ...
76+
```
77+
78+
If `--study` is omitted, the admin terminal uses `default`.
79+
80+
---
81+
82+
## Data Model
83+
84+
### Job Metadata
85+
86+
`study` is a first-class field on every job (`JobMetaKey.STUDY`). Set at submission time from the session's active study. Immutable after creation.
87+
88+
The study value is syntactically validated at the API layer (client-side) and again on the server before persistence. The regex pattern is `^[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?$` — lowercase alphanumeric with hyphens, 1–63 characters.
89+
90+
### Session Transport
91+
92+
For `flare_api.Session` and the admin terminal, `study` is not passed on each command. It is established at session creation/login time, stored on the server-side authenticated session, and carried in the signed session token so recreated sessions keep the same active study.
93+
94+
### Legacy Jobs
95+
96+
Jobs created before Phase 1 have no `study` field. `get_job_meta_study()` returns `"default"` for these jobs, so they appear in the `default` study transparently.
97+
98+
### Default Study Constant
99+
100+
`DEFAULT_STUDY = "default"` in `nvflare.apis.job_def` — single source of truth for the default value.
101+
102+
---
103+
104+
## What This Enables
105+
106+
- Data scientists tag jobs with a study and get physical data isolation on K8s immediately (once K8s launcher TODO is implemented).
107+
- Admin/API sessions operate inside one active study context, so Phase 2 authz can validate study access when sessions enter a study.
108+
- Legacy single-tenant deployments are unaffected — everything defaults to `"default"`.
109+
110+
## What This Does NOT Do
111+
112+
- No access control — any user can submit to any valid study name
113+
- No job store partitioning (`jobs/<uuid>/` path unchanged)
114+
- No `project.yml` parsing or `StudyRegistry`
115+
- No Docker launcher behavior change yet
116+
- No `set_study` / `list_studies` admin commands
117+
- Subprocess launcher unchanged (single-tenant/trusted only)

docs/design/multistudy_phase2.md

Lines changed: 234 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,234 @@
1+
# Multi-Study Support — Phase 2: Per-Study Access Control
2+
3+
## Prerequisites
4+
5+
Phase 1 ([multistudy_phase1.md](multistudy_phase1.md)) must be complete: `study` metadata flows from user-facing APIs through to job metadata and launcher plumbing.
6+
7+
## Introduction
8+
9+
Phase 2 adds per-study access control on top of Phase 1's plumbing. The existing role model (`project_admin`, `org_admin`, `lead`, `member`) is unchanged — the only difference is that roles become per-study instead of global. No new roles are introduced.
10+
11+
---
12+
13+
## Core Idea
14+
15+
Today, a user's role is global (baked into the X.509 cert). Phase 2 makes the same role **per-study**: a user can be `lead` in one study and `member` in another.
16+
17+
The existing authorization rules (`authorization.json`) stay exactly as they are. The only new layer is a **study filter**: before evaluating existing RBAC, the server checks whether the resource belongs to the user's active study. If not, the resource is invisible.
18+
19+
---
20+
21+
## How Authorization Works Today
22+
23+
### `authorization.json`
24+
25+
Each deployment ships an `authorization.json` that maps roles to permissions. The default policy:
26+
27+
```json
28+
{
29+
"format_version": "1.0",
30+
"permissions": {
31+
"project_admin": "any",
32+
"org_admin": {
33+
"submit_job": "none",
34+
"clone_job": "none",
35+
"manage_job": "o:submitter",
36+
"download_job": "o:submitter",
37+
"view": "any",
38+
"operate": "o:site",
39+
"shell_commands": "o:site",
40+
"byoc": "none"
41+
},
42+
"lead": {
43+
"submit_job": "any",
44+
"clone_job": "n:submitter",
45+
"manage_job": "n:submitter",
46+
"download_job": "n:submitter",
47+
"view": "any",
48+
"operate": "o:site",
49+
"shell_commands": "o:site",
50+
"byoc": "any"
51+
},
52+
"member": {
53+
"view": "any"
54+
}
55+
}
56+
}
57+
```
58+
59+
Permission conditions: `"any"` = unrestricted, `"none"` = denied, `"n:submitter"` = only if user is the submitter, `"o:submitter"` = only if user is in the same org as the submitter, `"o:site"` = only if user is in the same org as the target site.
60+
61+
### Authorization Flow
62+
63+
1. A `Person(name, org, role)` is constructed from the user's X.509 cert
64+
2. An `AuthzContext(right, user, submitter)` wraps the request
65+
3. The `Authorizer` evaluates the policy: look up `permissions[person.role][right]` and check the condition against the context
66+
67+
**Phase 2 changes only step 1**: the `role` used to construct `Person` can come from the per-study mapping instead of the cert. Steps 2 and 3 are untouched. `authorization.json` is unchanged.
68+
69+
---
70+
71+
## Role Resolution
72+
73+
No new roles. The existing four roles are reused:
74+
75+
| Role | Capabilities (per `authorization.json`, unchanged) |
76+
|------|-----|
77+
| `project_admin` | All operations (`"any"`) |
78+
| `org_admin` | Manage/download own-org jobs, view all, operate own-org sites |
79+
| `lead` | Submit/manage/download own jobs, view all, operate own-org sites |
80+
| `member` | View only |
81+
82+
**Resolution order:**
83+
1. If `project.yml` has a `studies:` section AND the user has a mapping for the active study → use that role
84+
2. Else if active study is `default` → fall back to cert-embedded role (legacy compatibility)
85+
3. Otherwise → deny
86+
87+
### What the participant `role` means
88+
89+
The `role` field on admin participants in `project.yml` serves two purposes:
90+
- It is baked into the X.509 cert at provisioning time (identity + authentication)
91+
- It is the **effective role for the `default` study** and for deployments without a `studies:` section
92+
93+
When a `studies:` section is present, the per-study role overrides the cert role for that study. The cert role still applies to the `default` study as a fallback.
94+
95+
Example: a user with `role: lead` in their participant entry and `member` in the `cancer-research` study mapping is `lead` when operating in the `default` study but `member` when operating in `cancer-research`.
96+
97+
---
98+
99+
## Provisioning: `project.yml`
100+
101+
Minimal addition: a `studies:` section that maps study names to enrolled sites and per-user role overrides. Everything else stays as-is.
102+
103+
```yaml
104+
# Existing sections unchanged
105+
participants:
106+
- name: server1.example.com
107+
type: server
108+
org: nvidia
109+
- name: hospital-a
110+
type: client
111+
org: org_a
112+
- name: hospital-b
113+
type: client
114+
org: org_b
115+
- name: admin@nvidia.com
116+
type: admin
117+
org: nvidia
118+
role: project_admin # cert role; effective role for "default" study
119+
- name: trainer@org_a.com
120+
type: admin
121+
org: org_a
122+
role: lead # cert role; effective role for "default" study
123+
124+
# New section — per-study role overrides
125+
studies:
126+
cancer-research:
127+
sites: [hospital-a, hospital-b]
128+
admins:
129+
trainer@org_a.com: lead # same as cert role here, but explicit
130+
131+
multiple-sclerosis:
132+
sites: [hospital-a]
133+
admins:
134+
trainer@org_a.com: member # overrides cert role for this study
135+
```
136+
137+
- If `studies:` is absent, the system behaves exactly as today (single-tenant, cert roles only).
138+
- Sites listed under a study must reference existing client-type participants.
139+
- Admins listed under a study must reference existing admin-type participants.
140+
- A user not listed under a study has no access to that study (except `default`, which falls back to cert role).
141+
142+
---
143+
144+
## Authorization Enforcement
145+
146+
Two layers, evaluated in order for every command:
147+
148+
1. **Study filter** (new): Does the target resource (job, client) belong to the user's active study? If no → invisible.
149+
2. **RBAC policy** (existing, unchanged): Construct `Person` with the resolved per-study role, evaluate `authorization.json` as today.
150+
151+
The session's active study (set at session start via `--study` or the `study` API parameter) determines which study filter applies. The study is carried by the authenticated server session and preserved in the session token, so subsequent commands do not need to resend it.
152+
153+
---
154+
155+
## Job Scheduler
156+
157+
When a `studies:` section is present in `project.yml`:
158+
159+
1. **Site filtering**: Only schedule jobs to sites enrolled in the job's study
160+
2. **Validation**: `deploy_map` sites must be a subset of the study's enrolled sites
161+
162+
No quota or priority changes.
163+
164+
---
165+
166+
## Runtime Isolation
167+
168+
### Kubernetes
169+
170+
The K8s launcher reads `study` from job metadata and resolves study-specific workspace volumes:
171+
- Workspace volume resolved by `(study, client)` tuple
172+
- Each job pod mounts only its study's data volume
173+
174+
### Docker
175+
176+
The Docker launcher reads `study` from job metadata and mounts the corresponding host directory (e.g., `/data/<study>/`) as the workspace volume.
177+
178+
---
179+
180+
## When to Use Multi-Study vs. Separate Deployments
181+
182+
Multi-study is an operational convenience feature for shared-trust environments. If you need stronger isolation, use separate deployments on separate hardware or VMs.
183+
184+
### Use multi-study when
185+
186+
- One or more client sites participate in multiple studies and re-provisioning them for each study is operationally costly
187+
- All studies operate within the same operational trust boundary
188+
- You accept software-enforced isolation: study separation relies on correct authorization logic, session management, and launcher volume configuration
189+
190+
### Use separate deployments when
191+
192+
- You need stronger isolation than a shared multi-study deployment can provide
193+
- A problem in one study must not be able to affect another study
194+
- Studies have non-overlapping participants (no shared client sites), so separate deployments add little operational cost
195+
196+
### Security Assumptions That Change
197+
198+
With a single multi-study deployment, the following are shared across all studies:
199+
200+
| Resource | Separate deployments | Multi-study |
201+
|----------|----------------------|-------------|
202+
| Server runtime | Separate per deployment | Shared |
203+
| Client runtime at each site | Separate per deployment | Shared |
204+
| PKI / cert root | Independent | Shared |
205+
| `project_admin` blast radius | One deployment | All studies in that deployment |
206+
| Isolation mechanism | Separate hardware/VM deployment boundary | Shared software authz and launcher configuration |
207+
208+
A bug in authorization, session handling, or launcher volume configuration can affect multiple studies in the same deployment.
209+
210+
### Summary
211+
212+
Use multi-study to reduce operational overhead when client sites participate in multiple studies under a shared trust boundary. Use separate deployments when you need stronger isolation.
213+
214+
---
215+
216+
## Migration / Backward Compatibility
217+
218+
1. **No `studies:` section** → system behaves as today, single-tenant, cert roles only.
219+
2. **`studies:` section present** → per-study role enforcement enabled; the `default` study falls back to cert roles for compatibility.
220+
3. **Legacy jobs** (no `study` field) → treated as `default` study (Phase 1 behavior, unchanged).
221+
4. **No data migration** — job store layout is unchanged (`jobs/<uuid>/`).
222+
223+
---
224+
225+
## Design Decisions
226+
227+
| # | Question | Decision |
228+
|---|----------|----------|
229+
| D1 | Can clients participate in multiple studies? | **Yes.** Listed under multiple studies in `project.yml`. Data isolation via launcher (K8s PVs / Docker mounts). |
230+
| D2 | New roles needed? | **No.** Existing `project_admin` / `org_admin` / `lead` / `member` reused per-study. |
231+
| D3 | Study lifecycle management? | **Deferred.** Studies defined at provisioning time in `project.yml`. |
232+
| D4 | Per-study quotas? | **Deferred.** Rely on K8s-level resource controls. |
233+
| D5 | How do launchers know which volume to mount? | Job metadata carries the study; the launcher resolves the volume from that. |
234+
| D6 | What does the participant `role` mean when `studies:` exists? | It is baked into the cert and serves as the effective role for the `default` study. Per-study mappings override it for non-default studies. |

docs/user_guide/admin_guide/deployment/operation.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,9 @@ Admin command prompt
1313
After running ``fl_admin.sh``, log in by following the prompt and entering the name of the participant that the admin
1414
package was provisioned for (or for poc mode, "admin" as the name and password).
1515

16+
To scope the terminal to a study, launch it with ``fl_admin.sh --study cancer-research``. If ``--study`` is omitted,
17+
the admin terminal uses the ``default`` study for that terminal session.
18+
1619
Typing "help" or "?" will display a list of the commands and a brief description for each. Typing "? " before a command
1720
like "? check_status" or "?ls" will provide additional details for the usage of a command. Provided below is a list of
1821
commands shown as examples of how they may be run with a description.

nvflare/apis/job_def.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121
# this is treated as all online sites in job deploy_map
2222
ALL_SITES = "@ALL"
2323
SERVER_SITE_NAME = "server"
24+
DEFAULT_STUDY = "default"
2425

2526

2627
class RunStatus(str, Enum):
@@ -76,6 +77,7 @@ class JobMetaKey(str, Enum):
7677
CUSTOM_PROPS = "custom_props"
7778
EDGE_METHOD = "edge_method"
7879
JOB_CLIENTS = "job_clients" # clients that participated the job
80+
STUDY = "study"
7981

8082
def __repr__(self):
8183
return self.value
@@ -214,6 +216,15 @@ def is_valid_job_id(jid: str) -> bool:
214216
return val.hex == jid.replace("-", "")
215217

216218

219+
def get_job_meta_study(meta: dict) -> str:
220+
if not isinstance(meta, dict):
221+
return DEFAULT_STUDY
222+
study = meta.get(JobMetaKey.STUDY.value)
223+
if isinstance(study, str) and study:
224+
return study
225+
return DEFAULT_STUDY
226+
227+
217228
def get_custom_prop(meta: dict, prop_key: str, default=None):
218229
props = meta.get(JobMetaKey.CUSTOM_PROPS)
219230
if not props:

0 commit comments

Comments
 (0)