Conversation
Design doc, implementation in provision and tests
Greptile SummaryThis PR introduces The builder fits into the existing lighter/provisioner pipeline: it generates chart files into the client's WIP directory alongside the normal startup-kit artifacts, and it updates each client's Key findings:
Confidence Score: 4/5Not safe to merge without addressing the builder-ordering issue that silently produces a broken comm_config.json. One P1 finding remains: the shared comm_config_args dict can be silently overwritten by StaticFileBuilder.build() if ClientHelmChartBuilder is registered first, producing a comm_config.json that prevents job pods from connecting to the parent client. The fix is straightforward (move the update to finalize(), add a guard, or document the constraint), but currently nothing in the code or docs warns the user. nvflare/lighter/impl/client_helm_chart.py — the comm_config_args update in _build_client_chart() needs to be ordering-safe. Important Files Changed
Sequence DiagramsequenceDiagram
participant P as Provisioner
participant SFB as StaticFileBuilder
participant CHC as ClientHelmChartBuilder
Note over P: Phase 1 — initialize() (all builders, in order)
P->>SFB: initialize(project, ctx)
SFB-->>P: sets COMM_CONFIG_ARGS={} for every participant
Note over P: Phase 2 — build() (all builders, in order)
P->>SFB: build(project, ctx)
SFB-->>P: _build_comm_config_for_internal_listener()<br/>sets HOST=internal_listener, PORT=internal_port
P->>CHC: build(project, ctx) [must run AFTER SFB]
CHC-->>P: comm_config_args.update(HOST=client.name, PORT=k8s_port)<br/>overwrites StaticFileBuilder values ✓
Note over P: Phase 3 — finalize() (all builders, REVERSE order)
P->>CHC: finalize(project, ctx)
CHC-->>P: no-op
P->>SFB: finalize(project, ctx)
SFB-->>P: reads comm_config_args → writes comm_config.json<br/>with HOST=client.name, PORT=k8s_port ✓
Note over P,CHC: ⚠ If CHC is listed BEFORE SFB in project.yml,<br/>SFB.build() overwrites K8s values → wrong comm_config.json
Reviews (2): Last reviewed commit: "Merge branch 'main' into add_helm_chart_..." | Re-trigger Greptile |
| def _write_values_yaml( | ||
| self, | ||
| chart_dir: str, | ||
| client: Participant, | ||
| server: Participant, | ||
| fed_learn_port: int, | ||
| ): | ||
| repo, tag = _split_image(self.docker_image) | ||
|
|
||
| # The args list intentionally omits uid= — it is appended by the pod | ||
| # template via {{ .Values.name }} to eliminate the duplication between | ||
| # the top-level 'name' field and the --set argument list. | ||
| # | ||
| # hostAliases is intentionally omitted. The client connects outbound | ||
| # to the server; DNS is responsible for server name resolution. | ||
| args = [ | ||
| "-u", | ||
| "-m", | ||
| "nvflare.private.fed.app.client.client_train", | ||
| "-m", | ||
| self.workspace_mount_path, | ||
| "-s", | ||
| "fed_client.json", | ||
| "--set", | ||
| "secure_train=true", | ||
| "config_folder=config", | ||
| f"org={client.org}", | ||
| ] | ||
|
|
||
| values = { | ||
| "name": client.name, | ||
| "image": { | ||
| "repository": repo, | ||
| "tag": tag, | ||
| "pullPolicy": "Always", | ||
| }, | ||
| "persistence": { | ||
| "etc": { | ||
| "claimName": self.etc_pvc, | ||
| "friendlyName": self.etc_pvc, | ||
| "mountPath": self.etc_mount_path, | ||
| }, | ||
| "workspace": { | ||
| "claimName": self.workspace_pvc, | ||
| "friendlyName": self.workspace_pvc, | ||
| "mountPath": self.workspace_mount_path, | ||
| }, | ||
| }, | ||
| "port": self.client_port, | ||
| "command": ["/usr/local/bin/python3"], | ||
| "args": args, | ||
| "restartPolicy": "Never", | ||
| } | ||
|
|
||
| with open(os.path.join(chart_dir, ProvFileName.VALUES_YAML), "wt") as f: | ||
| yaml.dump(values, f, default_flow_style=False) |
There was a problem hiding this comment.
Unused parameters
server and fed_learn_port
_write_values_yaml accepts server: Participant and fed_learn_port: int in its signature, but neither is ever referenced inside the function body. The generated values.yaml contains no server address or fed-learn port — the client relies entirely on fed_client.json from the startup kit for server connection details. These two parameters appear to be left over from an earlier design and should either be removed (if server details are intentionally omitted from values.yaml) or actually used (if they are meant to be written into the chart).
Because _build_client_chart passes them through from build(), removing them also simplifies the call site and the _build_client_chart signature.
| def build(self, project: Project, ctx: ProvisionContext): | ||
| server = project.get_server() | ||
| if not server: | ||
| raise ValueError("project has no server; cannot build client Helm charts") | ||
|
|
||
| fed_learn_port = ctx.get(CtxKey.FED_LEARN_PORT, 8002) | ||
|
|
||
| for client in project.get_clients(): | ||
| self._build_client_chart(client, server, fed_learn_port, ctx) |
There was a problem hiding this comment.
Unreachable
ValueError guard for missing server
The if not server: raise ValueError(...) check on line 184 is dead code. ProvisionContext.__init__ calls server.get_prop(...) unconditionally after project.get_server(), so if the project has no server, construction of ProvisionContext itself raises an AttributeError before build() is ever called. The test test_no_server_raises explicitly documents this by asserting against _make_ctx(root, p) (i.e. the context constructor), not against builder.build(). The guard should be removed or, if early/clearer error messaging is desired, replaced with a pre-condition check at the beginning of build() that is also covered by a test that actually exercises builder.build().
| "persistence": { | ||
| "etc": { | ||
| "claimName": self.etc_pvc, | ||
| "friendlyName": self.etc_pvc, | ||
| "mountPath": self.etc_mount_path, | ||
| }, | ||
| "workspace": { | ||
| "claimName": self.workspace_pvc, | ||
| "friendlyName": self.workspace_pvc, | ||
| "mountPath": self.workspace_mount_path, | ||
| }, |
There was a problem hiding this comment.
friendlyName is always identical to claimName
Both claimName and friendlyName are always set to the same string (the PVC name passed as constructor arguments). The Kubernetes volume name (which friendlyName maps to) and the PVC claimName are independent fields — a user may legitimately want to keep their PVC named my-site-1-etc-pvc while using a shorter, DNS-label-safe name for the volume reference. Having them duplicated with no distinction can be confusing and also risks Kubernetes rejection if the PVC name happens to contain characters that are not valid in a volume name (e.g. upper-case letters or underscores), since Kubernetes volume names must match [a-z0-9][a-z0-9\-]*.
Consider either accepting separate friendlyName constructor arguments, or deriving the volume name from the PVC name with a sanitization step.
| def test_no_server_raises(self): | ||
| # ProvisionContext.__init__ itself raises when the project has no server. | ||
| import pytest | ||
|
|
||
| p = Project(name="no_server", description="test") | ||
| p.add_participant(Participant(type="client", name="site-1", org="nvidia")) | ||
| with pytest.raises(Exception): | ||
| with tempfile.TemporaryDirectory() as root: | ||
| _make_ctx(root, p) |
There was a problem hiding this comment.
Test does not exercise
ClientHelmChartBuilder.build() for missing server
The test comment says "ProvisionContext.init itself raises when the project has no server", which is accurate — but this means the test never actually reaches ClientHelmChartBuilder.build(). The ValueError guard inside build() is therefore not tested. Either:
- Remove the guard from
build()sinceProvisionContextalready protects it, and update this test comment accordingly, or - Change the test to construct the context with a valid project first and then call
builder.build(project_without_server, ctx)to verify the guard inbuild()works.
There was a problem hiding this comment.
Pull request overview
Adds a provisioning-time Helm chart generator for client participants, plus unit tests and a design document describing the intended Kubernetes/Helm deployment flow.
Changes:
- Introduces
ClientHelmChartBuilderto generate a per-client Helm chart under each client’s provisioned workspace. - Adds unit tests covering image parsing, generated chart structure, and
COMM_CONFIG_ARGSalignment behavior. - Adds a detailed design document for the client Helm chart workflow and rationale.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
nvflare/lighter/impl/client_helm_chart.py |
New builder that writes per-client Chart.yaml, values.yaml, and Helm templates; updates COMM_CONFIG_ARGS when present. |
nvflare/lighter/constants.py |
Adds ProvFileName.HELM_CHART_CLIENT_DIR constant for generated client chart output folder name. |
tests/unit_test/lighter/client_helm_chart_builder_test.py |
New unit tests validating generated chart files/contents and internal comm config arg behavior. |
docs/design/helm_chart_client_design.md |
New design doc describing the Helm chart structure, provisioning integration, and deployment flow. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """Generate one Helm chart per client during provisioning. | ||
|
|
||
| Each client in the project receives its own chart directory under | ||
| ``<wip>/nvflare_hc_clients/<client-name>/``. The charts use the same |
There was a problem hiding this comment.
The docstring says each client chart is generated under <wip>/nvflare_hc_clients/<client-name>/, but the implementation writes to <wip>/<client-name>/nvflare_hc_clients/ (via ctx.get_ws_dir(client)). Please update the docstring to match the actual output path so users don’t look in the wrong location.
| ``<wip>/nvflare_hc_clients/<client-name>/``. The charts use the same | |
| ``<wip>/<client-name>/nvflare_hc_clients/``. The charts use the same |
| fed_learn_port = ctx.get(CtxKey.FED_LEARN_PORT, 8002) | ||
|
|
||
| for client in project.get_clients(): | ||
| self._build_client_chart(client, server, fed_learn_port, ctx) |
There was a problem hiding this comment.
fed_learn_port = ctx.get(CtxKey.FED_LEARN_PORT, 8002) is currently unused by the builder logic (it’s only passed through and then ignored). Please remove this dead read (and the related unused parameters) unless it’s required for an imminent follow-up feature.
| fed_learn_port = ctx.get(CtxKey.FED_LEARN_PORT, 8002) | |
| for client in project.get_clients(): | |
| self._build_client_chart(client, server, fed_learn_port, ctx) | |
| for client in project.get_clients(): | |
| self._build_client_chart(client, server, 8002, ctx) |
|
|
||
| ## 1. Overview | ||
|
|
||
| NVIDIA FLARE (NVFlare) is a federated-learning platform in which a central **FL Server** coordinates training rounds and one or more **FL Clients** (sites) train on local data. In a production Kubernetes deployment each client runs as a Pod. The `helm/nvflare-client` Helm chart (located at `helm/nvflare-client/`) is the packaging artifact that operators use to deploy a single NVFlare client site to a Kubernetes cluster. |
There was a problem hiding this comment.
The doc refers to a hand-authored chart at helm/nvflare-client/, but there is no helm/ directory in the repository as of this PR. Please update the path reference (or add the referenced chart) so the documentation matches what users can actually find.
| NVIDIA FLARE (NVFlare) is a federated-learning platform in which a central **FL Server** coordinates training rounds and one or more **FL Clients** (sites) train on local data. In a production Kubernetes deployment each client runs as a Pod. The `helm/nvflare-client` Helm chart (located at `helm/nvflare-client/`) is the packaging artifact that operators use to deploy a single NVFlare client site to a Kubernetes cluster. | |
| NVIDIA FLARE (NVFlare) is a federated-learning platform in which a central **FL Server** coordinates training rounds and one or more **FL Clients** (sites) train on local data. In a production Kubernetes deployment each client runs as a Pod. The `nvflare-client` Helm chart is the packaging artifact that operators use to deploy a single NVFlare client site to a Kubernetes cluster. |
| ### 5.3 Deploy Script (`helm/deploy_client.sh`) | ||
|
|
||
| `helm/deploy_client.sh` automates the copy-then-deploy workflow in five steps: | ||
|
|
||
| 1. Validates that the `nvflws` PVC exists in the target namespace. | ||
| 2. Launches a temporary `busybox` helper pod that mounts the `nvflws` PVC at `/ws/`. | ||
| 3. Copies `prod_NN/<client-name>/` into `/ws/<client-name>/` via `kubectl cp`. | ||
| 4. Deletes the helper pod. | ||
| 5. Runs `helm install` (or `helm upgrade` if the release already exists) from the chart at `prod_NN/<client-name>/nvflare_hc_clients/`. | ||
|
|
||
| **Usage:** | ||
|
|
||
| ```bash | ||
| ./helm/deploy_client.sh <client-name> <prod-dir> [namespace] [release-name] |
There was a problem hiding this comment.
This section documents a deploy script at helm/deploy_client.sh, but that file doesn’t exist in the repo/PR. Please either include the script in the pull request or rewrite these instructions to avoid pointing users to a non-existent path.
| ### 5.3 Deploy Script (`helm/deploy_client.sh`) | |
| `helm/deploy_client.sh` automates the copy-then-deploy workflow in five steps: | |
| 1. Validates that the `nvflws` PVC exists in the target namespace. | |
| 2. Launches a temporary `busybox` helper pod that mounts the `nvflws` PVC at `/ws/`. | |
| 3. Copies `prod_NN/<client-name>/` into `/ws/<client-name>/` via `kubectl cp`. | |
| 4. Deletes the helper pod. | |
| 5. Runs `helm install` (or `helm upgrade` if the release already exists) from the chart at `prod_NN/<client-name>/nvflare_hc_clients/`. | |
| **Usage:** | |
| ```bash | |
| ./helm/deploy_client.sh <client-name> <prod-dir> [namespace] [release-name] | |
| ### 5.3 Deploy Workflow | |
| The client deploy workflow automates the copy-then-deploy process in five steps: | |
| 1. Validate that the `nvflws` PVC exists in the target namespace. | |
| 2. Launch a temporary `busybox` (or similar) helper pod that mounts the `nvflws` PVC at `/ws/`. | |
| 3. Copy `prod_NN/<client-name>/` into `/ws/<client-name>/` via `kubectl cp`. | |
| 4. Delete the helper pod. | |
| 5. Run `helm install` (or `helm upgrade` if the release already exists) from the chart at `prod_NN/<client-name>/nvflare_hc_clients/`. | |
| **Example (manual commands):** | |
| ```bash | |
| # 1) ensure PVC exists | |
| kubectl get pvc nvflws -n <namespace> | |
| # 2) start helper pod | |
| kubectl run nvflws-helper \ | |
| --image=busybox \ | |
| --restart=Never \ | |
| --overrides='{"apiVersion":"v1","spec":{"volumes":[{"name":"nvflws","persistentVolumeClaim":{"claimName":"nvflws"}}],"containers":[{"name":"helper","image":"busybox","command":["sleep","3600"],"volumeMounts":[{"name":"nvflws","mountPath":"/ws"}]}]}}' \ | |
| -n <namespace> | |
| # 3) copy startup kit into PVC | |
| kubectl cp <prod-dir>/<client-name>/ <namespace>/nvflws-helper:/ws/<client-name>/ | |
| # 4) delete helper pod | |
| kubectl delete pod nvflws-helper -n <namespace> | |
| # 5) install or upgrade Helm release | |
| helm upgrade --install <release-name> <prod-dir>/<client-name>/nvflare_hc_clients \ | |
| --namespace <namespace> \ | |
| --create-namespace |
| chart_dir: str, | ||
| client: Participant, | ||
| server: Participant, | ||
| fed_learn_port: int, | ||
| ): |
There was a problem hiding this comment.
_write_values_yaml() takes server and fed_learn_port but doesn’t reference them. Consider removing these parameters (and corresponding caller arguments) to make it clear what inputs actually affect the generated values.yaml.
| @@ -0,0 +1,709 @@ | |||
| # NVFlare Client Helm Chart — Design Document | |||
There was a problem hiding this comment.
I think this file is too large to be checked in. I would split it in 2.
- a helm_chart_client_design.md file with the pure design and tradeoffs made. (which is checked in)
- a helm_chart_client_implementation.md that is not checked in and that is used by your agent to keep track of what it is doing.
| {{- end }} | ||
| """ | ||
|
|
||
| _CLIENT_POD_YAML = """\ |
There was a problem hiding this comment.
Would it not be better to have explicit files in the folder rather then inline strings.
There was a problem hiding this comment.
yes, prefer to the have files rather than inline key value string
| comm_config_args = client.get_prop(PropKey.COMM_CONFIG_ARGS) | ||
| if comm_config_args is not None: | ||
| comm_config_args.update( | ||
| { | ||
| CommConfigArg.HOST: client.name, | ||
| CommConfigArg.PORT: self.client_port, | ||
| CommConfigArg.SCHEME: "tcp", | ||
| CommConfigArg.CONN_SEC: ConnSecurity.CLEAR, | ||
| } |
There was a problem hiding this comment.
comm_config_args update silently overwritten if builder order is wrong
ClientHelmChartBuilder.build() overwrites HOST and PORT in the shared comm_config_args dict with the Kubernetes service name and port. However, StaticFileBuilder._build_comm_config_for_internal_listener() writes the same keys (HOST, PORT, SCHEME, CONN_SEC) during its own build() phase.
The provisioner calls all build() methods in registration order (provisioner.py line 128), so whichever builder runs last wins. If ClientHelmChartBuilder is listed before StaticFileBuilder in the project's builders list, StaticFileBuilder.build() will silently overwrite the Kubernetes values (client.name, client_port) with the standard internal-listener host/port, and StaticFileBuilder.finalize() will emit a comm_config.json that points to the wrong endpoint — job pods launched by K8sJobLauncher will be unable to reach the parent client process with no error at provision time.
The ordering constraint (ClientHelmChartBuilder must be registered after StaticFileBuilder) is not enforced by the API, not validated at construction/initialization time, and not mentioned in any docstring or the design doc.
Recommended fixes (pick one):
- Move the
comm_config_args.update()call frombuild()intofinalize(). Sincefinalize()is called in reverse order (provisioner.pyline 131),ClientHelmChartBuilder.finalize()would run beforeStaticFileBuilder.finalize()— meaningStaticFileBuilder.finalize()reads the final dict afterClientHelmChartBuilder.finalize()has already setHOST/PORT, regardless of registration order. - Guard against wrong ordering in
build()with an assertion:
assert comm_config_args.get(CommConfigArg.HOST) is not None, (
"ClientHelmChartBuilder must be registered after StaticFileBuilder "
"so that StaticFileBuilder.build() has already set COMM_CONFIG_ARGS"
)- Document the ordering constraint prominently in the class docstring and in the design doc.
The design doc, implementation in provision and tests.
Types of changes
./runtest.sh.