Skip to content

Add helm chart (client only)#4340

Open
IsaacYangSLA wants to merge 2 commits intoNVIDIA:mainfrom
IsaacYangSLA:add_helm_chart_design_docs
Open

Add helm chart (client only)#4340
IsaacYangSLA wants to merge 2 commits intoNVIDIA:mainfrom
IsaacYangSLA:add_helm_chart_design_docs

Conversation

@IsaacYangSLA
Copy link
Copy Markdown
Collaborator

The design doc, implementation in provision and tests.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

Design doc, implementation in provision and tests
Copilot AI review requested due to automatic review settings March 20, 2026 20:28
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 20, 2026

Greptile Summary

This PR introduces ClientHelmChartBuilder, a new NVFlare provisioning builder that generates a per-client Helm chart (Pod + ClusterIP Service + values.yaml) for every client in a project when nvflare provision is run. It also adds a design document, corresponding constants, and unit tests.

The builder fits into the existing lighter/provisioner pipeline: it generates chart files into the client's WIP directory alongside the normal startup-kit artifacts, and it updates each client's COMM_CONFIG_ARGS dict so that StaticFileBuilder.finalize() emits a comm_config.json whose internal.resources.host/port matches the Kubernetes Service name and containerPort, enabling job pods launched by K8sJobLauncher to communicate back to the parent client process.

Key findings:

  • P1 — Builder ordering dependency (silent correctness issue): Both ClientHelmChartBuilder.build() and StaticFileBuilder._build_comm_config_for_internal_listener() update the same keys (HOST, PORT, SCHEME, CONN_SEC) in the shared comm_config_args dict. Since provisioner.py calls all build() methods in registration order, whichever builder is listed last wins. If ClientHelmChartBuilder is placed before StaticFileBuilder in the project's builder list, StaticFileBuilder.build() silently overwrites the Kubernetes service name/port with the standard internal-listener values, producing a broken comm_config.json that prevents job pods from reaching the parent client. This constraint is neither enforced nor documented.
  • Several previous review concerns (unused server/fed_learn_port params, dead-code ValueError guard, friendlyName duplication) remain open from prior threads.
  • The design document is thorough but does not mention the required builder registration order.

Confidence Score: 4/5

Not safe to merge without addressing the builder-ordering issue that silently produces a broken comm_config.json.

One P1 finding remains: the shared comm_config_args dict can be silently overwritten by StaticFileBuilder.build() if ClientHelmChartBuilder is registered first, producing a comm_config.json that prevents job pods from connecting to the parent client. The fix is straightforward (move the update to finalize(), add a guard, or document the constraint), but currently nothing in the code or docs warns the user.

nvflare/lighter/impl/client_helm_chart.py — the comm_config_args update in _build_client_chart() needs to be ordering-safe.

Important Files Changed

Filename Overview
nvflare/lighter/impl/client_helm_chart.py New builder that generates per-client Helm charts; contains a silent ordering-dependency bug where StaticFileBuilder.build() can overwrite the K8s-specific comm_config_args values if registered after ClientHelmChartBuilder.
nvflare/lighter/constants.py Adds HELM_CHART_CLIENT_DIR, CHART_YAML, VALUES_YAML, HELM_CHART_TEMPLATES_DIR to ProvFileName and introduces CommConfigArg class; changes are straightforward constant additions.
tests/unit_test/lighter/client_helm_chart_builder_test.py Good unit test coverage for chart generation, image splitting, and comm_config_args seeding; does not test the builder-ordering scenario where StaticFileBuilder.build() could overwrite the K8s comm_config values.
docs/design/helm_chart_client_design.md Comprehensive design document covering chart structure, provisioning pipeline, runtime flow, and security model; does not mention the required builder registration order for correct comm_config.json generation.

Sequence Diagram

sequenceDiagram
    participant P as Provisioner
    participant SFB as StaticFileBuilder
    participant CHC as ClientHelmChartBuilder

    Note over P: Phase 1 — initialize() (all builders, in order)
    P->>SFB: initialize(project, ctx)
    SFB-->>P: sets COMM_CONFIG_ARGS={} for every participant

    Note over P: Phase 2 — build() (all builders, in order)
    P->>SFB: build(project, ctx)
    SFB-->>P: _build_comm_config_for_internal_listener()<br/>sets HOST=internal_listener, PORT=internal_port
    P->>CHC: build(project, ctx) [must run AFTER SFB]
    CHC-->>P: comm_config_args.update(HOST=client.name, PORT=k8s_port)<br/>overwrites StaticFileBuilder values ✓

    Note over P: Phase 3 — finalize() (all builders, REVERSE order)
    P->>CHC: finalize(project, ctx)
    CHC-->>P: no-op
    P->>SFB: finalize(project, ctx)
    SFB-->>P: reads comm_config_args → writes comm_config.json<br/>with HOST=client.name, PORT=k8s_port ✓

    Note over P,CHC: ⚠ If CHC is listed BEFORE SFB in project.yml,<br/>SFB.build() overwrites K8s values → wrong comm_config.json
Loading

Reviews (2): Last reviewed commit: "Merge branch 'main' into add_helm_chart_..." | Re-trigger Greptile

Comment on lines +240 to +295
def _write_values_yaml(
self,
chart_dir: str,
client: Participant,
server: Participant,
fed_learn_port: int,
):
repo, tag = _split_image(self.docker_image)

# The args list intentionally omits uid= — it is appended by the pod
# template via {{ .Values.name }} to eliminate the duplication between
# the top-level 'name' field and the --set argument list.
#
# hostAliases is intentionally omitted. The client connects outbound
# to the server; DNS is responsible for server name resolution.
args = [
"-u",
"-m",
"nvflare.private.fed.app.client.client_train",
"-m",
self.workspace_mount_path,
"-s",
"fed_client.json",
"--set",
"secure_train=true",
"config_folder=config",
f"org={client.org}",
]

values = {
"name": client.name,
"image": {
"repository": repo,
"tag": tag,
"pullPolicy": "Always",
},
"persistence": {
"etc": {
"claimName": self.etc_pvc,
"friendlyName": self.etc_pvc,
"mountPath": self.etc_mount_path,
},
"workspace": {
"claimName": self.workspace_pvc,
"friendlyName": self.workspace_pvc,
"mountPath": self.workspace_mount_path,
},
},
"port": self.client_port,
"command": ["/usr/local/bin/python3"],
"args": args,
"restartPolicy": "Never",
}

with open(os.path.join(chart_dir, ProvFileName.VALUES_YAML), "wt") as f:
yaml.dump(values, f, default_flow_style=False)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unused parameters server and fed_learn_port

_write_values_yaml accepts server: Participant and fed_learn_port: int in its signature, but neither is ever referenced inside the function body. The generated values.yaml contains no server address or fed-learn port — the client relies entirely on fed_client.json from the startup kit for server connection details. These two parameters appear to be left over from an earlier design and should either be removed (if server details are intentionally omitted from values.yaml) or actually used (if they are meant to be written into the chart).

Because _build_client_chart passes them through from build(), removing them also simplifies the call site and the _build_client_chart signature.

Comment on lines +181 to +189
def build(self, project: Project, ctx: ProvisionContext):
server = project.get_server()
if not server:
raise ValueError("project has no server; cannot build client Helm charts")

fed_learn_port = ctx.get(CtxKey.FED_LEARN_PORT, 8002)

for client in project.get_clients():
self._build_client_chart(client, server, fed_learn_port, ctx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unreachable ValueError guard for missing server

The if not server: raise ValueError(...) check on line 184 is dead code. ProvisionContext.__init__ calls server.get_prop(...) unconditionally after project.get_server(), so if the project has no server, construction of ProvisionContext itself raises an AttributeError before build() is ever called. The test test_no_server_raises explicitly documents this by asserting against _make_ctx(root, p) (i.e. the context constructor), not against builder.build(). The guard should be removed or, if early/clearer error messaging is desired, replaced with a pre-condition check at the beginning of build() that is also covered by a test that actually exercises builder.build().

Comment on lines +276 to +286
"persistence": {
"etc": {
"claimName": self.etc_pvc,
"friendlyName": self.etc_pvc,
"mountPath": self.etc_mount_path,
},
"workspace": {
"claimName": self.workspace_pvc,
"friendlyName": self.workspace_pvc,
"mountPath": self.workspace_mount_path,
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 friendlyName is always identical to claimName

Both claimName and friendlyName are always set to the same string (the PVC name passed as constructor arguments). The Kubernetes volume name (which friendlyName maps to) and the PVC claimName are independent fields — a user may legitimately want to keep their PVC named my-site-1-etc-pvc while using a shorter, DNS-label-safe name for the volume reference. Having them duplicated with no distinction can be confusing and also risks Kubernetes rejection if the PVC name happens to contain characters that are not valid in a volume name (e.g. upper-case letters or underscores), since Kubernetes volume names must match [a-z0-9][a-z0-9\-]*.

Consider either accepting separate friendlyName constructor arguments, or deriving the volume name from the PVC name with a sanitization step.

Comment on lines +184 to +192
def test_no_server_raises(self):
# ProvisionContext.__init__ itself raises when the project has no server.
import pytest

p = Project(name="no_server", description="test")
p.add_participant(Participant(type="client", name="site-1", org="nvidia"))
with pytest.raises(Exception):
with tempfile.TemporaryDirectory() as root:
_make_ctx(root, p)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Test does not exercise ClientHelmChartBuilder.build() for missing server

The test comment says "ProvisionContext.init itself raises when the project has no server", which is accurate — but this means the test never actually reaches ClientHelmChartBuilder.build(). The ValueError guard inside build() is therefore not tested. Either:

  1. Remove the guard from build() since ProvisionContext already protects it, and update this test comment accordingly, or
  2. Change the test to construct the context with a valid project first and then call builder.build(project_without_server, ctx) to verify the guard in build() works.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a provisioning-time Helm chart generator for client participants, plus unit tests and a design document describing the intended Kubernetes/Helm deployment flow.

Changes:

  • Introduces ClientHelmChartBuilder to generate a per-client Helm chart under each client’s provisioned workspace.
  • Adds unit tests covering image parsing, generated chart structure, and COMM_CONFIG_ARGS alignment behavior.
  • Adds a detailed design document for the client Helm chart workflow and rationale.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
nvflare/lighter/impl/client_helm_chart.py New builder that writes per-client Chart.yaml, values.yaml, and Helm templates; updates COMM_CONFIG_ARGS when present.
nvflare/lighter/constants.py Adds ProvFileName.HELM_CHART_CLIENT_DIR constant for generated client chart output folder name.
tests/unit_test/lighter/client_helm_chart_builder_test.py New unit tests validating generated chart files/contents and internal comm config arg behavior.
docs/design/helm_chart_client_design.md New design doc describing the Helm chart structure, provisioning integration, and deployment flow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"""Generate one Helm chart per client during provisioning.

Each client in the project receives its own chart directory under
``<wip>/nvflare_hc_clients/<client-name>/``. The charts use the same
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says each client chart is generated under <wip>/nvflare_hc_clients/<client-name>/, but the implementation writes to <wip>/<client-name>/nvflare_hc_clients/ (via ctx.get_ws_dir(client)). Please update the docstring to match the actual output path so users don’t look in the wrong location.

Suggested change
``<wip>/nvflare_hc_clients/<client-name>/``. The charts use the same
``<wip>/<client-name>/nvflare_hc_clients/``. The charts use the same

Copilot uses AI. Check for mistakes.
Comment on lines +186 to +189
fed_learn_port = ctx.get(CtxKey.FED_LEARN_PORT, 8002)

for client in project.get_clients():
self._build_client_chart(client, server, fed_learn_port, ctx)
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fed_learn_port = ctx.get(CtxKey.FED_LEARN_PORT, 8002) is currently unused by the builder logic (it’s only passed through and then ignored). Please remove this dead read (and the related unused parameters) unless it’s required for an imminent follow-up feature.

Suggested change
fed_learn_port = ctx.get(CtxKey.FED_LEARN_PORT, 8002)
for client in project.get_clients():
self._build_client_chart(client, server, fed_learn_port, ctx)
for client in project.get_clients():
self._build_client_chart(client, server, 8002, ctx)

Copilot uses AI. Check for mistakes.

## 1. Overview

NVIDIA FLARE (NVFlare) is a federated-learning platform in which a central **FL Server** coordinates training rounds and one or more **FL Clients** (sites) train on local data. In a production Kubernetes deployment each client runs as a Pod. The `helm/nvflare-client` Helm chart (located at `helm/nvflare-client/`) is the packaging artifact that operators use to deploy a single NVFlare client site to a Kubernetes cluster.
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc refers to a hand-authored chart at helm/nvflare-client/, but there is no helm/ directory in the repository as of this PR. Please update the path reference (or add the referenced chart) so the documentation matches what users can actually find.

Suggested change
NVIDIA FLARE (NVFlare) is a federated-learning platform in which a central **FL Server** coordinates training rounds and one or more **FL Clients** (sites) train on local data. In a production Kubernetes deployment each client runs as a Pod. The `helm/nvflare-client` Helm chart (located at `helm/nvflare-client/`) is the packaging artifact that operators use to deploy a single NVFlare client site to a Kubernetes cluster.
NVIDIA FLARE (NVFlare) is a federated-learning platform in which a central **FL Server** coordinates training rounds and one or more **FL Clients** (sites) train on local data. In a production Kubernetes deployment each client runs as a Pod. The `nvflare-client` Helm chart is the packaging artifact that operators use to deploy a single NVFlare client site to a Kubernetes cluster.

Copilot uses AI. Check for mistakes.
Comment on lines +305 to +318
### 5.3 Deploy Script (`helm/deploy_client.sh`)

`helm/deploy_client.sh` automates the copy-then-deploy workflow in five steps:

1. Validates that the `nvflws` PVC exists in the target namespace.
2. Launches a temporary `busybox` helper pod that mounts the `nvflws` PVC at `/ws/`.
3. Copies `prod_NN/<client-name>/` into `/ws/<client-name>/` via `kubectl cp`.
4. Deletes the helper pod.
5. Runs `helm install` (or `helm upgrade` if the release already exists) from the chart at `prod_NN/<client-name>/nvflare_hc_clients/`.

**Usage:**

```bash
./helm/deploy_client.sh <client-name> <prod-dir> [namespace] [release-name]
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section documents a deploy script at helm/deploy_client.sh, but that file doesn’t exist in the repo/PR. Please either include the script in the pull request or rewrite these instructions to avoid pointing users to a non-existent path.

Suggested change
### 5.3 Deploy Script (`helm/deploy_client.sh`)
`helm/deploy_client.sh` automates the copy-then-deploy workflow in five steps:
1. Validates that the `nvflws` PVC exists in the target namespace.
2. Launches a temporary `busybox` helper pod that mounts the `nvflws` PVC at `/ws/`.
3. Copies `prod_NN/<client-name>/` into `/ws/<client-name>/` via `kubectl cp`.
4. Deletes the helper pod.
5. Runs `helm install` (or `helm upgrade` if the release already exists) from the chart at `prod_NN/<client-name>/nvflare_hc_clients/`.
**Usage:**
```bash
./helm/deploy_client.sh <client-name> <prod-dir> [namespace] [release-name]
### 5.3 Deploy Workflow
The client deploy workflow automates the copy-then-deploy process in five steps:
1. Validate that the `nvflws` PVC exists in the target namespace.
2. Launch a temporary `busybox` (or similar) helper pod that mounts the `nvflws` PVC at `/ws/`.
3. Copy `prod_NN/<client-name>/` into `/ws/<client-name>/` via `kubectl cp`.
4. Delete the helper pod.
5. Run `helm install` (or `helm upgrade` if the release already exists) from the chart at `prod_NN/<client-name>/nvflare_hc_clients/`.
**Example (manual commands):**
```bash
# 1) ensure PVC exists
kubectl get pvc nvflws -n <namespace>
# 2) start helper pod
kubectl run nvflws-helper \
--image=busybox \
--restart=Never \
--overrides='{"apiVersion":"v1","spec":{"volumes":[{"name":"nvflws","persistentVolumeClaim":{"claimName":"nvflws"}}],"containers":[{"name":"helper","image":"busybox","command":["sleep","3600"],"volumeMounts":[{"name":"nvflws","mountPath":"/ws"}]}]}}' \
-n <namespace>
# 3) copy startup kit into PVC
kubectl cp <prod-dir>/<client-name>/ <namespace>/nvflws-helper:/ws/<client-name>/
# 4) delete helper pod
kubectl delete pod nvflws-helper -n <namespace>
# 5) install or upgrade Helm release
helm upgrade --install <release-name> <prod-dir>/<client-name>/nvflare_hc_clients \
--namespace <namespace> \
--create-namespace

Copilot uses AI. Check for mistakes.
Comment on lines +242 to +246
chart_dir: str,
client: Participant,
server: Participant,
fed_learn_port: int,
):
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_write_values_yaml() takes server and fed_learn_port but doesn’t reference them. Consider removing these parameters (and corresponding caller arguments) to make it clear what inputs actually affect the generated values.yaml.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,709 @@
# NVFlare Client Helm Chart — Design Document
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this file is too large to be checked in. I would split it in 2.

  • a helm_chart_client_design.md file with the pure design and tradeoffs made. (which is checked in)
  • a helm_chart_client_implementation.md that is not checked in and that is used by your agent to keep track of what it is doing.

{{- end }}
"""

_CLIENT_POD_YAML = """\
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it not be better to have explicit files in the folder rather then inline strings.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, prefer to the have files rather than inline key value string

Copy link
Copy Markdown
Collaborator

@chesterxgchen chesterxgchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add few comments

Comment on lines +210 to +218
comm_config_args = client.get_prop(PropKey.COMM_CONFIG_ARGS)
if comm_config_args is not None:
comm_config_args.update(
{
CommConfigArg.HOST: client.name,
CommConfigArg.PORT: self.client_port,
CommConfigArg.SCHEME: "tcp",
CommConfigArg.CONN_SEC: ConnSecurity.CLEAR,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 comm_config_args update silently overwritten if builder order is wrong

ClientHelmChartBuilder.build() overwrites HOST and PORT in the shared comm_config_args dict with the Kubernetes service name and port. However, StaticFileBuilder._build_comm_config_for_internal_listener() writes the same keys (HOST, PORT, SCHEME, CONN_SEC) during its own build() phase.

The provisioner calls all build() methods in registration order (provisioner.py line 128), so whichever builder runs last wins. If ClientHelmChartBuilder is listed before StaticFileBuilder in the project's builders list, StaticFileBuilder.build() will silently overwrite the Kubernetes values (client.name, client_port) with the standard internal-listener host/port, and StaticFileBuilder.finalize() will emit a comm_config.json that points to the wrong endpoint — job pods launched by K8sJobLauncher will be unable to reach the parent client process with no error at provision time.

The ordering constraint (ClientHelmChartBuilder must be registered after StaticFileBuilder) is not enforced by the API, not validated at construction/initialization time, and not mentioned in any docstring or the design doc.

Recommended fixes (pick one):

  1. Move the comm_config_args.update() call from build() into finalize(). Since finalize() is called in reverse order (provisioner.py line 131), ClientHelmChartBuilder.finalize() would run before StaticFileBuilder.finalize() — meaning StaticFileBuilder.finalize() reads the final dict after ClientHelmChartBuilder.finalize() has already set HOST/PORT, regardless of registration order.
  2. Guard against wrong ordering in build() with an assertion:
assert comm_config_args.get(CommConfigArg.HOST) is not None, (
    "ClientHelmChartBuilder must be registered after StaticFileBuilder "
    "so that StaticFileBuilder.build() has already set COMM_CONFIG_ARGS"
)
  1. Document the ordering constraint prominently in the class docstring and in the design doc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants