Skip to content

Commit 2a0966e

Browse files
committed
ad validate document
Signed-off-by: Meng Yan <[email protected]>
1 parent d9c2d35 commit 2a0966e

File tree

3 files changed

+39844
-65
lines changed

3 files changed

+39844
-65
lines changed

content/patterns/multicloud-federated-learning/_index.adoc

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ date: 2025-05-23
44
summary: This pattern helps you develop and deploy federated learning applications on an open hybrid cloud via Open Cluster Management.
55
rh_products:
66
- Red Hat Advanced Cluster Management
7+
- Red Hat OpenShift Container Platform
78
industries:
89
- General
910
aliases: /multicloud-federated-learning/
@@ -27,7 +28,7 @@ As machine learning (ML) evolves, protecting data privacy becomes increasingly i
2728

2829
Federated Learning (FL) addresses this by allowing multiple clusters or organizations to collaboratively train models without sharing sensitive data. Computation happens where the data lives, ensuring privacy, regulatory compliance, and efficiency.
2930

30-
By integrating FL with Open Cluster Management (OCM), this pattern provides an automated and scalable solution for deploying FL workloads across hybrid and multicluster environments.
31+
By integrating FL with Advanced Cluster Management (ACM), this pattern provides an automated and scalable solution for deploying FL workloads across hybrid and multicluster environments.
3132

3233
==== Technologies
3334
* Open Cluster Management (OCM)
@@ -42,13 +43,13 @@ By integrating FL with Open Cluster Management (OCM), this pattern provides an a
4243
* Grafana
4344
* OpenTelemetry
4445

45-
=== Why Use OCM for Federated Learning?
46+
=== Why Use Advanced Cluster Management for Federated Learning?
4647

47-
**Open Cluster Management (OCM)** simplifies and automates the deployment and orchestration of Federated Learning (FL) workloads across clusters:
48+
**Advanced Cluster Management (ACM)** simplifies and automates the deployment and orchestration of Federated Learning (FL) workloads across clusters:
4849

49-
- **Automatic Deployment & Simplified Operations**: OCM provides a unified and automated approach to running FL workflows across different runtimes (e.g., Flower, OpenFL). Its controller manages the entire FL lifecycle—including setup, coordination, status tracking, and teardown—across multiple clusters in a multicloud environment. This eliminates repetitive manual configurations, significantly reduces operational overhead, and ensures consistent, scalable FL deployments.
50+
- **Automatic Deployment & Simplified Operations**: ACM provides a unified and automated approach to running FL workflows across different runtimes (e.g., Flower, OpenFL). Its controller manages the entire FL lifecycle—including setup, coordination, status tracking, and teardown—across multiple clusters in a multicloud environment. This eliminates repetitive manual configurations, significantly reduces operational overhead, and ensures consistent, scalable FL deployments.
5051

51-
- **Dynamic Client Selection**: OCM's scheduling capabilities allow FL clients to be selected not only based on where the data resides, but also dynamically based on cluster labels, resource availability, and governance criteria. This enables a more adaptive and intelligent approach to client participation.
52+
- **Dynamic Client Selection**: ACM's scheduling capabilities allow FL clients to be selected not only based on where the data resides, but also dynamically based on cluster labels, resource availability, and governance criteria. This enables a more adaptive and intelligent approach to client participation.
5253

5354
Together, these capabilities support a **flexible FL client model**, where clusters can join or exit the training process dynamically, without requiring static or manual configuration.
5455

@@ -68,11 +69,11 @@ This approach empowers organizations to build smarter, privacy-first AI solution
6869

6970
=== Architecture
7071

71-
image::/images/multicloud-federated-learning/multicluster-federated-learning-workflow.png[multicloud-federated-learning-workflow,title="Multicloud Federated Learning Workflow",width=100%]
72+
image::/images/multicloud-federated-learning/multicluster-federated-learning-workflow.png[multicloud-federated-learning-workflow]
7273

73-
In this architecture, a central **Hub Cluster** acts as the aggregator, running the Federated Learning (FL) controller and scheduling workloads using Open Cluster Management (OCM) APIs like `Placement` and `ManifestWork`.
74+
- In this architecture, a central **Hub Cluster** acts as the aggregator, running the Federated Learning (FL) controller and scheduling workloads using ACM APIs like `Placement` and `ManifestWork`.
7475

75-
Multiple **Managed Clusters**, potentially across different clouds, serve as FL clients—each holding private data. These clusters pull the global model from the hub, train it locally, and push model updates back.
76+
- Multiple **Managed Clusters**, potentially across different clouds, serve as FL clients—each holding private data. These clusters pull the global model from the hub, train it locally, and push model updates back.
7677

77-
The controller manages this lifecycle using custom resources and supports runtimes like Flower and OpenFL. This setup enables scalable, multi-cloud model training with **data privacy preserved by design**, requiring no changes to existing FL training code.
78+
- The controller manages this lifecycle using custom resources and supports runtimes like Flower and OpenFL. This setup enables scalable, multi-cloud model training with **data privacy preserved by design**, requiring no changes to existing FL training code.
7879

content/patterns/multicloud-federated-learning/getting-started.adoc

Lines changed: 62 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -13,107 +13,105 @@ include::modules/comm-attributes.adoc[]
1313

1414
=== Prerequisites
1515

16-
* Go: Version 1.19 or higher
17-
* Ensure [kubectl](https://kubernetes.io/docs/reference/kubectl/) and [kustomize](https://kubectl.docs.kubernetes.io/installation/kustomize/) are installed.
18-
* Ensure [kind](https://kind.sigs.k8s.io/)(greater than v0.9.0+, or the latest version is preferred) is installed.
19-
* Make: Ensure `make` is installed for build automation
20-
* Optional: Podman or Docker for container image building
16+
==== Ensure the following tools are installed:
2117

22-
=== Set Up the Environment
18+
- link:https://kubernetes.io/docs/reference/kubectl/[`kubectl`]
19+
- link:https://kubectl.docs.kubernetes.io/installation/kustomize/[`kustomize`]
20+
- link:https://kind.sigs.k8s.io/[`kind`] (recommended version > v0.9.0)
21+
- link:https://www.gnu.org/software/make/[`make`] (for build automation)
2322

24-
. Install the `clusteradm` CLI tool:
25-
+
26-
[source,bash]
27-
----
28-
$ curl -L https://raw.githubusercontent.com/open-cluster-management-io/clusteradm/main/install.sh | bash
29-
----
23+
==== Optional (for container image building):
24+
25+
- link:https://podman.io/[Podman] or link:https://www.docker.com/[Docker]
26+
- link:https://go.dev/doc/install[Go] (version 1.19 or higher)
27+
28+
===== Advanced Cluster Management Environment
29+
30+
Prepare at least three clusters: one hub cluster and two managed clusters.
31+
32+
Verify the managed clusters are registered on the hub by running:
3033

31-
. Create hub and managed clusters using `kind`:
32-
+
3334
[source,bash]
3435
----
35-
$ curl -L https://raw.githubusercontent.com/open-cluster-management-io/OCM/main/solutions/setup-dev-environment/local-up.sh | bash
36+
$ kubectl get mcl
3637
----
37-
+
38-
Verify the environment
39-
+
38+
39+
Example output:
4040
[source,bash]
4141
----
42-
$ kubectl get mcl
4342
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE
44-
cluster1 true https://cluster1-control-plane:6443 True True 3d22h
45-
cluster2 true https://cluster2-control-plane:6443 True True 3d22h
43+
cluster1 true https://api.***.com:6443 True True 5m
44+
cluster2 true https://api.***.com:6443 True True 5m
4645
----
4746

48-
. Deploy Federated Learning Controller
47+
=== Deploy Federated Learning Controller
4948

50-
.. Clone and navigate to the repository:
49+
. Clone and navigate to the repository:
5150
+
5251
[source,bash]
5352
----
5453
$ [email protected]:open-cluster-management-io/addon-contrib.git
5554
$ cd federated-learning-controller
5655
----
5756

58-
.. Build and push the controller image (or use pre-built `quay.io/myan/federated-learning-controller:latest`):
57+
. Build and push the controller image (or use pre-built `quay.io/myan/federated-learning-controller:latest`):
5958
+
6059
[source,bash]
6160
----
6261
$ make docker-build docker-push IMG=<IMG>
6362
----
6463

65-
.. Deploy the controller:
64+
. Deploy the controller to the hub cluster:
6665
+
6766
[source,bash]
6867
----
69-
# switch the context to hub cluster
7068
$ kubectl config use-context kind-hub
71-
7269
$ make deploy IMG=<controller-image> NAMESPACE=<controller-namespace(default is open-cluster-management)>
7370
----
7471

75-
.. Verify deployment:
72+
. Verify the deployment:
7673
+
7774
The federated learning controller is running in the open-cluster-management namespace by default.
7875
+
7976
[source,bash]
8077
----
8178
$ kubectl get pods -n open-cluster-management
79+
----
80+
+
81+
Example output
82+
+
83+
[source,bash]
84+
----
8285
NAME READY STATUS RESTARTS AGE
8386
cluster-manager-d9db64db5-c7kfj 1/1 Running 0 3d22h
8487
cluster-manager-d9db64db5-t7grh 1/1 Running 0 3d22h
8588
cluster-manager-d9db64db5-wndd8 1/1 Running 0 3d22h
8689
federated-learning-controller-d7df846c9-nb4wc 1/1 Running 0 3d22h
8790
----
8891

89-
=== Deploy the Application
92+
=== Deploy the Federated Learning Instance
9093

91-
. build the Federated Learning Application Image
94+
. Build the Application Image
9295
+
93-
*Note*: You can directly use the pre-built image `quay.io/myan/federated-learning-app:latest`.
94-
95-
.. Navigate to the Flower framework example:
96+
*Note*: You can skip this step by using the pre-built image `quay.io/myan/flower-app-torch:latest`.
9697
+
9798
[source,bash]
9899
----
99-
$ cd federated-learning-controller/examples/flower
100-
----
100+
$ cd examples/flower
101101
102-
.. *(Optional)* Modify the model code located in `flower/app-torch`, then build and push the image:
103-
+
104-
[source,bash]
105-
----
106102
$ export REGISTRY=<your-registry>
107-
$ export IMAGE_TAG=<your-image-tag>
103+
$ export IMAGE_TAG=<your-tag>
108104
$ make build-app-image
109105
$ make push-app-image
110106
----
111-
+
112-
The image will be named `<REGISTRY>/flower-app-torch:<IMAGE_TAG>`.
107+
+
108+
Image format: `<REGISTRY>/flower-app-torch:<IMAGE_TAG>`
113109

114-
. Deploy the Application to the Hub Cluster
110+
. Deploy a Federated Learning Instance
115111
+
116-
The current server and client use the same image. You can also use the pre-built image `quay.io/myan/flower-app-torch:latest`. After creating the resource, the server will deploy to the hub cluster, and the clients will deploy to managed clusters.
112+
In this example, both the server and clients use the same image—either the one built above or the pre-built `quay.io/myan/flower-app-torch:latest`. Once the resource is created, the server is deployed to the hub cluster, and the clients are prepared for deployment to the managed clusters.
113+
+
114+
Create a `FederatedLearning` resource in the controller namespace on the hub cluster:
117115
+
118116
[source,yaml]
119117
----
@@ -149,11 +147,13 @@ spec:
149147
operator: Exists
150148
----
151149

152-
. Schedule the Application on Managed Clusters
150+
. Schedule the Federated Learning Clients into Managed Clusters
153151
+
154152
The above configuration schedules only clusters with a `ClusterClaim` having the key `federated-learning-sample.client-data`. You can combine this with other scheduling policies (refer to the Placement API for details).
153+
+
154+
Add the `ClusterClaim` to these clusters own the data for the client:
155155

156-
.. Managed Cluster 1 claims data:
156+
.. **Cluster1: **
157157
+
158158
[source,bash]
159159
----
@@ -169,7 +169,7 @@ spec:
169169
EOF
170170
----
171171

172-
.. Managed Cluster 2 claims data:
172+
.. **Cluster2: **
173173
+
174174
[source,bash]
175175
----
@@ -184,31 +184,33 @@ spec:
184184
EOF
185185
----
186186

187-
. Check the Application Status
187+
. Check the Federated Learning Instance Status
188188

189-
.. After creating the instance, the server initially shows a status of `Waiting`:
189+
.. After creating the instance, the server initially shows a status of `Waiting`
190190
+
191-
*Hub cluster server example:*
191+
*Example - server in hub cluster:*
192192
+
193193
[source,bash]
194194
----
195195
$ kubectl get pods
196196
NAME READY STATUS RESTARTS AGE
197-
federated-learning-sample-server-7jnfs 0/1 Completed 0 5d3h
197+
federated-learning-sample-server-7jnfs 0/1 Completed 0 9m
198198
----
199199

200-
.. Once the required clients are ready, status changes to `Running`:
200+
.. Once the required clients are ready, status changes to `Running`
201201
+
202-
*Managed cluster client example:*
202+
*Example - client in managed cluster:*
203203
+
204204
[source,bash]
205205
----
206206
$ kubectl get pods -n open-cluster-management
207207
NAME READY STATUS RESTARTS AGE
208-
federated-learning-sample-client-75sc8 0/1 Completed 0 5d3h
208+
federated-learning-sample-client-75sc8 0/1 Completed 0 8m
209209
----
210210

211-
.. After the training and aggregation rounds complete, the status becomes `Completed`:
211+
.. After the training and aggregation rounds complete, the status becomes `Completed`
212+
+
213+
*Example - Federated Learning instance:*
212214
+
213215
[source,bash]
214216
----
@@ -224,4 +226,8 @@ status:
224226

225227
.. Download and Verify the Trained Model
226228
+
227-
After training is complete and the status is Completed, the MNIST model is saved in the `model-pvc` PersistentVolumeClaim. You can download and evaluate the trained model by following this link:https://github.com/open-cluster-management-io/addon-contrib/blob/main/federated-learning-controller/examples/notebooks/1.hub-evaluation.ipynb[verification notebook].
229+
The trained MNIST model is saved in the `model-pvc` volume.
230+
+
231+
- link:https://github.com/open-cluster-management-io/addon-contrib/blob/main/federated-learning-controller/examples/notebooks/deploy[Deploy a Jupyter notebook server]
232+
- link:https://github.com/open-cluster-management-io/addon-contrib/blob/main/federated-learning-controller/examples/notebooks/1.hub-evaluation.ipynb[Validate the model]
233+

0 commit comments

Comments
 (0)