You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/patterns/multicloud-federated-learning/_index.adoc
+10-9Lines changed: 10 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,6 +4,7 @@ date: 2025-05-23
4
4
summary: This pattern helps you develop and deploy federated learning applications on an open hybrid cloud via Open Cluster Management.
5
5
rh_products:
6
6
- Red Hat Advanced Cluster Management
7
+
- Red Hat OpenShift Container Platform
7
8
industries:
8
9
- General
9
10
aliases: /multicloud-federated-learning/
@@ -27,7 +28,7 @@ As machine learning (ML) evolves, protecting data privacy becomes increasingly i
27
28
28
29
Federated Learning (FL) addresses this by allowing multiple clusters or organizations to collaboratively train models without sharing sensitive data. Computation happens where the data lives, ensuring privacy, regulatory compliance, and efficiency.
29
30
30
-
By integrating FL with Open Cluster Management (OCM), this pattern provides an automated and scalable solution for deploying FL workloads across hybrid and multicluster environments.
31
+
By integrating FL with Advanced Cluster Management (ACM), this pattern provides an automated and scalable solution for deploying FL workloads across hybrid and multicluster environments.
31
32
32
33
==== Technologies
33
34
* Open Cluster Management (OCM)
@@ -42,13 +43,13 @@ By integrating FL with Open Cluster Management (OCM), this pattern provides an a
42
43
* Grafana
43
44
* OpenTelemetry
44
45
45
-
=== Why Use OCM for Federated Learning?
46
+
=== Why Use Advanced Cluster Management for Federated Learning?
46
47
47
-
**Open Cluster Management (OCM)** simplifies and automates the deployment and orchestration of Federated Learning (FL) workloads across clusters:
48
+
**Advanced Cluster Management (ACM)** simplifies and automates the deployment and orchestration of Federated Learning (FL) workloads across clusters:
48
49
49
-
- **Automatic Deployment & Simplified Operations**: OCM provides a unified and automated approach to running FL workflows across different runtimes (e.g., Flower, OpenFL). Its controller manages the entire FL lifecycle—including setup, coordination, status tracking, and teardown—across multiple clusters in a multicloud environment. This eliminates repetitive manual configurations, significantly reduces operational overhead, and ensures consistent, scalable FL deployments.
50
+
- **Automatic Deployment & Simplified Operations**: ACM provides a unified and automated approach to running FL workflows across different runtimes (e.g., Flower, OpenFL). Its controller manages the entire FL lifecycle—including setup, coordination, status tracking, and teardown—across multiple clusters in a multicloud environment. This eliminates repetitive manual configurations, significantly reduces operational overhead, and ensures consistent, scalable FL deployments.
50
51
51
-
- **Dynamic Client Selection**: OCM's scheduling capabilities allow FL clients to be selected not only based on where the data resides, but also dynamically based on cluster labels, resource availability, and governance criteria. This enables a more adaptive and intelligent approach to client participation.
52
+
- **Dynamic Client Selection**: ACM's scheduling capabilities allow FL clients to be selected not only based on where the data resides, but also dynamically based on cluster labels, resource availability, and governance criteria. This enables a more adaptive and intelligent approach to client participation.
52
53
53
54
Together, these capabilities support a **flexible FL client model**, where clusters can join or exit the training process dynamically, without requiring static or manual configuration.
54
55
@@ -68,11 +69,11 @@ This approach empowers organizations to build smarter, privacy-first AI solution
In this architecture, a central **Hub Cluster** acts as the aggregator, running the Federated Learning (FL) controller and scheduling workloads using Open Cluster Management (OCM) APIs like `Placement` and `ManifestWork`.
74
+
- In this architecture, a central **Hub Cluster** acts as the aggregator, running the Federated Learning (FL) controller and scheduling workloads using ACM APIs like `Placement` and `ManifestWork`.
74
75
75
-
Multiple **Managed Clusters**, potentially across different clouds, serve as FL clients—each holding private data. These clusters pull the global model from the hub, train it locally, and push model updates back.
76
+
- Multiple **Managed Clusters**, potentially across different clouds, serve as FL clients—each holding private data. These clusters pull the global model from the hub, train it locally, and push model updates back.
76
77
77
-
The controller manages this lifecycle using custom resources and supports runtimes like Flower and OpenFL. This setup enables scalable, multi-cloud model training with **data privacy preserved by design**, requiring no changes to existing FL training code.
78
+
- The controller manages this lifecycle using custom resources and supports runtimes like Flower and OpenFL. This setup enables scalable, multi-cloud model training with **data privacy preserved by design**, requiring no changes to existing FL training code.
* Ensure [kubectl](https://kubernetes.io/docs/reference/kubectl/) and [kustomize](https://kubectl.docs.kubernetes.io/installation/kustomize/) are installed.
18
-
* Ensure [kind](https://kind.sigs.k8s.io/)(greater than v0.9.0+, or the latest version is preferred) is installed.
19
-
* Make: Ensure `make` is installed for build automation
20
-
* Optional: Podman or Docker for container image building
The current server and client use the same image. You can also use the pre-built image `quay.io/myan/flower-app-torch:latest`. After creating the resource, the server will deploy to the hub cluster, and the clients will deploy to managed clusters.
112
+
In this example, both the server and clients use the same image—either the one built above or the pre-built `quay.io/myan/flower-app-torch:latest`. Once the resource is created, the server is deployed to the hub cluster, and the clients are prepared for deployment to the managed clusters.
113
+
+
114
+
Create a `FederatedLearning` resource in the controller namespace on the hub cluster:
117
115
+
118
116
[source,yaml]
119
117
----
@@ -149,11 +147,13 @@ spec:
149
147
operator: Exists
150
148
----
151
149
152
-
. Schedule the Application on Managed Clusters
150
+
. Schedule the Federated Learning Clients into Managed Clusters
153
151
+
154
152
The above configuration schedules only clusters with a `ClusterClaim` having the key `federated-learning-sample.client-data`. You can combine this with other scheduling policies (refer to the Placement API for details).
153
+
+
154
+
Add the `ClusterClaim` to these clusters own the data for the client:
155
155
156
-
.. Managed Cluster 1 claims data:
156
+
.. **Cluster1: **
157
157
+
158
158
[source,bash]
159
159
----
@@ -169,7 +169,7 @@ spec:
169
169
EOF
170
170
----
171
171
172
-
.. Managed Cluster 2 claims data:
172
+
.. **Cluster2: **
173
173
+
174
174
[source,bash]
175
175
----
@@ -184,31 +184,33 @@ spec:
184
184
EOF
185
185
----
186
186
187
-
. Check the Application Status
187
+
. Check the Federated Learning Instance Status
188
188
189
-
.. After creating the instance, the server initially shows a status of `Waiting`:
189
+
.. After creating the instance, the server initially shows a status of `Waiting`
.. After the training and aggregation rounds complete, the status becomes `Completed`:
211
+
.. After the training and aggregation rounds complete, the status becomes `Completed`
212
+
+
213
+
*Example - Federated Learning instance:*
212
214
+
213
215
[source,bash]
214
216
----
@@ -224,4 +226,8 @@ status:
224
226
225
227
.. Download and Verify the Trained Model
226
228
+
227
-
After training is complete and the status is Completed, the MNIST model is saved in the `model-pvc` PersistentVolumeClaim. You can download and evaluate the trained model by following this link:https://github.com/open-cluster-management-io/addon-contrib/blob/main/federated-learning-controller/examples/notebooks/1.hub-evaluation.ipynb[verification notebook].
229
+
The trained MNIST model is saved in the `model-pvc` volume.
230
+
+
231
+
- link:https://github.com/open-cluster-management-io/addon-contrib/blob/main/federated-learning-controller/examples/notebooks/deploy[Deploy a Jupyter notebook server]
232
+
- link:https://github.com/open-cluster-management-io/addon-contrib/blob/main/federated-learning-controller/examples/notebooks/1.hub-evaluation.ipynb[Validate the model]
0 commit comments