You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Feb 26, 2026. It is now read-only.
Copy file name to clipboardExpand all lines: docs/fml.md
+37-52Lines changed: 37 additions & 52 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -46,7 +46,7 @@ Propeller's FML system is built on a workload-agnostic design where the core orc
46
46
47
47
3.**Hybrid Communication Pattern**: Components communicate via a mix of HTTP and MQTT. HTTP is used for synchronous operations (task requests, update submission) while MQTT handles orchestration (round start, completion notifications).
48
48
49
-
4.**WASM-Based Training**: Training workloads execute as WebAssembly modules, providing portability, security isolation, and consistent execution across different device architectures.
49
+
4.**WASM-Based Training**: Training workloads execute as WebAssembly modules.
50
50
51
51
The following diagram illustrates the architecture and message flow of Propeller's federated learning system:
52
52
@@ -123,45 +123,45 @@ The Manager is Propeller's core orchestration component. In the context of feder
123
123
124
124
**Responsibilities**:
125
125
126
-
-**Experiment Configuration**: The Manager exposes `POST /fl/experiments` to configure FL experiments. It forwards the configuration to the Coordinator via HTTP and then publishes a round start message to MQTT to trigger task distribution.
126
+
-**Experiment Configuration**: The Manager exposes `POST /fl/experiments` to configure FL experiments.
127
127
128
-
-**Task Creation and Distribution**: When a federated learning round starts, the Manager receives the round start message and creates training tasks for each participating proplet (edge device). Each task is configured with the necessary environment variables (round ID, model URI, hyperparameters) and pinned to a specific proplet.
128
+
-**Task Creation and Distribution**: When a federated learning round starts, the Manager receives the round start message and creates training tasks for each participating proplet (edge device).
129
129
130
-
-**HTTP Endpoint Proxies**: The Manager provides HTTP endpoints that proxy requests to the Coordinator. These are primarily used by external clients (monitoring systems, test scripts, UI dashboards) rather than proplets:
130
+
-**HTTP Endpoint Proxies**: The Manager provides HTTP endpoints that proxy requests to Coordinator.
131
131
-`GET /fl/task?round_id={id}&proplet_id={id}` - Forward task requests to Coordinator
132
132
-`POST /fl/update` - Forward update submissions to Coordinator (JSON)
133
133
-`POST /fl/update_cbor` - Forward update submissions to Coordinator (CBOR)
134
134
-`GET /fl/rounds/{round_id}/complete` - Check round completion status
135
135
136
-
**Note**: Proplets communicate directly with the Coordinator's endpoints (not through Manager proxies) when possible, as this reduces latency and eliminates a network hop. Manager proxy endpoints exist to provide a unified API surface for external systems that need to interact with the FL workflow without needing direct Coordinator access.
136
+
**Note**: Proplets communicate directly with the Coordinator's endpoints (not through Manager proxies).
137
137
138
-
-**Proplet Management**: The Manager maintains awareness of available proplets and their health status, ensuring tasks are only created for active, reachable devices.
138
+
-**Proplet Management**: The Manager maintains awareness of available proplets and their health status.
139
139
140
-
**Key Design**: Propeller's Manager delegates all FL-specific logic to the external Coordinator. It provides HTTP endpoints as a convenience but does not understand federated learning semantics, model structures, or aggregation logic. This separation allows the Manager to orchestrate other types of distributed workloads beyond federated learning.
140
+
**Key Design**: Propeller's Manager delegates all FL-specific logic to the external Coordinator.
141
141
142
142
### FML Coordinator
143
143
144
144
The Coordinator is the FL-specific service that manages the federated learning lifecycle, from round initialization through aggregation to model publication.
145
145
146
146
**Responsibilities**:
147
147
148
-
-**Experiment Configuration**: The Coordinator receives experiment configuration via HTTP `POST /experiments` from the Manager, initializes round state, and returns the configuration status.
148
+
-**Experiment Configuration**: The Coordinator receives experiment configuration via HTTP `POST /experiments` from the Manager.
149
149
150
-
-**Task Provisioning**: The Coordinator provides FL task details to proplets via HTTP `GET /task`, including the model reference, hyperparameters, and configuration.
150
+
-**Task Provisioning**: The Coordinator provides FL task details to proplets via HTTP `GET /task`.
151
151
152
-
-**Update Collection**: The Coordinator receives training updates from proplets via HTTP `POST /update` (JSON) or `POST /update_cbor` (CBOR). Proplets can submit updates directly to the Coordinator without going through the Manager.
152
+
-**Update Collection**: The Coordinator receives training updates from proplets via HTTP `POST /update` (JSON) or `POST /update_cbor` (CBOR).
153
153
154
-
-**Aggregation Orchestration**: When the minimum number of updates (k-of-n) is received, the Coordinator calls the Aggregator service to perform weighted federated averaging. Each update is weighted by the number of training samples used.
154
+
-**Aggregation Orchestration**: When the minimum number of updates (k-of-n) is received, the Coordinator calls the Aggregator service.
155
155
156
-
-**Model Versioning**: The Coordinator maintains a version counter for global models, incrementing it each time a new aggregated model is created. This enables tracking of model evolution over multiple training rounds.
156
+
-**Model Versioning**: The Coordinator maintains a version counter for global models, incrementing it each time a new aggregated model is created.
157
157
158
158
-**Model Storage**: The Coordinator stores aggregated models in the Model Registry via HTTP POST.
159
159
160
-
-**Round Completion**: The Coordinator handles round completion, either when sufficient updates are received or when a timeout is reached. It publishes completion notifications via MQTT to `fl/rounds/next` topic.
160
+
-**Round Completion**: The Coordinator handles round completion, either when sufficient updates are received or when a timeout is reached.
161
161
162
162
-**Timeout Handling**: The Coordinator monitors each round for timeouts. If a round doesn't receive sufficient updates within the specified timeout period, it aggregates whatever updates have been received (if any) and completes the round.
163
163
164
-
**State Management**: The Coordinator maintains round state in memory, with thread-safe access patterns to handle concurrent updates from multiple proplets. Each round has its own mutex to protect its update list while allowing parallel processing of different rounds.
164
+
**State Management**: The Coordinator maintains round state in memory.
165
165
166
166
### Aggregator Service
167
167
@@ -177,18 +177,7 @@ The Aggregator is a dedicated service that performs the mathematical operations
177
177
178
178
-**Aggregated Model Generation**: Returns the aggregated model weights to the Coordinator for storage.
179
179
180
-
**Service Discovery**: The Coordinator discovers the Aggregator service via environment variable configuration. The `AGGREGATOR_URL` environment variable specifies the base URL of the Aggregator service (e.g., `http://aggregator:8080`). This simple configuration approach allows for:
181
-
- Single Aggregator deployment (most common): One Aggregator URL is configured
182
-
- Multiple aggregators (future): Multiple Aggregator URLs can be configured for load balancing or specialized aggregation algorithms
183
-
184
-
**Algorithm Registration**: The current implementation uses a single Federated Averaging (FedAvg) algorithm. To support pluggable aggregation algorithms, the Coordinator can specify the aggregation algorithm via the experiment configuration:
-`aggregation_params`: Algorithm-specific parameters (e.g., proximal term for FedProx, momentum for FedAvgM)
187
-
188
-
When the Coordinator receives this configuration, it passes the algorithm identifier and parameters to the Aggregator via the `/aggregate` request. The Aggregator implements a registry pattern where different algorithms are registered and selected based on the identifier. Future extensions can support:
189
-
- Dynamic algorithm registration via a configuration endpoint
190
-
- Algorithm hot-swapping without restarting the Coordinator or Aggregator
191
-
- A/B testing of different aggregation algorithms
180
+
**Service Discovery**: The Coordinator discovers the Aggregator service via environment variable configuration. The `AGGREGATOR_URL` environment variable specifies the base URL of the Aggregator service (e.g., `http://aggregator:8080`).
192
181
193
182
**Design Separation**: Separating the aggregation logic into its own service allows for pluggable aggregation algorithms (FedAvg, FedProx, etc.) without modifying the Coordinator.
194
183
@@ -291,7 +280,7 @@ The Proxy service fetches WASM binaries from container registries and serves the
291
280
292
281
-**Topic Subscription**: Listens to `registry/proplet` topic for binary requests from proplets.
293
282
294
-
**Benefits**: The Proxy service enables proplets to fetch WASM binaries without requiring direct outbound HTTP access to external registries, which is important for security-constrained environments.
283
+
**Benefits**: The Proxy service enables proplets to fetch WASM binaries without requiring direct outbound HTTP access to external registries.
295
284
296
285
### SuperMQ MQTT Infrastructure
297
286
@@ -646,11 +635,7 @@ Each training round follows this pattern:
646
635
647
636
### Incremental Improvement
648
637
649
-
Each round incrementally improves the model by incorporating knowledge from participating proplets. The federated averaging algorithm ensures that:
650
-
651
-
- Updates from proplets with more training samples have greater influence on the aggregated model
652
-
- The model converges toward a solution that works well across all participating devices' data distributions
653
-
- No single proplet's data dominates the final model
638
+
Each round incrementally improves the model by incorporating knowledge from participating proplets.
654
639
655
640
### Version History
656
641
@@ -668,65 +653,65 @@ Propeller's FML architecture is designed to scale across several dimensions:
668
653
669
654
### Horizontal Scaling
670
655
671
-
-**Multiple Proplets**: Propeller naturally scales to support hundreds or thousands of proplets participating in a single round. The Manager can create tasks for all participants in parallel.
656
+
-**Multiple Proplets**: Propeller naturally scales to support hundreds or thousands of proplets participating in a single round.
672
657
673
658
-**Multiple Coordinators**: While Propeller's current implementation uses a single Coordinator, the architecture supports multiple Coordinators with consistent hashing or round assignment to distribute load.
674
659
675
660
-**Distributed Model Registry**: The Model Registry can be replicated or sharded to handle high request volumes from many proplets fetching models simultaneously.
676
661
677
662
### Network Efficiency
678
663
679
-
-**Chunked WASM Transport**: The Proxy service automatically chunks large WASM binaries for efficient MQTT transport, allowing binaries to be distributed even over bandwidth-constrained networks.
664
+
-**Chunked WASM Transport**: The Proxy service chunks large WASM binaries for efficient MQTT transport.
680
665
681
-
-**HTTP for Large Data**: Model files and aggregated results are transferred via HTTP, which provides efficient streaming and chunked transfer encoding for large payloads.
666
+
-**HTTP for Large Data**: Model files and aggregated results are transferred via HTTP.
682
667
683
-
-**Asynchronous Communication**: Propeller leverages MQTT's asynchronous nature to allow proplets to submit updates without blocking, and the Coordinator can process updates as they arrive.
668
+
-**Asynchronous Communication**: Proplets can submit updates asynchronously, and the Coordinator processes updates as they arrive.
684
669
685
670
### Fault Tolerance
686
671
687
-
-**Timeout Handling**: Propeller ensures rounds complete even if some proplets fail to submit updates, ensuring progress despite device failures or network issues.
672
+
-**Timeout Handling**: Rounds complete even if some proplets fail to submit updates.
688
673
689
-
-**Update Thresholds**: The k-of-n parameter allows rounds to complete with a subset of participants, providing resilience to device failures.
674
+
-**Update Thresholds**: The k-of-n parameter allows rounds to complete with a subset of participants.
690
675
691
-
-**Fallback Mechanisms**: Propeller's proplets can fall back from HTTP to MQTT if network conditions degrade, ensuring updates are delivered even in challenging network environments.
676
+
-**Fallback Mechanisms**: Proplets can fall back from HTTP to MQTT if network conditions degrade.
692
677
693
678
### Error Handling
694
679
695
680
Propeller's FML system implements comprehensive error handling for various failure scenarios:
696
681
697
682
**HTTP Endpoint Failures**
698
683
699
-
-**Coordinator Unavailable**: When the Coordinator is unreachable, the Manager returns a 503 Service Unavailable error to external clients. The Manager does not retry the request; the client is responsible for implementing retry logic with exponential backoff.
684
+
-**Coordinator Unavailable**: When the Coordinator is unreachable, the Manager returns a 503 Service Unavailable error to external clients.
700
685
701
-
-**Malformed Update Payloads**: When the Coordinator receives a malformed update (invalid JSON, missing required fields, or weight/bias dimensions that don't match the expected model shape), it returns a 400 Bad Request error with a detailed error message. The proplet should validate its update payload before submission and handle 400 responses by logging the error and potentially retraining.
686
+
-**Malformed Update Payloads**: When the Coordinator receives a malformed update (invalid JSON, missing required fields, or weight/bias dimensions that don't match the expected model shape), it returns a 400 Bad Request error.
702
687
703
-
-**Task Request Errors**: When a proplet requests a task with an invalid round ID or proplet ID, the Coordinator returns a 404 Not Found error. Proplets should validate that they are using the correct round and proplet identifiers before making requests.
688
+
-**Task Request Errors**: When a proplet requests a task with an invalid round ID or proplet ID, the Coordinator returns a 404 Not Found error.
704
689
705
-
-**Model Registry Unavailable**: When the Model Registry is unavailable, proplets receive a 503 error when attempting to fetch the model. Proplets should implement retry logic with exponential backoff (e.g., 1s, 2s, 4s, 8s) before giving up. The maximum number of retries should be configurable.
690
+
-**Model Registry Unavailable**: When the Model Registry is unavailable, proplets receive a 503 error when attempting to fetch the model.
706
691
707
692
**Aggregator Service Failures**
708
693
709
-
-**Aggregator Unavailable**: When the Coordinator attempts to call the Aggregator but the service is unreachable, the Coordinator implements a retry policy with exponential backoff. After 3 failed retry attempts, the Coordinator stores the unaggregated updates for later recovery and completes the round with a warning status. Administrators can manually trigger aggregation once the Aggregator is back online.
694
+
-**Aggregator Unavailable**: When the Coordinator attempts to call the Aggregator but the service is unreachable, the Coordinator implements a retry policy with exponential backoff. After 3 failed retry attempts, the Coordinator stores the unaggregated updates for later recovery and completes the round with a warning status.
710
695
711
-
-**Aggregation Algorithm Errors**: When the Aggregator encounters an error during aggregation (e.g., mismatched weight dimensions, numerical overflow), it returns a 500 Internal Server Error with details to the Coordinator. The Coordinator logs the error, marks the round as failed, and notifies operators via the monitoring system.
696
+
-**Aggregation Algorithm Errors**: When the Aggregator encounters an error during aggregation (e.g., mismatched weight dimensions, numerical overflow), it returns a 500 Internal Server Error. The Coordinator logs the error and marks the round as failed.
712
697
713
-
-**Timeout on Aggregation**: The Coordinator sets a timeout (configurable, default 30 seconds) for Aggregator requests. If the Aggregator does not respond within the timeout, the Coordinator treats this as a failure and proceeds with the retry policy.
698
+
-**Timeout on Aggregation**: The Coordinator sets a timeout (configurable, default 30 seconds) for Aggregator requests.
714
699
715
700
**Proplet Failures**
716
701
717
-
-**Proplet Timeout**: When a proplet does not submit an update within the round timeout, the Coordinator excludes it from aggregation but continues the round with the remaining participants. The Coordinator logs which proplets timed out for monitoring purposes.
702
+
-**Proplet Timeout**: When a proplet does not submit an update within the round timeout, the Coordinator excludes it from aggregation but continues the round with the remaining participants.
718
703
719
-
-**Proplet Crash During Training**: If a proplet crashes during WASM execution, it will not submit an update. The round proceeds without that proplet's contribution. The Manager's proplet health monitoring detects the crash and marks the proplet as unhealthy, preventing it from being assigned tasks in subsequent rounds.
704
+
-**Proplet Crash During Training**: If a proplet crashes during WASM execution, it will not submit an update. The round proceeds without that proplet's contribution.
720
705
721
-
-**WASM Execution Errors**: If the WASM module encounters an error during training (e.g., division by zero, out of memory), the proplet captures the error and reports it to the Coordinator via a special error update message. The Coordinator logs the error and excludes the proplet from the round.
706
+
-**WASM Execution Errors**: If the WASM module encounters an error during training (e.g., division by zero, out of memory), the proplet reports it to the Coordinator via a special error update message. The Coordinator logs the error and excludes the proplet from the round.
722
707
723
708
**Network-Related Failures**
724
709
725
-
-**HTTP to MQTT Fallback**: Proplets prefer HTTP for update submission but fall back to MQTT if HTTP fails. The fallback is triggered after 3 consecutive HTTP failures with exponential backoff. Once MQTT is used, the proplet continues to use MQTT for the remainder of the round.
710
+
-**HTTP to MQTT Fallback**: Proplets prefer HTTP for update submission but fall back to MQTT if HTTP fails. The fallback is triggered after 3 consecutive HTTP failures.
726
711
727
-
-**MQTT Connection Loss**: If a proplet loses its MQTT connection during a round, it attempts to reconnect. If reconnection fails after 5 attempts, the proplet marks itself as unhealthy and stops participating in the round.
712
+
-**MQTT Connection Loss**: If a proplet loses its MQTT connection during a round, it attempts to reconnect. If reconnection fails after 5 attempts, the proplet marks itself as unhealthy.
728
713
729
-
-**Partial Update Transmission**: If an update transmission is interrupted (e.g., network cut during HTTP POST), the Coordinator treats this as a failed submission and waits for a retry from the proplet. Proplets implement client-side retries for failed submissions.
714
+
-**Partial Update Transmission**: If an update transmission is interrupted (e.g., network cut during HTTP POST), the Coordinator treats this as a failed submission and waits for a retry from the proplet.
0 commit comments