Skip to content
This repository was archived by the owner on Feb 26, 2026. It is now read-only.

Commit b601ad5

Browse files
committed
docs: remove AI-generated slop from fml.md
Remove verbose explanations, speculative content, and unnecessary explanatory text that appears to be AI-generated.
1 parent 46baba5 commit b601ad5

File tree

1 file changed

+37
-52
lines changed

1 file changed

+37
-52
lines changed

docs/fml.md

Lines changed: 37 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ Propeller's FML system is built on a workload-agnostic design where the core orc
4646

4747
3. **Hybrid Communication Pattern**: Components communicate via a mix of HTTP and MQTT. HTTP is used for synchronous operations (task requests, update submission) while MQTT handles orchestration (round start, completion notifications).
4848

49-
4. **WASM-Based Training**: Training workloads execute as WebAssembly modules, providing portability, security isolation, and consistent execution across different device architectures.
49+
4. **WASM-Based Training**: Training workloads execute as WebAssembly modules.
5050

5151
The following diagram illustrates the architecture and message flow of Propeller's federated learning system:
5252

@@ -123,45 +123,45 @@ The Manager is Propeller's core orchestration component. In the context of feder
123123

124124
**Responsibilities**:
125125

126-
- **Experiment Configuration**: The Manager exposes `POST /fl/experiments` to configure FL experiments. It forwards the configuration to the Coordinator via HTTP and then publishes a round start message to MQTT to trigger task distribution.
126+
- **Experiment Configuration**: The Manager exposes `POST /fl/experiments` to configure FL experiments.
127127

128-
- **Task Creation and Distribution**: When a federated learning round starts, the Manager receives the round start message and creates training tasks for each participating proplet (edge device). Each task is configured with the necessary environment variables (round ID, model URI, hyperparameters) and pinned to a specific proplet.
128+
- **Task Creation and Distribution**: When a federated learning round starts, the Manager receives the round start message and creates training tasks for each participating proplet (edge device).
129129

130-
- **HTTP Endpoint Proxies**: The Manager provides HTTP endpoints that proxy requests to the Coordinator. These are primarily used by external clients (monitoring systems, test scripts, UI dashboards) rather than proplets:
130+
- **HTTP Endpoint Proxies**: The Manager provides HTTP endpoints that proxy requests to Coordinator.
131131
- `GET /fl/task?round_id={id}&proplet_id={id}` - Forward task requests to Coordinator
132132
- `POST /fl/update` - Forward update submissions to Coordinator (JSON)
133133
- `POST /fl/update_cbor` - Forward update submissions to Coordinator (CBOR)
134134
- `GET /fl/rounds/{round_id}/complete` - Check round completion status
135135

136-
**Note**: Proplets communicate directly with the Coordinator's endpoints (not through Manager proxies) when possible, as this reduces latency and eliminates a network hop. Manager proxy endpoints exist to provide a unified API surface for external systems that need to interact with the FL workflow without needing direct Coordinator access.
136+
**Note**: Proplets communicate directly with the Coordinator's endpoints (not through Manager proxies).
137137

138-
- **Proplet Management**: The Manager maintains awareness of available proplets and their health status, ensuring tasks are only created for active, reachable devices.
138+
- **Proplet Management**: The Manager maintains awareness of available proplets and their health status.
139139

140-
**Key Design**: Propeller's Manager delegates all FL-specific logic to the external Coordinator. It provides HTTP endpoints as a convenience but does not understand federated learning semantics, model structures, or aggregation logic. This separation allows the Manager to orchestrate other types of distributed workloads beyond federated learning.
140+
**Key Design**: Propeller's Manager delegates all FL-specific logic to the external Coordinator.
141141

142142
### FML Coordinator
143143

144144
The Coordinator is the FL-specific service that manages the federated learning lifecycle, from round initialization through aggregation to model publication.
145145

146146
**Responsibilities**:
147147

148-
- **Experiment Configuration**: The Coordinator receives experiment configuration via HTTP `POST /experiments` from the Manager, initializes round state, and returns the configuration status.
148+
- **Experiment Configuration**: The Coordinator receives experiment configuration via HTTP `POST /experiments` from the Manager.
149149

150-
- **Task Provisioning**: The Coordinator provides FL task details to proplets via HTTP `GET /task`, including the model reference, hyperparameters, and configuration.
150+
- **Task Provisioning**: The Coordinator provides FL task details to proplets via HTTP `GET /task`.
151151

152-
- **Update Collection**: The Coordinator receives training updates from proplets via HTTP `POST /update` (JSON) or `POST /update_cbor` (CBOR). Proplets can submit updates directly to the Coordinator without going through the Manager.
152+
- **Update Collection**: The Coordinator receives training updates from proplets via HTTP `POST /update` (JSON) or `POST /update_cbor` (CBOR).
153153

154-
- **Aggregation Orchestration**: When the minimum number of updates (k-of-n) is received, the Coordinator calls the Aggregator service to perform weighted federated averaging. Each update is weighted by the number of training samples used.
154+
- **Aggregation Orchestration**: When the minimum number of updates (k-of-n) is received, the Coordinator calls the Aggregator service.
155155

156-
- **Model Versioning**: The Coordinator maintains a version counter for global models, incrementing it each time a new aggregated model is created. This enables tracking of model evolution over multiple training rounds.
156+
- **Model Versioning**: The Coordinator maintains a version counter for global models, incrementing it each time a new aggregated model is created.
157157

158158
- **Model Storage**: The Coordinator stores aggregated models in the Model Registry via HTTP POST.
159159

160-
- **Round Completion**: The Coordinator handles round completion, either when sufficient updates are received or when a timeout is reached. It publishes completion notifications via MQTT to `fl/rounds/next` topic.
160+
- **Round Completion**: The Coordinator handles round completion, either when sufficient updates are received or when a timeout is reached.
161161

162162
- **Timeout Handling**: The Coordinator monitors each round for timeouts. If a round doesn't receive sufficient updates within the specified timeout period, it aggregates whatever updates have been received (if any) and completes the round.
163163

164-
**State Management**: The Coordinator maintains round state in memory, with thread-safe access patterns to handle concurrent updates from multiple proplets. Each round has its own mutex to protect its update list while allowing parallel processing of different rounds.
164+
**State Management**: The Coordinator maintains round state in memory.
165165

166166
### Aggregator Service
167167

@@ -177,18 +177,7 @@ The Aggregator is a dedicated service that performs the mathematical operations
177177

178178
- **Aggregated Model Generation**: Returns the aggregated model weights to the Coordinator for storage.
179179

180-
**Service Discovery**: The Coordinator discovers the Aggregator service via environment variable configuration. The `AGGREGATOR_URL` environment variable specifies the base URL of the Aggregator service (e.g., `http://aggregator:8080`). This simple configuration approach allows for:
181-
- Single Aggregator deployment (most common): One Aggregator URL is configured
182-
- Multiple aggregators (future): Multiple Aggregator URLs can be configured for load balancing or specialized aggregation algorithms
183-
184-
**Algorithm Registration**: The current implementation uses a single Federated Averaging (FedAvg) algorithm. To support pluggable aggregation algorithms, the Coordinator can specify the aggregation algorithm via the experiment configuration:
185-
- `aggregation_algorithm`: Algorithm identifier (e.g., "fedavg", "fedprox", "fedavgm")
186-
- `aggregation_params`: Algorithm-specific parameters (e.g., proximal term for FedProx, momentum for FedAvgM)
187-
188-
When the Coordinator receives this configuration, it passes the algorithm identifier and parameters to the Aggregator via the `/aggregate` request. The Aggregator implements a registry pattern where different algorithms are registered and selected based on the identifier. Future extensions can support:
189-
- Dynamic algorithm registration via a configuration endpoint
190-
- Algorithm hot-swapping without restarting the Coordinator or Aggregator
191-
- A/B testing of different aggregation algorithms
180+
**Service Discovery**: The Coordinator discovers the Aggregator service via environment variable configuration. The `AGGREGATOR_URL` environment variable specifies the base URL of the Aggregator service (e.g., `http://aggregator:8080`).
192181

193182
**Design Separation**: Separating the aggregation logic into its own service allows for pluggable aggregation algorithms (FedAvg, FedProx, etc.) without modifying the Coordinator.
194183

@@ -291,7 +280,7 @@ The Proxy service fetches WASM binaries from container registries and serves the
291280

292281
- **Topic Subscription**: Listens to `registry/proplet` topic for binary requests from proplets.
293282

294-
**Benefits**: The Proxy service enables proplets to fetch WASM binaries without requiring direct outbound HTTP access to external registries, which is important for security-constrained environments.
283+
**Benefits**: The Proxy service enables proplets to fetch WASM binaries without requiring direct outbound HTTP access to external registries.
295284

296285
### SuperMQ MQTT Infrastructure
297286

@@ -646,11 +635,7 @@ Each training round follows this pattern:
646635

647636
### Incremental Improvement
648637

649-
Each round incrementally improves the model by incorporating knowledge from participating proplets. The federated averaging algorithm ensures that:
650-
651-
- Updates from proplets with more training samples have greater influence on the aggregated model
652-
- The model converges toward a solution that works well across all participating devices' data distributions
653-
- No single proplet's data dominates the final model
638+
Each round incrementally improves the model by incorporating knowledge from participating proplets.
654639

655640
### Version History
656641

@@ -668,65 +653,65 @@ Propeller's FML architecture is designed to scale across several dimensions:
668653

669654
### Horizontal Scaling
670655

671-
- **Multiple Proplets**: Propeller naturally scales to support hundreds or thousands of proplets participating in a single round. The Manager can create tasks for all participants in parallel.
656+
- **Multiple Proplets**: Propeller naturally scales to support hundreds or thousands of proplets participating in a single round.
672657

673658
- **Multiple Coordinators**: While Propeller's current implementation uses a single Coordinator, the architecture supports multiple Coordinators with consistent hashing or round assignment to distribute load.
674659

675660
- **Distributed Model Registry**: The Model Registry can be replicated or sharded to handle high request volumes from many proplets fetching models simultaneously.
676661

677662
### Network Efficiency
678663

679-
- **Chunked WASM Transport**: The Proxy service automatically chunks large WASM binaries for efficient MQTT transport, allowing binaries to be distributed even over bandwidth-constrained networks.
664+
- **Chunked WASM Transport**: The Proxy service chunks large WASM binaries for efficient MQTT transport.
680665

681-
- **HTTP for Large Data**: Model files and aggregated results are transferred via HTTP, which provides efficient streaming and chunked transfer encoding for large payloads.
666+
- **HTTP for Large Data**: Model files and aggregated results are transferred via HTTP.
682667

683-
- **Asynchronous Communication**: Propeller leverages MQTT's asynchronous nature to allow proplets to submit updates without blocking, and the Coordinator can process updates as they arrive.
668+
- **Asynchronous Communication**: Proplets can submit updates asynchronously, and the Coordinator processes updates as they arrive.
684669

685670
### Fault Tolerance
686671

687-
- **Timeout Handling**: Propeller ensures rounds complete even if some proplets fail to submit updates, ensuring progress despite device failures or network issues.
672+
- **Timeout Handling**: Rounds complete even if some proplets fail to submit updates.
688673

689-
- **Update Thresholds**: The k-of-n parameter allows rounds to complete with a subset of participants, providing resilience to device failures.
674+
- **Update Thresholds**: The k-of-n parameter allows rounds to complete with a subset of participants.
690675

691-
- **Fallback Mechanisms**: Propeller's proplets can fall back from HTTP to MQTT if network conditions degrade, ensuring updates are delivered even in challenging network environments.
676+
- **Fallback Mechanisms**: Proplets can fall back from HTTP to MQTT if network conditions degrade.
692677

693678
### Error Handling
694679

695680
Propeller's FML system implements comprehensive error handling for various failure scenarios:
696681

697682
**HTTP Endpoint Failures**
698683

699-
- **Coordinator Unavailable**: When the Coordinator is unreachable, the Manager returns a 503 Service Unavailable error to external clients. The Manager does not retry the request; the client is responsible for implementing retry logic with exponential backoff.
684+
- **Coordinator Unavailable**: When the Coordinator is unreachable, the Manager returns a 503 Service Unavailable error to external clients.
700685

701-
- **Malformed Update Payloads**: When the Coordinator receives a malformed update (invalid JSON, missing required fields, or weight/bias dimensions that don't match the expected model shape), it returns a 400 Bad Request error with a detailed error message. The proplet should validate its update payload before submission and handle 400 responses by logging the error and potentially retraining.
686+
- **Malformed Update Payloads**: When the Coordinator receives a malformed update (invalid JSON, missing required fields, or weight/bias dimensions that don't match the expected model shape), it returns a 400 Bad Request error.
702687

703-
- **Task Request Errors**: When a proplet requests a task with an invalid round ID or proplet ID, the Coordinator returns a 404 Not Found error. Proplets should validate that they are using the correct round and proplet identifiers before making requests.
688+
- **Task Request Errors**: When a proplet requests a task with an invalid round ID or proplet ID, the Coordinator returns a 404 Not Found error.
704689

705-
- **Model Registry Unavailable**: When the Model Registry is unavailable, proplets receive a 503 error when attempting to fetch the model. Proplets should implement retry logic with exponential backoff (e.g., 1s, 2s, 4s, 8s) before giving up. The maximum number of retries should be configurable.
690+
- **Model Registry Unavailable**: When the Model Registry is unavailable, proplets receive a 503 error when attempting to fetch the model.
706691

707692
**Aggregator Service Failures**
708693

709-
- **Aggregator Unavailable**: When the Coordinator attempts to call the Aggregator but the service is unreachable, the Coordinator implements a retry policy with exponential backoff. After 3 failed retry attempts, the Coordinator stores the unaggregated updates for later recovery and completes the round with a warning status. Administrators can manually trigger aggregation once the Aggregator is back online.
694+
- **Aggregator Unavailable**: When the Coordinator attempts to call the Aggregator but the service is unreachable, the Coordinator implements a retry policy with exponential backoff. After 3 failed retry attempts, the Coordinator stores the unaggregated updates for later recovery and completes the round with a warning status.
710695

711-
- **Aggregation Algorithm Errors**: When the Aggregator encounters an error during aggregation (e.g., mismatched weight dimensions, numerical overflow), it returns a 500 Internal Server Error with details to the Coordinator. The Coordinator logs the error, marks the round as failed, and notifies operators via the monitoring system.
696+
- **Aggregation Algorithm Errors**: When the Aggregator encounters an error during aggregation (e.g., mismatched weight dimensions, numerical overflow), it returns a 500 Internal Server Error. The Coordinator logs the error and marks the round as failed.
712697

713-
- **Timeout on Aggregation**: The Coordinator sets a timeout (configurable, default 30 seconds) for Aggregator requests. If the Aggregator does not respond within the timeout, the Coordinator treats this as a failure and proceeds with the retry policy.
698+
- **Timeout on Aggregation**: The Coordinator sets a timeout (configurable, default 30 seconds) for Aggregator requests.
714699

715700
**Proplet Failures**
716701

717-
- **Proplet Timeout**: When a proplet does not submit an update within the round timeout, the Coordinator excludes it from aggregation but continues the round with the remaining participants. The Coordinator logs which proplets timed out for monitoring purposes.
702+
- **Proplet Timeout**: When a proplet does not submit an update within the round timeout, the Coordinator excludes it from aggregation but continues the round with the remaining participants.
718703

719-
- **Proplet Crash During Training**: If a proplet crashes during WASM execution, it will not submit an update. The round proceeds without that proplet's contribution. The Manager's proplet health monitoring detects the crash and marks the proplet as unhealthy, preventing it from being assigned tasks in subsequent rounds.
704+
- **Proplet Crash During Training**: If a proplet crashes during WASM execution, it will not submit an update. The round proceeds without that proplet's contribution.
720705

721-
- **WASM Execution Errors**: If the WASM module encounters an error during training (e.g., division by zero, out of memory), the proplet captures the error and reports it to the Coordinator via a special error update message. The Coordinator logs the error and excludes the proplet from the round.
706+
- **WASM Execution Errors**: If the WASM module encounters an error during training (e.g., division by zero, out of memory), the proplet reports it to the Coordinator via a special error update message. The Coordinator logs the error and excludes the proplet from the round.
722707

723708
**Network-Related Failures**
724709

725-
- **HTTP to MQTT Fallback**: Proplets prefer HTTP for update submission but fall back to MQTT if HTTP fails. The fallback is triggered after 3 consecutive HTTP failures with exponential backoff. Once MQTT is used, the proplet continues to use MQTT for the remainder of the round.
710+
- **HTTP to MQTT Fallback**: Proplets prefer HTTP for update submission but fall back to MQTT if HTTP fails. The fallback is triggered after 3 consecutive HTTP failures.
726711

727-
- **MQTT Connection Loss**: If a proplet loses its MQTT connection during a round, it attempts to reconnect. If reconnection fails after 5 attempts, the proplet marks itself as unhealthy and stops participating in the round.
712+
- **MQTT Connection Loss**: If a proplet loses its MQTT connection during a round, it attempts to reconnect. If reconnection fails after 5 attempts, the proplet marks itself as unhealthy.
728713

729-
- **Partial Update Transmission**: If an update transmission is interrupted (e.g., network cut during HTTP POST), the Coordinator treats this as a failed submission and waits for a retry from the proplet. Proplets implement client-side retries for failed submissions.
714+
- **Partial Update Transmission**: If an update transmission is interrupted (e.g., network cut during HTTP POST), the Coordinator treats this as a failed submission and waits for a retry from the proplet.
730715

731716
## Security and Privacy
732717

0 commit comments

Comments
 (0)