You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Introduce RequestsScheduler to track (and deduplicate) all requests and validators' scores (#4856)
Backport of #4752
## Motivation
The Linera client needs to interact with multiple validator nodes
efficiently. Previously, the
client would make individual requests to validators without:
1. Performance tracking: No mechanism to prefer faster, more reliable
validators
2. Request deduplication: Concurrent requests for the same data would
all hit the network, wasting
bandwidth and validator resources
3. Response caching: Repeated requests for the same data would always go
to validators
4. Load balancing: No rate limiting per validator, risking overload
5. Resilience: No fallback mechanism when a validator is slow or
unresponsive
This led to:
- Unnecessary network traffic and validator load
- Poor user experience with redundant waiting
- No optimization based on validator performance
- Risk of overwhelming validators with too many concurrent requests
- No recovery mechanism when validators are slow
## Proposal
This PR introduces `RequestsScheduler`, a sophisticated request
orchestration layer that provides
intelligent peer selection, request deduplication, caching, and
performance-based routing.
Key Features
1. Performance Tracking with Exponential Moving Averages (EMA)
- Tracks latency, success rate, and current load for each validator
- Uses configurable weights to compute a composite performance score
- Intelligently selects the best available validator for each request
- Weighted random selection from top performers to avoid hotspots
2. Request Deduplication
- Exact matching: Multiple concurrent requests for identical data are
deduplicated
- Subsumption-based matching: Smaller requests are satisfied by larger
in-flight requests that
contain the needed data (e.g., a request for blocks 10-12 can be
satisfied by an in-flight request
for blocks 10-20)
- Broadcast mechanism ensures all waiting requesters receive the result
when the request completes
- Timeout handling: Stale in-flight requests (>200ms) are not
deduplicated against, allowing fresh
attempts
3. Response Caching
- Successfully completed requests are cached with configurable TTL
(default: 2 seconds)
- LRU eviction when cache reaches maximum size (default: 1000 entries)
- Works with both exact and subsumption matching
- Only successful results are cached
4. Slot-Based Rate Limiting
- Each validator has a maximum concurrent request limit (default: 100)
- Async await mechanism: requests wait for available slots without
polling
- Prevents overloading individual validators
- Automatic slot release on request completion
5. Alternative Peer Handling
- When multiple callers request the same data, they register as
"alternative peers"
- If the original request times out (>200ms), any alternative peer can
complete the request
- The result is broadcast to all waiting requesters
- Provides resilience against slow validators
6. Modular Architecture
Created a new ` requests_scheduler` module with clear separation of
concerns:
```
requests_scheduler/
├── mod.rs - Module exports and constants
├── scheduler.rs - RequestsScheduler orchestration logic
├── in_flight_tracker.rs - In-flight request tracking and deduplication
├── node_info.rs - Per-validator performance tracking
├── request.rs - Request types and result extraction
└── scoring.rs - Configurable scoring weights
```
API
High-level APIs:
```rust
// Execute with best available validator
scheduler.with_best(request_key, |peer| async {
peer.download_certificates(chain_id, start, limit).await
}).await
// Execute with specific validator
scheduler.with_peer(request_key, peer, |peer| async {
peer.download_blob(blob_id).await
}).await
Configuration:
let manager = RequestsScheduler::with_config(
validator_nodes,
max_requests_per_node: 100,
weights: ScoringWeights { latency: 0.4, success: 0.4, load: 0.2 },
alpha: 0.1, // EMA smoothing factor
max_expected_latency_ms: 5000.0,
cache_ttl: Duration::from_secs(2),
max_cache_size: 1000,
);
```
Benefits
- Reduced network load: Deduplication and caching eliminate redundant
requests
- Better performance: Intelligent peer selection routes to fastest
validators
- Improved reliability: Alternative peer mechanism provides resilience
- Protection for validators: Rate limiting prevents overload
- Efficient resource usage: EMA-based scoring optimizes validator
selection
- Clean architecture: Modular design makes code maintainable and
testable
Metrics
In production usage, this should significantly reduce:
- Network traffic between clients and validators
- Validator CPU/memory usage from redundant requests
- Client request latency through caching and smart routing
- Failed requests through performance tracking and rate limiting
The following metrics have been added to Prometheus (with compiled with
`--features metrics`):
- `requests_scheduler_response_time_ms` - Response time for requests to
validators in milliseconds
- `requests_scheduler_request_total` - Total number of requests made to
each validator
- `requests_scheduler_request_success` - Number of successful requests
to each validator (`(requests_scheduler_request_total -
requests_scheduler_request_success) / requests_scheduler_request_total`
is an error rate)
- `requests_scheduler_request_deduplication_total` - Number of requests
that were deduplicated by joining an in-flight request
- `requests_scheduler_request_cache_hit_total` - Number of requests that
were served from cache
## Test Plan
Existing CI makes sure we maintain backwards compatibility. Some tests
have been added to the new modules.
## Release Plan
- Nothing to do / These changes follow the usual release cycle.
## Links
<!--
Optional section for related PRs, related issues, and other references.
If needed, please create issues to track future improvements and link
them here.
-->
- [reviewer
checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist)
Copy file name to clipboardExpand all lines: CLI.md
+15Lines changed: 15 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -190,6 +190,21 @@ Client implementation and command-line tool for the Linera blockchain
190
190
*`--max-joined-tasks <MAX_JOINED_TASKS>` — Maximum number of tasks that can are joined concurrently in the client
191
191
192
192
Default value: `100`
193
+
*`--max-accepted-latency-ms <MAX_ACCEPTED_LATENCY_MS>` — Maximum expected latency in milliseconds for score normalization
194
+
195
+
Default value: `5000`
196
+
*`--cache-ttl-ms <CACHE_TTL_MS>` — Time-to-live for cached responses in milliseconds
197
+
198
+
Default value: `2000`
199
+
*`--cache-max-size <CACHE_MAX_SIZE>` — Maximum number of entries in the cache
200
+
201
+
Default value: `1000`
202
+
*`--max-request-ttl-ms <MAX_REQUEST_TTL_MS>` — Maximum latency for an in-flight request before we stop deduplicating it (in milliseconds)
203
+
204
+
Default value: `200`
205
+
*`--alpha <ALPHA>` — Smoothing factor for Exponential Moving Averages (0 < alpha < 1). Higher values give more weight to recent observations. Typical values are between 0.01 and 0.5. A value of 0.1 means that 10% of the new observation is considered and 90% of the previous average is retained
206
+
207
+
Default value: `0.1`
193
208
*`--storage <STORAGE_CONFIG>` — Storage configuration for the blockchain history
194
209
*`--storage-max-concurrent-queries <STORAGE_MAX_CONCURRENT_QUERIES>` — The maximal number of simultaneous queries to the database
195
210
*`--storage-max-stream-queries <STORAGE_MAX_STREAM_QUERIES>` — The maximal number of simultaneous stream queries to the database
0 commit comments