-
Notifications
You must be signed in to change notification settings - Fork 138
HIP-1081 Prioritize block nodes by latency #11607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
HIP-1081 Prioritize block nodes by latency #11607
Conversation
Signed-off-by: Xin Li <[email protected]>
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
Signed-off-by: Xin Li <[email protected]>
Signed-off-by: Xin Li <[email protected]>
Coverage summary from CodacySee diff coverage on Codacy
Coverage variation details
Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: Diff coverage details
Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: See your quality gate settings Change summary preferences |
Signed-off-by: Xin Li <[email protected]>
importer/src/main/java/org/hiero/mirror/importer/downloader/block/BlockNode.java
Outdated
Show resolved
Hide resolved
importer/src/main/java/org/hiero/mirror/importer/downloader/block/BlockNode.java
Outdated
Show resolved
Hide resolved
importer/src/main/java/org/hiero/mirror/importer/downloader/block/BlockNodeSubscriber.java
Outdated
Show resolved
Hide resolved
importer/src/main/java/org/hiero/mirror/importer/downloader/block/BlockNodeSubscriber.java
Outdated
Show resolved
Hide resolved
…ency Signed-off-by: Xin Li <[email protected]>
…ency Signed-off-by: Xin Li <[email protected]>
Signed-off-by: Xin Li <[email protected]>
…ency Signed-off-by: Xin Li <[email protected]>
Signed-off-by: Xin Li <[email protected]>
…tize-block-nodes-by-latency Signed-off-by: Xin Li <[email protected]>
Signed-off-by: Xin Li <[email protected]>
Signed-off-by: Xin Li <[email protected]>
…ency Signed-off-by: Xin Li <[email protected]>
Signed-off-by: Xin Li <[email protected]>
…ency Signed-off-by: Xin Li <[email protected]>
Signed-off-by: Xin Li <[email protected]>
…ency Signed-off-by: Xin Li <[email protected]>
importer/src/main/java/org/hiero/mirror/importer/downloader/block/BlockNodeSubscriber.java
Show resolved
Hide resolved
importer/src/main/java/org/hiero/mirror/importer/downloader/block/Latency.java
Show resolved
Hide resolved
…ency Signed-off-by: Xin Li <[email protected]>
Signed-off-by: Xin Li <[email protected]>
7bbef9c to
ddca603
Compare
Signed-off-by: Xin Li <[email protected]>
| | `hiero.mirror.importer.block.scheduler.maxPostProcessingLatency` | 1s | The maximum allowed post-processing delay to calculate and record block node streaming latency. | | ||
| | `hiero.mirror.importer.block.scheduler.minRescheduleInterval` | 10s | The mininum block node reschedule interval. | | ||
| | `hiero.mirror.importer.block.scheduler.rescheduleLatencyThreshold` | 50ms | The threshold to meet for lower latency block nodes to trigger a reschedule. | | ||
| | `hiero.mirror.importer.block.scheduler.type` | PRIORITY | The scheduler type. Can be `LATENCY`, `PRIORITY`, or `PRIORITY_THEN_LATENCY`. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we default to PRIORITY_THEN_LATENCY? Otherwise all this latency work will never be used in practice.
| | `hiero.mirror.importer.block.nodes[].port` | 40840 | The port of the block node server. | | ||
| | `hiero.mirror.importer.block.nodes[].priority` | 0 | The priority of the block node server. A lower value indicates higher priority, and 0 is the highest priority. | | ||
| | `hiero.mirror.importer.block.persistBytes` | false | Whether to persist the block stream file bytes to the database. | | ||
| | `hiero.mirror.importer.block.scheduler.latencyService.backlog` | 1 | The backlog size of pending latency measuring tasks. Note a max number of backlog plus 1 tasks can be scheduled when the service is idle. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer simpler latency for property names instead of implementation names.
| | `hiero.mirror.importer.block.scheduler.latencyService.backlog` | 1 | The backlog size of pending latency measuring tasks. Note a max number of backlog plus 1 tasks can be scheduled when the service is idle. | | |
| | `hiero.mirror.importer.block.scheduler.latency.backlog` | 1 | The backlog size of pending latency measuring tasks. Note a max number of backlog plus 1 tasks can be scheduled when the service is idle. | |
| final class Latency { | ||
|
|
||
| private static final int HISTORY_SIZE = 5; | ||
|
|
||
| @Getter | ||
| private long average = Long.MIN_VALUE; | ||
|
|
||
| private int count = 0; | ||
| private final long[] history = new long[HISTORY_SIZE]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class seems like overkill and may not smooth the averages as we like. Why can't we drop the class and use a simple exponential moving average in the calling class that's stored as a double or AtomicDouble?
|
|
||
| @Data | ||
| @Validated | ||
| public final class SchedulerProperties { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer this as a separate @ConfigurationProperties instead of nested to reduce coupling. Also move to scheduler package.
| case END_OF_BLOCK -> { | ||
| running = !assembler.onEndOfBlock(response.getEndOfBlock()); | ||
| if (!running) { | ||
| log.info("Cancel the subscription to try rescheduling"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log sounds like a suggestion to the operator to take some action. Do you mean to say Cancelling the subscription... to indicate the code is doing the action?
| log.info("Cancel the subscription to try rescheduling"); | |
| log.info("Cancel the subscription to try rescheduling"); |
| protected final AtomicReference<@Nullable BlockNode> current = new AtomicReference<>(); | ||
| protected final AtomicLong lastScheduledTime = new AtomicLong(0); | ||
|
|
||
| protected long lastPostProcessingLatency; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be private. Should also be atomic or volatile to be safe.
| @Scheduled(fixedDelayString = "#{@blockProperties.getScheduler().getLatencyService().getFrequency().toMillis()}") | ||
| public void schedule() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible this takes longer than frequency and multiple invocations occur simultaneously? If so, might need synchronized.
| latencyService.cancelAll(); | ||
| current.set(super.getNode(blockNumber)); | ||
| candidates.clear(); | ||
| candidates.addAll(getCandidates()); | ||
| latencyService.setNodes(candidates); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
latencyService.cancelAll() called twice: once directly and other via setNodes.
| cancelAll(); | ||
|
|
||
| long bornGeneration = generation.incrementAndGet(); | ||
| nodes.forEach(blockNode -> tasks.add(new Task(bornGeneration, blockNode))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to optimize to skip currently connected node since we can measure that directly elsewhere?
| log.info("Measuring {}'s latency by streaming block {}", node, nextBlockNumber); | ||
| final var timeout = | ||
| blockProperties.getScheduler().getLatencyService().getTimeout(); | ||
| node.streamBlocks(nextBlockNumber, nextBlockNumber, this::measureLatency, timeout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The next block might not be available. Should we stream a couple previous blocks that we know are present? This would omit time waiting for next block and spread the latency across a few blocks.
Description:
This PR implements latency based block node scheduling
LATENCY,PRIORITY, andPRIORITY_THEN_LATENCYLatencyServiceto measure block node streaming latency in background for latency aware schedulersRelated issue(s):
Fixes #11271
Fixes #11546
Notes for reviewer:
Checklist