Skip to content

Conversation

@AlfredoG87
Copy link
Contributor

@AlfredoG87 AlfredoG87 commented Dec 31, 2025

Summary

Refactors the BackfillPlugin from a monolithic ~650-line class into a modular architecture of focused, testable components.

Key change: Introduces a dual-scheduler design that separates historical and live-tail backfill processing—ensuring recent gaps are never blocked by long-running historical catch-up.


Architecture Overview

BackfillPlugin now acts as an orchestrator coordinating three stages:

1. DetectGapDetector

  • Scans storage for missing block ranges
  • Classifies gaps as HISTORICAL (older) or LIVE_TAIL (recent/near-head)
  • Simplifies Greedy Backfill by defining an upper bound based on true/false -> peerMax/latestStoredBlock

2. ScheduleBackfillTaskScheduler

  • Routes gaps into dedicated bounded queues by type
  • Historical and live-tail backfills run independently

3. ExecuteBackfillRunner

  • Selects optimal peer via PriorityHealthBasedStrategy
  • Fetches blocks in configurable batches
  • Awaits persistence confirmation before fetching more (natural backpressure)

New Components

Class Responsibility
BackfillRunner Core execution logic
BackfillTaskScheduler Bounded queue task scheduling
BackfillPersistenceAwaiter Persistence-based backpressure
GapDetector Gap detection and classification
PriorityHealthBasedStrategy Health-aware node selection
NodeSelectionStrategy Node selection abstraction

Configuration

Property Default Description
backfill.historicalQueueCapacity 20 Max pending gaps for historical scheduler
backfill.liveTailQueueCapacity 10 Max pending gaps for live-tail scheduler
backfill.healthPenaltyPerFailure 1000.0 Health score penalty per node failure
backfill.maxBackoffMs 300000 Maximum backoff duration (5 min)

Other Changes

  • Exposed WebClient HTTP/2 tuning parameters with high-throughput defaults
  • Extended BlockNodeSource proto with optional NodeId and Name fields (non-breaking)
  • Updated design and configuration documentation
  • Reduced annoyingly large logs in the MessagingFacility Impl

Review guide (recommended order)

If you want the fastest mental model, I recommend reviewing in this order:

  1. GapDetector – gap identification + HISTORICAL vs LIVE_TAIL classification
  2. BackfillTaskScheduler – dual bounded queues + worker lifecycle
  3. BackfillRunner – node selection → fetch → dispatch → await persistence loop
  4. BackfillPersistenceAwaiter – how backpressure is enforced via notifications
  5. Config + docs – new properties, defaults, and documentation updates

PR Stats

Category Lines % of PR
Tests ~2,459 70%
Main Source ~1,282 36%
Docs ~261 7%
  • Tests (70%): 102 test methods covering critical backfill infrastructure.
  • Main Source (36%): Significant architectural refactor - BackfillPlugin split into 6 focused components (BackfillRunner, GapDetector, TaskScheduler, etc.)
  • Docs (7%): Updated design documentation

Related Issues

Fixes #1977
Fixes #1550
Fixes #1502
Fixes #1778

@AlfredoG87 AlfredoG87 self-assigned this Jan 1, 2026
@AlfredoG87 AlfredoG87 modified the milestones: 0.27.0, 0.26.0 Jan 1, 2026
@AlfredoG87 AlfredoG87 added the Block Node Issues/PR related to the Block Node. label Jan 1, 2026
@AlfredoG87 AlfredoG87 changed the title refactor(backfill): Backfill Plugin Major Refactoring refactor(backfill): Backfill Plugin Major Refactor Jan 2, 2026
@AlfredoG87 AlfredoG87 changed the title refactor(backfill): Backfill Plugin Major Refactor refactor(backfill): Backfill Plugin Major Refactor And Improvements Jan 2, 2026
@AlfredoG87 AlfredoG87 marked this pull request as ready for review January 2, 2026 22:44
@AlfredoG87 AlfredoG87 requested review from a team as code owners January 2, 2026 22:44
@AlfredoG87 AlfredoG87 added the Improvement Code changes driven by non business requirements label Jan 3, 2026
Copy link
Contributor

@ata-nas ata-nas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass on this one, will need at least another one. Have not yet looked at tests as well.

I leave general comments for cleanup and questions.

I feel, however, that we are really missing out on not using the ranged sets and instead we are using a list of ranges and manually manipulating them. I do see usages of merging ranges, this is something the ranged sets will do automatically in a more performant fashion. I guess I just do not see why we need to use a list and complicating some of the things we do, instead of just using the ranged sets. Since we are doing such a big rework, now is the time to make decisions. Something worth thinking about.

@AlfredoG87
Copy link
Contributor Author

AlfredoG87 commented Jan 6, 2026

@ata-nas Thank you for reviewing my PR, really appreciate and value all your input and time effort made into reviewing it 🙏

I've addressed all your comments mostly in a positive outcome.

And for your general notes suggestion:

I feel, however, that we are really missing out on not using the ranged sets and instead we are using a list of ranges and manually manipulating them. I do see usages of merging ranges, this is something the ranged sets will do automatically in a more performant fashion. I guess I just do not see why we need to use a list and complicating some of the things we do, instead of just using the ranged sets. Since we re doing such a big rework, now is the time to make decisions. Something worth thinking about.

I've ponder this up and decided that the benefits do not outweigh the effort and added complexity. The use case we need is simpler as opposed to the BlockRangeSet impls and prefer to keep it simple as a native List.

@AlfredoG87 AlfredoG87 requested a review from ata-nas January 6, 2026 04:46
@AlfredoG87 AlfredoG87 force-pushed the backfill-improvements2 branch from 0d1abe4 to b5048d2 Compare January 6, 2026 05:50
Introduce typed gaps to classify detected block gaps as HISTORICAL or
LIVE_TAIL for routing to appropriate schedulers. GapDetector now returns
TypedGap instances with proper boundary detection.

Signed-off-by: Alfredo Gutierrez Grajeda <alfredo@hashgraph.com>
Replace single scheduler with two independent schedulers (historical and
live-tail) so live blocks never wait for historical backfill. Each scheduler
has bounded queue with discard-on-full semantics. Remove unused
BackfillScheduler wrapper and BackfillTask status tracking.

Signed-off-by: Alfredo Gutierrez Grajeda <alfredo@hashgraph.com>
Update BackfillPlugin to orchestrate two independent schedulers with
dedicated executors. Add high-water mark deduplication for live-tail gaps.
Add configuration for queue capacities and health penalty settings.

Signed-off-by: Alfredo Gutierrez Grajeda <alfredo@hashgraph.com>
Merge gRPC client functionality into BackfillFetcher. Use configurable
health penalty and backoff settings. Remove redundant BackfillGrpcClient.

Signed-off-by: Alfredo Gutierrez Grajeda <alfredo@hashgraph.com>
Add @timeout annotations to all test classes (30s for integration, 5s for
unit tests) to fail fast if tests hang instead of blocking indefinitely.

Signed-off-by: Alfredo Gutierrez Grajeda <alfredo@hashgraph.com>
Update configuration docs with new queue capacity and health settings.
Update design docs to reflect dual scheduler architecture.

Signed-off-by: Alfredo Gutierrez Grajeda <alfredo@hashgraph.com>
Signed-off-by: Alfredo Gutierrez Grajeda <alfredo@hashgraph.com>
…gic into BackfillPersistenceAwaiter class.

Improved logs overall for the plugin.

Signed-off-by: Alfredo Gutierrez Grajeda <alfredo@hashgraph.com>
Signed-off-by: Alfredo Gutierrez Grajeda <alfredo@hashgraph.com>
Signed-off-by: Alfredo Gutierrez Grajeda <alfredo@hashgraph.com>
Simplified TypedGap and GapType into nested classes of GapDetector for readability and code reduction

Signed-off-by: Alfredo Gutierrez Grajeda <alfredo@hashgraph.com>
Signed-off-by: Alfredo Gutierrez Grajeda <alfredo@hashgraph.com>
…ualityCheck for now as the current PBJ is

improve logging on final failed retries to a peer

Signed-off-by: Alfredo Gutierrez Grajeda <alfredo@hashgraph.com>
Signed-off-by: Alfredo Gutierrez Grajeda <alfredo@hashgraph.com>
  - Use unparsed block builder directly
  - Replace mock metrics with real TestUtils.createMetrics()
  - Replace var with explicit types
  - Split multi-scenario tests into focused single-behavior tests

Signed-off-by: Alfredo Gutierrez Grajeda <alfredo@hashgraph.com>
Signed-off-by: Alfredo Gutierrez Grajeda <alfredo@hashgraph.com>
…tenceAwaiter

Signed-off-by: Alfredo Gutierrez Grajeda <alfredo@hashgraph.com>
Signed-off-by: Alfredo Gutierrez Grajeda <alfredo@hashgraph.com>
Signed-off-by: Alfredo Gutierrez Grajeda <alfredo@hashgraph.com>
…t names, use assertSame for reference checks

Signed-off-by: Alfredo Gutierrez Grajeda <alfredo@hashgraph.com>
Signed-off-by: Alfredo Gutierrez Grajeda <alfredo@hashgraph.com>
  - Simplify node logging using PBJ toString()
  - Use log format replacement instead of .formatted()
  - Add missing newline at EOF (block_node_source.proto)

Signed-off-by: Alfredo Gutierrez Grajeda <alfredo@hashgraph.com>
Signed-off-by: Alfredo Gutierrez Grajeda <alfredo@hashgraph.com>
Signed-off-by: Alfredo Gutierrez Grajeda <alfredo@hashgraph.com>
Copy link
Contributor

@ata-nas ata-nas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are in a good state here. Great work @AlfredoG87! We can proceed with what we have. Looking forward to even more improvements and seeing it work in action!

@AlfredoG87
Copy link
Contributor Author

We are in a good state here. Great work @AlfredoG87! We can proceed with what we have. Looking forward to even more improvements and seeing it work in action!

Yes, I've been doing plenty of local testing but eager to see them in the wild 💯

Thank you for your hard work 🙏

@AlfredoG87 AlfredoG87 merged commit db62ff6 into main Jan 16, 2026
20 of 23 checks passed
@AlfredoG87 AlfredoG87 deleted the backfill-improvements2 branch January 16, 2026 15:56
@codecov
Copy link

codecov bot commented Jan 16, 2026

Codecov Report

❌ Patch coverage is 87.56906% with 90 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
.../org/hiero/block/node/backfill/BackfillPlugin.java 75.53% 33 Missing and 13 partials ⚠️
...ro/block/node/backfill/client/BlockNodeClient.java 77.96% 10 Missing and 3 partials ⚠️
...org/hiero/block/node/backfill/BackfillFetcher.java 91.60% 8 Missing and 4 partials ⚠️
...ock/node/backfill/PriorityHealthBasedStrategy.java 88.13% 4 Missing and 3 partials ⚠️
...ill/client/BlockStreamSubscribeUnparsedClient.java 42.85% 2 Missing and 2 partials ⚠️
.../org/hiero/block/node/backfill/BackfillRunner.java 98.30% 1 Missing and 1 partial ⚠️
...ero/block/node/backfill/BackfillTaskScheduler.java 95.12% 1 Missing and 1 partial ⚠️
...ava/org/hiero/block/node/backfill/GapDetector.java 93.54% 1 Missing and 1 partial ⚠️
...lock/node/backfill/BackfillPersistenceAwaiter.java 98.24% 0 Missing and 1 partial ⚠️
...ock/node/messaging/BlockMessagingFacilityImpl.java 83.33% 1 Missing ⚠️
@@             Coverage Diff              @@
##               main    #2006      +/-   ##
============================================
+ Coverage     78.99%   79.95%   +0.95%     
- Complexity     1241     1334      +93     
============================================
  Files           130      136       +6     
  Lines          5952     6250     +298     
  Branches        646      688      +42     
============================================
+ Hits           4702     4997     +295     
+ Misses          955      954       -1     
- Partials        295      299       +4     
Files with missing lines Coverage Δ Complexity Δ
...ero/block/node/backfill/BackfillConfiguration.java 100.00% <ø> (ø) 1.00 <0.00> (ø)
...ero/block/node/backfill/NodeSelectionStrategy.java 100.00% <100.00%> (ø) 0.00 <0.00> (?)
...ero/block/node/spi/historicalblocks/LongRange.java 100.00% <100.00%> (ø) 30.00 <5.00> (+5.00)
...lock/node/backfill/BackfillPersistenceAwaiter.java 98.24% <98.24%> (ø) 16.00 <16.00> (?)
...ock/node/messaging/BlockMessagingFacilityImpl.java 85.93% <83.33%> (+0.16%) 43.00 <2.00> (ø)
.../org/hiero/block/node/backfill/BackfillRunner.java 98.30% <98.30%> (ø) 31.00 <31.00> (?)
...ero/block/node/backfill/BackfillTaskScheduler.java 95.12% <95.12%> (ø) 16.00 <16.00> (?)
...ava/org/hiero/block/node/backfill/GapDetector.java 93.54% <93.54%> (ø) 11.00 <11.00> (?)
...ill/client/BlockStreamSubscribeUnparsedClient.java 69.62% <42.85%> (+6.95%) 4.00 <1.00> (ø)
...ock/node/backfill/PriorityHealthBasedStrategy.java 88.13% <88.13%> (ø) 26.00 <26.00> (?)
... and 3 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Block Node Issues/PR related to the Block Node. Improvement Code changes driven by non business requirements

Projects

None yet

6 participants