Skip to content

Conversation

@dgafka
Copy link
Member

@dgafka dgafka commented Jan 9, 2026

Why is this change proposed?

Batched backfill is a feature in Ecotone's projection system that allows rebuilding projections by processing partitions in configurable batches. It supports both synchronous and asynchronous execution modes, enabling efficient processing of large datasets without overwhelming system resources.

For global tracked projection it allows to do the rebuild asynchronously.
For partitioned projections it allows to scale whole process of rebuild, and do the rebuild for different partitions concurrently. Speeding up the whole process depending on the message consumers count.

Backfill attribute

Configures backfill behavior for a projection::
backfillPartitionBatchSize: Number of partitions to process in a single batch (default: 100, minimum: 1)
asyncChannelName: Optional async channel name for asynchronous backfill execution (null = synchronous)

Example: 5 Partitions with Batch Size 2

Configuration:

#[ProjectionBackfill(backfillPartitionBatchSize: 2, asyncChannelName: 'backfill_channel')]

Execution:

  • Calculate batches: 5 partitions ÷ 2 = 3 batches (ceil)
  • Send 3 messages to backfill_channel:
  • Batch 0: offset=0, limit=2 → processes partitions [0, 1]
  • Batch 1: offset=2, limit=2 → processes partitions [2, 3]
  • Batch 2: offset=4, limit=2 → processes partition [4]
  • Consumer processes each message from the channel
  • Result: All 5 partitions processed in 3 separate batch operations

Pull Request Contribution Terms

  • I have read and agree to the contribution terms outlined in CONTRIBUTING.

#[Attribute(Attribute::TARGET_CLASS)]
class ProjectionExecution
{
public function __construct(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed from ProjectionBatchSize attribute

@dgafka dgafka merged commit cc2bc01 into main Jan 9, 2026
9 checks passed
@dgafka dgafka deleted the feat/async-batched-backfill branch January 9, 2026 19:04
Copy link
Contributor

@jlabedo jlabedo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice !

FROM {$streamTable}
WHERE metadata->>'_aggregate_type' = ?
SQL, [$this->aggregateType]);
ORDER BY aggregate_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause a full table scan for each batch.

For partitioned projections, I think we may have to handle all the partitioning in the projecting module. When you "prepareForBackfill", you read all the events from the event store once and write all aggregate ids in an ephemeral table, which acts like a queue. Then all workers can read the first unlocked row, lock it, project, then delete the row.

This kind of system may open the road to more advanced partitioning strategy. Like partition by any header.

Wdyt ?

Copy link
Member Author

@dgafka dgafka Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For partitioned projections, I think we may have to handle all the partitioning in the projecting module.

Why is it so?

In general the trick here is that prepareBackfill is to be as simple as possible. If you will take a look on it, it only need to know the count (single sql), and then generates all the messages to for rebuild within PHP process. This way, we can generate everything for rebuild within seconds even for large scale event streams.
The rest is to be done by async Message Consumers.

Therefore the case is that we need to ensure that we can go by deterministic in batches (limit offset provides next aggregate ids). So if current approach makes full table scan maybe we can do it differently, or require adding custom index for it?

*/
declare(strict_types=1);

namespace Ecotone\EventSourcing\Attribute;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, those attributes are related to the pdo event sourcing module. If you want to project from another event source (let's say another event store implementation), you would create another attribute that would register its event store as an event source

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do not hold any database (pdo) specific, they just stand what stream should be used (maybe the aggregate-type is a bit specific, but it could be delivered as separate attribute).

So I think it's fine to allow it to be part of generic Projecton API?

/**
* @deprecated Use prepareBackfill() instead. This method is kept for backward compatibility.
*/
public function backfill(): void
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not public for now, we may avoid deprecation and remove the method until it's stable ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProjectingManager is internal, so won't be used by end users, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants