Skip to content

[PEP] stream plugin that supports object-storage micro-batches with optional inline batch support #17331

@udaysagar2177

Description

@udaysagar2177

Summary:

Introduce a new stream plugin that can handle external files or inline batches for real-time ingestion.

Motivation:

  • Reduce Kafka cross-AZ replication, production, and consumption network costs.
  • Kafka delivers sub-second message latency but requires substantial infrastructure; many use cases can tolerate higher latency at lower cost.
  • High-volume traffic can exceed Kafka broker throughput or retention limits, necessitating complex operational management.

Core Mechanics

  • Partition-level consumers receive micro-batch descriptor records instead of individual events.
  • Each record triggers a controlled background fetch logic that downloads the referenced object via PinotFS.
  • A dedicated thread extracts events using the configured RecordReader and pushes them into a bounded in-memory queue.
  • The LLC consumer remains unchanged and reads from this in-memory queue as if it were a normal streaming source.

Offset Model

  • Offsets use a serialized JSON structure similar to the Kinesis stream plugin.
  • JSON tracks:
    • The Kafka record offset carrying the micro-batch descriptor record.
    • The intra-file (or intra-batch) event offset.
  • Supports replay correctness, restart recovery, and stable start/end offset behavior.

Other advantages:

  • Enables real-time ingestion of Avro or Parquet files without complicating the architecture with Spark or a separate batch ingestion job.
  • Enables improved compression efficiency using inline batches compared to the message-per-event model.

Micro-batch descriptor protocol

  • The micro-batch protocol defines deterministic sub-selection rules.
  • Consumers may extract:
    • The full batch, or
    • Only the assigned-partition subset,
  • Replay semantics remain fully stable.

Expected Outcome

A stream plugin that supports object-storage micro-batches with optional inline batch support, reduces the impact of Kafka, and simplifies ingestion pipelines for file-based formats.

If this proposal aligns with the project’s direction. I would be happy to move it forward and submit a pull request for review.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions