-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Open
Labels
PEP-RequestPinot Enhancement Proposal request to be reviewed.Pinot Enhancement Proposal request to be reviewed.ingestionreal-time
Description
Summary:
Introduce a new stream plugin that can handle external files or inline batches for real-time ingestion.
Motivation:
- Reduce Kafka cross-AZ replication, production, and consumption network costs.
- Kafka delivers sub-second message latency but requires substantial infrastructure; many use cases can tolerate higher latency at lower cost.
- High-volume traffic can exceed Kafka broker throughput or retention limits, necessitating complex operational management.
Core Mechanics
- Partition-level consumers receive micro-batch descriptor records instead of individual events.
- Each record triggers a controlled background fetch logic that downloads the referenced object via PinotFS.
- A dedicated thread extracts events using the configured RecordReader and pushes them into a bounded in-memory queue.
- The LLC consumer remains unchanged and reads from this in-memory queue as if it were a normal streaming source.
Offset Model
- Offsets use a serialized JSON structure similar to the Kinesis stream plugin.
- JSON tracks:
- The Kafka record offset carrying the micro-batch descriptor record.
- The intra-file (or intra-batch) event offset.
- Supports replay correctness, restart recovery, and stable start/end offset behavior.
Other advantages:
- Enables real-time ingestion of Avro or Parquet files without complicating the architecture with Spark or a separate batch ingestion job.
- Enables improved compression efficiency using inline batches compared to the message-per-event model.
Micro-batch descriptor protocol
- The micro-batch protocol defines deterministic sub-selection rules.
- Consumers may extract:
- The full batch, or
- Only the assigned-partition subset,
- Replay semantics remain fully stable.
Expected Outcome
A stream plugin that supports object-storage micro-batches with optional inline batch support, reduces the impact of Kafka, and simplifies ingestion pipelines for file-based formats.
If this proposal aligns with the project’s direction. I would be happy to move it forward and submit a pull request for review.
ankitsultana
Metadata
Metadata
Assignees
Labels
PEP-RequestPinot Enhancement Proposal request to be reviewed.Pinot Enhancement Proposal request to be reviewed.ingestionreal-time