Skip to content

Conversation

@fresh-borzoni
Copy link
Contributor

@fresh-borzoni fresh-borzoni commented Jan 6, 2026

Purpose

Linked issue: close #27

New feature for analytics use cases. Provides scanner variants with compile-time mode separation between per-record and batch access patterns.

Brief change log

** Scanner architecture: **

  • Created RecordBatchLogScanner for batch access (separate from LogScanner)
  • Introduced LogScannerInner holding shared implementation
  • Both scanner types wrap Arc<LogScannerInner> with distinct public APIs:
    • LogScanner::poll()Result<ScanRecords> (per-record with offset/timestamp metadata)
    • RecordBatchLogScanner::poll()Result<Vec<RecordBatch>> (direct batch access)

Implementation details:

  • Added TableScan::create_record_batch_log_scanner() constructor
  • Extended CompletedFetch trait with fetch_batches() method
  • Added LogRecordBatch::record_batch() for direct batch extraction
  • Type system prevents mixing access patterns (compile-time enforcement)

Cleanup:

  • Removed runtime guards (poll_mode: AtomicU8, mode checking logic)
  • Removed Error::IllegalState variant (no longer needed)
  • Backward compatible: existing LogScanner API unchanged

Tests

Integration Tests:
Added test_poll_batches IT test with scenarios:

  • Basic functionality and data correctness

  • Timeout behavior on empty table

  • Field projection support

  • Order preservation across writes

  • Offset tracking across consecutive polls

  • starting from non-zero offset, batch should be sliced

  • Existing record-level tests continue to pass

API and Format

New API:

// Batch access
TableScan::create_record_batch_log_scanner() -> Result<RecordBatchLogScanner>
RecordBatchLogScanner::poll(timeout: Duration) -> Result<Vec<RecordBatch>>

// Existing record access (unchanged)
TableScan::create_log_scanner() -> Result<LogScanner>
LogScanner::poll(timeout: Duration) -> Result<ScanRecords>

Breaking: None (additive only, existing API fully preserved)

Storage: No changes

@fresh-borzoni fresh-borzoni force-pushed the scan-log-records-batch branch from c9850bb to 1ea2031 Compare January 6, 2026 12:55
@fresh-borzoni
Copy link
Contributor Author

Hi @luoyuxia, PTAL 🙏
Thank you!

Copy link
Contributor

@leekeiabstraction leekeiabstraction left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! Hope you don't mind me reviewing it, left a comment.

@fresh-borzoni
Copy link
Contributor Author

fresh-borzoni commented Jan 7, 2026

@leekeiabstraction I've splitted the logic intro two different scanner types. PTAL 🙏

@luoyuxia
Copy link
Contributor

luoyuxia commented Jan 7, 2026

@fresh-borzoni Thanks for the pr. I'll have a look when i find some time.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new batch-oriented scanning API (RecordBatchLogScanner) for analytics workloads, providing compile-time separation between per-record and batch access patterns. The implementation shares common logic via LogScannerInner while exposing distinct public APIs for each mode.

  • Created RecordBatchLogScanner for direct Arrow RecordBatch access alongside existing LogScanner
  • Implemented shared LogScannerInner containing common scanning logic for both scanner types
  • Added comprehensive integration tests covering basic functionality, projections, order preservation, and offset tracking

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
crates/fluss/src/client/table/scanner.rs Core implementation: Added RecordBatchLogScanner and LogScannerInner with separate poll_batches() and collect_batches() methods for batch-level access
crates/fluss/src/client/table/mod.rs Exports new RecordBatchLogScanner type in public API
crates/fluss/src/client/table/log_fetch_buffer.rs Extended CompletedFetch trait with fetch_batches() method and added next_fetched_batch() helper for direct batch extraction
crates/fluss/src/record/arrow.rs Added LogRecordBatch::record_batch() method for efficient batch-level access without row iteration
crates/fluss/tests/integration/table.rs Added 5 comprehensive integration tests covering basic functionality, empty results, projection, order preservation, and offset continuation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@luoyuxia luoyuxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fresh-borzoni Thanks for the pr. Only left minor comments. PTAL

@fresh-borzoni
Copy link
Contributor Author

@luoyuxia Thanks for the review.
Pushed chages, PTAL 👍

@fresh-borzoni fresh-borzoni requested a review from luoyuxia January 9, 2026 15:48
Copy link
Contributor

@luoyuxia luoyuxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fresh-borzoni Thanks for quick update. LGTM!

@luoyuxia luoyuxia merged commit 6962c73 into apache:main Jan 10, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

support scan log records as record batch directly

3 participants