Skip to content

Conversation

@OussamaSaoudi
Copy link
Collaborator

@OussamaSaoudi OussamaSaoudi commented Nov 19, 2025

🥞 Stacked PR

Use this link to review incremental changes.


What changes are proposed in this pull request?

How was this change tested?

This was referenced Nov 19, 2025
@OussamaSaoudi OussamaSaoudi changed the title add leaf reader feat: Leaf Checkpoint Reader Nov 19, 2025
Comment on lines +21 to +25
/// # Distributability
///
/// This phase is designed to be distributable. To distribute:
/// 1. Partition `files` across N executors
/// 2. Create N `LeafCheckpointReader` instances, one per executor with its file partition
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an odd place to bring up distributability? The log replay phase that consumes this reader is the thing getting distributed, and which needs the careful choreography? Seems like this reader is just a detail in that bigger picture.

let actions = engine
.parquet_handler()
.read_parquet_files(&files, schema, None)?
.map(|batch| batch.map(|b| ActionsBatch::new(b, false)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we're splitting up the phases instead of chaining everything into one big iterator, we should probably reconsider whether we still need these (batch, bool) pairs -- I suspect that whoever invokes the dedup visitor will now know, structurally, whether it's a log or checkpoint batch, and can just pass true or false accordingly?

Comment on lines +104 to +107
if log_segment.checkpoint_parts.is_empty() {
println!("Test table has no checkpoint parts, skipping");
return Ok(());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very strange to have in a test?? Either the test case has a checkpoint or it doesn't.

And in this case it should have a checkpoint, and panicking at L110 below is a perfectly reasonable test failure mode if the checkpoint somehow went missing.

ManifestPhase::new(manifest_file, log_segment.log_root.clone(), engine.clone())?;

// Drain manifest phase and apply processor
for batch in manifest_phase.by_ref() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: TIL that Iterator::by_ref is a thing!

Comment on lines +118 to +119
let batch = batch?;
processor.process_actions_batch(batch)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
let batch = batch?;
processor.process_actions_batch(batch)?;
processor.process_actions_batch(batch?)?;

(not sure which way is more readable?)

let mut sidecar_file_paths = Vec::new();
let mut batch_count = 0;

while let Some(result) = sidecar_phase.next() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this just

Suggested change
while let Some(result) = sidecar_phase.next() {
for result in sidecar_phase {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants