Skip to content

Refactor pipeline execution for async steps#20

Merged
aisrael merged 4 commits intomainfrom
async-steps
Mar 4, 2026
Merged

Refactor pipeline execution for async steps#20
aisrael merged 4 commits intomainfrom
async-steps

Conversation

@aisrael
Copy link
Owner

@aisrael aisrael commented Mar 3, 2026

Summary

  • Convert the pipeline Step trait to async (async_trait) and update command paths (convert, head, tail) plus tests to await step execution.
  • Move DataFrame read/write primitives into src/pipeline/dataframe.rs (renamed from data_frame_reader.rs) so DataFrameSource/DataFrameWriter live with pipeline code.
  • Refactor REPL execution to separate parse/eval from execution (eval now returns stages, then execute_pipeline runs them), and print the resolved pipeline stages before execution.
  • Update column selection flow to use richer column specs while preserving existing select behavior.

Why

These changes align pipeline step execution with async DataFusion usage, reduce cross-module coupling between CLI and pipeline internals, and make REPL execution flow clearer for future async stage expansion.

aisrael added 4 commits March 2, 2026 20:27
- Updated the `Step` trait to use async methods for execution, enhancing the pipeline's ability to handle asynchronous operations.
- Modified various pipeline steps (CSV, Avro, Parquet, ORC, JSON, XLSX, YAML) to implement the new async `execute` method.
- Adjusted command implementations in `convert`, `head`, `tail`, and other features to await the execution of steps, ensuring proper async behavior.
- Enhanced tests to validate the new async functionality across different file formats and commands, improving overall robustness and performance.
- Added `async-trait` as a dependency in `Cargo.toml` to support asynchronous trait methods.
- Updated the REPL evaluation logic to construct and execute pipelines asynchronously, improving performance and responsiveness.
- Refactored the `PipelineStage` enum to include a `Print` stage and modified the `exec_select` method to handle column specifications more effectively.
- Enhanced the display functionality for pipeline stages, providing clearer output for users during REPL interactions.
- Updated the `PipelineStage` enum to replace `Vec<String>` with `Vec<ColumnSpec>` for column selection, enhancing type safety and clarity.
- Refactored the `exec_select` method and related functions to handle `ColumnSpec`, allowing for both exact and case-insensitive column matching.
- Introduced `resolve_column_specs` function to resolve column specifications against the schema, improving the selection logic.
- Updated tests to validate the new column selection behavior, ensuring robust functionality across various scenarios.
- Renamed `data_frame_reader` module to `dataframe` for consistency and clarity.
- Updated references in `pipeline.rs` and `convert.rs` to use the new `dataframe` module.
- Introduced `DataFrameSource` and `DataFrameWriter` structs in the new `dataframe.rs` file, encapsulating DataFrame reading and writing functionality.
- Enhanced the `read_to_batches` and `write_batches` functions to utilize the new DataFrame structures, improving modularity and readability.
@aisrael aisrael changed the title Refactor pipeline steps to support async execution Refactor pipeline execution for async steps Mar 4, 2026
@aisrael aisrael merged commit f599ed8 into main Mar 4, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant