-
Notifications
You must be signed in to change notification settings - Fork 1
Extend support for Dagster partitions #46
Description
Description
Currently, kedro-dagster provides experimental support for Dagster partitions via the DagsterPartitionedDataset Kedro dataset type. At present, only static Dagster partition definitions and basic partition mappings are supported. The fanning out of nodes associated to partitioned datasets is handled by kedro-dagster, enabling parallel processing. However, this logic is quite complex, and only limited types of partitions/mappings are supported at the moment. Kedro-dagster translates Kedro datasets into Dagster assets materialized by jobs corresponding to their Kedro pipelines. Dagster offers a backfill feature that allow parallel partition materialization at the asset level but there is no equivalent or Dagster job as far as I understand. In practice, parallel materialization of asset partitions at the job level would be extremely useful.
Context
Full support for fanning out nodes involving various Dagster partition types and partition mappings would enable more robust workflows and better integration between Kedro and Dagster. This would allow users to utilize advanced partitioning features native to Dagster, and improve parallel processing options. Additionally, it may be preferable to move the fanning out logic to the Kedro pipeline level, either performed by Kedro directly or by kedro-dagster. This could simplify the current implementation and make it more maintainable.
Possible Implementation
- Extend support beyond static partitions to all Dagster partition definitions and mappings.
- Refactor the fanning out logic so that it is managed by Kedro pipelines (possibly by the Kedro core library or by kedro-dagster), rather than directly in kedro-dagster.
- Ensure compatibility with downstream assets and partition propagation in all possible cases (there are many to consider...)
Possible Alternatives
There is an existing Kedro plugin that appears to implement pipeline-level partition fanning out: https://github.com/kedro-partitioned/kedro-partitioned. Based on its documentation, it:
- Extends Kedro’s partitioned data support with helpers and decorators to define and operate on partitions at the pipeline/node level.
- Fans out execution by expanding nodes across partitions, enabling parallel processing of partitioned inputs and synchronized handling of outputs.
- Provides guidance on partition-aware pipeline design, mapping between partitioned inputs and outputs, and how to structure parallelism safely and predictably (e.g., avoiding race conditions, coordinating write paths).
Having had a quick look, it seems it would add quite a lot of complexity. Also, I am not sure it is actively maintained.