Skip to content

Investigate and propose solution to performance bottleneck in KedroSession #5276

@lrcouto

Description

@lrcouto

Follow up to #5247

During the spike, we profiled a performance bottleneck in KedroSession, described in detail here: #5247 (comment)

KedroContext and the load_context() method we also investigated during this spike and also load all of the context into memory before the session run starts. But it's less likely that it'd cause significant slowness on a typical use case. We decided to focus on the KedroSession initially.

KedroSession retrieves pipelines through the global pipelines object from kedro.framework.project, which is an instance of _ProjectPipelines. This object is designed as a lazily loaded, dict-like interface that defers pipeline loading until first access. However, the current lazy-loading implementation eagerly loads all pipelines on first access to any pipeline key.


The current behavior is the following:

  • _ProjectPipelines starts empty and uninitialized.
  • bootstrap_project()configure_project() sets _pipelines_module but does not load data.
  • When KedroSession.run() accesses pipelines[name], the __getitem__ call is wrapped by _load_data_wrapper.
  • _load_data() imports the pipelines registry module and calls register_pipelines().
  • register_pipelines(), by default, calls find_pipelines(), which constructs all pipelines and returns a full dictionary.
  • The entire pipelines dictionary is loaded into memory before the requested pipeline is returned.

This means that even when a single pipeline is requested, all pipelines are instantiated first.


The objective of this issue is to go deeper into the investigation into how pipelines are loaded during KedroSession initialization and execution, identify where unnecessary overhead is introduced (particularly when only a single pipeline is requested), and evaluate whether the current lazy-loading design achieves its intended performance benefits.

The expected outcome is a concrete, well-scoped proposal for improving performance, potentially by enabling more granular or truly lazy pipeline loading, along with an assessment of trade-offs.

Related tickets - #2879

Metadata

Metadata

Labels

Issue: Feature RequestNew feature or improvement to existing feature

Type

Projects

Status

In Review

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions