feature - Added a Pandas based Transformation and BaseTransformation by dannymeijer · Pull Request #141 · Nike-Inc/koheesio

dannymeijer · 2024-12-01T11:48:07Z

Summary

This PR enables Spark-independent ML workflows in Koheesio by adding pandas-based transformations and an ml extra dependency. Users can now build type-safe ML pipelines without requiring PySpark, while maintaining full compatibility with Spark-based workflows when needed.

Key Features

1. ML Extra Dependency

Install with koheesio[ml] for ML workflows without Spark, or koheesio[ml,pyspark] for both.

Includes:

pandas
numpy >= 1.21.5
scikit-learn >= 1.1.1
scipy >= 1.9.1

2. Pandas Transformations

New koheesio.pandas.transformations module
Same API as Spark transformations for consistency
Full Pydantic validation
Composable with .pipe() method
Works completely independently of PySpark

3. Spark-Optional Pandas Import

New koheesio.utils.pandas module provides pandas import utilities
Validates PySpark compatibility only when PySpark is available
Allows pandas functionality to work standalone

4. Comprehensive Documentation

Getting started guide with proper core framework positioning
ML pipeline examples without Spark
Pandas transformation API reference
Testing and development guides

Related Issue

Closes #79

Motivation and Context

Primary Goal: Enable ML workflows without Spark dependency

Build lightweight ML pipelines for local development
Create type-safe feature engineering workflows
Develop ML services without Spark overhead
Prototype before scaling to distributed processing

Bonus: DBR ML runtime compatibility (DBR 13, 14, 15)

Testing & Validation

✅ Completed

Fixed all failing tests (3 tests)
- test_import_pandas_based_on_pyspark_version (2 parameterized cases)
- test_execute_w_dbtable_and_query
Tested koheesio core without [ml] extra (108 tests passed)
Tested pandas transformations without [pyspark] (3 tests passed)
Created and validated ML pipeline example without Spark
All unit tests pass (762+ tests)

🔄 Pending (Optional)

Run integration tests on DBR 13, 14, 15 ML runtimes

Technical Details

Files Changed (22 files, +1,070/-214 lines)

New Core Files:

src/koheesio/utils/pandas.py - Spark-optional pandas import utilities
src/koheesio/models/dataframe.py - DataFrame abstraction layer
src/koheesio/pandas/transformations/__init__.py - Pandas transformation classes

Examples:

examples/ml_without_spark/simple_ml_pipeline.py - Complete ML pipeline example
examples/ml_without_spark/README.md - Usage guide

Documentation:

Updated README with concise ML integration docs
Enhanced getting-started guide
Added pandas API reference
Created comprehensive testing guide

Configuration:

pyproject.toml - Added ml extra and test matrix
.github/workflows/test.yml - CI/CD for ML tests

Key Implementation Details

Pandas Import Logic (src/koheesio/utils/pandas.py)
- Uses try-except-else pattern to handle optional PySpark
- Only validates compatibility when PySpark is present
- Raises ImportError for version mismatches
Deprecation (src/koheesio/spark/utils/common.py)
- Added proper deprecation warning to import_pandas_based_on_pyspark_version()
- Uses warnings.warn() with DeprecationWarning and stacklevel=2
- Delegates to new core utility for consistency
Backward Compatibility
- All existing Spark functionality unchanged
- No breaking changes to APIs
- Optional dependencies only installed with extras

Types of Changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist

My code follows the code style of this project
My change requires a change to the documentation
I have updated the documentation accordingly
I have read the CONTRIBUTING document
I have added tests to cover my changes
All new and existing tests passed

Installation Examples

# For ML workflows without Spark
pip install koheesio[ml]

# For ML workflows with Spark
pip install koheesio[ml,pyspark]

# For just Spark (no ML)
pip install koheesio[pyspark]

# Core only (for libraries, API integrations, etc.)
pip install koheesio

Usage Example

from koheesio.pandas.transformations import Transformation
import pandas as pd

class AddOne(Transformation):
    target_column: str = "new_column"
    
    def execute(self):
        self.output.df = self.df.copy()
        self.output.df[self.target_column] = self.df["old_column"] + 1

# Use with pandas DataFrames
df = pd.DataFrame({"old_column": [0, 1, 2]})
result = AddOne(target_column="incremented").transform(df)

# Or chain with .pipe()
result = df.pipe(AddOne(target_column="foo").transform) \
           .pipe(AddOne(target_column="bar").transform)

See examples/ml_without_spark/ for complete working examples.

…tion to a BaseTransformation

dannymeijer · 2024-12-01T11:48:50Z

Marking this as blocked based on the To Do's I mentioned

dannymeijer · 2024-12-01T11:52:37Z

pyproject.toml

Newly added is ml - I sorted the features afterwards

…refactor existing test jobs

…Inc/koheesio into feature/79-dbr-ml-support

- Updated import from koheesio.models.transformation to koheesio.models.dataframe - This resolves import errors for BaseTransformation class

- Updated version to 0.11.0a0 from main - Kept BaseTransformation import and updated field_validator to model_validator - Resolved conflicts in __about__.py and spark transformations __init__.py

- Fix pandas import logic to properly handle PySpark version compatibility - Use try-except-else pattern to avoid catching intentional ImportErrors - Allow pandas to work independently when PySpark is not installed - Add proper deprecation warning to import_pandas_based_on_pyspark_version() - Use warnings.warn() with DeprecationWarning and stacklevel=2 - Move imports to top-level for consistency with codebase - Add Sphinx deprecation directive to docstring - Improve documentation - Add concise ML integration documentation to README - Properly highlight core framework value in getting-started guide - Create comprehensive ML pipeline example without Spark - Add pandas transformation API reference documentation - Validation - All previously failing tests now pass (3 tests fixed) - Core works without [ml] extra (108 tests) - Pandas transformations work without PySpark (3 tests) - Example ML pipeline runs successfully Resolves test failures in: - tests/spark/test_spark_utils.py::test_import_pandas_based_on_pyspark_version - tests/spark/readers/test_jdbc.py::TestJdbcReader::test_execute_w_dbtable_and_query

…port

dannymeijer · 2025-11-29T14:41:48Z

Feature is now unblocked. Marking this ready for review.

Added a Pandas based Transformtion class after abstracting Transforma…

356ea1e

…tion to a BaseTransformation

dannymeijer requested a review from a team as a code owner December 1, 2024 11:48

dannymeijer linked an issue Dec 1, 2024 that may be closed by this pull request

[FEATURE] DBR ML Support #79

Open

dannymeijer added this to the 0.10.0 milestone Dec 1, 2024

dannymeijer added enhancement New feature or request blocked labels Dec 1, 2024

dannymeijer self-assigned this Dec 1, 2024

Merge branch 'main' into feature/79-dbr-ml-support

cddc781

dannymeijer commented Dec 1, 2024

View reviewed changes

pyproject.toml

Copy link

Member Author

dannymeijer Dec 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newly added is ml - I sorted the features afterwards

dannymeijer changed the title ~~[FEATURE] Added a Pandas based Transformtion and BaseTransformation~~ [FEATURE] Added a Pandas based Transformation and BaseTransformation Dec 1, 2024

dannymeijer removed the blocked label Dec 9, 2024

dannymeijer and others added 3 commits December 16, 2024 11:47

Merge branch 'main' into feature/79-dbr-ml-support

f922e82

Merge branch 'main' into feature/79-dbr-ml-support

f8f9d9b

[ENHANCEMENT] Update GitHub Actions workflow to include ML tests and …

00d7e76

…refactor existing test jobs

dannymeijer marked this pull request as draft December 20, 2024 11:16

dannymeijer modified the milestones: 0.10, 0.11 Feb 20, 2025

dannymeijer added 5 commits October 11, 2025 18:04

Merge branch 'feature/79-dbr-ml-support' of personal.github.com:Nike-…

9be6625

…Inc/koheesio into feature/79-dbr-ml-support

Fix BaseTransformation import paths in pandas and spark transformations

ead14a8

- Updated import from koheesio.models.transformation to koheesio.models.dataframe - This resolves import errors for BaseTransformation class

Resolve merge conflicts from main

d9ece51

- Updated version to 0.11.0a0 from main - Kept BaseTransformation import and updated field_validator to model_validator - Resolved conflicts in __about__.py and spark transformations __init__.py

Merge remote-tracking branch 'origin/main' into feature/79-dbr-ml-sup…

5519d0f

…port

dannymeijer marked this pull request as ready for review November 29, 2025 14:41

dannymeijer changed the title ~~[FEATURE] Added a Pandas based Transformation and BaseTransformation~~ feature - Added a Pandas based Transformation and BaseTransformation Nov 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature - Added a Pandas based Transformation and BaseTransformation#141

feature - Added a Pandas based Transformation and BaseTransformation#141
dannymeijer wants to merge 10 commits intomainfrom
feature/79-dbr-ml-support

dannymeijer commented Dec 1, 2024 •

edited

Loading

Uh oh!

dannymeijer commented Dec 1, 2024

Uh oh!

dannymeijer Dec 1, 2024

Uh oh!

dannymeijer commented Nov 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dannymeijer commented Dec 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

1. ML Extra Dependency

2. Pandas Transformations

3. Spark-Optional Pandas Import

4. Comprehensive Documentation

Related Issue

Motivation and Context

Testing & Validation

✅ Completed

🔄 Pending (Optional)

Technical Details

Files Changed (22 files, +1,070/-214 lines)

Key Implementation Details

Types of Changes

Checklist

Installation Examples

Usage Example

Uh oh!

dannymeijer commented Dec 1, 2024

Uh oh!

dannymeijer Dec 1, 2024

Choose a reason for hiding this comment

Uh oh!

dannymeijer commented Nov 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dannymeijer commented Dec 1, 2024 •

edited

Loading