Skip to content

feature - Added a Pandas based Transformation and BaseTransformation#141

Open
dannymeijer wants to merge 10 commits intomainfrom
feature/79-dbr-ml-support
Open

feature - Added a Pandas based Transformation and BaseTransformation#141
dannymeijer wants to merge 10 commits intomainfrom
feature/79-dbr-ml-support

Conversation

@dannymeijer
Copy link
Member

@dannymeijer dannymeijer commented Dec 1, 2024

Summary

This PR enables Spark-independent ML workflows in Koheesio by adding pandas-based transformations and an ml extra dependency. Users can now build type-safe ML pipelines without requiring PySpark, while maintaining full compatibility with Spark-based workflows when needed.

Key Features

1. ML Extra Dependency

Install with koheesio[ml] for ML workflows without Spark, or koheesio[ml,pyspark] for both.

Includes:

  • pandas
  • numpy >= 1.21.5
  • scikit-learn >= 1.1.1
  • scipy >= 1.9.1

2. Pandas Transformations

  • New koheesio.pandas.transformations module
  • Same API as Spark transformations for consistency
  • Full Pydantic validation
  • Composable with .pipe() method
  • Works completely independently of PySpark

3. Spark-Optional Pandas Import

  • New koheesio.utils.pandas module provides pandas import utilities
  • Validates PySpark compatibility only when PySpark is available
  • Allows pandas functionality to work standalone

4. Comprehensive Documentation

  • Getting started guide with proper core framework positioning
  • ML pipeline examples without Spark
  • Pandas transformation API reference
  • Testing and development guides

Related Issue

Closes #79

Motivation and Context

Primary Goal: Enable ML workflows without Spark dependency

  • Build lightweight ML pipelines for local development
  • Create type-safe feature engineering workflows
  • Develop ML services without Spark overhead
  • Prototype before scaling to distributed processing

Bonus: DBR ML runtime compatibility (DBR 13, 14, 15)

Testing & Validation

✅ Completed

  • Fixed all failing tests (3 tests)
    • test_import_pandas_based_on_pyspark_version (2 parameterized cases)
    • test_execute_w_dbtable_and_query
  • Tested koheesio core without [ml] extra (108 tests passed)
  • Tested pandas transformations without [pyspark] (3 tests passed)
  • Created and validated ML pipeline example without Spark
  • All unit tests pass (762+ tests)

🔄 Pending (Optional)

  • Run integration tests on DBR 13, 14, 15 ML runtimes

Technical Details

Files Changed (22 files, +1,070/-214 lines)

New Core Files:

  • src/koheesio/utils/pandas.py - Spark-optional pandas import utilities
  • src/koheesio/models/dataframe.py - DataFrame abstraction layer
  • src/koheesio/pandas/transformations/__init__.py - Pandas transformation classes

Examples:

  • examples/ml_without_spark/simple_ml_pipeline.py - Complete ML pipeline example
  • examples/ml_without_spark/README.md - Usage guide

Documentation:

  • Updated README with concise ML integration docs
  • Enhanced getting-started guide
  • Added pandas API reference
  • Created comprehensive testing guide

Configuration:

  • pyproject.toml - Added ml extra and test matrix
  • .github/workflows/test.yml - CI/CD for ML tests

Key Implementation Details

  1. Pandas Import Logic (src/koheesio/utils/pandas.py)

    • Uses try-except-else pattern to handle optional PySpark
    • Only validates compatibility when PySpark is present
    • Raises ImportError for version mismatches
  2. Deprecation (src/koheesio/spark/utils/common.py)

    • Added proper deprecation warning to import_pandas_based_on_pyspark_version()
    • Uses warnings.warn() with DeprecationWarning and stacklevel=2
    • Delegates to new core utility for consistency
  3. Backward Compatibility

    • All existing Spark functionality unchanged
    • No breaking changes to APIs
    • Optional dependencies only installed with extras

Types of Changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • My code follows the code style of this project
  • My change requires a change to the documentation
  • I have updated the documentation accordingly
  • I have read the CONTRIBUTING document
  • I have added tests to cover my changes
  • All new and existing tests passed

Installation Examples

# For ML workflows without Spark
pip install koheesio[ml]

# For ML workflows with Spark
pip install koheesio[ml,pyspark]

# For just Spark (no ML)
pip install koheesio[pyspark]

# Core only (for libraries, API integrations, etc.)
pip install koheesio

Usage Example

from koheesio.pandas.transformations import Transformation
import pandas as pd

class AddOne(Transformation):
    target_column: str = "new_column"
    
    def execute(self):
        self.output.df = self.df.copy()
        self.output.df[self.target_column] = self.df["old_column"] + 1

# Use with pandas DataFrames
df = pd.DataFrame({"old_column": [0, 1, 2]})
result = AddOne(target_column="incremented").transform(df)

# Or chain with .pipe()
result = df.pipe(AddOne(target_column="foo").transform) \
           .pipe(AddOne(target_column="bar").transform)

See examples/ml_without_spark/ for complete working examples.

@dannymeijer dannymeijer requested a review from a team as a code owner December 1, 2024 11:48
@dannymeijer dannymeijer linked an issue Dec 1, 2024 that may be closed by this pull request
@dannymeijer dannymeijer added this to the 0.10.0 milestone Dec 1, 2024
@dannymeijer dannymeijer added enhancement New feature or request blocked labels Dec 1, 2024
@dannymeijer dannymeijer self-assigned this Dec 1, 2024
@dannymeijer
Copy link
Member Author

Marking this as blocked based on the To Do's I mentioned

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newly added is ml - I sorted the features afterwards

@dannymeijer dannymeijer changed the title [FEATURE] Added a Pandas based Transformtion and BaseTransformation [FEATURE] Added a Pandas based Transformation and BaseTransformation Dec 1, 2024
@dannymeijer dannymeijer marked this pull request as draft December 20, 2024 11:16
@dannymeijer dannymeijer modified the milestones: 0.10, 0.11 Feb 20, 2025
- Updated import from koheesio.models.transformation to koheesio.models.dataframe
- This resolves import errors for BaseTransformation class
- Updated version to 0.11.0a0 from main
- Kept BaseTransformation import and updated field_validator to model_validator
- Resolved conflicts in __about__.py and spark transformations __init__.py
- Fix pandas import logic to properly handle PySpark version compatibility
  - Use try-except-else pattern to avoid catching intentional ImportErrors
  - Allow pandas to work independently when PySpark is not installed

- Add proper deprecation warning to import_pandas_based_on_pyspark_version()
  - Use warnings.warn() with DeprecationWarning and stacklevel=2
  - Move imports to top-level for consistency with codebase
  - Add Sphinx deprecation directive to docstring

- Improve documentation
  - Add concise ML integration documentation to README
  - Properly highlight core framework value in getting-started guide
  - Create comprehensive ML pipeline example without Spark
  - Add pandas transformation API reference documentation

- Validation
  - All previously failing tests now pass (3 tests fixed)
  - Core works without [ml] extra (108 tests)
  - Pandas transformations work without PySpark (3 tests)
  - Example ML pipeline runs successfully

Resolves test failures in:
- tests/spark/test_spark_utils.py::test_import_pandas_based_on_pyspark_version
- tests/spark/readers/test_jdbc.py::TestJdbcReader::test_execute_w_dbtable_and_query
@dannymeijer
Copy link
Member Author

Feature is now unblocked. Marking this ready for review.

@dannymeijer dannymeijer marked this pull request as ready for review November 29, 2025 14:41
@dannymeijer dannymeijer changed the title [FEATURE] Added a Pandas based Transformation and BaseTransformation feature - Added a Pandas based Transformation and BaseTransformation Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

[FEATURE] DBR ML Support

1 participant