feature - Added a Pandas based Transformation and BaseTransformation#141
Open
dannymeijer wants to merge 10 commits intomainfrom
Open
feature - Added a Pandas based Transformation and BaseTransformation#141dannymeijer wants to merge 10 commits intomainfrom
dannymeijer wants to merge 10 commits intomainfrom
Conversation
…tion to a BaseTransformation
Member
Author
|
Marking this as blocked based on the |
dannymeijer
commented
Dec 1, 2024
Member
Author
There was a problem hiding this comment.
Newly added is ml - I sorted the features afterwards
…Inc/koheesio into feature/79-dbr-ml-support
- Updated import from koheesio.models.transformation to koheesio.models.dataframe - This resolves import errors for BaseTransformation class
- Updated version to 0.11.0a0 from main - Kept BaseTransformation import and updated field_validator to model_validator - Resolved conflicts in __about__.py and spark transformations __init__.py
- Fix pandas import logic to properly handle PySpark version compatibility - Use try-except-else pattern to avoid catching intentional ImportErrors - Allow pandas to work independently when PySpark is not installed - Add proper deprecation warning to import_pandas_based_on_pyspark_version() - Use warnings.warn() with DeprecationWarning and stacklevel=2 - Move imports to top-level for consistency with codebase - Add Sphinx deprecation directive to docstring - Improve documentation - Add concise ML integration documentation to README - Properly highlight core framework value in getting-started guide - Create comprehensive ML pipeline example without Spark - Add pandas transformation API reference documentation - Validation - All previously failing tests now pass (3 tests fixed) - Core works without [ml] extra (108 tests) - Pandas transformations work without PySpark (3 tests) - Example ML pipeline runs successfully Resolves test failures in: - tests/spark/test_spark_utils.py::test_import_pandas_based_on_pyspark_version - tests/spark/readers/test_jdbc.py::TestJdbcReader::test_execute_w_dbtable_and_query
Member
Author
|
Feature is now unblocked. Marking this ready for review. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR enables Spark-independent ML workflows in Koheesio by adding pandas-based transformations and an
mlextra dependency. Users can now build type-safe ML pipelines without requiring PySpark, while maintaining full compatibility with Spark-based workflows when needed.Key Features
1. ML Extra Dependency
Install with
koheesio[ml]for ML workflows without Spark, orkoheesio[ml,pyspark]for both.Includes:
2. Pandas Transformations
koheesio.pandas.transformationsmodule.pipe()method3. Spark-Optional Pandas Import
koheesio.utils.pandasmodule provides pandas import utilities4. Comprehensive Documentation
Related Issue
Closes #79
Motivation and Context
Primary Goal: Enable ML workflows without Spark dependency
Bonus: DBR ML runtime compatibility (DBR 13, 14, 15)
Testing & Validation
✅ Completed
test_import_pandas_based_on_pyspark_version(2 parameterized cases)test_execute_w_dbtable_and_query[ml]extra (108 tests passed)[pyspark](3 tests passed)🔄 Pending (Optional)
Technical Details
Files Changed (22 files, +1,070/-214 lines)
New Core Files:
src/koheesio/utils/pandas.py- Spark-optional pandas import utilitiessrc/koheesio/models/dataframe.py- DataFrame abstraction layersrc/koheesio/pandas/transformations/__init__.py- Pandas transformation classesExamples:
examples/ml_without_spark/simple_ml_pipeline.py- Complete ML pipeline exampleexamples/ml_without_spark/README.md- Usage guideDocumentation:
Configuration:
pyproject.toml- Addedmlextra and test matrix.github/workflows/test.yml- CI/CD for ML testsKey Implementation Details
Pandas Import Logic (
src/koheesio/utils/pandas.py)Deprecation (
src/koheesio/spark/utils/common.py)import_pandas_based_on_pyspark_version()warnings.warn()withDeprecationWarningandstacklevel=2Backward Compatibility
Types of Changes
Checklist
Installation Examples
Usage Example
See
examples/ml_without_spark/for complete working examples.