Add decorator functionality to validate_schema function#255
Add decorator functionality to validate_schema function#255kunaljubce wants to merge 21 commits intomrpowers-io:planning-1.0-releasefrom
validate_schema function#255Conversation
…a decorator to other functions. Also added associated changes to tests and README
validate_schema function
|
@SemyonSinchenko @MrPowers @jeffbrennan This PR is ready for review now! |
SemyonSinchenko
left a comment
There was a problem hiding this comment.
To be honest, I do not like an implementation with deepcopy and three loops. I'm also not 100% sure how deepcopy() works with py4j.java_gateway.JavaObject.
| dataframe = func(*args, **kwargs) | ||
| _all_struct_fields = copy.deepcopy(dataframe.schema) | ||
| _required_schema = copy.deepcopy(required_schema) | ||
|
|
There was a problem hiding this comment.
Why not to check that lengths of both schemas are the same? I mean before do the deepcopy.
There was a problem hiding this comment.
@SemyonSinchenko Are you suggesting that we check if the length matches, only then we do the deepcopy?
There was a problem hiding this comment.
Exactly! Deep copy may be quite expensive for big objects.
| raise DataFrameMissingStructFieldError(error_message) | ||
| missing_struct_fields = [x for x in _required_schema if x not in _all_struct_fields] | ||
| error_message = ( | ||
| f"The {missing_struct_fields} StructFields are not included in the DataFrame with the following StructFields {_all_struct_fields}" |
There was a problem hiding this comment.
In the case of nested schemas this error message will be unreadable
There was a problem hiding this comment.
@SemyonSinchenko How do you suggest we put this?
There was a problem hiding this comment.
I would suggest to flatten all field first to the way parent.parent.field and only after that render all the missing fields. I would suggest to put the flattening logic in a separate function
| def decorator(func: Callable[..., DataFrame]) -> Callable[..., DataFrame]: | ||
| def wrapper(*args: object, **kwargs: object) -> DataFrame: | ||
| dataframe = func(*args, **kwargs) | ||
| _all_struct_fields = copy.deepcopy(dataframe.schema) |
There was a problem hiding this comment.
We are doing a deep copy of two potentially long schemas on each call. I do not like it.
There was a problem hiding this comment.
@SemyonSinchenko I took a step back and re-evaluated the implementation. The question is - do we even need deepcopy here? We are not changing the original dfs or their schemas. Is shallow copy a better idea here? Lemme know your thoughts!
|
|
||
| if missing_struct_fields: | ||
| raise DataFrameMissingStructFieldError(error_message) | ||
| missing_struct_fields = [x for x in _required_schema if x not in _all_struct_fields] |
There was a problem hiding this comment.
I would like to transform this comprehension into a loop, like:
for field_name in _required_schema.fieldNames():
if field_name not in ...
else:
if ignore_nullable:
%% compare name, dataType %%
else:
%% compare name, dataType, nullabe %%
…s setup Add details to `CONTRIBUTING.md` on auto-assigning issues, pre-commit installation, and GitHub Actions local setup using 'act'. * **Auto-assigning issues**: Add a section explaining auto-assigning issues on the comment 'take', referencing the configuration in `.github/workflows/assign-on-comment.yml`. * **Pre-commit installation and execution**: Add a section detailing pre-commit installation and execution, referencing the configuration in `.pre-commit-config.yaml`. * **GitHub Actions local setup using 'act'**: Add a section providing instructions for GitHub Actions local setup using 'act', referencing the configuration in `.github/workflows/ci.yml`. Include instructions for running specific jobs and handling MacBooks with M1 processors.
According to the review comment, make the pip installs as poetry installs
Bumps [jupyterlab](https://github.com/jupyterlab/jupyterlab) from 3.6.7 to 3.6.8. - [Release notes](https://github.com/jupyterlab/jupyterlab/releases) - [Changelog](https://github.com/jupyterlab/jupyterlab/blob/main/CHANGELOG.md) - [Commits](https://github.com/jupyterlab/jupyterlab/compare/@jupyterlab/vdom@3.6.7...@jupyterlab/vdom@3.6.8) --- updated-dependencies: - dependency-name: jupyterlab dependency-type: direct:development ... Signed-off-by: dependabot[bot] <support@github.com>
- Update deps - Update Ruff - Corresponding updates of pyproject - Slightly update Makefile - Drop Support of Spark2 - Drop Support of Spark 3.1-3.2 - Minimal python is 3.10 from now - Apply new ruff rules - Apply ruff linter On branch 202-drop-pyspark2 Changes to be committed: modified: .github/workflows/ci.yml modified: Makefile modified: poetry.lock modified: pyproject.toml modified: quinn/append_if_schema_identical.py modified: quinn/dataframe_helpers.py modified: quinn/keyword_finder.py modified: quinn/math.py modified: quinn/schema_helpers.py modified: quinn/split_columns.py modified: quinn/transformations.py
Related to mrpowers-io#237 Remove deprecated extension functions from the codebase * Delete `quinn/extensions/dataframe_ext.py` and `quinn/extensions/spark_session_ext.py` * Remove import statements for `dataframe_ext` and `spark_session_ext` in `quinn/extensions/__init__.py` * Remove per-file ignores for `quinn/extensions/dataframe_ext.py` and `quinn/extensions/__init__.py` in `pyproject.toml` * Delete test files `tests/extensions/test_dataframe_ext.py` and `tests/extensions/test_spark_session_ext.py`
* Create the `extensions/` directory. * Create the `__init__.py` file within it with the license header. * Create the `spark_session_ext.py` file within the extensions directory * Update the doc-string in create_df to align with the parameter list. * Format modified files & lint the code using ruff (successful)
As a matter of fact, a create_df function already exists under the dataframe_helpers.py Following are the changes introduced by this commit. * Update `dataframe_helpers.py` by updating the doc-string in the create_df function. * Remove quinn/extensions/ directory along with the two `.py` files involved. * Remove the ruff check from the pyproject.toml. * Finally, format the dataframe_helpers.py with ruff
* Remove the test_spark_connect.py file as it is not relevant anymore. * Update the Makefile to remove spark-connect test * Hardcode the hadoop version to 3 as 2 is EOL.
apply hotfix update lock file
update makefile add make command for ruff
Proposed changes
Make
validate_schema()behave both as a function and as a decorator to other functions. Also added associated changes to tests and README.Took another stab at #140, as an extension to #144.
Types of changes
What types of changes does your code introduce to quinn?
Put an
xin the boxes that applyFurther comments: Implementation details for
validate_schemaSo we are implementing a decorator factory here so that our function can be used both as a decorator as well as a callable function. In this implementation:
The
validate_schemafunction acts as both a decorator factory and a decorator. It takesrequired_schema,ignore_nullable, and an optionaldf_to_be_validatedargument.df_to_be_validatedis None, it means the function is being used as a decorator factory, and it returns the decorator decorator.df_to_be_validatedis not None, it means the function is being called directly with a DataFrame, and it applies the decorator todf_to_be_validatedimmediately.When
validate_schemais called directly with a DataFrame, the validation logic gets executed by wrapping the DataFrame in a lambda function and immediately calling the decorator.