Skip to content

Extractor for different sources (not only filesystem paths)#208

Merged
PeterKraus merged 24 commits intodgbowl:mainfrom
carla-terboven:extract-raw-bytes
Apr 9, 2025
Merged

Extractor for different sources (not only filesystem paths)#208
PeterKraus merged 24 commits intodgbowl:mainfrom
carla-terboven:extract-raw-bytes

Conversation

@carla-terboven
Copy link
Contributor

As mentioned in #207 it would be helpful for my application to use the raw binary data from the mpr files instead of the file path.

I was not sure how to handle this so it could also be useful for all the other file extractors in yadg. This would be my first attempt.
Another option for my usecase would be to create a new raw_mpr.py file with a new extractor function that simply misuses the fn parameter for raw data and does not need the changes in the https://github.com/dgbowl/yadg/blob/main/src/yadg/extractors/__init__.py

What do you like better?

Copy link
Contributor

@PeterKraus PeterKraus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start!

I wouldn't really mess with yadg.extractors.extract() for now, as we should define the interface first (which should be shared by all extractors, as it is now).

I think a good place to start would be to add a yadg.extractors.extract_from_bytes() function in parallel to the existing yadg.extractors.extract_from_path(), as you are doing here, and then figure out a way how each extractor announces what it supports (i.e. modify extract() to do some magic with dispatching: https://docs.python.org/3/library/functools.html#functools.singledispatch).

@carla-terboven
Copy link
Contributor Author

Maybe like this?

Copy link
Contributor

@PeterKraus PeterKraus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I think it's getting there. Could you please also add a test?

@carla-terboven
Copy link
Contributor Author

Hi, thank you for the feedback. I added a test, but it feels like a lot of copy/paste. If you also find it somewhat cumbersome, I can think about it again on Monday.

Copy link
Contributor

@PeterKraus PeterKraus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work, I think a few minor changes and it can be merged.

Comment on lines +72 to +78
if path is not None:
logger.warning(
"The parameter 'path' is deprecated and has been replaced by 'source'. "
"Please use 'source' instead.",
DeprecationWarning,
)
source = path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be handled via a decorator, otherwise the function will fail when source is not specified. We can then re-use the decorator to process then fn -> source transition in each extract() function of the individual extractors.



@singledispatch
def extract_source(source, timezone):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be just extract(source: Any, timezone: str, **kwargs) decorated by the path -> source or fn -> source deprecation warning.

@carla-terboven
Copy link
Contributor Author

Hi, thank you for the feedback. I am not sure if I have understood/implemented your idea of the decorator correctly. Please check and I can make changes if you have a different idea.

Comment on lines +570 to +575
@singledispatch
def extract_source(source: Any, timezone: str, **kwargs):
logger.warning(
"The selected extractor does not support the provided source. "
"Please check the available extractors or enter a valid file path."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to get rid of this function and just use extract() with two decorators?

Copy link
Contributor Author

@carla-terboven carla-terboven Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if I get what you are aiming for.
Would you like to use extract() as the singeldispatch? And then using a decorator to prepare the kwargs to don't run in the requires at least 1 positional argument? Then we could have a decorator like this:

def prepare_args_dispatch(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        source = kwargs.pop("source")
        new_args = (source,) + args

        return func(*new_args, **kwargs)

    return wrapper

And to not change the API extract would look like this:

@deprecate_fn_path
@prepare_args_dispatch
@singledispatch
def extract(
    *,
    source: Any,
    timezone: str,
    **kwargs: dict,
) -> DataTree:

But with this function we could not use the functionality of a default dispatcher because for unsupported types we would have the error extract() takes 0 positional arguments but 1 positional argument (and 1 keyword-only argument) were given.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe you mean to use the decorator instead of the singledispatch?
Then we could do something like

def handle_source_type(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        source = kwargs.get("source")
        if isinstance(source, (str, Path)):
            with open(source, "rb") as mpr_file:
                mpr_bytes = mpr_file.read()
            kwargs["source"] = mpr_bytes
        elif not isinstance(source, bytes):
            logger.warning(
                "The selected extractor does not support the provided source. "
                "Please check the available extractors or enter a valid file path."
            )

        return func(*args, **kwargs)

    return wrapper

But then we could also just check for the type inside the extract function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would something like this work:

@deprecate_fn_path    # we handle `fn=` as `kwargs` here, deprecate them, pass them as positional `source: Any`
@singledispatch       # dispatch is happy because we now have a positional arg `source: Any`
def extract(
    source: Any,
    *,
    timezone: str,
    **kwargs: dict,
) -> DataTree:

You might have to change the deprecate_fn_path() decorator to explicitly pass arguments as source=fn, **kwargs instead of storing kwargs["source"] = fn and passing just **kwargs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought changing the positional arguments of extract was not an option because of this earlier comment (#208 (comment)). But if we can have source: Any as a positional argument that works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in a683a53 . Please note that I also had to change the positional argument of the extract in mpt.py to run all tests.

Copy link
Contributor

@PeterKraus PeterKraus Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I said that - but now we have the decorator that takes care of this issue for us, without breaking the API.

@PeterKraus PeterKraus merged commit c333073 into dgbowl:main Apr 9, 2025
12 checks passed
@PeterKraus
Copy link
Contributor

Great job, thanks for the contribution.

Do you need a new release with this feature urgently, or are you happy to use the git+https build for now?

@carla-terboven
Copy link
Contributor Author

Thank you for your help!

We don't urgently need the new release. The git+https build is ok for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants