Extractor for different sources (not only filesystem paths)#208
Extractor for different sources (not only filesystem paths)#208PeterKraus merged 24 commits intodgbowl:mainfrom
Conversation
There was a problem hiding this comment.
Good start!
I wouldn't really mess with yadg.extractors.extract() for now, as we should define the interface first (which should be shared by all extractors, as it is now).
I think a good place to start would be to add a yadg.extractors.extract_from_bytes() function in parallel to the existing yadg.extractors.extract_from_path(), as you are doing here, and then figure out a way how each extractor announces what it supports (i.e. modify extract() to do some magic with dispatching: https://docs.python.org/3/library/functools.html#functools.singledispatch).
|
Maybe like this? |
PeterKraus
left a comment
There was a problem hiding this comment.
Hi, I think it's getting there. Could you please also add a test?
|
Hi, thank you for the feedback. I added a test, but it feels like a lot of copy/paste. If you also find it somewhat cumbersome, I can think about it again on Monday. |
PeterKraus
left a comment
There was a problem hiding this comment.
Good work, I think a few minor changes and it can be merged.
src/yadg/extractors/__init__.py
Outdated
| if path is not None: | ||
| logger.warning( | ||
| "The parameter 'path' is deprecated and has been replaced by 'source'. " | ||
| "Please use 'source' instead.", | ||
| DeprecationWarning, | ||
| ) | ||
| source = path |
There was a problem hiding this comment.
This should probably be handled via a decorator, otherwise the function will fail when source is not specified. We can then re-use the decorator to process then fn -> source transition in each extract() function of the individual extractors.
src/yadg/extractors/eclab/mpr.py
Outdated
|
|
||
|
|
||
| @singledispatch | ||
| def extract_source(source, timezone): |
There was a problem hiding this comment.
I think this should be just extract(source: Any, timezone: str, **kwargs) decorated by the path -> source or fn -> source deprecation warning.
|
Hi, thank you for the feedback. I am not sure if I have understood/implemented your idea of the decorator correctly. Please check and I can make changes if you have a different idea. |
| @singledispatch | ||
| def extract_source(source: Any, timezone: str, **kwargs): | ||
| logger.warning( | ||
| "The selected extractor does not support the provided source. " | ||
| "Please check the available extractors or enter a valid file path." | ||
| ) |
There was a problem hiding this comment.
Is there a way to get rid of this function and just use extract() with two decorators?
There was a problem hiding this comment.
I am not sure if I get what you are aiming for.
Would you like to use extract() as the singeldispatch? And then using a decorator to prepare the kwargs to don't run in the requires at least 1 positional argument? Then we could have a decorator like this:
def prepare_args_dispatch(func):
@wraps(func)
def wrapper(*args, **kwargs):
source = kwargs.pop("source")
new_args = (source,) + args
return func(*new_args, **kwargs)
return wrapper
And to not change the API extract would look like this:
@deprecate_fn_path
@prepare_args_dispatch
@singledispatch
def extract(
*,
source: Any,
timezone: str,
**kwargs: dict,
) -> DataTree:
But with this function we could not use the functionality of a default dispatcher because for unsupported types we would have the error extract() takes 0 positional arguments but 1 positional argument (and 1 keyword-only argument) were given.
There was a problem hiding this comment.
But maybe you mean to use the decorator instead of the singledispatch?
Then we could do something like
def handle_source_type(func):
@wraps(func)
def wrapper(*args, **kwargs):
source = kwargs.get("source")
if isinstance(source, (str, Path)):
with open(source, "rb") as mpr_file:
mpr_bytes = mpr_file.read()
kwargs["source"] = mpr_bytes
elif not isinstance(source, bytes):
logger.warning(
"The selected extractor does not support the provided source. "
"Please check the available extractors or enter a valid file path."
)
return func(*args, **kwargs)
return wrapper
But then we could also just check for the type inside the extract function.
There was a problem hiding this comment.
Would something like this work:
@deprecate_fn_path # we handle `fn=` as `kwargs` here, deprecate them, pass them as positional `source: Any`
@singledispatch # dispatch is happy because we now have a positional arg `source: Any`
def extract(
source: Any,
*,
timezone: str,
**kwargs: dict,
) -> DataTree:
You might have to change the deprecate_fn_path() decorator to explicitly pass arguments as source=fn, **kwargs instead of storing kwargs["source"] = fn and passing just **kwargs.
There was a problem hiding this comment.
I thought changing the positional arguments of extract was not an option because of this earlier comment (#208 (comment)). But if we can have source: Any as a positional argument that works.
There was a problem hiding this comment.
Changed in a683a53 . Please note that I also had to change the positional argument of the extract in mpt.py to run all tests.
There was a problem hiding this comment.
You're right, I said that - but now we have the decorator that takes care of this issue for us, without breaking the API.
|
Great job, thanks for the contribution. Do you need a new release with this feature urgently, or are you happy to use the |
|
Thank you for your help! We don't urgently need the new release. The |
As mentioned in #207 it would be helpful for my application to use the raw binary data from the mpr files instead of the file path.
I was not sure how to handle this so it could also be useful for all the other file extractors in yadg. This would be my first attempt.
Another option for my usecase would be to create a new raw_mpr.py file with a new extractor function that simply misuses the fn parameter for raw data and does not need the changes in the https://github.com/dgbowl/yadg/blob/main/src/yadg/extractors/__init__.py
What do you like better?