-
Notifications
You must be signed in to change notification settings - Fork 65
feat: Handle passing of multiple files, multiple folders, path with patterns, HF Dataset and combination #424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Handle passing of multiple files, multiple folders, path with patterns, HF Dataset and combination #424
Conversation
Signed-off-by: Abhishek <[email protected]>
|
Thanks for making a pull request! 😃 |
|
Thanks for making a pull request! 😃 |
1 similar comment
|
Thanks for making a pull request! 😃 |
…s, HF Dataset and combination Signed-off-by: Abhishek <[email protected]>
Signed-off-by: Abhishek <[email protected]>
Signed-off-by: Abhishek <[email protected]>
Signed-off-by: Abhishek <[email protected]>
dushyantbehl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me but requesting minor clarifications.
@ashokponkumar requesting you to review this PR too.
tuning/data/data_config.py
Outdated
| class DataSetConfig: | ||
| name: str | ||
| data_paths: List[str] | ||
| builder: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add one line comment here that this is referring to the builder name in HF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. Thanks!
tuning/data/data_config.py
Outdated
| builder = kwargs["builder"] | ||
| assert isinstance( | ||
| builder, str | ||
| ), f"builder: {builder} should be str with values in (json, text, parquet, arrow)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to call out the values supported? can we just say builder name for supported HF data formats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense added it.
tuning/data/data_processors.py
Outdated
| with standardized exception handling. | ||
| Args: | ||
| path: The path argument for load_dataset (could be a directory, file, builder, etc.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if Path can be a builder why do we need a builder separately...should we rename path here to just data_path and say it is a path be it directory, file, pattern or dataset id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
| except FileNotFoundError as e: | ||
| # Handle file/directory not found | ||
| context = f"builder {builder}" if builder else f"path {path}" | ||
| raise ValueError(f"Data loading failed: invalid {context}.") from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the case of builder shouldn't the context still include data path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense.
tuning/data/data_processors.py
Outdated
| all_datasets.append(dataset) | ||
|
|
||
| # Validate all datasets to have same columns | ||
| validate_datasets(all_datasets) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we call this function else? maybe something which specifically calls that we are comparing them to be mergable or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe validate_mergeable_datasets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on this discussion here, function validate_mergeable_datasets now just logs warnings and load_dataset allow concatenation of any datasets.
tuning/data/data_processors.py
Outdated
| raise ValueError(f"data path is invalid [{', '.join(files)}]") from e | ||
| except Exception as e: | ||
| raise ValueError( | ||
| f"An error occurred while concatenating datasets: {e}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you already have the dataset config can you add the dataset name to this error...to tell it failed for which dataset definition in the config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| ( | ||
| [ | ||
| TWITTER_COMPLAINTS_DATA_INPUT_OUTPUT_JSON, | ||
| TWITTER_COMPLAINTS_DATA_DIR_JSON, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice test case. Thanks
| AssertionError, | ||
| ValueError, | ||
| datasets.exceptions.DatasetGenerationCastError, | ||
| pyarrow.lib.ArrowInvalid, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in which case is pyarrow error thrown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this unit test as test_process_dataconfig_multiple_files_varied_data_formats will not throw error now because we are removing validate_dataset function and hence datasets will any columns (varied format) will get concatenated and will not give error. Though e2e successful run with varied format depends on which handler user passes. @dushyantbehl
CC: @willmj
tuning/utils/utils.py
Outdated
| return None | ||
|
|
||
|
|
||
| def validate_datasets(datasets): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it is already tested inside the code path but could we have a testcase for this...a simple one should suffice. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on this discussion here, function validate_mergeable_datasets now just logs warnings and load_dataset allow concatenation of any datasets.
tuning/data/data_processors.py
Outdated
| all_datasets.append(dataset) | ||
|
|
||
| # Validate all datasets to have same columns | ||
| validate_datasets(all_datasets) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concatenate_datasets function itself will throw an error, isn't it? Isn't it better to do it in the catch function of that. Otherwise it will involve some unnecessary computation, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on this discussion here, function validate_mergeable_datasets now just logs warnings and load_dataset allow concatenation of any datasets.
willmj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, can review again after Dushyant's comments
tuning/data/data_processors.py
Outdated
| all_datasets.append(dataset) | ||
|
|
||
| # Validate all datasets to have same columns | ||
| validate_datasets(all_datasets) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe validate_mergeable_datasets?
Signed-off-by: Abhishek <[email protected]>
Signed-off-by: Abhishek <[email protected]>
|
LGTM! |
ashokponkumar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more nit
tuning/data/data_processors.py
Outdated
| return all_datasets[0] | ||
|
|
||
| raw_datasets = datasets.concatenate_datasets(all_datasets) | ||
| logging.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logger instead of logging
Signed-off-by: Abhishek <[email protected]>
|
Modified all |
Great Thanks! |
willmj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks Abhishek!
|
@Abhishek-TAMU LGTM too |
Signed-off-by: Abhishek <[email protected]>
d275b43
Description of the change
Added support to handle passing of multiple files, multiple folders, path with patterns, HF Dataset and combination of each of them.
Now multiple files with different extension can also be passed. (But Columns of all files should be similiar)
User can pass
builderinDataSetConfigto mention the specific loader for the passed file/folder/pattern.Below Unit test cases tested (some already there and some are added):
1- Passing of multiple files and e2e unit tests:
Without builder:
With builder:
2- Passing of multiple folders and e2e unit tests:
Without builder:
With builder:
3- Passing combination of files and folders with e2e unit test:
4- Passing of files with pattern with/without builder:
Supported:
Unit test
test_process_dataconfig_multiple_files_folders_with_globbing:Not Supported:
Unit test
test_process_dataconfig_multiple_files_folders_without_builder:Without builder file passed via wildcard pattern and without extension.
datasets.load_dataset(path)as path without builder and hencedatasets.load_datasetsearchhuggingface_hubfor this dataset and as datasets.load_dataset doesn't excepthuggingface_hubdataset as wildcard pattern, it raises error.5- Passing of folder with pattern with/without builder:
Supported:
Unit test
test_process_dataconfig_multiple_files_folders_with_globbing:Not Supported:
Unit test
test_process_dataconfig_multiple_files_folders_without_builder:6- Passing of HF dataset:
test_load_dataset_with_hf_dataset7- Passing non-existing files and folder:
test_load_dataset_with_non_exist_pathRelated issue number
Issue: https://github.ibm.com/ai-foundation/watson-fm-stack-tracker/issues/1487
Closes PR:
#416
#417
How to verify the PR
Was the PR tested