feat: Handle passing of multiple files, multiple folders, path with patterns, HF Dataset and combination #424

Abhishek-TAMU · 2024-12-18T17:06:50Z

Description of the change

Added support to handle passing of multiple files, multiple folders, path with patterns, HF Dataset and combination of each of them.
Now multiple files with different extension can also be passed. (But Columns of all files should be similiar)

User can pass builder in DataSetConfig to mention the specific loader for the passed file/folder/pattern.

Below Unit test cases tested (some already there and some are added):

1- Passing of multiple files and e2e unit tests:

Without builder:

test_process_dataconfig_multiple_files
test_process_dataconfig_multiple_datasets_datafiles_sampling
test_process_dataconfig_multiple_files_varied_data_formats
test_run_causallm_ft_and_inference_with_multiple_dataset

With builder:

test_load_dataset_with_datasetconfig
test_load_dataset_with_datasetconfig_incorrect_builder

2- Passing of multiple folders and e2e unit tests:

Without builder:

test_load_dataset_with_dataconfig_and_datafolder
test_run_causallm_ft_and_inference_with_multiple_dataset

With builder:

test_load_dataset_with_dataconfig_and_datafolder
test_load_dataset_with_dataconfig_and_datafolder_incorrect_builder

3- Passing combination of files and folders with e2e unit test:

test_load_dataset_with_datasetconfig_files_folders
test_load_dataset_with_datasetconfig_files_folders_incorrect_format
test_run_causallm_ft_and_inference_with_multiple_dataset

4- Passing of files with pattern with/without builder:

Supported:

Unit test test_process_dataconfig_multiple_files_folders_with_globbing:

With builder File passed and with/without extension
Without builder File passed with extension

Not Supported:

Unit test test_process_dataconfig_multiple_files_folders_without_builder:
Without builder file passed via wildcard pattern and without extension.

Without builder passed without extension: This is not supported because, in our code passing file path via pattern without extension and without builder won't be able to find the builder and hence we pass file path directly to datasets.load_dataset(path) as path without builder and hence datasets.load_dataset search huggingface_hub for this dataset and as datasets.load_dataset doesn't except huggingface_hub dataset as wildcard pattern, it raises error.

5- Passing of folder with pattern with/without builder:

Supported:

Unit test test_process_dataconfig_multiple_files_folders_with_globbing:

Folder passed with builder
Folder passed without builder

Not Supported:

Unit test test_process_dataconfig_multiple_files_folders_without_builder:

Without builder folder passed via wildcard pattern: HF datasets.load_dataset doesn't accepts directory path as wild card.

6- Passing of HF dataset:

test_load_dataset_with_hf_dataset

7- Passing non-existing files and folder:

test_load_dataset_with_non_exist_path

Related issue number

Issue: https://github.ibm.com/ai-foundation/watson-fm-stack-tracker/issues/1487

Closes PR:
#416
#417

How to verify the PR

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

Signed-off-by: Abhishek <[email protected]>

github-actions · 2024-12-18T17:07:03Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

github-actions · 2024-12-18T17:07:09Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

github-actions · 2024-12-18T17:07:10Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

…s, HF Dataset and combination Signed-off-by: Abhishek <[email protected]>

Signed-off-by: Abhishek <[email protected]>

dushyantbehl

Overall looks good to me but requesting minor clarifications.

@ashokponkumar requesting you to review this PR too.

dushyantbehl · 2024-12-19T13:57:02Z

tuning/data/data_config.py

 class DataSetConfig:
    name: str
    data_paths: List[str]
+    builder: Optional[str] = None


Could you please add one line comment here that this is referring to the builder name in HF.

Added. Thanks!

dushyantbehl · 2024-12-19T13:57:58Z

tuning/data/data_config.py

+        builder = kwargs["builder"]
+        assert isinstance(
+            builder, str
+        ), f"builder: {builder} should be str with values in (json, text, parquet, arrow)"


do we need to call out the values supported? can we just say builder name for supported HF data formats

Make sense added it.

dushyantbehl · 2024-12-19T13:59:41Z

tuning/data/data_processors.py

+            with standardized exception handling.
+
+            Args:
+                path: The path argument for load_dataset (could be a directory, file, builder, etc.)


if Path can be a builder why do we need a builder separately...should we rename path here to just data_path and say it is a path be it directory, file, pattern or dataset id.

dushyantbehl · 2024-12-19T14:02:29Z

tuning/data/data_processors.py

+            except FileNotFoundError as e:
+                # Handle file/directory not found
+                context = f"builder {builder}" if builder else f"path {path}"
+                raise ValueError(f"Data loading failed: invalid {context}.") from e


for the case of builder shouldn't the context still include data path?

Make sense.

dushyantbehl · 2024-12-19T14:08:54Z

tuning/data/data_processors.py

+            all_datasets.append(dataset)
+
+        # Validate all datasets to have same columns
+        validate_datasets(all_datasets)


can we call this function else? maybe something which specifically calls that we are comparing them to be mergable or not.

Maybe validate_mergeable_datasets?

Based on this discussion here, function validate_mergeable_datasets now just logs warnings and load_dataset allow concatenation of any datasets.

dushyantbehl · 2024-12-19T14:10:27Z

tuning/data/data_processors.py

-            raise ValueError(f"data path is invalid [{', '.join(files)}]") from e
+        except Exception as e:
+            raise ValueError(
+                f"An error occurred while concatenating datasets: {e}"


since you already have the dataset config can you add the dataset name to this error...to tell it failed for which dataset definition in the config.

dushyantbehl · 2024-12-19T14:12:29Z

tests/test_sft_trainer.py

+        (
+            [
+                TWITTER_COMPLAINTS_DATA_INPUT_OUTPUT_JSON,
+                TWITTER_COMPLAINTS_DATA_DIR_JSON,


Nice test case. Thanks

dushyantbehl · 2024-12-19T14:13:28Z

tests/data/test_data_preprocessing_utils.py

+            AssertionError,
+            ValueError,
+            datasets.exceptions.DatasetGenerationCastError,
+            pyarrow.lib.ArrowInvalid,


in which case is pyarrow error thrown?

I removed this unit test as test_process_dataconfig_multiple_files_varied_data_formats will not throw error now because we are removing validate_dataset function and hence datasets will any columns (varied format) will get concatenated and will not give error. Though e2e successful run with varied format depends on which handler user passes. @dushyantbehl

CC: @willmj

dushyantbehl · 2024-12-19T14:17:12Z

tuning/utils/utils.py

    return None
+
+
+def validate_datasets(datasets):


I know it is already tested inside the code path but could we have a testcase for this...a simple one should suffice. Thanks

Based on this discussion here, function validate_mergeable_datasets now just logs warnings and load_dataset allow concatenation of any datasets.

ashokponkumar · 2024-12-19T15:28:50Z

tuning/data/data_processors.py

+            all_datasets.append(dataset)
+
+        # Validate all datasets to have same columns
+        validate_datasets(all_datasets)


The concatenate_datasets function itself will throw an error, isn't it? Isn't it better to do it in the catch function of that. Otherwise it will involve some unnecessary computation, isn't it?

Based on this discussion here, function validate_mergeable_datasets now just logs warnings and load_dataset allow concatenation of any datasets.

willmj

LGTM, can review again after Dushyant's comments

willmj · 2024-12-19T15:40:04Z

tuning/data/data_processors.py

+            all_datasets.append(dataset)
+
+        # Validate all datasets to have same columns
+        validate_datasets(all_datasets)


Maybe validate_mergeable_datasets?

Signed-off-by: Abhishek <[email protected]>

tuning/utils/utils.py

Signed-off-by: Abhishek <[email protected]>

dushyantbehl · 2024-12-19T19:48:06Z

LGTM!

ashokponkumar

One more nit

ashokponkumar · 2024-12-19T19:47:44Z

tuning/data/data_processors.py

+                return all_datasets[0]
+
+            raw_datasets = datasets.concatenate_datasets(all_datasets)
+            logging.info(


Logger instead of logging

Signed-off-by: Abhishek <[email protected]>

Abhishek-TAMU · 2024-12-19T20:09:11Z

Modified all logging used in any Data preprocessor code to logger. @ashokponkumar @dushyantbehl

dushyantbehl · 2024-12-19T20:10:47Z

Modified all logging used in any Data preprocessor code to logger. @ashokponkumar @dushyantbehl

Great Thanks!

willmj

LGTM! Thanks Abhishek!

dushyantbehl · 2024-12-19T20:35:10Z

@Abhishek-TAMU LGTM too
Please merge post the test cases succeed.

Signed-off-by: Abhishek <[email protected]>

Test commit

220efb6

Signed-off-by: Abhishek <[email protected]>

github-actions bot added the feat label Dec 18, 2024

Abhishek-TAMU changed the title ~~feat: Handle passing of multiple files, multiple folders, path with patterns, HF Dataset and combination~~ [WIP] feat: Handle passing of multiple files, multiple folders, path with patterns, HF Dataset and combination Dec 18, 2024

Abhishek-TAMU added 4 commits December 18, 2024 20:15

Handle passing of multiple files, multiple folders, path with pattern…

aff8006

…s, HF Dataset and combination Signed-off-by: Abhishek <[email protected]>

Merge remote-tracking branch 'upstream/main' into files_folder_pattern

8db0c14

fix: changes

d8460d7

Signed-off-by: Abhishek <[email protected]>

Optimize load_dataset

26e522e

Signed-off-by: Abhishek <[email protected]>

Abhishek-TAMU changed the title ~~[WIP] feat: Handle passing of multiple files, multiple folders, path with patterns, HF Dataset and combination~~ feat: Handle passing of multiple files, multiple folders, path with patterns, HF Dataset and combination Dec 19, 2024

Abhishek-TAMU marked this pull request as ready for review December 19, 2024 02:51

Abhishek-TAMU requested review from Ssukriti, aluu317, anhuong, fabianlim and kmehant as code owners December 19, 2024 02:51

Abhishek-TAMU requested review from ashokponkumar and willmj and removed request for Ssukriti, aluu317, anhuong and fabianlim December 19, 2024 02:51

fix: fmt

4bf1a31

Signed-off-by: Abhishek <[email protected]>

dushyantbehl reviewed Dec 19, 2024

View reviewed changes

ashokponkumar reviewed Dec 19, 2024

View reviewed changes

willmj reviewed Dec 19, 2024

View reviewed changes

PR changes

c818b45

Signed-off-by: Abhishek <[email protected]>

ashokponkumar reviewed Dec 19, 2024

View reviewed changes

tuning/utils/utils.py Outdated Show resolved Hide resolved

PR changes

ba0e543

Signed-off-by: Abhishek <[email protected]>

ashokponkumar reviewed Dec 19, 2024

View reviewed changes

PR changes

62efeec

Signed-off-by: Abhishek <[email protected]>

ashokponkumar previously approved these changes Dec 19, 2024

View reviewed changes

willmj previously approved these changes Dec 19, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into files_folder_pattern

2e78c7f

Change HF dataset to stable dataset squad

d275b43

Signed-off-by: Abhishek <[email protected]>

Abhishek-TAMU dismissed stale reviews from ashokponkumar and willmj via d275b43 December 19, 2024 20:49

ashokponkumar approved these changes Dec 19, 2024

View reviewed changes

ashokponkumar merged commit d7f06f5 into foundation-model-stack:main Dec 19, 2024
8 checks passed

Abhishek-TAMU deleted the files_folder_pattern branch December 19, 2024 21:00

Abhishek-TAMU mentioned this pull request Dec 19, 2024

test: Add support and unit tests for handling of multiple files passed as a pattern in data_config #416

Closed

2 tasks

feat: Handle passing of multiple files, multiple folders, path with patterns, HF Dataset and combination #424

feat: Handle passing of multiple files, multiple folders, path with patterns, HF Dataset and combination #424

Uh oh!

Conversation

Abhishek-TAMU commented Dec 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of the change

1- Passing of multiple files and e2e unit tests:

2- Passing of multiple folders and e2e unit tests:

3- Passing combination of files and folders with e2e unit test:

4- Passing of files with pattern with/without builder:

Supported:

Not Supported:

5- Passing of folder with pattern with/without builder:

Supported:

Not Supported:

6- Passing of HF dataset:

7- Passing non-existing files and folder:

Related issue number

How to verify the PR

Was the PR tested

Uh oh!

github-actions bot commented Dec 18, 2024

Uh oh!

github-actions bot commented Dec 18, 2024

Uh oh!

github-actions bot commented Dec 18, 2024

Uh oh!

dushyantbehl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Abhishek-TAMU Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Abhishek-TAMU Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashokponkumar Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

willmj left a comment

Choose a reason for hiding this comment

Uh oh!

Abhishek-TAMU commented Dec 18, 2024 •

edited

Loading

Abhishek-TAMU Dec 19, 2024 •

edited

Loading

Abhishek-TAMU Dec 19, 2024 •

edited

Loading

ashokponkumar Dec 19, 2024 •

edited

Loading

Abhishek-TAMU commented Dec 19, 2024 •

edited

Loading