Skip to content

Conversation

@Abhishek-TAMU
Copy link
Collaborator

@Abhishek-TAMU Abhishek-TAMU commented Dec 18, 2024

Description of the change

Added support to handle passing of multiple files, multiple folders, path with patterns, HF Dataset and combination of each of them.
Now multiple files with different extension can also be passed. (But Columns of all files should be similiar)

User can pass builder in DataSetConfig to mention the specific loader for the passed file/folder/pattern.

Below Unit test cases tested (some already there and some are added):

1- Passing of multiple files and e2e unit tests:

Without builder:

test_process_dataconfig_multiple_files
test_process_dataconfig_multiple_datasets_datafiles_sampling
test_process_dataconfig_multiple_files_varied_data_formats
test_run_causallm_ft_and_inference_with_multiple_dataset

With builder:

test_load_dataset_with_datasetconfig
test_load_dataset_with_datasetconfig_incorrect_builder

2- Passing of multiple folders and e2e unit tests:

Without builder:

test_load_dataset_with_dataconfig_and_datafolder
test_run_causallm_ft_and_inference_with_multiple_dataset

With builder:

test_load_dataset_with_dataconfig_and_datafolder
test_load_dataset_with_dataconfig_and_datafolder_incorrect_builder

3- Passing combination of files and folders with e2e unit test:

test_load_dataset_with_datasetconfig_files_folders
test_load_dataset_with_datasetconfig_files_folders_incorrect_format
test_run_causallm_ft_and_inference_with_multiple_dataset

4- Passing of files with pattern with/without builder:

Supported:

Unit test test_process_dataconfig_multiple_files_folders_with_globbing:

  • With builder File passed and with/without extension
  • Without builder File passed with extension

Not Supported:

Unit test test_process_dataconfig_multiple_files_folders_without_builder:
Without builder file passed via wildcard pattern and without extension.

  • Without builder passed without extension: This is not supported because, in our code passing file path via pattern without extension and without builder won't be able to find the builder and hence we pass file path directly to datasets.load_dataset(path) as path without builder and hence datasets.load_dataset search huggingface_hub for this dataset and as datasets.load_dataset doesn't except huggingface_hub dataset as wildcard pattern, it raises error.

5- Passing of folder with pattern with/without builder:

Supported:

Unit test test_process_dataconfig_multiple_files_folders_with_globbing:

  • Folder passed with builder
  • Folder passed without builder

Not Supported:

Unit test test_process_dataconfig_multiple_files_folders_without_builder:

  • Without builder folder passed via wildcard pattern: HF datasets.load_dataset doesn't accepts directory path as wild card.

6- Passing of HF dataset:

test_load_dataset_with_hf_dataset

7- Passing non-existing files and folder:

test_load_dataset_with_non_exist_path

Related issue number

Issue: https://github.ibm.com/ai-foundation/watson-fm-stack-tracker/issues/1487

Closes PR:
#416
#417

How to verify the PR

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

Signed-off-by: Abhishek <[email protected]>
@github-actions
Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@github-actions github-actions bot added the feat label Dec 18, 2024
@github-actions
Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

1 similar comment
@github-actions
Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@Abhishek-TAMU Abhishek-TAMU changed the title feat: Handle passing of multiple files, multiple folders, path with patterns, HF Dataset and combination [WIP] feat: Handle passing of multiple files, multiple folders, path with patterns, HF Dataset and combination Dec 18, 2024
@Abhishek-TAMU Abhishek-TAMU changed the title [WIP] feat: Handle passing of multiple files, multiple folders, path with patterns, HF Dataset and combination feat: Handle passing of multiple files, multiple folders, path with patterns, HF Dataset and combination Dec 19, 2024
@Abhishek-TAMU Abhishek-TAMU marked this pull request as ready for review December 19, 2024 02:51
@Abhishek-TAMU Abhishek-TAMU requested review from ashokponkumar and willmj and removed request for Ssukriti, aluu317, anhuong and fabianlim December 19, 2024 02:51
Signed-off-by: Abhishek <[email protected]>
Copy link
Collaborator

@dushyantbehl dushyantbehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me but requesting minor clarifications.

@ashokponkumar requesting you to review this PR too.

class DataSetConfig:
name: str
data_paths: List[str]
builder: Optional[str] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add one line comment here that this is referring to the builder name in HF.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. Thanks!

builder = kwargs["builder"]
assert isinstance(
builder, str
), f"builder: {builder} should be str with values in (json, text, parquet, arrow)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to call out the values supported? can we just say builder name for supported HF data formats

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense added it.

with standardized exception handling.
Args:
path: The path argument for load_dataset (could be a directory, file, builder, etc.)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if Path can be a builder why do we need a builder separately...should we rename path here to just data_path and say it is a path be it directory, file, pattern or dataset id.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

except FileNotFoundError as e:
# Handle file/directory not found
context = f"builder {builder}" if builder else f"path {path}"
raise ValueError(f"Data loading failed: invalid {context}.") from e
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the case of builder shouldn't the context still include data path?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense.

all_datasets.append(dataset)

# Validate all datasets to have same columns
validate_datasets(all_datasets)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call this function else? maybe something which specifically calls that we are comparing them to be mergable or not.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe validate_mergeable_datasets?

Copy link
Collaborator Author

@Abhishek-TAMU Abhishek-TAMU Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this discussion here, function validate_mergeable_datasets now just logs warnings and load_dataset allow concatenation of any datasets.

raise ValueError(f"data path is invalid [{', '.join(files)}]") from e
except Exception as e:
raise ValueError(
f"An error occurred while concatenating datasets: {e}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you already have the dataset config can you add the dataset name to this error...to tell it failed for which dataset definition in the config.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

(
[
TWITTER_COMPLAINTS_DATA_INPUT_OUTPUT_JSON,
TWITTER_COMPLAINTS_DATA_DIR_JSON,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test case. Thanks

AssertionError,
ValueError,
datasets.exceptions.DatasetGenerationCastError,
pyarrow.lib.ArrowInvalid,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in which case is pyarrow error thrown?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this unit test as test_process_dataconfig_multiple_files_varied_data_formats will not throw error now because we are removing validate_dataset function and hence datasets will any columns (varied format) will get concatenated and will not give error. Though e2e successful run with varied format depends on which handler user passes. @dushyantbehl

CC: @willmj

return None


def validate_datasets(datasets):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it is already tested inside the code path but could we have a testcase for this...a simple one should suffice. Thanks

Copy link
Collaborator Author

@Abhishek-TAMU Abhishek-TAMU Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this discussion here, function validate_mergeable_datasets now just logs warnings and load_dataset allow concatenation of any datasets.

all_datasets.append(dataset)

# Validate all datasets to have same columns
validate_datasets(all_datasets)
Copy link
Collaborator

@ashokponkumar ashokponkumar Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concatenate_datasets function itself will throw an error, isn't it? Isn't it better to do it in the catch function of that. Otherwise it will involve some unnecessary computation, isn't it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this discussion here, function validate_mergeable_datasets now just logs warnings and load_dataset allow concatenation of any datasets.

Copy link
Collaborator

@willmj willmj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, can review again after Dushyant's comments

all_datasets.append(dataset)

# Validate all datasets to have same columns
validate_datasets(all_datasets)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe validate_mergeable_datasets?

Signed-off-by: Abhishek <[email protected]>
Signed-off-by: Abhishek <[email protected]>
@dushyantbehl
Copy link
Collaborator

LGTM!

Copy link
Collaborator

@ashokponkumar ashokponkumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more nit

return all_datasets[0]

raw_datasets = datasets.concatenate_datasets(all_datasets)
logging.info(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logger instead of logging

Signed-off-by: Abhishek <[email protected]>
@Abhishek-TAMU
Copy link
Collaborator Author

Abhishek-TAMU commented Dec 19, 2024

Modified all logging used in any Data preprocessor code to logger. @ashokponkumar @dushyantbehl

@dushyantbehl
Copy link
Collaborator

Modified all logging used in any Data preprocessor code to logger. @ashokponkumar @dushyantbehl

Great Thanks!

ashokponkumar
ashokponkumar previously approved these changes Dec 19, 2024
willmj
willmj previously approved these changes Dec 19, 2024
Copy link
Collaborator

@willmj willmj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks Abhishek!

@dushyantbehl
Copy link
Collaborator

@Abhishek-TAMU LGTM too
Please merge post the test cases succeed.

@Abhishek-TAMU Abhishek-TAMU dismissed stale reviews from ashokponkumar and willmj via d275b43 December 19, 2024 20:49
@ashokponkumar ashokponkumar merged commit d7f06f5 into foundation-model-stack:main Dec 19, 2024
8 checks passed
@Abhishek-TAMU Abhishek-TAMU deleted the files_folder_pattern branch December 19, 2024 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants