feat: data folder processing in datapreprocessor #417

willmj · 2024-12-12T20:41:04Z

Description of the change

Changes the way data is processed:
from files = datasetconfig.data_paths to

            files = []
            for path in datasetconfig.data_paths:
                if os.path.isdir(path):
                    # If the path is a folder, collect all files within it
                    folder_files = [
                        os.path.join(path, file)
                        for file in os.listdir(path)
                        if os.path.isfile(os.path.join(path, file))
                    ]
                    files.extend(folder_files)
                else:
                    files.append(path)

To be rebased on top of #412

Related issue number

How to verify the PR

Unit tests or run training passing in a data config with a data folder as the data_paths

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

Signed-off-by: Will Johnson <[email protected]>

github-actions · 2024-12-12T20:41:19Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

Signed-off-by: Will Johnson <[email protected]>

Abhishek-TAMU

Thanks for the PR @willmj. We might need to support files/folder as per our discussion here.

Abhishek-TAMU · 2024-12-13T20:07:49Z

tuning/data/data_processors.py

+            files = []
+            for path in datasetconfig.data_paths:
+                if os.path.isdir(path):
+                    # If the path is a folder, collect all files within it
+                    folder_files = [
+                        os.path.join(path, file)
+                        for file in os.listdir(path)
+                        if os.path.isfile(os.path.join(path, file))
+                    ]
+                    files.extend(folder_files)
+                else:
+                    files.append(path)


Based on this discussion and comments by @dushyantbehl, we are looking to support files/folder like this:

if extension is found then use that as the loader

if it is a folder then pass the folder directly

else fallback on the hf dataset id

One reason is as discussed by Ashok here that glob.glob OR os.listdir can be a performance bottleneck as it iterate through files in a folder once, hence we can avoid that.

As mentioned here in datasets.load_dataset, you can directly pass the directory path here.

@Abhishek-TAMU thanks for the explanation! I missed this thread, that makes sense. I pushed up some changes to have it work if the user is passing a single folder. I have some additional questions about how we plan to support this now, which might be better answered by @ashokponkumar and @dushyantbehl:

Are we assuming the user will only pass 1 folder if a folder is passed?

If not, are we assuming the user will only pass folders or files and not a combination?

Does our current implementation work with the HF dataset ID or does additional functionality need to be added for this?

@willmj Thanks for the PR
Yes the user will pass just one folder per dataset we will support only 1 (for simplicity) as HF seems to support only 1.

User has to pass either a folder or files, you can assume that if a single path is specified it can be checked with a isfile or isdir checks

I don't see any reason why our code won't be able to handle a HF dataset ID so supporting that would be great!

Signed-off-by: Will Johnson <[email protected]>

willmj added 3 commits December 11, 2024 10:21

feat: first pass at data folder functionality

e76f798

Signed-off-by: Will Johnson <[email protected]>

fix: remove files set after condition

7f5451d

Signed-off-by: Will Johnson <[email protected]>

fix: remove files set after condition

7f9fff1

Signed-off-by: Will Johnson <[email protected]>

github-actions bot added the feat label Dec 12, 2024

willmj added 3 commits December 13, 2024 09:52

test: for data folder

922856d

Signed-off-by: Will Johnson <[email protected]>

merge: branch 'main' into data folder branch

71dcc5c

Signed-off-by: Will Johnson <[email protected]>

fmt

2fe3304

Signed-off-by: Will Johnson <[email protected]>

willmj changed the title ~~feat: [WIP] data folder processing in datapreprocessor~~ feat: data folder processing in datapreprocessor Dec 13, 2024

willmj marked this pull request as ready for review December 13, 2024 19:32

willmj requested review from Ssukriti, aluu317, anhuong, fabianlim and kmehant as code owners December 13, 2024 19:32

Abhishek-TAMU reviewed Dec 13, 2024

View reviewed changes

willmj added 2 commits December 13, 2024 16:04

fix: loading folder

a9b2644

Signed-off-by: Will Johnson <[email protected]>

fix: assume only 1 dataset

865c5fa

Signed-off-by: Will Johnson <[email protected]>

Abhishek-TAMU mentioned this pull request Dec 18, 2024

feat: Handle passing of multiple files, multiple folders, path with patterns, HF Dataset and combination #424

Merged

2 tasks

willmj closed this Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: data folder processing in datapreprocessor #417

feat: data folder processing in datapreprocessor #417

Uh oh!

willmj commented Dec 12, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Dec 12, 2024

Uh oh!

Abhishek-TAMU left a comment

Uh oh!

Abhishek-TAMU Dec 13, 2024

Uh oh!

willmj Dec 13, 2024

Uh oh!

dushyantbehl Dec 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: data folder processing in datapreprocessor #417

feat: data folder processing in datapreprocessor #417

Uh oh!

Conversation

willmj commented Dec 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of the change

Related issue number

How to verify the PR

Was the PR tested

Uh oh!

github-actions bot commented Dec 12, 2024

Uh oh!

Abhishek-TAMU left a comment

Choose a reason for hiding this comment

Uh oh!

Abhishek-TAMU Dec 13, 2024

Choose a reason for hiding this comment

Uh oh!

willmj Dec 13, 2024

Choose a reason for hiding this comment

Uh oh!

dushyantbehl Dec 17, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

willmj commented Dec 12, 2024 •

edited

Loading