Skip to content

Conversation

@willmj
Copy link
Collaborator

@willmj willmj commented Dec 12, 2024

Description of the change

Changes the way data is processed:
from files = datasetconfig.data_paths to

            files = []
            for path in datasetconfig.data_paths:
                if os.path.isdir(path):
                    # If the path is a folder, collect all files within it
                    folder_files = [
                        os.path.join(path, file)
                        for file in os.listdir(path)
                        if os.path.isfile(os.path.join(path, file))
                    ]
                    files.extend(folder_files)
                else:
                    files.append(path)

To be rebased on top of #412

Related issue number

How to verify the PR

Unit tests or run training passing in a data config with a data folder as the data_paths

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

@github-actions
Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@github-actions github-actions bot added the feat label Dec 12, 2024
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
@willmj willmj changed the title feat: [WIP] data folder processing in datapreprocessor feat: data folder processing in datapreprocessor Dec 13, 2024
@willmj willmj marked this pull request as ready for review December 13, 2024 19:32
Copy link
Collaborator

@Abhishek-TAMU Abhishek-TAMU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @willmj. We might need to support files/folder as per our discussion here.

Comment on lines 88 to 99
files = []
for path in datasetconfig.data_paths:
if os.path.isdir(path):
# If the path is a folder, collect all files within it
folder_files = [
os.path.join(path, file)
for file in os.listdir(path)
if os.path.isfile(os.path.join(path, file))
]
files.extend(folder_files)
else:
files.append(path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this discussion and comments by @dushyantbehl, we are looking to support files/folder like this:

  • if extension is found then use that as the loader
  • if it is a folder then pass the folder directly
  • else fallback on the hf dataset id

One reason is as discussed by Ashok here that glob.glob OR os.listdir can be a performance bottleneck as it iterate through files in a folder once, hence we can avoid that.

As mentioned here in datasets.load_dataset, you can directly pass the directory path here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Abhishek-TAMU thanks for the explanation! I missed this thread, that makes sense. I pushed up some changes to have it work if the user is passing a single folder. I have some additional questions about how we plan to support this now, which might be better answered by @ashokponkumar and @dushyantbehl:

  • Are we assuming the user will only pass 1 folder if a folder is passed?
  • If not, are we assuming the user will only pass folders or files and not a combination?
  • Does our current implementation work with the HF dataset ID or does additional functionality need to be added for this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willmj Thanks for the PR
Yes the user will pass just one folder per dataset we will support only 1 (for simplicity) as HF seems to support only 1.

User has to pass either a folder or files, you can assume that if a single path is specified it can be checked with a isfile or isdir checks

I don't see any reason why our code won't be able to handle a HF dataset ID so supporting that would be great!

Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants