-
Notifications
You must be signed in to change notification settings - Fork 65
feat: data folder processing in datapreprocessor #417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
|
Thanks for making a pull request! 😃 |
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Abhishek-TAMU
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @willmj. We might need to support files/folder as per our discussion here.
tuning/data/data_processors.py
Outdated
| files = [] | ||
| for path in datasetconfig.data_paths: | ||
| if os.path.isdir(path): | ||
| # If the path is a folder, collect all files within it | ||
| folder_files = [ | ||
| os.path.join(path, file) | ||
| for file in os.listdir(path) | ||
| if os.path.isfile(os.path.join(path, file)) | ||
| ] | ||
| files.extend(folder_files) | ||
| else: | ||
| files.append(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on this discussion and comments by @dushyantbehl, we are looking to support files/folder like this:
- if
extensionis found then use that as theloader - if it is a folder then pass the folder directly
- else fallback on the hf dataset id
One reason is as discussed by Ashok here that glob.glob OR os.listdir can be a performance bottleneck as it iterate through files in a folder once, hence we can avoid that.
As mentioned here in datasets.load_dataset, you can directly pass the directory path here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Abhishek-TAMU thanks for the explanation! I missed this thread, that makes sense. I pushed up some changes to have it work if the user is passing a single folder. I have some additional questions about how we plan to support this now, which might be better answered by @ashokponkumar and @dushyantbehl:
- Are we assuming the user will only pass 1 folder if a folder is passed?
- If not, are we assuming the user will only pass folders or files and not a combination?
- Does our current implementation work with the HF dataset ID or does additional functionality need to be added for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@willmj Thanks for the PR
Yes the user will pass just one folder per dataset we will support only 1 (for simplicity) as HF seems to support only 1.
User has to pass either a folder or files, you can assume that if a single path is specified it can be checked with a isfile or isdir checks
I don't see any reason why our code won't be able to handle a HF dataset ID so supporting that would be great!
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Description of the change
Changes the way data is processed:
from
files = datasetconfig.data_pathstoTo be rebased on top of #412
Related issue number
How to verify the PR
Unit tests or run training passing in a data config with a data folder as the
data_pathsWas the PR tested