-
Notifications
You must be signed in to change notification settings - Fork 19
feat: ODM without categories #162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: ODM without categories #162
Conversation
|
Okay, I forgot to run the tests, so some linting tests may fail. If it fails, I will fix them! |
|
@romitjain please fix the tests and let me know |
|
@kmehant the builds failed due to an OSError Do you know how to fix it? |
|
Please amend or remove this commit - 0adf9c0 |
|
@kmehant How would that resolve the OSError? |
|
@romitjain thats for DCO. For no space, we need to see how to make the runner slim if we can avoid installation of any dependencies or removing existing files. Do you want to do it as part of this PR like profiling the disk? |
Signed-off-by: romit <[email protected]> Co-authored-by: Mehant Kammakomati <[email protected]> Co-authored-by: Padmanabha Venkatagiri Seshadri <[email protected]>
0a846a5 to
931dfe2
Compare
|
@kmehant I will figure out why the linting github actions are not running, but meanwhile, you can review the logic and PR |
Signed-off-by: romit <[email protected]>
Signed-off-by: romit <[email protected]>
| # distributed setup | ||
| dataloader_config = DataLoaderConfiguration(split_batches=True, dispatch_batches=True) | ||
| accelerator = Accelerator(split_batches=True, dataloader_config=dataloader_config) | ||
| accelerator = Accelerator(dataloader_config=dataloader_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the reason to remove split_batches=True ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataLoaderConfiguration already includes split_batches. Do we still need to add it to Accelerator?
Signed-off-by: romit <[email protected]>
| tokenizer: Optional[Any] = None | ||
|
|
||
|
|
||
| class DatasetAutoCategorizer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts on supporting when the dataset is iterable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, at least for the current implementation - no. Clustering would require all the data to be in memory, so with iterable datasets, we would need to fetch all the records and then run clustering.
This auto categorization is only suitable for smaller datasets (sub million).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, do you want to raise an error or something that iterable is currently not supported if the dataset is of that type?
Signed-off-by: romit <[email protected]>
DatasetAutoCategorizerclass that creates an embedding out of text data and clusters it using KMeans (GPU accelerated)python -m pytest tests/test_auto_categorization.py