Skip to content

Conversation

@romitjain
Copy link
Collaborator

@romitjain romitjain commented Nov 18, 2025

  1. Added a new DatasetAutoCategorizer class that creates an embedding out of text data and clusters it using KMeans (GPU accelerated)
  2. The above is automatically triggered if ODM is used with a single category (i.e a dataset dict with a single key/value)
  3. To run the tests: python -m pytest tests/test_auto_categorization.py

@romitjain
Copy link
Collaborator Author

Okay, I forgot to run the tests, so some linting tests may fail. If it fails, I will fix them!

@kmehant
Copy link
Collaborator

kmehant commented Nov 24, 2025

@romitjain please fix the tests and let me know

@kmehant kmehant self-requested a review November 24, 2025 07:03
@romitjain
Copy link
Collaborator Author

@kmehant the builds failed due to an OSError

ERROR: Could not install packages due to an OSError: [Errno 28] No space left on device

Do you know how to fix it?

@kmehant
Copy link
Collaborator

kmehant commented Nov 24, 2025

Please amend or remove this commit - 0adf9c0

@romitjain
Copy link
Collaborator Author

@kmehant How would that resolve the OSError?

@kmehant
Copy link
Collaborator

kmehant commented Nov 24, 2025

@romitjain thats for DCO.

For no space, we need to see how to make the runner slim if we can avoid installation of any dependencies or removing existing files. Do you want to do it as part of this PR like profiling the disk?

Signed-off-by: romit <[email protected]>
Co-authored-by: Mehant Kammakomati <[email protected]>
Co-authored-by: Padmanabha Venkatagiri Seshadri <[email protected]>
@romitjain romitjain force-pushed the feat/odm-without-category branch from 0a846a5 to 931dfe2 Compare November 24, 2025 10:17
@romitjain
Copy link
Collaborator Author

@kmehant I will figure out why the linting github actions are not running, but meanwhile, you can review the logic and PR

# distributed setup
dataloader_config = DataLoaderConfiguration(split_batches=True, dispatch_batches=True)
accelerator = Accelerator(split_batches=True, dataloader_config=dataloader_config)
accelerator = Accelerator(dataloader_config=dataloader_config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reason to remove split_batches=True ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataLoaderConfiguration already includes split_batches. Do we still need to add it to Accelerator?

Signed-off-by: romit <[email protected]>
@romitjain romitjain requested a review from kmehant November 25, 2025 08:22
tokenizer: Optional[Any] = None


class DatasetAutoCategorizer:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on supporting when the dataset is iterable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, at least for the current implementation - no. Clustering would require all the data to be in memory, so with iterable datasets, we would need to fetch all the records and then run clustering.

This auto categorization is only suitable for smaller datasets (sub million).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, do you want to raise an error or something that iterable is currently not supported if the dataset is of that type?

@romitjain romitjain requested a review from kmehant November 26, 2025 03:49
@kmehant kmehant merged commit a741e5d into foundation-model-stack:main Nov 26, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants