feat: ODM without categories #162

romitjain · 2025-11-18T09:51:25Z

Added a new DatasetAutoCategorizer class that creates an embedding out of text data and clusters it using KMeans (GPU accelerated)
The above is automatically triggered if ODM is used with a single category (i.e a dataset dict with a single key/value)
To run the tests: python -m pytest tests/test_auto_categorization.py

romitjain · 2025-11-18T09:53:50Z

Okay, I forgot to run the tests, so some linting tests may fail. If it fails, I will fix them!

kmehant · 2025-11-24T07:03:49Z

@romitjain please fix the tests and let me know

romitjain · 2025-11-24T08:33:55Z

@kmehant the builds failed due to an OSError

ERROR: Could not install packages due to an OSError: [Errno 28] No space left on device

Do you know how to fix it?

kmehant · 2025-11-24T08:37:36Z

Please amend or remove this commit - 0adf9c0

romitjain · 2025-11-24T08:45:00Z

@kmehant How would that resolve the OSError?

kmehant · 2025-11-24T08:53:16Z

@romitjain thats for DCO.

For no space, we need to see how to make the runner slim if we can avoid installation of any dependencies or removing existing files. Do you want to do it as part of this PR like profiling the disk?

Signed-off-by: romit <[email protected]> Co-authored-by: Mehant Kammakomati <[email protected]> Co-authored-by: Padmanabha Venkatagiri Seshadri <[email protected]>

romitjain · 2025-11-24T10:19:14Z

@kmehant I will figure out why the linting github actions are not running, but meanwhile, you can review the logic and PR

Signed-off-by: romit <[email protected]>

plugins/online-data-mixing/pyproject.toml

kmehant · 2025-11-24T16:21:32Z

plugins/online-data-mixing/artifacts/custom_loop_usage.py

 # distributed setup
 dataloader_config = DataLoaderConfiguration(split_batches=True, dispatch_batches=True)
-accelerator = Accelerator(split_batches=True, dataloader_config=dataloader_config)
+accelerator = Accelerator(dataloader_config=dataloader_config)


What was the reason to remove split_batches=True ?

DataLoaderConfiguration already includes split_batches. Do we still need to add it to Accelerator?

Signed-off-by: romit <[email protected]>

kmehant · 2025-11-25T10:28:38Z

plugins/online-data-mixing/src/fms_acceleration_odm/odm/auto_categorizer.py

+    tokenizer: Optional[Any] = None
+
+
+class DatasetAutoCategorizer:


Any thoughts on supporting when the dataset is iterable?

Good catch, at least for the current implementation - no. Clustering would require all the data to be in memory, so with iterable datasets, we would need to fetch all the records and then run clustering.

This auto categorization is only suitable for smaller datasets (sub million).

Sure, do you want to raise an error or something that iterable is currently not supported if the dataset is of that type?

Signed-off-by: romit <[email protected]>

romitjain mentioned this pull request Nov 24, 2025

feat: ODM without categories integration with fms-accel foundation-model-stack/fms-hf-tuning#641

Merged

2 tasks

kmehant self-requested a review November 24, 2025 07:03

ODM without categories

931dfe2

Signed-off-by: romit <[email protected]> Co-authored-by: Mehant Kammakomati <[email protected]> Co-authored-by: Padmanabha Venkatagiri Seshadri <[email protected]>

romitjain force-pushed the feat/odm-without-category branch from 0a846a5 to 931dfe2 Compare November 24, 2025 10:17

romitjain added 2 commits November 24, 2025 14:04

Removed cuml

1e7a679

Signed-off-by: romit <[email protected]>

Making ci cd gods happy

9a24acb

Signed-off-by: romit <[email protected]>

kmehant reviewed Nov 24, 2025

View reviewed changes

this is it

ec9e831

Signed-off-by: romit <[email protected]>

romitjain requested a review from kmehant November 25, 2025 08:22

kmehant reviewed Nov 25, 2025

View reviewed changes

Added iterable dataset error

73b3497

Signed-off-by: romit <[email protected]>

romitjain requested a review from kmehant November 26, 2025 03:49

kmehant approved these changes Nov 26, 2025

View reviewed changes

kmehant merged commit a741e5d into foundation-model-stack:main Nov 26, 2025
8 checks passed

		tokenizer: Optional[Any] = None


		class DatasetAutoCategorizer:

feat: ODM without categories #162

feat: ODM without categories #162

Uh oh!

Conversation

romitjain commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

romitjain commented Nov 18, 2025

Uh oh!

kmehant commented Nov 24, 2025

Uh oh!

romitjain commented Nov 24, 2025

Uh oh!

kmehant commented Nov 24, 2025

Uh oh!

romitjain commented Nov 24, 2025

Uh oh!

kmehant commented Nov 24, 2025

Uh oh!

romitjain commented Nov 24, 2025

Uh oh!

Uh oh!

kmehant Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

romitjain Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

kmehant Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

romitjain Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

kmehant Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

romitjain commented Nov 18, 2025 •

edited

Loading