Skip to content

[Dataset Performance] Add num workers on dataset processing - labels, tokenization#1189

Merged
dsikka merged 3 commits intomainfrom
num-proc-dataset
Feb 25, 2025
Merged

[Dataset Performance] Add num workers on dataset processing - labels, tokenization#1189
dsikka merged 3 commits intomainfrom
num-proc-dataset

Conversation

@horheynm
Copy link
Copy Markdown

SUMMARY:

  • Add preprocessing_num_workers to run dataset processing in parallel for 2:4 example.

Before:
Tokenizing: 371.12 examples/s,
Adding labels: 1890.18 examples/s,
Tokenizing: 333.39 examples/s

Tokenizing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12802/12802 [00:34<00:00, 371.12 examples/s]
Adding labels: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12802/12802 [00:06<00:00, 1890.18 examples/s]
Tokenizing:   9%|█████████▌                                                                                                     | 22077/256032 [00:59<11:41, 333.39 examples/s

After (num_proc=8):
Tokenizing: 2703.93 examples/s,
Adding labels: 5524.98 examples/s,
Tokenizing: 2925.98 examples/s

Tokenizing (num_proc=8): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 12802/12802 [00:04<00:00, 2703.93 examples/s]
Adding labels (num_proc=8): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 12802/12802 [00:02<00:00, 5524.98 examples/s]
Tokenizing (num_proc=8): 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 256032/256032 [01:27<00:00, 2925.98 examples/s]

TEST PLAN:

  • Pass existing tests

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@horheynm horheynm added the ready When a PR is ready for review label Feb 25, 2025
Copy link
Copy Markdown
Collaborator

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks

@dsikka dsikka enabled auto-merge (squash) February 25, 2025 20:33
@dsikka dsikka merged commit 77e4f4c into main Feb 25, 2025
7 checks passed
@dsikka dsikka deleted the num-proc-dataset branch February 25, 2025 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready When a PR is ready for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants