Allow non-Tensor values in a batch with `dispatch_batches=True` by tomaarsen · Pull Request #3850 · huggingface/accelerate

tomaarsen · 2025-11-25T10:43:04Z

Resolves #3849

What does this PR do?

Allow non-Tensor values in a batch with dispatch_batches=True, matching behaviour for when dispatch_batches=False or when accelerate is not used

Details

Rerunning the script from #3849 now also gives this for the previously broken case:

Accelerator, with `IterableDataset`

Batch:
query_input_ids: <class 'torch.Tensor'> with shape torch.Size([4, 13])
query_token_type_ids: <class 'torch.Tensor'> with shape torch.Size([4, 13])
query_attention_mask: <class 'torch.Tensor'> with shape torch.Size([4, 13])
query_str_parameter: <class 'str'> parameter_value
query_bool_parameter: <class 'bool'> True
query_str_list: <class 'list'> ['list_item_1', 'list_item_2']
answer_input_ids: <class 'torch.Tensor'> with shape torch.Size([4, 328])
answer_token_type_ids: <class 'torch.Tensor'> with shape torch.Size([4, 328])
answer_attention_mask: <class 'torch.Tensor'> with shape torch.Size([4, 328])
answer_str_parameter: <class 'str'> parameter_value
answer_bool_parameter: <class 'bool'> True
answer_str_list: <class 'list'> ['list_item_1', 'list_item_2']

Batch:
query_input_ids: <class 'torch.Tensor'> with shape torch.Size([1, 11])
query_token_type_ids: <class 'torch.Tensor'> with shape torch.Size([1, 11])
query_attention_mask: <class 'torch.Tensor'> with shape torch.Size([1, 11])
query_str_parameter: <class 'str'> parameter_value
query_bool_parameter: <class 'bool'> True
query_str_list: <class 'list'> ['list_item_1', 'list_item_2']
answer_input_ids: <class 'torch.Tensor'> with shape torch.Size([1, 164])
answer_token_type_ids: <class 'torch.Tensor'> with shape torch.Size([1, 164])
answer_attention_mask: <class 'torch.Tensor'> with shape torch.Size([1, 164])
answer_str_parameter: <class 'str'> parameter_value
answer_bool_parameter: <class 'bool'> True
answer_str_list: <class 'list'> ['list_item_1', 'list_item_2']

As expected!

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@BenjaminBossan @SunMarc

Tom Aarsen

HuggingFaceDocBuilderDev · 2025-11-25T10:46:23Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SunMarc

Thanks for this ! Left a comment, let me know if this is unclear

SunMarc · 2025-11-25T13:59:46Z

src/accelerate/utils/operations.py

-    return torch.cat(data, dim=dim)
+    elif isinstance(data[0], torch.Tensor):
+        return torch.cat(data, dim=dim)
+    return data[0]


This is a fix that only works when we have one gpu (only one batch is passed in concatenate). The issue happens when we have multi-gpu, we might have the following situation for 2 batchs:
[{'key1': ["str1", "str2"]}, {'key1':["str3", "str4"]}], we would get the following result:
[{'key1': ["str1", "str2"]]. Not sure if this is what we want unless str1==str3 and str2==str4.

Well for now, what we can do is to check if len(data)>=2 when it is not a tensor, list or mapping. If it is 1, we do that otherwise we return an error saying that we can only concat tensors.

Also, can you add some simple tests for concatenate ?

Damn, you're right. Apologies, I was thinking that the elif isinstance(data[0], Mapping) branch would take care of it, but the recursive call with ['str1', 'str3'] and ['str2', 'str4'] will result into 'str1' and 'str2', dropping the other 2.

I think the len(data) check is smart, will incorporate and run some tests.

Sounds good ! Yeah this part is quite tricky so I prefer to be extra cautious

Suggested change

return data[0]

elif isinstance(data, (tuple, list)) and len(data) == 1:

return data[0]

else:

raise TypeError(f"Can only concatenate tensors but got {type(data[0])}")

The suggestion here works fine with single-process, but the len(data) is simply equal to the number of processes. In short: it'll always fail in multi-process settings. That tells me that perhaps it's simply not viable to pass string parameters from tokenization or collation to the model during training in this way. Bools are simpler, I can just turn those into a singleton bool tensor that get concatenated.

Perhaps we should leave this PR be?

Still I think it might be worth to make it work in single-process no ? If you think this will create more issues, then we can leave this PR be.
We can add in the docstring that if we receive only one batch of data, we will return it as it.

Agreed, it's a step in the right direction, even though multi-process support is not possible. We can still merge it. I've also added some tests and updated the docstring slightly.

I do think I'll have to reconsider some stuff in Sentence Transformers, e.g. whether I want to use string values in my batches to pass parameters during training. Not supporting IterableDataset + MultiGPU is a bit annoying.

…ests

SunMarc

Awesome, thanks for adding these nice tests

tomaarsen added 2 commits November 25, 2025 11:25

Allow non-Tensor values in a batch with dispatch_batches=True

e8e0667

Add a test case that previously failed with a TypeError

c289334

SunMarc reviewed Nov 25, 2025

View reviewed changes

Allow non-tensor samples in batches in single-process settings; add t…

bec8a3e

…ests

SunMarc approved these changes Nov 26, 2025

View reviewed changes

SunMarc merged commit b521400 into huggingface:main Nov 26, 2025
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow non-Tensor values in a batch with `dispatch_batches=True`#3850

Allow non-Tensor values in a batch with `dispatch_batches=True`#3850
SunMarc merged 3 commits intohuggingface:mainfrom
tomaarsen:feat/dispatch_batches_non_tensor_samples

tomaarsen commented Nov 25, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Nov 25, 2025

Uh oh!

SunMarc left a comment

Uh oh!

SunMarc Nov 25, 2025 •

edited

Loading

Uh oh!

tomaarsen Nov 25, 2025

Uh oh!

SunMarc Nov 25, 2025

Uh oh!

tomaarsen Nov 25, 2025

Uh oh!

SunMarc Nov 25, 2025

Uh oh!

tomaarsen Nov 26, 2025

Uh oh!

SunMarc left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tomaarsen commented Nov 25, 2025

What does this PR do?

Details

Accelerator, with IterableDataset

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Nov 25, 2025

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

SunMarc Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomaarsen Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

SunMarc Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

tomaarsen Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

SunMarc Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

tomaarsen Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Accelerator, with `IterableDataset`

SunMarc Nov 25, 2025 •

edited

Loading