Conversion of `Donut` (vision-encoder-decoder type of model) fails if we don't pass the `--task` argument to the conversion script

### System Info

Transformers.js v3.7.6 running on Linux.
The issue is related to the conversion script only, with runs on Python 3.12.

### Environment/Platform

- [ ] Website/web-app
- [ ] Browser extension
- [x] Server-side (e.g., Node.js, Deno, Bun)
- [ ] Desktop app (e.g., Electron)
- [ ] Other (e.g., VSCode extension)

### Description

As discussed in on this thread in [huggingface.co/spaces/onnx-community/convert-to-onnx](https://huggingface.co/spaces/onnx-community/convert-to-onnx/discussions/30), when passing `--task image-to-text`, the conversion of [Norm/nougat-latex-base](https://huggingface.co/Norm/nougat-latex-base) works fine.

But if we run without the `--task` argument, the conversion fails with the following message:
```
Conversion failed: Config of the encoder: <class 'transformers.models.donut.modeling_donut_swin.DonutSwinModel'> is overwritten by shared encoder config: DonutSwinConfig { "attention_probs_dropout_prob": 0.0, "depths": [ 2, 2, 14, 2 ], "drop_path_rate": 0.1, "embed_dim": 128, "hidden_act": "gelu", "hidden_dropout_prob": 0.0, "hidden_size": 1024, "image_size": [ 224, 560 ], "initializer_range": 0.02, "layer_norm_eps": 1e-05, "mlp_ratio": 4.0, "model_type": "donut-swin", "num_channels": 3, "num_heads": [ 4, 8, 16, 32 ], "num_layers": 4, "patch_size": 4, "qkv_bias": true, "torch_dtype": "float32", "transformers_version": "4.49.0", "use_absolute_embeddings": false, "window_size": 7 }

Config of the decoder: <class 'transformers.models.mbart.modeling_mbart.MBartForCausalLM'> is overwritten by shared decoder config: MBartConfig { "activation_dropout": 0.0, "activation_function": "gelu", "add_cross_attention": true, "add_final_layer_norm": true, "attention_dropout": 0.0, "bos_token_id": 0, "classifier_dropout": 0.0, "d_model": 1024, "decoder_attention_heads": 16, "decoder_ffn_dim": 4096, "decoder_layerdrop": 0.0, "decoder_layers": 10, "dropout": 0.1, "encoder_attention_heads": 16, "encoder_ffn_dim": 4096, "encoder_layerdrop": 0.0, "encoder_layers": 12, "eos_token_id": 2, "forced_eos_token_id": 2, "init_std": 0.02, "is_decoder": true, "is_encoder_decoder": false, "max_length": 800, "max_position_embeddings": 4096, "model_type": "mbart", "num_hidden_layers": 12, "pad_token_id": 1, "scale_embedding": true, "tie_word_embeddings": false, "torch_dtype": "float32", "transformers_version": "4.49.0", "use_cache": true, "vocab_size": 50000 }

Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False. /usr/local/lib/python3.12/site-packages/transformers/models/donut/modeling_donut_swin.py:265: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if width % self.patch_size[1] != 0: /usr/local/lib/python3.12/site-packages/transformers/models/donut/modeling_donut_swin.py:268: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if height % self.patch_size[0] != 0: /usr/local/lib/python3.12/site-packages/transformers/models/donut/modeling_donut_swin.py:575: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if min(input_resolution) ≤ self.window_size: /usr/local/lib/python3.12/site-packages/transformers/models/donut/modeling_donut_swin.py:669: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! was_padded = pad_values[3] > 0 or pad_values[5] > 0 /usr/local/lib/python3.12/site-packages/transformers/models/donut/modeling_donut_swin.py:670: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if was_padded: /usr/local/lib/python3.12/site-packages/transformers/models/donut/modeling_donut_swin.py:307: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! should_pad = (height % 2 == 1) or (width % 2 == 1) /usr/local/lib/python3.12/site-packages/transformers/models/donut/modeling_donut_swin.py:308: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if should_pad: /usr/local/lib/python3.12/site-packages/transformers/models/donut/modeling_donut_swin.py:579: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect. torch.min(torch.tensor(input_resolution)) if torch.jit.is_tracing() else min(input_resolution) /usr/local/lib/python3.12/site-packages/transformers/modeling_attn_mask_utils.py:88: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if input_shape[-1] > 1 or self.sliding_window is not None: /usr/local/lib/python3.12/site-packages/transformers/modeling_attn_mask_utils.py:164: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if past_key_values_length > 0: /usr/local/lib/python3.12/site-packages/transformers/models/mbart/modeling_mbart.py:503: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if attn_output.size() != (bsz, self.num_heads, tgt_len, self.head_dim): /usr/local/lib/python3.12/site-packages/transformers/models/mbart/modeling_mbart.py:490: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! is_causal = True if self.is_causal and attention_mask is None and tgt_len > 1 else False /usr/local/lib/python3.12/site-packages/transformers/models/mbart/modeling_mbart.py:455: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! and past_key_value[0].shape[2] == key_value_states.shape[1]
```

### Reproduction

1. Access https://huggingface.co/spaces/onnx-community/convert-to-onnx
2. Insert "Norm/nougat-latex-base" as the model name to convert and hit Enter
3. Fill in your Hugging Face write token, so the model can be uploaded to your account. Otherwise it won't display the "Proceed [with conversion]" button. Don't change any other option.
4. Click the "Proceed" button.
5. Wait a few minutes and the conversion error will be displayed.

Note: The same error should happen if you run the conversion script manually.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Conversion of `Donut` (vision-encoder-decoder type of model) fails if we don't pass the `--task` argument to the conversion script #1453

System Info

Environment/Platform

Description

Reproduction

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Conversion of Donut (vision-encoder-decoder type of model) fails if we don't pass the --task argument to the conversion script #1453

Description

System Info

Environment/Platform

Description

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Conversion of `Donut` (vision-encoder-decoder type of model) fails if we don't pass the `--task` argument to the conversion script #1453