Add in-out modalities as class attribute per model #41366

zucchini-nlp · 2025-10-06T12:41:23Z

What does this PR do?

Branches out from #40884 (comment) to make review and merge faster

Adds a class attr for each model to indicate the supported input and output modalities. Out modalities will be None in case the model is not generative and "text" in most other cases. We have only a few models that can generate audio and image in the output. Note that for encoder decoder models that whisper the input modalities will contain both encoder ("audio") and decoder ("text") modalities

This will be used firstly for the pipeline and we can extend usage later to better testing suite and in preparing inputs better in generation with multimodal LLMs (e.g. if we move multimodal encoding to GenrationMixin._prepare_multimodal_encodings). No test added at this point, because there is nothing to test

github-actions · 2025-10-06T12:42:55Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: aimv2, align, altclip, aria, audio_spectrogram_transformer, autoformer, aya_vision, bark, beit, bit, blip, blip_2, blt, bridgetower, chameleon, chinese_clip

HuggingFaceDocBuilderDev · 2025-10-06T12:57:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

gante

I wonder if we can automate these variables, instead of having to manually define them. E.g. can we look at the signature of forward and, based on arguments present / type hints, determine modalities?

(fewer manual flags = smaller odds of human error = fewer bugs)

gante · 2025-10-06T15:21:39Z

src/transformers/models/aimv2/modeling_aimv2.py


    config: Aimv2Config
    base_model_prefix = "aimv2"
+    input_modalities = "image"


suggestion: use a tuple always, even when length = 1

single possible type = simpler usage

immutable

tuples can be used as dictionary keys. In the future, this might be useful to do modality-specific operations (e.g. SOME_MAPPING_OF_FUNCTIONS[model.input_modalities](**kwargs))

gante · 2025-10-06T15:32:39Z

src/transformers/modeling_utils.py

+    @classmethod
+    def output_modalities(cls) -> Optional[Union[str, list[str]]]:
+        """
+        Returns a list of output modalities that a model can generate. For non-generative models
+        returns a `None`. Multimodal models that can output several modalities or non-text modalities
+        should overwrite this method.
+
+        Returns:
+            `Union[str, list[str]]`: Output modalities supported for models that can call `.generate()`.
+        """
+        if cls.can_generate():
+            return "text"
+        return None


I think one of two things should happen:

[my preference] output_modalities can be used on all models, i.e. it is not limited to generative models. I suspect this is a useful piece of info to have in model-agnostic code, enabling better error handling and other functionality. (unimplemented cases could throw an exception for now?)

If we truly only want to use this in models that inherit GenerationMixin, then this function should be moved to GenerationMixin. Otherwise, we're tangling the classes (= bad practice).

Hmm, if we define it for all models what would it mean for models that output encodings (e.g. CLIP)? I could not think of use-case for that tbh

zucchini-nlp · 2025-10-06T15:41:19Z

E.g. can we look at the signature of forward and, based on arguments present / type hints, determine modalities?

yeah, also thought of it. It is doable for most models but there are some tricky ones as well, For example we don't have a consistent naming habit for video modality or we have no way to say what is being output by a model that has an overwritten generate(). We can have have a default to input_modalities as well similar to output_modalities, but then manually overwrite all models where the pattern does not match

gante · 2025-10-06T15:45:54Z

We can have have a default to input_modalities as well similar to output_modalities, but then manually overwrite all models where the pattern does not match

imo this would be an improvement :) also an incentive to nudge contributors towards standard names and definitions!

gante · 2025-10-06T15:46:29Z

But check with @ArthurZucker before committing code!

zucchini-nlp added 2 commits October 6, 2025 14:32

update all models

697a125

fix copies

b853eba

explanation comment

0fbe159

zucchini-nlp requested review from gante and ArthurZucker October 6, 2025 13:02

better notation in omni model

1466dfe

gante reviewed Oct 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add in-out modalities as class attribute per model #41366

Add in-out modalities as class attribute per model #41366

zucchini-nlp commented Oct 6, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 6, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Oct 6, 2025

Uh oh!

gante left a comment

Uh oh!

gante Oct 6, 2025

Uh oh!

gante Oct 6, 2025 •

edited

Loading

Uh oh!

zucchini-nlp Oct 6, 2025

Uh oh!

zucchini-nlp commented Oct 6, 2025

Uh oh!

gante commented Oct 6, 2025

Uh oh!

gante commented Oct 6, 2025

Uh oh!

Uh oh!

Add in-out modalities as class attribute per model #41366

Are you sure you want to change the base?

Add in-out modalities as class attribute per model #41366

Conversation

zucchini-nlp commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

github-actions bot commented Oct 6, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Oct 6, 2025

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

gante Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

gante Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp commented Oct 6, 2025

Uh oh!

gante commented Oct 6, 2025

Uh oh!

gante commented Oct 6, 2025

Uh oh!

Uh oh!

zucchini-nlp commented Oct 6, 2025 •

edited

Loading

gante Oct 6, 2025 •

edited

Loading