Skip to content

Conversation

zucchini-nlp
Copy link
Member

@zucchini-nlp zucchini-nlp commented Oct 6, 2025

What does this PR do?

Branches out from #40884 (comment) to make review and merge faster

Adds a class attr for each model to indicate the supported input and output modalities. Out modalities will be None in case the model is not generative and "text" in most other cases. We have only a few models that can generate audio and image in the output. Note that for encoder decoder models that whisper the input modalities will contain both encoder ("audio") and decoder ("text") modalities

This will be used firstly for the pipeline and we can extend usage later to better testing suite and in preparing inputs better in generation with multimodal LLMs (e.g. if we move multimodal encoding to GenrationMixin._prepare_multimodal_encodings). No test added at this point, because there is nothing to test

Copy link
Contributor

github-actions bot commented Oct 6, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: aimv2, align, altclip, aria, audio_spectrogram_transformer, autoformer, aya_vision, bark, beit, bit, blip, blip_2, blt, bridgetower, chameleon, chinese_clip

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can automate these variables, instead of having to manually define them. E.g. can we look at the signature of forward and, based on arguments present / type hints, determine modalities?

(fewer manual flags = smaller odds of human error = fewer bugs)


config: Aimv2Config
base_model_prefix = "aimv2"
input_modalities = "image"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: use a tuple always, even when length = 1

  • single possible type = simpler usage
  • immutable
  • tuples can be used as dictionary keys. In the future, this might be useful to do modality-specific operations (e.g. SOME_MAPPING_OF_FUNCTIONS[model.input_modalities](**kwargs))

Comment on lines +2231 to +2243
@classmethod
def output_modalities(cls) -> Optional[Union[str, list[str]]]:
"""
Returns a list of output modalities that a model can generate. For non-generative models
returns a `None`. Multimodal models that can output several modalities or non-text modalities
should overwrite this method.
Returns:
`Union[str, list[str]]`: Output modalities supported for models that can call `.generate()`.
"""
if cls.can_generate():
return "text"
return None
Copy link
Member

@gante gante Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one of two things should happen:

  1. [my preference] output_modalities can be used on all models, i.e. it is not limited to generative models. I suspect this is a useful piece of info to have in model-agnostic code, enabling better error handling and other functionality. (unimplemented cases could throw an exception for now?)
  2. If we truly only want to use this in models that inherit GenerationMixin, then this function should be moved to GenerationMixin. Otherwise, we're tangling the classes (= bad practice).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if we define it for all models what would it mean for models that output encodings (e.g. CLIP)? I could not think of use-case for that tbh

@zucchini-nlp
Copy link
Member Author

E.g. can we look at the signature of forward and, based on arguments present / type hints, determine modalities?

yeah, also thought of it. It is doable for most models but there are some tricky ones as well, For example we don't have a consistent naming habit for video modality or we have no way to say what is being output by a model that has an overwritten generate(). We can have have a default to input_modalities as well similar to output_modalities, but then manually overwrite all models where the pattern does not match

@gante
Copy link
Member

gante commented Oct 6, 2025

We can have have a default to input_modalities as well similar to output_modalities, but then manually overwrite all models where the pattern does not match

imo this would be an improvement :) also an incentive to nudge contributors towards standard names and definitions!

@gante
Copy link
Member

gante commented Oct 6, 2025

But check with @ArthurZucker before committing code!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants