-
Notifications
You must be signed in to change notification settings - Fork 30.7k
Add in-out modalities as class attribute per model #41366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add in-out modalities as class attribute per model #41366
Conversation
[For maintainers] Suggested jobs to run (before merge) run-slow: aimv2, align, altclip, aria, audio_spectrogram_transformer, autoformer, aya_vision, bark, beit, bit, blip, blip_2, blt, bridgetower, chameleon, chinese_clip |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can automate these variables, instead of having to manually define them. E.g. can we look at the signature of forward
and, based on arguments present / type hints, determine modalities?
(fewer manual flags = smaller odds of human error = fewer bugs)
|
||
config: Aimv2Config | ||
base_model_prefix = "aimv2" | ||
input_modalities = "image" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: use a tuple
always, even when length = 1
- single possible type = simpler usage
- immutable
tuples
can be used as dictionary keys. In the future, this might be useful to do modality-specific operations (e.g.SOME_MAPPING_OF_FUNCTIONS[model.input_modalities](**kwargs)
)
@classmethod | ||
def output_modalities(cls) -> Optional[Union[str, list[str]]]: | ||
""" | ||
Returns a list of output modalities that a model can generate. For non-generative models | ||
returns a `None`. Multimodal models that can output several modalities or non-text modalities | ||
should overwrite this method. | ||
Returns: | ||
`Union[str, list[str]]`: Output modalities supported for models that can call `.generate()`. | ||
""" | ||
if cls.can_generate(): | ||
return "text" | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one of two things should happen:
- [my preference]
output_modalities
can be used on all models, i.e. it is not limited to generative models. I suspect this is a useful piece of info to have in model-agnostic code, enabling better error handling and other functionality. (unimplemented cases could throw an exception for now?) - If we truly only want to use this in models that inherit
GenerationMixin
, then this function should be moved toGenerationMixin
. Otherwise, we're tangling the classes (= bad practice).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, if we define it for all models what would it mean for models that output encodings (e.g. CLIP)? I could not think of use-case for that tbh
yeah, also thought of it. It is doable for most models but there are some tricky ones as well, For example we don't have a consistent naming habit for video modality or we have no way to say what is being output by a model that has an overwritten |
imo this would be an improvement :) also an incentive to nudge contributors towards standard names and definitions! |
But check with @ArthurZucker before committing code! |
What does this PR do?
Branches out from #40884 (comment) to make review and merge faster
Adds a class attr for each model to indicate the supported input and output modalities. Out modalities will be
None
in case the model is not generative and "text" in most other cases. We have only a few models that can generate audio and image in the output. Note that for encoder decoder models that whisper the input modalities will contain both encoder ("audio") and decoder ("text") modalitiesThis will be used firstly for the pipeline and we can extend usage later to better testing suite and in preparing inputs better in generation with multimodal LLMs (e.g. if we move multimodal encoding to
GenrationMixin._prepare_multimodal_encodings
). No test added at this point, because there is nothing to test