@@ -151,7 +151,7 @@ def retrieve_latents(
151151
152152class Cosmos2_5_TransferPipeline (DiffusionPipeline ):
153153 r"""
154- Pipeline for Cosmos Transfer2.5 base model .
154+ Pipeline for Cosmos Transfer2.5, supporting auto-regressive inference .
155155
156156 This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
157157 implemented for all pipelines (downloading, saving, running on a particular device, etc.).
@@ -538,18 +538,25 @@ def __call__(
538538 num_latent_conditional_frames : Optional [int ] = None ,
539539 ):
540540 r"""
541- The call function supports a predict-compatible path when `controls` is `None` (or `self.controlnet` is
542- `None`). In that mode it follows the same input semantics as `Cosmos2_5_PredictPipeline`:
541+ The call function can be used in two modes: with or without controls.
543542
543+ When controls are not provided (`controls is None`), inference works in the same manner as predict2.5 (see
544+ `Cosmos2_5_PredictPipeline`). This mode strictly uses the base transformer (`self.transformer`) to perform
545+ inference and accepts as input an optional `image` or `video` along with a `prompt` / `negative_prompt`, and
546+ can be used in the following ways:
544547 - **Text2World**: `image=None`, `video=None`, `prompt` provided.
545548 - **Image2World**: `image` provided, `video=None`, `prompt` provided.
546549 - **Video2World**: `video` provided, `image=None`, `prompt` provided.
547550
548551 When `controls` are provided and a ControlNet is attached, `controls` drive the conditioning and `video` &
549- `image` is ignored.
552+ `image` is ignored. Controls are assumed to be pre-processed, e.g. edge maps are pre-computed.
550553
551554 Setting `num_frames` will restrict the total number of frames output, if not provided or assigned to None
552555 (default) then the number of output frames will match the input `video`, `image` or `controls` respectively.
556+ Auto-regressive inference is supported and thus a sliding window of `num_frames_per_chunk` frames are used per
557+ denoising loop. In addition, when auto-regressive inference is performed, the previous
558+ `num_latent_conditional_frames` or `num_conditional_frames` are used to condition the following denoising
559+ inference loops.
553560
554561 Args:
555562 image (`PipelineImageInput`, *optional*):
0 commit comments