Skip to content

Commit 930e30b

Browse files
authored
nits on any-to-any task (#1372)
1 parent 457119e commit 930e30b

File tree

1 file changed

+27
-14
lines changed
  • packages/tasks/src/tasks/any-to-any

1 file changed

+27
-14
lines changed

packages/tasks/src/tasks/any-to-any/about.md

Lines changed: 27 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -6,47 +6,60 @@ Any-to-any models can help embodied agents operate in multi-sensory environments
66

77
### Real-time Accessibility Systems
88

9-
Vision-language based any-to-any models can be used aid visually impaired people. A real-time on-device any-to-any model can take a real-world video stream from wearable glasses, and describe the scene in audio (e.g., "A person in a red coat is walking toward you") or provide real-time closed captions and environmental sound cues.
9+
Vision-language based any-to-any models can be used to aid visually impaired people. A real-time on-device any-to-any model can take a real-world video stream from wearable glasses, and describe the scene in audio (e.g., "A person in a red coat is walking toward you"), or provide real-time closed captions and environmental sound cues.
1010

1111
### Multimodal Content Creation
1212

1313
One can use any-to-any models to generate multimodal content. For example, given a video and an outline, the model can generate speech, better videos, or a descriptive blog post. Moreover, these models can sync narration timing with visual transitions.
1414

1515
## Inference
1616

17-
You can infer with any-to-any models using transformers. Below is an example to infer Qwen2.5-Omni-7B model, make sure to check the model you're inferring with.
17+
You can infer with any-to-any models using transformers. Below is an example that passes a video as part of a chat conversation to the Qwen2.5-Omni-7B model, and retrieves text and audio responses. Make sure to check the model you're inferring with.
1818

1919
```python
2020
import soundfile as sf
21-
from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor
22-
from qwen_omni_utils import process_mm_info
23-
24-
model = Qwen2_5OmniModel.from_pretrained("Qwen/Qwen2.5-Omni-7B", torch_dtype="auto", device_map="auto")
21+
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
2522

23+
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
24+
"Qwen/Qwen2.5-Omni-7B",
25+
torch_dtype="auto",
26+
device_map="auto",
27+
attn_implementation="flash_attention_2",
28+
)
2629
processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")
2730

2831
conversation = [
2932
{
3033
"role": "system",
31-
"content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
34+
"content": [
35+
{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
36+
],
3237
},
3338
{
3439
"role": "user",
3540
"content": [
3641
{"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4"},
42+
{"type": "text", "text": "What can you hear and see in this video?"},
3743
],
3844
},
3945
]
4046

41-
USE_AUDIO_IN_VIDEO = True
42-
43-
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
44-
audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
45-
inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
46-
inputs = inputs.to(model.device).to(model.dtype)
47+
inputs = processor.apply_chat_template(
48+
conversation,
49+
load_audio_from_video=True,
50+
add_generation_prompt=True,
51+
tokenize=True,
52+
return_dict=True,
53+
return_tensors="pt",
54+
video_fps=2,
55+
56+
# kwargs to be passed to `Qwen2-5-OmniProcessor`
57+
padding=True,
58+
use_audio_in_video=True,
59+
)
4760

4861
# Inference: Generation of the output text and audio
49-
text_ids, audio = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO)
62+
text_ids, audio = model.generate(**inputs, use_audio_in_video=True)
5063

5164
text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
5265
print(text)

0 commit comments

Comments
 (0)