You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: packages/tasks/src/tasks/any-to-any/about.md
+27-14Lines changed: 27 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,47 +6,60 @@ Any-to-any models can help embodied agents operate in multi-sensory environments
6
6
7
7
### Real-time Accessibility Systems
8
8
9
-
Vision-language based any-to-any models can be used aid visually impaired people. A real-time on-device any-to-any model can take a real-world video stream from wearable glasses, and describe the scene in audio (e.g., "A person in a red coat is walking toward you") or provide real-time closed captions and environmental sound cues.
9
+
Vision-language based any-to-any models can be used to aid visually impaired people. A real-time on-device any-to-any model can take a real-world video stream from wearable glasses, and describe the scene in audio (e.g., "A person in a red coat is walking toward you"), or provide real-time closed captions and environmental sound cues.
10
10
11
11
### Multimodal Content Creation
12
12
13
13
One can use any-to-any models to generate multimodal content. For example, given a video and an outline, the model can generate speech, better videos, or a descriptive blog post. Moreover, these models can sync narration timing with visual transitions.
14
14
15
15
## Inference
16
16
17
-
You can infer with any-to-any models using transformers. Below is an example to infer Qwen2.5-Omni-7B model, make sure to check the model you're inferring with.
17
+
You can infer with any-to-any models using transformers. Below is an example that passes a video as part of a chat conversation to the Qwen2.5-Omni-7B model, and retrieves text and audio responses. Make sure to check the model you're inferring with.
18
18
19
19
```python
20
20
import soundfile as sf
21
-
from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor
22
-
from qwen_omni_utils import process_mm_info
23
-
24
-
model = Qwen2_5OmniModel.from_pretrained("Qwen/Qwen2.5-Omni-7B", torch_dtype="auto", device_map="auto")
21
+
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
25
22
23
+
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
"content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
34
+
"content": [
35
+
{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
0 commit comments