Skip to content

Commit ea824b7

Browse files
suluyanasuluyan
andauthored
Feat/multimodal model (#879)
Co-authored-by: suluyan <suluyan.sly@alibaba-inc.com>
1 parent 5f7178b commit ea824b7

File tree

5 files changed

+649
-12
lines changed

5 files changed

+649
-12
lines changed
Lines changed: 299 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,299 @@
1+
---
2+
slug: multimodal-support
3+
title: 多模态支持
4+
description: Ms-Agent 多模态对话使用指南:图片理解、分析功能配置与使用方法。
5+
---
6+
7+
# 多模态支持
8+
9+
本文档介绍如何使用 ms-agent 进行多模态对话,包括图片理解和分析功能。
10+
11+
## 概述
12+
13+
ms-agent 已经支持多模态模型,如阿里云的 `qwen3.5-plus` 模型。多模态模型能够:
14+
- 分析图片内容
15+
- 识别图片中的对象、场景和文字
16+
- 结合图片内容进行对话
17+
18+
## 前置要求
19+
20+
### 1. 安装依赖
21+
22+
确保已安装必要的依赖包:
23+
24+
```bash
25+
pip install openai
26+
```
27+
28+
### 2. 配置 API Key
29+
30+
(以 qwen3.5-plus 为例)获取 DashScope API Key 并设置环境变量:
31+
32+
```bash
33+
export DASHSCOPE_API_KEY='your-dashscope-api-key'
34+
```
35+
36+
或者在配置文件中直接设置 `dashscope_api_key`
37+
38+
## 配置多模态模型
39+
40+
多模态功能主要取决于两点:
41+
1. **选择支持多模态的模型**(如 `qwen3.5-plus`
42+
2. **使用正确的消息格式**(包含 `image_url` 块)
43+
44+
你可以在现有配置基础上,通过代码动态修改模型配置:
45+
46+
```python
47+
from ms_agent.config import Config
48+
from ms_agent import LLMAgent
49+
import os
50+
51+
# 使用现有配置文件(如 ms_agent/agent/agent.yaml)
52+
config = Config.from_task('ms_agent/agent/agent.yaml')
53+
54+
# 覆盖配置为多模态模型
55+
config.llm.model = 'qwen3.5-plus'
56+
config.llm.service = 'dashscope'
57+
config.llm.dashscope_api_key = os.environ.get('DASHSCOPE_API_KEY', '')
58+
config.llm.modelscope_base_url = 'https://dashscope.aliyuncs.com/compatible-mode/v1'
59+
60+
# 创建 LLMAgent
61+
agent = LLMAgent(config=config)
62+
```
63+
64+
## 使用 LLMAgent 进行多模态对话
65+
66+
推荐使用 `LLMAgent` 来进行多模态对话,它提供了更完整的功能,包括记忆管理、工具调用和回调支持。
67+
68+
### 基本用法
69+
70+
```python
71+
import asyncio
72+
import os
73+
from ms_agent import LLMAgent
74+
from ms_agent.config import Config
75+
from ms_agent.llm.utils import Message
76+
77+
async def multimodal_chat():
78+
# 创建配置
79+
config = Config.from_task('ms_agent/agent/agent.yaml')
80+
config.llm.model = 'qwen3.5-plus'
81+
config.llm.service = 'dashscope'
82+
config.llm.dashscope_api_key = os.environ.get('DASHSCOPE_API_KEY', '')
83+
config.llm.modelscope_base_url = 'https://dashscope.aliyuncs.com/compatible-mode/v1'
84+
85+
# 创建 LLMAgent
86+
agent = LLMAgent(config=config)
87+
88+
# 构建多模态消息
89+
multimodal_content = [
90+
{"type": "text", "text": "请描述这张图片。"},
91+
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
92+
]
93+
94+
# 调用 agent
95+
response = await agent.run(messages=[Message(role="user", content=multimodal_content)])
96+
print(response[-1].content)
97+
98+
asyncio.run(multimodal_chat())
99+
```
100+
101+
### 非 Stream 模式
102+
103+
```python
104+
# 配置中禁用 stream
105+
config.generation_config.stream = False
106+
107+
agent = LLMAgent(config=config)
108+
109+
multimodal_content = [
110+
{"type": "text", "text": "请描述这张图片。"},
111+
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
112+
]
113+
114+
# 非 stream 模式:直接返回完整响应
115+
response = await agent.run(messages=[Message(role="user", content=multimodal_content)])
116+
print(f"[回复] {response[-1].content}")
117+
print(f"[Token使用] 输入: {response[-1].prompt_tokens}, 输出: {response[-1].completion_tokens}")
118+
```
119+
120+
### Stream 模式
121+
122+
```python
123+
# 配置中启用 stream
124+
config.generation_config.stream = True
125+
126+
agent = LLMAgent(config=config)
127+
128+
multimodal_content = [
129+
{"type": "text", "text": "请描述这张图片。"},
130+
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
131+
]
132+
133+
# stream 模式:返回生成器
134+
generator = await agent.run(
135+
messages=[Message(role="user", content=multimodal_content)],
136+
stream=True
137+
)
138+
139+
full_response = ""
140+
async for response_chunk in generator:
141+
if response_chunk and len(response_chunk) > 0:
142+
last_msg = response_chunk[-1]
143+
if last_msg.content:
144+
# 流式输出新增内容
145+
print(last_msg.content[len(full_response):], end='', flush=True)
146+
full_response = last_msg.content
147+
148+
print(f"\n[完整回复] {full_response}")
149+
```
150+
151+
### 多轮对话
152+
153+
LLMAgent 支持多轮对话,可以在对话中混合使用图片和文本:
154+
155+
```python
156+
agent = LLMAgent(config=config, tag="multimodal_conversation")
157+
158+
# 第一轮:发送图片
159+
multimodal_content = [
160+
{"type": "text", "text": "这张图片里有几个人?"},
161+
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
162+
]
163+
164+
messages = [Message(role="user", content=multimodal_content)]
165+
response = await agent.run(messages=messages)
166+
print(f"[第一轮回复] {response[-1].content}")
167+
168+
# 第二轮:继续追问(纯文本,保留上下文)
169+
messages = response # 使用上一轮的回复作为上下文
170+
messages.append(Message(role="user", content="他们在做什么?"))
171+
response = await agent.run(messages=messages)
172+
print(f"[第二轮回复] {response[-1].content}")
173+
```
174+
175+
## 多模态消息格式
176+
177+
ms-agent 使用 OpenAI 兼容的多模态消息格式。图片可以通过以下三种方式提供:
178+
179+
### 1. 图片 URL
180+
181+
```python
182+
from ms_agent.llm.utils import Message
183+
184+
multimodal_content = [
185+
{"type": "text", "text": "请描述这张图片。"},
186+
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
187+
]
188+
189+
messages = [
190+
Message(role="user", content=multimodal_content)
191+
]
192+
193+
response = llm.generate(messages=messages)
194+
```
195+
196+
### 2. Base64 编码
197+
198+
```python
199+
import base64
200+
201+
# 读取并编码图片
202+
with open('image.jpg', 'rb') as f:
203+
image_data = base64.b64encode(f.read()).decode('utf-8')
204+
205+
multimodal_content = [
206+
{"type": "text", "text": "这是什么?"},
207+
{
208+
"type": "image_url",
209+
"image_url": {
210+
"url": f"data:image/jpeg;base64,{image_data}"
211+
}
212+
}
213+
]
214+
215+
messages = [Message(role="user", content=multimodal_content)]
216+
response = llm.generate(messages=messages)
217+
```
218+
219+
### 3. 本地文件路径
220+
221+
```python
222+
import base64
223+
import os
224+
225+
image_path = 'path/to/image.png'
226+
227+
# 获取 MIME 类型
228+
ext = os.path.splitext(image_path)[1].lower()
229+
mime_type = {
230+
'.png': 'image/png',
231+
'.jpg': 'image/jpeg',
232+
'.jpeg': 'image/jpeg',
233+
'.gif': 'image/gif',
234+
'.webp': 'image/webp'
235+
}.get(ext, 'image/png')
236+
237+
# 读取并编码
238+
with open(image_path, 'rb') as f:
239+
image_data = base64.b64encode(f.read()).decode('utf-8')
240+
241+
multimodal_content = [
242+
{"type": "text", "text": "描述这张图片。"},
243+
{
244+
"type": "image_url",
245+
"image_url": {
246+
"url": f"data:{mime_type};base64,{image_data}"
247+
}
248+
}
249+
]
250+
251+
messages = [Message(role="user", content=multimodal_content)]
252+
response = llm.generate(messages=messages)
253+
```
254+
255+
## 运行示例
256+
257+
### 运行 Agent 示例
258+
259+
```bash
260+
# 运行完整测试套件(包括 stream 和非 stream 模式)
261+
python examples/agent/test_llm_agent_multimodal.py
262+
```
263+
264+
## 常见问题
265+
266+
### Q: 图片大小有限制吗?
267+
268+
A: 是的,不同模型有不同的限制:
269+
- qwen3.5-plus: 推荐图片大小不超过 4MB
270+
- 分辨率建议不超过 2048x2048
271+
272+
### Q: 支持哪些图片格式?
273+
274+
A: 通常支持:
275+
- JPEG / JPG
276+
- PNG
277+
- GIF
278+
- WebP
279+
280+
### Q: 可以一次发送多张图片吗?
281+
282+
A: 是的,可以在消息中添加多个 `image_url` 块:
283+
284+
```python
285+
multimodal_content = [
286+
{"type": "text", "text": "比较这两张图片。"},
287+
{"type": "image_url", "image_url": {"url": "https://example.com/img1.jpg"}},
288+
{"type": "image_url", "image_url": {"url": "https://example.com/img2.jpg"}}
289+
]
290+
```
291+
292+
### Q: 流式输出支持吗?
293+
294+
A: 是的,多模态对话支持流式输出。设置 `stream: true` 即可:
295+
296+
```python
297+
config.generation_config.stream = True
298+
response = llm.generate(messages=messages, stream=True)
299+
```

docs/zh/Components/supported-models.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,3 +48,7 @@ llm:
4848
```
4949
5050
> 如果你有其他模型provider,请协助更新此文档。
51+
52+
## 多模态支持
53+
54+
关于如何使用多模态模型(如图片理解、分析功能),请参考 [多模态支持指南](./multimodal-support.md)。

0 commit comments

Comments
 (0)