| slug | multimodal-support |
|---|---|
| title | 多模态支持 |
| description | Ms-Agent 多模态对话使用指南:图片理解、分析功能配置与使用方法。 |
本文档介绍如何使用 ms-agent 进行多模态对话,包括图片理解和分析功能。
ms-agent 已经支持多模态模型,如阿里云的 qwen3.5-plus 模型。多模态模型能够:
- 分析图片内容
- 识别图片中的对象、场景和文字
- 结合图片内容进行对话
确保已安装必要的依赖包:
pip install openai(以 qwen3.5-plus 为例)获取 DashScope API Key 并设置环境变量:
export DASHSCOPE_API_KEY='your-dashscope-api-key'或者在配置文件中直接设置 dashscope_api_key。
多模态功能主要取决于两点:
- 选择支持多模态的模型(如
qwen3.5-plus) - 使用正确的消息格式(包含
image_url块)
你可以在现有配置基础上,通过代码动态修改模型配置:
from ms_agent.config import Config
from ms_agent import LLMAgent
import os
# 使用现有配置文件(如 ms_agent/agent/agent.yaml)
config = Config.from_task('ms_agent/agent/agent.yaml')
# 覆盖配置为多模态模型
config.llm.model = 'qwen3.5-plus'
config.llm.service = 'dashscope'
config.llm.dashscope_api_key = os.environ.get('DASHSCOPE_API_KEY', '')
config.llm.modelscope_base_url = 'https://dashscope.aliyuncs.com/compatible-mode/v1'
# 创建 LLMAgent
agent = LLMAgent(config=config)推荐使用 LLMAgent 来进行多模态对话,它提供了更完整的功能,包括记忆管理、工具调用和回调支持。
import asyncio
import os
from ms_agent import LLMAgent
from ms_agent.config import Config
from ms_agent.llm.utils import Message
async def multimodal_chat():
# 创建配置
config = Config.from_task('ms_agent/agent/agent.yaml')
config.llm.model = 'qwen3.5-plus'
config.llm.service = 'dashscope'
config.llm.dashscope_api_key = os.environ.get('DASHSCOPE_API_KEY', '')
config.llm.modelscope_base_url = 'https://dashscope.aliyuncs.com/compatible-mode/v1'
# 创建 LLMAgent
agent = LLMAgent(config=config)
# 构建多模态消息
multimodal_content = [
{"type": "text", "text": "请描述这张图片。"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]
# 调用 agent
response = await agent.run(messages=[Message(role="user", content=multimodal_content)])
print(response[-1].content)
asyncio.run(multimodal_chat())# 配置中禁用 stream
config.generation_config.stream = False
agent = LLMAgent(config=config)
multimodal_content = [
{"type": "text", "text": "请描述这张图片。"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]
# 非 stream 模式:直接返回完整响应
response = await agent.run(messages=[Message(role="user", content=multimodal_content)])
print(f"[回复] {response[-1].content}")
print(f"[Token使用] 输入: {response[-1].prompt_tokens}, 输出: {response[-1].completion_tokens}")# 配置中启用 stream
config.generation_config.stream = True
agent = LLMAgent(config=config)
multimodal_content = [
{"type": "text", "text": "请描述这张图片。"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]
# stream 模式:返回生成器
generator = await agent.run(
messages=[Message(role="user", content=multimodal_content)],
stream=True
)
full_response = ""
async for response_chunk in generator:
if response_chunk and len(response_chunk) > 0:
last_msg = response_chunk[-1]
if last_msg.content:
# 流式输出新增内容
print(last_msg.content[len(full_response):], end='', flush=True)
full_response = last_msg.content
print(f"\n[完整回复] {full_response}")LLMAgent 支持多轮对话,可以在对话中混合使用图片和文本:
agent = LLMAgent(config=config, tag="multimodal_conversation")
# 第一轮:发送图片
multimodal_content = [
{"type": "text", "text": "这张图片里有几个人?"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]
messages = [Message(role="user", content=multimodal_content)]
response = await agent.run(messages=messages)
print(f"[第一轮回复] {response[-1].content}")
# 第二轮:继续追问(纯文本,保留上下文)
messages = response # 使用上一轮的回复作为上下文
messages.append(Message(role="user", content="他们在做什么?"))
response = await agent.run(messages=messages)
print(f"[第二轮回复] {response[-1].content}")ms-agent 使用 OpenAI 兼容的多模态消息格式。图片可以通过以下三种方式提供:
from ms_agent.llm.utils import Message
multimodal_content = [
{"type": "text", "text": "请描述这张图片。"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]
messages = [
Message(role="user", content=multimodal_content)
]
response = llm.generate(messages=messages)import base64
# 读取并编码图片
with open('image.jpg', 'rb') as f:
image_data = base64.b64encode(f.read()).decode('utf-8')
multimodal_content = [
{"type": "text", "text": "这是什么?"},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_data}"
}
}
]
messages = [Message(role="user", content=multimodal_content)]
response = llm.generate(messages=messages)import base64
import os
image_path = 'path/to/image.png'
# 获取 MIME 类型
ext = os.path.splitext(image_path)[1].lower()
mime_type = {
'.png': 'image/png',
'.jpg': 'image/jpeg',
'.jpeg': 'image/jpeg',
'.gif': 'image/gif',
'.webp': 'image/webp'
}.get(ext, 'image/png')
# 读取并编码
with open(image_path, 'rb') as f:
image_data = base64.b64encode(f.read()).decode('utf-8')
multimodal_content = [
{"type": "text", "text": "描述这张图片。"},
{
"type": "image_url",
"image_url": {
"url": f"data:{mime_type};base64,{image_data}"
}
}
]
messages = [Message(role="user", content=multimodal_content)]
response = llm.generate(messages=messages)# 运行完整测试套件(包括 stream 和非 stream 模式)
python examples/agent/test_llm_agent_multimodal.pyA: 是的,不同模型有不同的限制:
- qwen3.5-plus: 推荐图片大小不超过 4MB
- 分辨率建议不超过 2048x2048
A: 通常支持:
- JPEG / JPG
- PNG
- GIF
- WebP
A: 是的,可以在消息中添加多个 image_url 块:
multimodal_content = [
{"type": "text", "text": "比较这两张图片。"},
{"type": "image_url", "image_url": {"url": "https://example.com/img1.jpg"}},
{"type": "image_url", "image_url": {"url": "https://example.com/img2.jpg"}}
]A: 是的,多模态对话支持流式输出。设置 stream: true 即可:
config.generation_config.stream = True
response = llm.generate(messages=messages, stream=True)