Skip to content

Commit e2a8400

Browse files
alcholiclgwangxingjun778alcholiclg
authored
refine docs (#890)
Co-authored-by: 班扬 <xingjun.wxj@alibaba-inc.com> Co-authored-by: alcholiclg <ligongshengzju@foxmail.com>
1 parent 30d4fc1 commit e2a8400

File tree

9 files changed

+1371
-454
lines changed

9 files changed

+1371
-454
lines changed

docs/en/Components/AgentSkills.md

Lines changed: 516 additions & 0 deletions
Large diffs are not rendered by default.

docs/en/Components/Config.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,24 @@ tools:
106106
107107
For the complete list of supported tools and custom tools, please refer to [here](./Tools.md)
108108
109+
## Skills Configuration
110+
111+
> Optional, used when enabling Agent Skills
112+
113+
```yaml
114+
skills:
115+
# Path to skills directory or ModelScope repo ID
116+
path: /path/to/skills
117+
# Whether to auto-execute skills (default: True)
118+
auto_execute: true
119+
# Working directory for outputs
120+
work_dir: /path/to/workspace
121+
# Whether to use Docker sandbox for execution (default: True)
122+
use_sandbox: false
123+
```
124+
125+
For the complete skill module documentation (including architecture, directory structure, API reference, and security mechanisms), see [Agent Skills](./AgentSkills).
126+
109127
## Memory Compression Configuration
110128
111129
> Optional, for context management in long conversations
Lines changed: 299 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,299 @@
1+
---
2+
slug: MultimodalSupport
3+
title: Multimodal Support
4+
description: Ms-Agent multimodal conversation guide - image understanding and analysis configuration and usage.
5+
---
6+
7+
# Multimodal Support
8+
9+
This document describes how to use ms-agent for multimodal conversations, including image understanding and analysis capabilities.
10+
11+
## Overview
12+
13+
ms-agent supports multimodal models such as Alibaba Cloud's `qwen3.5-plus`. Multimodal models can:
14+
- Analyze image content
15+
- Recognize objects, scenes, and text in images
16+
- Engage in conversations based on image content
17+
18+
## Prerequisites
19+
20+
### 1. Install Dependencies
21+
22+
Ensure the required packages are installed:
23+
24+
```bash
25+
pip install openai
26+
```
27+
28+
### 2. Configure API Key
29+
30+
(Using qwen3.5-plus as an example) Obtain a DashScope API Key and set the environment variable:
31+
32+
```bash
33+
export DASHSCOPE_API_KEY='your-dashscope-api-key'
34+
```
35+
36+
Or set `dashscope_api_key` directly in the configuration file.
37+
38+
## Configure Multimodal Models
39+
40+
Multimodal functionality depends on two factors:
41+
1. **Choose a model that supports multimodal input** (e.g. `qwen3.5-plus`)
42+
2. **Use the correct message format** (containing `image_url` blocks)
43+
44+
You can dynamically modify the model configuration in code on top of an existing config:
45+
46+
```python
47+
from ms_agent.config import Config
48+
from ms_agent import LLMAgent
49+
import os
50+
51+
# Use an existing configuration file (e.g. ms_agent/agent/agent.yaml)
52+
config = Config.from_task('ms_agent/agent/agent.yaml')
53+
54+
# Override configuration for multimodal model
55+
config.llm.model = 'qwen3.5-plus'
56+
config.llm.service = 'dashscope'
57+
config.llm.dashscope_api_key = os.environ.get('DASHSCOPE_API_KEY', '')
58+
config.llm.modelscope_base_url = 'https://dashscope.aliyuncs.com/compatible-mode/v1'
59+
60+
# Create LLMAgent
61+
agent = LLMAgent(config=config)
62+
```
63+
64+
## Using LLMAgent for Multimodal Conversations
65+
66+
Using `LLMAgent` for multimodal conversations is recommended, as it provides more complete features including memory management, tool calling, and callback support.
67+
68+
### Basic Usage
69+
70+
```python
71+
import asyncio
72+
import os
73+
from ms_agent import LLMAgent
74+
from ms_agent.config import Config
75+
from ms_agent.llm.utils import Message
76+
77+
async def multimodal_chat():
78+
# Create configuration
79+
config = Config.from_task('ms_agent/agent/agent.yaml')
80+
config.llm.model = 'qwen3.5-plus'
81+
config.llm.service = 'dashscope'
82+
config.llm.dashscope_api_key = os.environ.get('DASHSCOPE_API_KEY', '')
83+
config.llm.modelscope_base_url = 'https://dashscope.aliyuncs.com/compatible-mode/v1'
84+
85+
# Create LLMAgent
86+
agent = LLMAgent(config=config)
87+
88+
# Build multimodal message
89+
multimodal_content = [
90+
{"type": "text", "text": "Please describe this image."},
91+
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
92+
]
93+
94+
# Call the agent
95+
response = await agent.run(messages=[Message(role="user", content=multimodal_content)])
96+
print(response[-1].content)
97+
98+
asyncio.run(multimodal_chat())
99+
```
100+
101+
### Non-Stream Mode
102+
103+
```python
104+
# Disable stream in configuration
105+
config.generation_config.stream = False
106+
107+
agent = LLMAgent(config=config)
108+
109+
multimodal_content = [
110+
{"type": "text", "text": "Please describe this image."},
111+
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
112+
]
113+
114+
# Non-stream mode: returns complete response directly
115+
response = await agent.run(messages=[Message(role="user", content=multimodal_content)])
116+
print(f"[Response] {response[-1].content}")
117+
print(f"[Token Usage] Input: {response[-1].prompt_tokens}, Output: {response[-1].completion_tokens}")
118+
```
119+
120+
### Stream Mode
121+
122+
```python
123+
# Enable stream in configuration
124+
config.generation_config.stream = True
125+
126+
agent = LLMAgent(config=config)
127+
128+
multimodal_content = [
129+
{"type": "text", "text": "Please describe this image."},
130+
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
131+
]
132+
133+
# Stream mode: returns a generator
134+
generator = await agent.run(
135+
messages=[Message(role="user", content=multimodal_content)],
136+
stream=True
137+
)
138+
139+
full_response = ""
140+
async for response_chunk in generator:
141+
if response_chunk and len(response_chunk) > 0:
142+
last_msg = response_chunk[-1]
143+
if last_msg.content:
144+
# Stream output of new content
145+
print(last_msg.content[len(full_response):], end='', flush=True)
146+
full_response = last_msg.content
147+
148+
print(f"\n[Full Response] {full_response}")
149+
```
150+
151+
### Multi-Turn Conversations
152+
153+
LLMAgent supports multi-turn conversations, allowing you to mix images and text:
154+
155+
```python
156+
agent = LLMAgent(config=config, tag="multimodal_conversation")
157+
158+
# Turn 1: Send an image
159+
multimodal_content = [
160+
{"type": "text", "text": "How many people are in this image?"},
161+
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
162+
]
163+
164+
messages = [Message(role="user", content=multimodal_content)]
165+
response = await agent.run(messages=messages)
166+
print(f"[Turn 1 Response] {response[-1].content}")
167+
168+
# Turn 2: Follow-up question (text only, preserving context)
169+
messages = response # Use previous response as context
170+
messages.append(Message(role="user", content="What are they doing?"))
171+
response = await agent.run(messages=messages)
172+
print(f"[Turn 2 Response] {response[-1].content}")
173+
```
174+
175+
## Multimodal Message Format
176+
177+
ms-agent uses the OpenAI-compatible multimodal message format. Images can be provided in three ways:
178+
179+
### 1. Image URL
180+
181+
```python
182+
from ms_agent.llm.utils import Message
183+
184+
multimodal_content = [
185+
{"type": "text", "text": "Please describe this image."},
186+
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
187+
]
188+
189+
messages = [
190+
Message(role="user", content=multimodal_content)
191+
]
192+
193+
response = llm.generate(messages=messages)
194+
```
195+
196+
### 2. Base64 Encoding
197+
198+
```python
199+
import base64
200+
201+
# Read and encode the image
202+
with open('image.jpg', 'rb') as f:
203+
image_data = base64.b64encode(f.read()).decode('utf-8')
204+
205+
multimodal_content = [
206+
{"type": "text", "text": "What is this?"},
207+
{
208+
"type": "image_url",
209+
"image_url": {
210+
"url": f"data:image/jpeg;base64,{image_data}"
211+
}
212+
}
213+
]
214+
215+
messages = [Message(role="user", content=multimodal_content)]
216+
response = llm.generate(messages=messages)
217+
```
218+
219+
### 3. Local File Path
220+
221+
```python
222+
import base64
223+
import os
224+
225+
image_path = 'path/to/image.png'
226+
227+
# Get MIME type
228+
ext = os.path.splitext(image_path)[1].lower()
229+
mime_type = {
230+
'.png': 'image/png',
231+
'.jpg': 'image/jpeg',
232+
'.jpeg': 'image/jpeg',
233+
'.gif': 'image/gif',
234+
'.webp': 'image/webp'
235+
}.get(ext, 'image/png')
236+
237+
# Read and encode
238+
with open(image_path, 'rb') as f:
239+
image_data = base64.b64encode(f.read()).decode('utf-8')
240+
241+
multimodal_content = [
242+
{"type": "text", "text": "Describe this image."},
243+
{
244+
"type": "image_url",
245+
"image_url": {
246+
"url": f"data:{mime_type};base64,{image_data}"
247+
}
248+
}
249+
]
250+
251+
messages = [Message(role="user", content=multimodal_content)]
252+
response = llm.generate(messages=messages)
253+
```
254+
255+
## Running Examples
256+
257+
### Running the Agent Example
258+
259+
```bash
260+
# Run the complete test suite (including stream and non-stream modes)
261+
python examples/agent/test_llm_agent_multimodal.py
262+
```
263+
264+
## FAQ
265+
266+
### Q: Are there image size limits?
267+
268+
A: Yes, different models have different limits:
269+
- qwen3.5-plus: Recommended image size under 4MB
270+
- Recommended resolution not exceeding 2048x2048
271+
272+
### Q: What image formats are supported?
273+
274+
A: Commonly supported formats:
275+
- JPEG / JPG
276+
- PNG
277+
- GIF
278+
- WebP
279+
280+
### Q: Can I send multiple images at once?
281+
282+
A: Yes, you can add multiple `image_url` blocks in a single message:
283+
284+
```python
285+
multimodal_content = [
286+
{"type": "text", "text": "Compare these two images."},
287+
{"type": "image_url", "image_url": {"url": "https://example.com/img1.jpg"}},
288+
{"type": "image_url", "image_url": {"url": "https://example.com/img2.jpg"}}
289+
]
290+
```
291+
292+
### Q: Is streaming output supported?
293+
294+
A: Yes, multimodal conversations support streaming output. Set `stream: true`:
295+
296+
```python
297+
config.generation_config.stream = True
298+
response = llm.generate(messages=messages, stream=True)
299+
```

docs/en/index.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,14 +20,16 @@ MS-Agent DOCUMENTATION
2020
Components/LLMAgent
2121
Components/Workflow
2222
Components/SupportedModels
23+
Components/MultimodalSupport
2324
Components/Tools
25+
Components/AgentSkills
2426
Components/ContributorGuide
2527

2628
.. toctree::
2729
:maxdepth: 2
2830
:caption: 📁 Projects
2931

30-
Projects/AgentSkills
32+
Projects/CodeGenesis
3133
Projects/DeepResearch
3234
Projects/FinResearch
3335
Projects/VideoGeneration

0 commit comments

Comments
 (0)