Skip to content

Commit d4f10e2

Browse files
committed
feat: add qwen audio sample (#151)
* chore: update xml documents * docs: update docs * feat: group samples by type * feat: add audio caption sample * feat: add audio understanding sample
1 parent 2b9065d commit d4f10e2

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

47 files changed

+2830
-218
lines changed

README.md

Lines changed: 75 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,8 @@ public class YourService(IDashScopeClient client)
109109
- [Formula Recognition](#formula-recognition)
110110
- [Text Recognition](#text-recognition)
111111
- [Multilanguage](#multilanguage)
112-
112+
- [GUI](#gui)
113+
- [Audio Understanding](#audio-understanding)
113114
- [Text-to-Speech](#text-to-speech) - CosyVoice, Sambert, etc. For TTS applications
114115
- [Image Generation](#image-generation) - wanx2.1, etc. For text-to-image and portrait style transfer
115116
- [Application Call](#application-call)
@@ -499,7 +500,7 @@ new ModelRequest<TextGenerationInput, ITextGenerationParameters>()
499500
For Qwen-Long models:
500501
```csharp
501502
var file = new FileInfo("test.txt");
502-
var uploadedFile = await dashScopeClient.UploadFileAsync(file.OpenRead(), file.Name);
503+
var uploadedFile = await dashScopeClient.OpenAiCompatibleUploadFileAsync(file.OpenRead(), file.Name);
503504
var history = new List<ChatMessage> { ChatMessage.File(uploadedFile.Id) };
504505
var completion = await client.client.GetTextCompletionAsync(
505506
new ModelRequest<TextGenerationInput, ITextGenerationParameters>()
@@ -514,7 +515,7 @@ var completion = await client.client.GetTextCompletionAsync(
514515
});
515516
Console.WriteLine(completion.Output.Choices[0].Message.Content);
516517
// Cleanup
517-
await dashScopeClient.DeleteFileAsync(uploadedFile.Id);
518+
await dashScopeClient.OpenAiCompatibleDeleteFileAsync(uploadedFile.Id);
518519
```
519520

520521
## Multimodal
@@ -1442,6 +1443,77 @@ Response:
14421443

14431444
Then you can execute the command that model returns, and reply the screenshot with next intension.
14441445

1446+
### Audio Understanding
1447+
1448+
Example(use `Qwen3-Omni-Captioner`)
1449+
1450+
```csharp
1451+
// upload file
1452+
await using var audio = File.OpenRead("noise.wav");
1453+
var ossLink = await client.UploadTemporaryFileAsync("qwen3-omni-30b-a3b-captioner", audio, "noise.wav");
1454+
Console.WriteLine($"File uploaded: {ossLink}");
1455+
var messages = new List<MultimodalMessage>
1456+
{
1457+
MultimodalMessage.User(
1458+
[
1459+
// 也可以直接传入公网地址
1460+
MultimodalMessageContent.AudioContent(ossLink),
1461+
])
1462+
};
1463+
var completion = client.GetMultimodalGenerationStreamAsync(
1464+
new ModelRequest<MultimodalInput, IMultimodalParameters>()
1465+
{
1466+
Model = "qwen3-omni-30b-a3b-captioner",
1467+
Input = new MultimodalInput() { Messages = messages },
1468+
Parameters = new MultimodalParameters() { IncrementalOutput = true, }
1469+
});
1470+
var reply = new StringBuilder();
1471+
var first = true;
1472+
MultimodalTokenUsage? usage = null;
1473+
await foreach (var chunk in completion)
1474+
{
1475+
var choice = chunk.Output.Choices[0];
1476+
if (first)
1477+
{
1478+
first = false;
1479+
Console.WriteLine();
1480+
Console.Write("Assistant > ");
1481+
}
1482+
1483+
if (choice.Message.Content.Count == 0)
1484+
{
1485+
continue;
1486+
}
1487+
1488+
Console.Write(choice.Message.Content[0].Text);
1489+
reply.Append(choice.Message.Content[0].Text);
1490+
usage = chunk.Usage;
1491+
}
1492+
1493+
Console.WriteLine();
1494+
messages.Add(MultimodalMessage.Assistant([MultimodalMessageContent.TextContent(reply.ToString())]));
1495+
if (usage != null)
1496+
{
1497+
Console.WriteLine(
1498+
$"Usage: in({usage.InputTokens})/out({usage.OutputTokens})/audio({usage.InputTokensDetails?.AudioTokens})/total({usage.TotalTokens})");
1499+
}
1500+
```
1501+
1502+
Sample output
1503+
1504+
```csharp
1505+
Assistant > The audio clip opens with a rapid, percussive metallic clatter, reminiscent of a typewriter or similar mechanical device, which continues in a steady rhythm throughout the recording. This clatter is slightly left-of-center in the stereo field and is accompanied by a faint, low-frequency hum, likely from a household appliance or HVAC system. The acoustic environment is a small, enclosed room with hard surfaces, indicated by the short, bright reverberation of both the clatter and the speakers voice. The audio quality is moderate, with a noticeable electronic hiss and some loss of high-frequency detail, but no digital distortion or clipping.
1506+
1507+
At the one-second mark, a male voice enters, positioned slightly right-of-center and closer to the microphone. He speaks in standard Mandarin, with a tone of weary exasperation: “哎 呀,这样我还怎么安静工作啊?” (“Aiyā, zěnyàng wǒ hái zěnme ānjìng gōngzuò a?”), which translates toOh, how can I possibly work quietly like this?His speech is clear, with a slightly rising pitch on “安静” (“quietly”) and a falling pitch on “啊” (“a”), conveying a sense of complaint and fatigue. The accent is standard, with no regional inflection, and the voice is that of a young to middle-aged adult male.
1508+
1509+
Throughout the clip, the mechanical clatter remains constant and prominent, occasionally competing with the voice for clarity. There are no other sounds, such as footsteps, additional voices, or environmental noises, and the background is otherwise quiet. The interplay between the persistent mechanical noise and the speaker’s complaint creates a vivid sense of disruption and frustration, suggesting an environment where work is being impeded by an external, uncontrolled sound source.
1510+
1511+
Culturally, the use of Mandarin, standard pronunciation, and modern recording quality indicate a contemporary, urban Chinese setting. The language and tone are universally relatable, reflecting a common experience of being disturbed during work. The lack of regional markers or distinctive background noises suggests a generic, possibly domestic or office-like space, but with no clear indicators of a specific location or social context.
1512+
1513+
In summary, the audio portrays a modern Mandarin-speaking man, exasperated by a constant, distracting mechanical noise (likely a typewriter or similar device), attempting to work in a small, reverberant room. The recordings technical and acoustic features reinforce the sense of disruption and frustration, while the language and setting suggest a contemporary, urban Chinese context.
1514+
Usage: in(160)/out(514)/audio(152)/total(674)
1515+
```
1516+
14451517
## Text-to-Speech
14461518

14471519
Create a speech synthesis session using `dashScopeClient.CreateSpeechSynthesizerSocketSessionAsync()`.

README.zh-Hans.md

Lines changed: 98 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,7 @@ public class YourService(IDashScopeClient client)
108108
- [通用文本识别](#通用文本识别)
109109
- [多语言识别](#多语言识别)
110110
- [界面交互](#界面交互)
111+
- [音频理解](#音频理解)
111112
- [语音合成](#语音合成) - CosyVoice,Sambert 等,支持 TTS 等应用场景
112113
- [图像生成](#图像生成) - wanx2.1 等,支持文生图,人像风格重绘等应用场景
113114
- [应用调用](#应用调用)
@@ -1385,14 +1386,6 @@ Deleting file2...Success
13851386
*/
13861387
```
13871388

1388-
**注意及时删除上传的文件,这个接口有文件总数(1万)和文件总量(100GB)限制。**
1389-
1390-
你可以使用 `ListFileAsync` 获取完整的文件列表并删除不再需要使用的文件
1391-
1392-
示例:
1393-
1394-
1395-
13961389
### 翻译能力(Qwen-MT)
13971390

13981391
翻译能力主要通过 `Parameters` 里的 `TranslationOptions` 进行配置。
@@ -1614,6 +1607,23 @@ var completion = client.GetTextCompletionStreamAsync(
16141607
});
16151608
```
16161609

1610+
如果文件来自公网 URL,也可以使用 `TextChatMessage.DocUrl` 传入,此时不再需要额外添加一个 User 信息。
1611+
1612+
示例:
1613+
1614+
```csharp
1615+
var messages = new List<TextChatMessage>
1616+
{
1617+
TextChatMessage.System("You are a helpful assistant"),
1618+
TextChatMessage.DocUrl(
1619+
"从这两份产品手册中,提取所有产品信息,并整理成一个标准的JSON数组。每个对象需要包含:model(产品的型号)、name(产品的名称)、price(价格(去除货币符号和逗号))",
1620+
[
1621+
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20251107/jockge/%E7%A4%BA%E4%BE%8B%E4%BA%A7%E5%93%81%E6%89%8B%E5%86%8CA.docx",
1622+
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20251107/ztwxzr/%E7%A4%BA%E4%BE%8B%E4%BA%A7%E5%93%81%E6%89%8B%E5%86%8CB.docx"
1623+
]),
1624+
};
1625+
```
1626+
16171627
完整示例代码:
16181628

16191629
````csharp
@@ -2680,7 +2690,7 @@ foreach (var info in completion.Output.Choices[0].Message.Content[0].OcrResult!.
26802690
输出结果:
26812691
26822692
````csharp
2683-
Text:
2693+
Text:
26842694
```json
26852695
[
26862696
{"rotate_rect": [236, 254, 115, 299, 90], "text": "OpenAI 兼容"},
@@ -2690,7 +2700,7 @@ Text:
26902700
{"rotate_rect": [712, 684, 115, 85, 90], "text": "curl"}
26912701
]
26922702
```
2693-
WordsInfo:
2703+
WordsInfo:
26942704
OpenAI 兼容
26952705
Location: [46,55,205,55,205,87,46,87]
26962706
RotateRect: [125,71,159,32,0]
@@ -3309,44 +3319,44 @@ Salam!
33093319
"y": <integer>,
33103320
"description": "<string, optional: (可选) 一个简短的字符串,描述你点击的是什么,例如 "Chrome浏览器图标" 或 "登录按钮"。>"
33113321
}
3312-
3322+
33133323
### TYPE
33143324
- **功能**: 输入文本。
33153325
- **Parameters模板**:
33163326
{
33173327
"text": "<string>",
33183328
"needs_enter": <boolean>
33193329
}
3320-
3330+
33213331
### SCROLL
33223332
- **功能**: 滚动窗口。
33233333
- **Parameters模板**:
33243334
{
33253335
"direction": "<'up' or 'down'>",
33263336
"amount": "<'small', 'medium', or 'large'>"
33273337
}
3328-
3338+
33293339
### KEY_PRESS
33303340
- **功能**: 按下功能键。
33313341
- **Parameters模板**:
33323342
{
33333343
"key": "<string: e.g., 'enter', 'esc', 'alt+f4'>"
33343344
}
3335-
3345+
33363346
### FINISH
33373347
- **功能**: 任务成功完成。
33383348
- **Parameters模板**:
33393349
{
33403350
"message": "<string: 总结任务完成情况>"
33413351
}
3342-
3352+
33433353
### FAILE
33443354
- **功能**: 任务无法完成。
33453355
- **Parameters模板**:
33463356
{
33473357
"reason": "<string: 清晰解释失败原因>"
33483358
}
3349-
3359+
33503360
## 4. 思维与决策框架
33513361
在生成每一步操作前,请严格遵循以下思考-验证流程:
33523362
@@ -3400,6 +3410,78 @@ var completion = client.GetMultimodalGenerationStreamAsync(
34003410
34013411
随后您需要自行实现大模型返回的操作(这里是点击屏幕上的位置),然后返回下一步的截图和意图。
34023412
3413+
### 音频理解
3414+
3415+
`qwen-audio` 无法用于生产环境,这里以使用 `Qwen3-Omni-Captioner` 为例:
3416+
3417+
示例请求:
3418+
3419+
```csharp
3420+
// upload file
3421+
await using var audio = File.OpenRead("noise.wav");
3422+
var ossLink = await client.UploadTemporaryFileAsync("qwen3-omni-30b-a3b-captioner", audio, "noise.wav");
3423+
Console.WriteLine($"File uploaded: {ossLink}");
3424+
var messages = new List<MultimodalMessage>
3425+
{
3426+
MultimodalMessage.User(
3427+
[
3428+
// 也可以直接传入公网地址
3429+
MultimodalMessageContent.AudioContent(ossLink),
3430+
])
3431+
};
3432+
var completion = client.GetMultimodalGenerationStreamAsync(
3433+
new ModelRequest<MultimodalInput, IMultimodalParameters>()
3434+
{
3435+
Model = "qwen3-omni-30b-a3b-captioner",
3436+
Input = new MultimodalInput() { Messages = messages },
3437+
Parameters = new MultimodalParameters() { IncrementalOutput = true, }
3438+
});
3439+
```
3440+
3441+
这里开启了流式增量输出,遍历返回的 `IAsyncEnumerable` 即可获取模型回复,示例:
3442+
3443+
```csharp
3444+
var reply = new StringBuilder();
3445+
var first = true;
3446+
MultimodalTokenUsage? usage = null;
3447+
await foreach (var chunk in completion)
3448+
{
3449+
var choice = chunk.Output.Choices[0];
3450+
if (first)
3451+
{
3452+
first = false;
3453+
Console.WriteLine();
3454+
Console.Write("Assistant > ");
3455+
}
3456+
3457+
if (choice.Message.Content.Count == 0)
3458+
{
3459+
continue;
3460+
}
3461+
3462+
Console.Write(choice.Message.Content[0].Text);
3463+
reply.Append(choice.Message.Content[0].Text);
3464+
usage = chunk.Usage;
3465+
}
3466+
3467+
Console.WriteLine();
3468+
messages.Add(MultimodalMessage.Assistant([MultimodalMessageContent.TextContent(reply.ToString())]));
3469+
```
3470+
3471+
示例输出:
3472+
3473+
```
3474+
Assistant > The audio clip opens with a rapid, percussive metallic clatter, reminiscent of a typewriter or similar mechanical device, which continues in a steady rhythm throughout the recording. This clatter is slightly left-of-center in the stereo field and is accompanied by a faint, low-frequency hum, likely from a household appliance or HVAC system. The acoustic environment is a small, enclosed room with hard surfaces, indicated by the short, bright reverberation of both the clatter and the speaker’s voice. The audio quality is moderate, with a noticeable electronic hiss and some loss of high-frequency detail, but no digital distortion or clipping.
3475+
3476+
At the one-second mark, a male voice enters, positioned slightly right-of-center and closer to the microphone. He speaks in standard Mandarin, with a tone of weary exasperation: “哎 呀,这样我还怎么安静工作啊?” (“Aiyā, zěnyàng wǒ hái zěnme ānjìng gōngzuò a?”), which translates to “Oh, how can I possibly work quietly like this?” His speech is clear, with a slightly rising pitch on “安静” (“quietly”) and a falling pitch on “啊” (“a”), conveying a sense of complaint and fatigue. The accent is standard, with no regional inflection, and the voice is that of a young to middle-aged adult male.
3477+
3478+
Throughout the clip, the mechanical clatter remains constant and prominent, occasionally competing with the voice for clarity. There are no other sounds, such as footsteps, additional voices, or environmental noises, and the background is otherwise quiet. The interplay between the persistent mechanical noise and the speaker’s complaint creates a vivid sense of disruption and frustration, suggesting an environment where work is being impeded by an external, uncontrolled sound source.
3479+
3480+
Culturally, the use of Mandarin, standard pronunciation, and modern recording quality indicate a contemporary, urban Chinese setting. The language and tone are universally relatable, reflecting a common experience of being disturbed during work. The lack of regional markers or distinctive background noises suggests a generic, possibly domestic or office-like space, but with no clear indicators of a specific location or social context.
3481+
3482+
In summary, the audio portrays a modern Mandarin-speaking man, exasperated by a constant, distracting mechanical noise (likely a typewriter or similar device), attempting to work in a small, reverberant room. The recording’s technical and acoustic features reinforce the sense of disruption and frustration, while the language and setting suggest a contemporary, urban Chinese context.
3483+
```
3484+
34033485
## 语音合成
34043486
34053487
通过 `dashScopeClient.CreateSpeechSynthesizerSocketSessionAsync()` 来创建一个语音合成会话。

sample/Cnblogs.DashScope.Sample/Cnblogs.DashScope.Sample.csproj

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,9 @@
5050
<None Update="multilanguage.jpg">
5151
<CopyToOutputDirectory>Always</CopyToOutputDirectory>
5252
</None>
53+
<None Update="noise.wav">
54+
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
55+
</None>
5356
</ItemGroup>
5457

5558
<ItemGroup>

sample/Cnblogs.DashScope.Sample/Files/FileUploadSample.cs

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,13 +3,13 @@
33

44
namespace Cnblogs.DashScope.Sample.Files;
55

6-
public class FileUploadSample : ISample
6+
public class FileUploadSample : FilesSample
77
{
88
/// <inheritdoc />
9-
public string Description => "Upload File Sample";
9+
public override string Description => "Upload File Sample";
1010

1111
/// <inheritdoc />
12-
public async Task RunAsync(IDashScopeClient client)
12+
public override async Task RunAsync(IDashScopeClient client)
1313
{
1414
var json = new JsonSerializerOptions(JsonSerializerDefaults.Web) { WriteIndented = true };
1515
var file = new FileInfo("Lenna.jpg");
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
using Cnblogs.DashScope.Core;
2+
3+
namespace Cnblogs.DashScope.Sample;
4+
5+
public abstract class FilesSample : ISample
6+
{
7+
/// <inheritdoc />
8+
public string Group => "Files";
9+
10+
/// <inheritdoc />
11+
public abstract string Description { get; }
12+
13+
/// <inheritdoc />
14+
public abstract Task RunAsync(IDashScopeClient client);
15+
}

sample/Cnblogs.DashScope.Sample/ISample.cs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ namespace Cnblogs.DashScope.Sample;
44

55
public interface ISample
66
{
7+
string Group { get; }
78
string Description { get; }
89
Task RunAsync(IDashScopeClient client);
910
}

0 commit comments

Comments
 (0)