-
Notifications
You must be signed in to change notification settings - Fork 214
add Qwen tts model support. #541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Change-Id: Ie83648211c2578fa814b351f7a106be0f4387d85
Change-Id: I1bbfffc706544db8abc8188d613f2853efde4427
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds comprehensive Qwen TTS (Text-to-Speech) model support to AgentScope Java, enabling agents to "speak" their responses in real-time. The implementation supports the qwen3-tts-flash and qwen-tts models alongside existing Sambert models.
Changes:
- Adds core TTS infrastructure (TTSModel interface, TTSOptions, TTSResponse, TTSException)
- Implements DashScopeTTSModel for non-streaming TTS and DashScopeRealtimeTTSModel for streaming synthesis
- Provides TTSHook for automatic agent speech synthesis and AudioPlayer for local audio playback
- Updates DashScopeMultiModalTool to support Qwen TTS models alongside existing Sambert models
- Includes comprehensive documentation in both English and Chinese with usage examples
- Adds example applications (CLI and web-based) demonstrating all three TTS usage patterns
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/zh/task/tts.md | Chinese documentation for TTS functionality |
| docs/en/task/tts.md | English documentation for TTS functionality |
| TTSExample.java | Quickstart example demonstrating three TTS usage patterns |
| ReActAgentWithTTSDemo.java | Interactive CLI demo with real-time TTS |
| ChatTTSSpringBootApplication.java | Spring Boot web application entry point |
| ChatController.java | REST controller for TTS-enabled chat with SSE streaming |
| index.html | Frontend UI for real-time chat with audio playback |
| TTSModel.java | Base interface for all TTS model implementations |
| TTSOptions.java | Configuration options for TTS synthesis |
| TTSResponse.java | Response object encapsulating TTS synthesis results |
| TTSException.java | Exception class for TTS operation failures |
| DashScopeTTSModel.java | Non-streaming TTS model implementation using DashScope API |
| DashScopeRealtimeTTSModel.java | Streaming TTS model with incremental input support |
| AudioPlayer.java | Local audio playback using Java Sound API |
| TTSHook.java | Hook for real-time TTS during agent execution |
| DashScopeMultiModalTool.java | Extended to support Qwen TTS models via multimodal API |
| DashScopeTTSModelTest.java | Unit tests for DashScopeTTSModel |
| Test files | Updated test cases for DashScopeMultiModalTool |
| POM files | Added dependencies and module configurations |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /** | ||
| * Synthesize audio using Qwen TTS models via multimodal-generation API. | ||
| */ | ||
| private Mono<ToolResultBlock> synthesizeWithQwenTTS( | ||
| String text, String model, String voice, String language) { | ||
| String finalVoice = | ||
| Optional.ofNullable(voice).filter(s -> !s.trim().isEmpty()).orElse("Cherry"); | ||
| String finalLanguage = | ||
| Optional.ofNullable(language).filter(s -> !s.trim().isEmpty()).orElse("Chinese"); | ||
|
|
||
| return Mono.fromCallable( | ||
| () -> { | ||
| // Build request for Qwen TTS API | ||
| Map<String, Object> input = new java.util.HashMap<>(); | ||
| input.put("text", text); | ||
| input.put("voice", finalVoice); | ||
| input.put("language_type", finalLanguage); | ||
|
|
||
| Map<String, Object> request = new java.util.HashMap<>(); | ||
| request.put("model", model); | ||
| request.put("input", input); | ||
|
|
||
| String requestBody = | ||
| io.agentscope.core.util.JsonUtils.getJsonCodec() | ||
| .toJson(request); | ||
|
|
||
| // Call DashScope API using Java HttpClient | ||
| java.net.http.HttpClient client = | ||
| java.net.http.HttpClient.newHttpClient(); | ||
| java.net.http.HttpRequest httpRequest = | ||
| java.net.http.HttpRequest.newBuilder() | ||
| .uri( | ||
| URI.create( | ||
| "https://dashscope.aliyuncs.com/api/v1/services" | ||
| + "/aigc/multimodal-generation/generation")) | ||
| .header("Authorization", "Bearer " + this.apiKey) | ||
| .header("Content-Type", "application/json") | ||
| .header("User-Agent", Version.getUserAgent()) | ||
| .POST( | ||
| java.net.http.HttpRequest.BodyPublishers | ||
| .ofString(requestBody)) | ||
| .build(); | ||
|
|
||
| java.net.http.HttpResponse<String> response = | ||
| client.send( | ||
| httpRequest, | ||
| java.net.http.HttpResponse.BodyHandlers.ofString()); | ||
|
|
||
| if (response.statusCode() != 200) { | ||
| log.error( | ||
| "Qwen TTS API failed: status={}, body={}", | ||
| response.statusCode(), | ||
| response.body()); | ||
| return ToolResultBlock.error( | ||
| "TTS API failed: " + response.statusCode()); | ||
| } | ||
|
|
||
| return parseQwenTTSResponse(response.body()); | ||
| }) | ||
| .onErrorResume( | ||
| e -> { | ||
| log.error( | ||
| "Failed to generate audio with Qwen TTS: '{}'", | ||
| e.getMessage(), | ||
| e); | ||
| return Mono.just(ToolResultBlock.error(e.getMessage())); | ||
| }); | ||
| } |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parseQwenTTSResponse and synthesizeWithQwenTTS methods are private but lack Javadoc documentation. While the guideline focuses on public methods, these are substantial private methods that would benefit from documentation explaining their purpose and behavior, especially since they handle complex API response parsing.
| // Streaming input support | ||
| private final boolean supportsStreamingInput = true; | ||
| private final StringBuilder textBuffer = new StringBuilder(); | ||
| private final BlockingQueue<AudioBlock> audioQueue = new LinkedBlockingQueue<>(); | ||
| private final Sinks.Many<AudioBlock> audioSink = | ||
| Sinks.many().multicast().onBackpressureBuffer(); | ||
| private final AtomicBoolean sessionActive = new AtomicBoolean(false); | ||
| private Thread synthesisThread; |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sessionActive, textBuffer, and audioQueue fields are accessed from multiple threads (main thread and synthesis thread) without proper synchronization. While AtomicBoolean is used for sessionActive, the textBuffer uses synchronized blocks, but audioQueue operations and reading sessionActive in multiple places could lead to race conditions. Consider using more comprehensive synchronization or documented thread-safety guarantees.
| public void stop() { | ||
| if (audioPlayer != null && playerStarted) { | ||
| audioPlayer.stop(); | ||
| playerStarted = false; | ||
| } | ||
| sessionStarted = false; | ||
| audioSink.tryEmitComplete(); | ||
| } |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the stop() method, audioSink.tryEmitComplete() is called without checking the result. If emission fails, the sink may not be properly closed. Consider logging a warning if tryEmitComplete() returns a failure result.
| private void emitAudio(AudioBlock audio) { | ||
| // 1. Emit to reactive stream (for SSE/WebSocket consumers) | ||
| audioSink.tryEmitNext(audio); | ||
|
|
||
| // 2. Call callback if provided | ||
| if (audioCallback != null) { | ||
| audioCallback.accept(audio); | ||
| } | ||
|
|
||
| // 3. Play locally if player is configured | ||
| if (audioPlayer != null) { | ||
| audioPlayer.play(audio); | ||
| } | ||
| } |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method emitAudio calls audioSink.tryEmitNext(audio) without checking the return value. If the emission fails (e.g., due to backpressure or sink termination), the audio block will be silently dropped. Consider logging a warning when emission fails to help with debugging.
|
|
||
| // Start background playback thread | ||
| playbackThread = new Thread(this::playbackLoop, "audio-player"); | ||
| playbackThread.setDaemon(true); |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The playbackThread is started as a daemon thread, which means it will be abruptly terminated when the JVM exits, potentially cutting off audio playback mid-stream. Consider implementing graceful shutdown or documenting this behavior, especially since the stop() method exists but may not always be called.
| playbackThread.setDaemon(true); |
| URI.create( | ||
| "https://dashscope.aliyuncs.com/api/v1/services" | ||
| + "/aigc/multimodal-generation/generation")) | ||
| .header("Authorization", "Bearer " + this.apiKey) |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API key is passed directly to the HttpClient without any validation or sanitization. Consider adding validation to ensure the API key is not empty/null before making the HTTP request, or at least document the expected format.
agentscope-core/src/main/java/io/agentscope/core/model/tts/DashScopeTTSModel.java
Show resolved
Hide resolved
agentscope-core/src/main/java/io/agentscope/core/model/tts/DashScopeRealtimeTTSModel.java
Outdated
Show resolved
Hide resolved
agentscope-core/src/main/java/io/agentscope/core/model/tts/AudioPlayer.java
Outdated
Show resolved
Hide resolved
agentscope-core/src/main/java/io/agentscope/core/model/tts/DashScopeRealtimeTTSModel.java
Outdated
Show resolved
Hide resolved
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
…hScopeRealtimeTTSModel.java Co-authored-by: Copilot <[email protected]>
…hScopeRealtimeTTSModel.java Co-authored-by: Copilot <[email protected]>
…ioPlayer.java Co-authored-by: Copilot <[email protected]>
Change-Id: Ia1f57f71a15d0e7ae6c856c2e70af580d1e86c1a
Change-Id: Id22dae41979765f5ac7b95bea69cdec49966bf8b
Change-Id: I6d25a1ffc056b87afcf0f7ccdc50e71e357538c1
Change-Id: I2c2760b5268279690992b329b5a779973f7e4cc9
Change-Id: I42142bfb331d73ed859f19e9a8d18b9c852d52e3
Change-Id: Id92fffd580ef06f8430a9dce8923aabde6500efd
Change-Id: I1a335080359c2c4de5f14c6f46bf5eb3cf802b4b
Change-Id: Ic3cd684fb3ac9d327d5e417f8ba4a8e95478abbe
Change-Id: I6a1dbc5c37ad71ddb936c9896ddfc6756192d202
Change-Id: I3ca32bfc9e83d2c7f6121f86a0affc4c4433ee4f
Change-Id: I638f53187f4facac5568103649b3034d7269c44d
Change-Id: I6fc008d9337c6dccbf01b2d07aa177667b2dc08a
agentscope-core/src/main/java/io/agentscope/core/model/tts/AudioPlayer.java
Outdated
Show resolved
Hide resolved
agentscope-core/src/main/java/io/agentscope/core/model/tts/DashScopeRealtimeTTSModel.java
Outdated
Show resolved
Hide resolved
agentscope-core/src/main/java/io/agentscope/core/model/tts/DashScopeRealtimeTTSModel.java
Outdated
Show resolved
Hide resolved
agentscope-core/src/main/java/io/agentscope/core/model/tts/DashScopeRealtimeTTSModel.java
Outdated
Show resolved
Hide resolved
Change-Id: I39283aebefa6978c6cc9829f156e345aa47f3c31
AgentScope-Java Version
1.0.7-SNAPSHOT
Description
Add Qwen TTS model support.
AgentScope provides three ways to use TTS:
Checklist
Please check the following items before code is ready to be reviewed.
mvn spotless:applymvn test)