Skip to content

Conversation

@flystar32
Copy link
Collaborator

AgentScope-Java Version

1.0.7-SNAPSHOT

Description

Add Qwen TTS model support.

AgentScope provides three ways to use TTS:

  1. TTSHook - Auto-speak all Agent responses (non-invasive, speak while generating)
  2. TTSModel - Standalone speech synthesis (independent of Agent, flexible calling)
  3. DashScopeMultiModalTool - Agent invokes TTS as tool actively (Agent converts text to speech when needed)

Checklist

Please check the following items before code is ready to be reviewed.

  • Code has been formatted with mvn spotless:apply
  • All tests are passing (mvn test)
  • Javadoc comments are complete and follow project conventions
  • Related documentation has been updated (e.g. links, examples, etc.)
  • Code is ready for review

Change-Id: Ie83648211c2578fa814b351f7a106be0f4387d85
Change-Id: I1bbfffc706544db8abc8188d613f2853efde4427
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive Qwen TTS (Text-to-Speech) model support to AgentScope Java, enabling agents to "speak" their responses in real-time. The implementation supports the qwen3-tts-flash and qwen-tts models alongside existing Sambert models.

Changes:

  • Adds core TTS infrastructure (TTSModel interface, TTSOptions, TTSResponse, TTSException)
  • Implements DashScopeTTSModel for non-streaming TTS and DashScopeRealtimeTTSModel for streaming synthesis
  • Provides TTSHook for automatic agent speech synthesis and AudioPlayer for local audio playback
  • Updates DashScopeMultiModalTool to support Qwen TTS models alongside existing Sambert models
  • Includes comprehensive documentation in both English and Chinese with usage examples
  • Adds example applications (CLI and web-based) demonstrating all three TTS usage patterns

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
docs/zh/task/tts.md Chinese documentation for TTS functionality
docs/en/task/tts.md English documentation for TTS functionality
TTSExample.java Quickstart example demonstrating three TTS usage patterns
ReActAgentWithTTSDemo.java Interactive CLI demo with real-time TTS
ChatTTSSpringBootApplication.java Spring Boot web application entry point
ChatController.java REST controller for TTS-enabled chat with SSE streaming
index.html Frontend UI for real-time chat with audio playback
TTSModel.java Base interface for all TTS model implementations
TTSOptions.java Configuration options for TTS synthesis
TTSResponse.java Response object encapsulating TTS synthesis results
TTSException.java Exception class for TTS operation failures
DashScopeTTSModel.java Non-streaming TTS model implementation using DashScope API
DashScopeRealtimeTTSModel.java Streaming TTS model with incremental input support
AudioPlayer.java Local audio playback using Java Sound API
TTSHook.java Hook for real-time TTS during agent execution
DashScopeMultiModalTool.java Extended to support Qwen TTS models via multimodal API
DashScopeTTSModelTest.java Unit tests for DashScopeTTSModel
Test files Updated test cases for DashScopeMultiModalTool
POM files Added dependencies and module configurations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 393 to 460
/**
* Synthesize audio using Qwen TTS models via multimodal-generation API.
*/
private Mono<ToolResultBlock> synthesizeWithQwenTTS(
String text, String model, String voice, String language) {
String finalVoice =
Optional.ofNullable(voice).filter(s -> !s.trim().isEmpty()).orElse("Cherry");
String finalLanguage =
Optional.ofNullable(language).filter(s -> !s.trim().isEmpty()).orElse("Chinese");

return Mono.fromCallable(
() -> {
// Build request for Qwen TTS API
Map<String, Object> input = new java.util.HashMap<>();
input.put("text", text);
input.put("voice", finalVoice);
input.put("language_type", finalLanguage);

Map<String, Object> request = new java.util.HashMap<>();
request.put("model", model);
request.put("input", input);

String requestBody =
io.agentscope.core.util.JsonUtils.getJsonCodec()
.toJson(request);

// Call DashScope API using Java HttpClient
java.net.http.HttpClient client =
java.net.http.HttpClient.newHttpClient();
java.net.http.HttpRequest httpRequest =
java.net.http.HttpRequest.newBuilder()
.uri(
URI.create(
"https://dashscope.aliyuncs.com/api/v1/services"
+ "/aigc/multimodal-generation/generation"))
.header("Authorization", "Bearer " + this.apiKey)
.header("Content-Type", "application/json")
.header("User-Agent", Version.getUserAgent())
.POST(
java.net.http.HttpRequest.BodyPublishers
.ofString(requestBody))
.build();

java.net.http.HttpResponse<String> response =
client.send(
httpRequest,
java.net.http.HttpResponse.BodyHandlers.ofString());

if (response.statusCode() != 200) {
log.error(
"Qwen TTS API failed: status={}, body={}",
response.statusCode(),
response.body());
return ToolResultBlock.error(
"TTS API failed: " + response.statusCode());
}

return parseQwenTTSResponse(response.body());
})
.onErrorResume(
e -> {
log.error(
"Failed to generate audio with Qwen TTS: '{}'",
e.getMessage(),
e);
return Mono.just(ToolResultBlock.error(e.getMessage()));
});
}
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parseQwenTTSResponse and synthesizeWithQwenTTS methods are private but lack Javadoc documentation. While the guideline focuses on public methods, these are substantial private methods that would benefit from documentation explaining their purpose and behavior, especially since they handle complex API response parsing.

Copilot uses AI. Check for mistakes.
Comment on lines 99 to 106
// Streaming input support
private final boolean supportsStreamingInput = true;
private final StringBuilder textBuffer = new StringBuilder();
private final BlockingQueue<AudioBlock> audioQueue = new LinkedBlockingQueue<>();
private final Sinks.Many<AudioBlock> audioSink =
Sinks.many().multicast().onBackpressureBuffer();
private final AtomicBoolean sessionActive = new AtomicBoolean(false);
private Thread synthesisThread;
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sessionActive, textBuffer, and audioQueue fields are accessed from multiple threads (main thread and synthesis thread) without proper synchronization. While AtomicBoolean is used for sessionActive, the textBuffer uses synchronized blocks, but audioQueue operations and reading sessionActive in multiple places could lead to race conditions. Consider using more comprehensive synchronization or documented thread-safety guarantees.

Copilot uses AI. Check for mistakes.
Comment on lines 235 to 242
public void stop() {
if (audioPlayer != null && playerStarted) {
audioPlayer.stop();
playerStarted = false;
}
sessionStarted = false;
audioSink.tryEmitComplete();
}
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the stop() method, audioSink.tryEmitComplete() is called without checking the result. If emission fails, the sink may not be properly closed. Consider logging a warning if tryEmitComplete() returns a failure result.

Copilot uses AI. Check for mistakes.
Comment on lines 180 to 193
private void emitAudio(AudioBlock audio) {
// 1. Emit to reactive stream (for SSE/WebSocket consumers)
audioSink.tryEmitNext(audio);

// 2. Call callback if provided
if (audioCallback != null) {
audioCallback.accept(audio);
}

// 3. Play locally if player is configured
if (audioPlayer != null) {
audioPlayer.play(audio);
}
}
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method emitAudio calls audioSink.tryEmitNext(audio) without checking the return value. If the emission fails (e.g., due to backpressure or sink termination), the audio block will be silently dropped. Consider logging a warning when emission fails to help with debugging.

Copilot uses AI. Check for mistakes.

// Start background playback thread
playbackThread = new Thread(this::playbackLoop, "audio-player");
playbackThread.setDaemon(true);
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The playbackThread is started as a daemon thread, which means it will be abruptly terminated when the JVM exits, potentially cutting off audio playback mid-stream. Consider implementing graceful shutdown or documenting this behavior, especially since the stop() method exists but may not always be called.

Suggested change
playbackThread.setDaemon(true);

Copilot uses AI. Check for mistakes.
URI.create(
"https://dashscope.aliyuncs.com/api/v1/services"
+ "/aigc/multimodal-generation/generation"))
.header("Authorization", "Bearer " + this.apiKey)
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API key is passed directly to the HttpClient without any validation or sanitization. Consider adding validation to ensure the API key is not empty/null before making the HTTP request, or at least document the expected format.

Copilot uses AI. Check for mistakes.
flystar32 and others added 17 commits January 13, 2026 18:57
Change-Id: Ia1f57f71a15d0e7ae6c856c2e70af580d1e86c1a
Change-Id: Id22dae41979765f5ac7b95bea69cdec49966bf8b
Change-Id: I6d25a1ffc056b87afcf0f7ccdc50e71e357538c1
Change-Id: I2c2760b5268279690992b329b5a779973f7e4cc9
Change-Id: I42142bfb331d73ed859f19e9a8d18b9c852d52e3
Change-Id: Id92fffd580ef06f8430a9dce8923aabde6500efd
Change-Id: I1a335080359c2c4de5f14c6f46bf5eb3cf802b4b
Change-Id: Ic3cd684fb3ac9d327d5e417f8ba4a8e95478abbe
Change-Id: I6a1dbc5c37ad71ddb936c9896ddfc6756192d202
Change-Id: I3ca32bfc9e83d2c7f6121f86a0affc4c4433ee4f
Change-Id: I638f53187f4facac5568103649b3034d7269c44d
Change-Id: I6fc008d9337c6dccbf01b2d07aa177667b2dc08a
Change-Id: I39283aebefa6978c6cc9829f156e345aa47f3c31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants