This document explains the implementation of streaming responses in Autumn AI Assistant and the potential challenges you might face.
When you set "stream": True in the Ollama API request, instead of waiting for the complete response, the model sends back tokens (words/word pieces) as they are generated in real-time.
- Better User Experience: Users see text appearing character by character (like ChatGPT)
- Perceived Speed: Feels faster even though total generation time is the same
- Early Interruption: Users can stop generation if they don't like the direction
- Real-time Feedback: Can process sentences as they complete for TTS
- Better Engagement: More interactive and dynamic conversation flow
response = await self.http_client.post(
"http://localhost:11434/api/generate",
json={
"model": "gemma3:1b-it-q4_K_M",
"prompt": prompt,
"stream": False, # Wait for complete response
"options": {
"temperature": 0.7,
"top_p": 0.9,
"num_predict": 500
}
}
)async with self.http_client.stream(
"POST", "http://localhost:11434/api/generate",
json={
"model": "gemma3:1b-it-q4_K_M",
"prompt": prompt,
"stream": True, # Enable streaming
"options": { /* same options */ }
}
) as response:
async for chunk in response.aiter_lines():
chunk_data = json.loads(chunk)
token = chunk_data.get("response", "")
# Process token immediatelyProblem: Each streaming chunk is a separate JSON object, some might be malformed.
{"response": "Hello"}
{"response": " world"}
{"response": "!", "done": true}Solution: Wrap JSON parsing in try-catch and skip invalid chunks:
try:
chunk_data = json.loads(chunk)
except json.JSONDecodeError:
logger.warning(f"Invalid JSON chunk: {chunk}")
continueProblem: Network issues can break the stream mid-response.
Solution: Implement retry logic and fallback to non-streaming:
except httpx.ConnectError:
logger.error("Stream connection lost, falling back to non-streaming")
return await self._process_local_non_streaming(prompt)Problem: Tokens might come in incomplete words or sentences.
Solution: Implement sentence buffering for TTS:
sentence_buffer += token
if self._is_sentence_complete(sentence_buffer):
# Send complete sentence to TTS
await self._speak_sentence(sentence_buffer)
sentence_buffer = ""Problem: Harder to detect errors mid-stream vs complete responses.
Solution: Monitor for error indicators in stream:
if chunk_data.get("error"):
logger.error(f"Stream error: {chunk_data['error']}")
breakProblem: UI needs to update in real-time with streaming tokens.
Solution: Use async callbacks for UI updates:
async def stream_callback(data):
if data["type"] == "token":
ui.append_text(data["content"])
elif data["type"] == "sentence":
tts.speak(data["content"])Problem: Long responses might consume excessive memory with token buffering.
Solution: Implement sliding window for very long responses:
if len(full_response) > MAX_RESPONSE_LENGTH:
# Trim older content but keep recent context
full_response = full_response[-KEEP_LENGTH:]Problem: TTS needs complete sentences, but streaming gives individual tokens.
Solution: Buffer tokens until sentence completion:
def _is_sentence_complete(text):
endings = ['.', '!', '?', ':', ';']
return (len(text.strip()) > 10 and
any(text.strip().endswith(end) for end in endings))Problem: Streaming can hang if model stops generating without "done" signal.
Solution: Implement timeout with heartbeat monitoring:
async with asyncio.timeout(60.0): # 1 minute max
async for chunk in response.aiter_lines():
# Process chunksProblem: Multi-byte characters might be split across tokens.
Solution: Use proper UTF-8 handling and buffer incomplete characters:
try:
token.encode('utf-8').decode('utf-8')
except UnicodeDecodeError:
# Buffer incomplete character
continueProblem: Processing many small chunks vs one large response.
Solution: Batch small tokens for processing:
token_batch = []
for token in tokens:
token_batch.append(token)
if len(token_batch) >= BATCH_SIZE:
process_batch(token_batch)
token_batch = []- Test with various query lengths
- Test with Unicode/emoji content
- Test connection interruption scenarios
- Test timeout scenarios
- Compare streaming vs non-streaming response times
- Memory usage monitoring during long responses
- Network usage patterns
- Invalid JSON chunks
- Network disconnection mid-stream
- Model timeout scenarios
- Empty responses
if streaming_failed:
return await self._process_non_streaming(prompt)if streaming_error_count > MAX_ERRORS:
self.disable_streaming_temporarily()stream_start = time.time()
# ... streaming logic
stream_duration = time.time() - stream_start
if stream_duration > non_stream_duration * 1.5:
logger.warning("Streaming slower than non-streaming")# Allow users to toggle streaming
settings.enable_streaming = user_preferenceasync def stream_callback(data):
if data["type"] == "token":
print(data["content"], end="", flush=True)
elif data["type"] == "complete":
print("\n[Done]")
response = await brain.process_streaming(
"Tell me about AI",
stream_callback
)async def advanced_callback(data):
if data["type"] == "token":
ui.update_text(data["content"])
elif data["type"] == "sentence":
await tts_engine.speak(data["content"])
elif data["type"] == "complete":
ui.mark_complete()Streaming makes Autumn feel much more responsive and engaging, but requires careful handling of:
- Network issues
- JSON parsing errors
- Buffer management
- TTS integration
- Error recovery
The implementation includes comprehensive error handling and fallback mechanisms to ensure reliability while providing the benefits of real-time streaming.