Tai is a fully local, real-time multimodal AI assistant built as a modular system of specialized services.
The project aims to provide a voice-first assistant capable of:
- understanding spoken input
- reasoning with a local LLM
- answering through local speech synthesis
- supporting barge-in when the user starts speaking
- exposing observable runtime state and performance metrics
- evolving toward a real-time UI, expressive 2D avatar and screen vision capabilities
Tai is designed around a simple idea:
Keep intelligence local, split responsibilities cleanly, and make every component replaceable.
Tai is not a monolithic AI application.
It is a coordinated system of specialized services:
voice input
→ STT pipeline
→ Orchestrator
→ LLM service
→ TTS service
→ audio output
→ UI / avatar feedback
The architecture favors:
- local execution
- service isolation
- event-driven communication
- clear ownership of state and decisions
- measurable performance at each step of the pipeline
Tai is built around events and callbacks.
Services emit facts:
speech started
transcript accepted
LLM response completed
TTS playback started
TTS playback completed
The orchestrator receives those facts, updates the session state, and decides what should happen next.
Each service has one primary responsibility:
STT listener → capture and segment speech
STT transcription → transcribe audio
Orchestrator → own conversation decisions
LLM service → generate assistant text
TTS service → synthesize and play speech
UI → display state and controls
Avatar → render expressive 2D visual presence
Vision → analyze the screen on demand
The system is designed so individual capabilities can evolve independently:
- Whisper can be replaced by another STT engine.
- Piper can be replaced by another TTS engine.
- Ollama models can be swapped.
- UI and avatar layers can evolve without changing the conversation core.
┌────────────────────┐
│ Microphone / UI │
└─────────┬──────────┘
│
▼
┌────────────────────┐
│ STT Pipeline │
│ capture + whisper │
└─────────┬──────────┘
│ events
▼
┌────────────────────┐
│ Orchestrator │
│ state + decisions │
└─────────┬──────────┘
│ service call
▼
┌────────────────────┐
│ LLM │
│ service │
└─────────┬──────────┘
│ events
▼
┌────────────────────┐
│ Orchestrator │
│ state + decisions │
└─────┬────────────┬─┘
│ │
▼ ▼
┌────────────┐ ┌──────────┐
│ UI/Avatar │ │ TTS │
│ layer │ │ service │
└────┬───────┘ └────┬─────┘
│ │
▼ ▼
UI / Avatar audio output
layer
The orchestrator is the central non-AI decision layer.
It owns:
- event routing
- session state
- active conversation turn
- conversation history
- barge-in decisions
- LLM request orchestration
- TTS request and stop orchestration
- conversation logs
- performance metrics aggregation
The orchestrator does not perform transcription, generation or speech synthesis itself. It coordinates specialized services.
The STT pipeline converts microphone input into accepted or rejected text events.
The system is split into two responsibilities:
- continuous microphone listening and speech segmentation
- pure Whisper transcription
This separation keeps audio capture, gatekeeping and transcription independent.
Key concepts:
- continuous microphone listening
- speech-start detection
- speech-end detection
- pre-filtering before transcription
- Whisper transcription
- post-filtering after transcription
- STT events sent to the orchestrator
The LLM service is responsible for text generation.
Tai targets local models through Ollama.
The model is treated as a replaceable reasoning backend. The rest of the system interacts with it through a service boundary rather than embedding model-specific logic in the orchestrator.
Target model family:
- 7B–8B local models for latency and quality balance
- larger or specialized models when hardware allows
The TTS service converts assistant replies into spoken audio.
The current direction is local speech synthesis through Piper.
Key concepts:
- local voice model
- local WAV synthesis
- local audio playback
- playback lifecycle callbacks
- stop command support for barge-in
The UI is the real-time control and monitoring surface for Tai.
Capabilities:
- display global system state
- display component health
- show current conversation turn
- show latest user transcript
- show latest assistant reply
- expose conversation history
- expose runtime toggles and diagnostics
The avatar layer is Tai's visual presence.
The target is an expressive 2D avatar with a VTuber-like rendering style.
Capabilities:
- 2D rigged character rendering
- idle animation
- speech animation
- lip sync
- facial expressions
- emotion and state visualization
- subtle pseudo-3D head/body orientation
- limited left/right turning without requiring a full 3D model
The avatar remains a rendering layer, not a business logic layer.
Screen vision is an on-demand capability.
Pipeline:
screen capture
→ OCR / visual extraction
→ structured text summary
→ LLM context
Screen analysis is intentionally modeled as an explicit capability, not as a permanent background observer.
1. User starts speaking
2. STT detects speech start
3. Orchestrator handles possible barge-in
4. STT captures and transcribes the utterance
5. Orchestrator creates a conversation turn
6. LLM generates an assistant response
7. TTS synthesizes and plays speech
8. Orchestrator finalizes the conversation turn
9. Metrics and conversation logs are written
Tai tracks several runtime states to keep the system observable and controllable:
- listening state
- thinking state
- speaking state
- interruption state
- active conversation turn
- session context
- health status
- performance metrics
- UI state
- avatar state
- emotion state
Tai is designed to make latency visible.
The orchestrator centralizes turn-level performance metrics, including:
- STT transcription duration
- speech-to-transcript latency
- LLM generation duration
- TTS synthesis duration
- first-audio latency
- TTS speech duration
- total turn duration
This makes it possible to identify whether a delay comes from:
speech capture
STT
LLM
TTS synthesis
audio playback
or orchestration
Health and runtime state are exposed through service-level health endpoints and consolidated by the orchestration layer.
| Area | Technology |
|---|---|
| Language runtime | Java 21, Python |
| Java framework | Spring Boot |
| Python API framework | FastAPI |
| Build tool | Maven |
| STT transcription | faster-whisper / Whisper |
| STT acceleration | CUDA-capable GPU |
| LLM runtime | Ollama |
| TTS engine | Piper |
| API documentation | OpenAPI / Swagger UI |
| Health checks | Spring Actuator / service health endpoints |
| Logging | Logback and dedicated domain loggers |
| UI target | Web frontend |
| Avatar target | Live2D / Cubism-style 2D avatar, or equivalent 2D rigging runtime |
- Voice input
- STT pipeline
- Local LLM response generation
- Local TTS playback
- Event-driven orchestrator
- Session management
- Barge-in support
- Conversation logs
- Performance metrics
- Basic monitoring surface
- System health dashboard
- Conversation history
- Current turn visualization
- Thinking / speaking / listening states
- Runtime controls
- Debug panels
- 2D rigged avatar rendering
- Lip sync
- Idle and speech animations
- Facial expressions
- State-driven visual reactions
- Subtle pseudo-3D orientation
- Avatar integration with assistant state
- On-demand screen capture
- OCR / visual extraction
- Structured screen analysis
- LLM integration
- Long-term memory
- Retrieval-augmented generation
- Domain-specific behavior
- Lightweight model adaptation
- Curated dataset processing
Tai uses local models only.
The project favors:
- small-to-medium local models
- careful prompt design
- contextual orchestration
- performance-aware model selection
- retrieval and memory over full fine-tuning
The goal is not to depend on one specific model, but to build a system that can swap models as local AI tooling improves.
Tai keeps model behavior configurable at the application level.
Filtering and policies are expected to be handled outside the model through:
- configurable behavior rules
- optional text filters
- runtime toggles
- orchestration policies
This keeps model execution local while allowing the application to decide how strict or permissive the assistant should be.
- Fully local operation
- Real-time voice interaction
- Modular service architecture
- Replaceable AI components
- Observable runtime behavior
- Clear separation between AI execution and orchestration decisions
- Developer-friendly experimentation
Primary hardware target:
- NVIDIA RTX 4070 with 12GB VRAM
- 64GB RAM
The architecture is designed to support local real-time interaction under multi-service load.
Tai prioritizes architecture over model size.
The goal is not to build the biggest AI, but a responsive, modular and controllable assistant that can evolve service by service.
The Tai name and project identity are not licensed for use in a way that suggests official endorsement or affiliation without permission.