Voice Response System with Multi-Layer Intent Recognition Pipeline

A hybrid intent recognition system combining classical NLP pattern matching, neural semantic embeddings, and large language models for robust conversational AI. Features intelligent RAG, and a novel three-layer architecture that balances speed, accuracy, and cost-efficiency.

Use Case: Domain-agnostic customer service with voice interaction

Current Domain: Pizza Restaurant

Research Focus: Hybrid NLP approach combining rule-based, embedding-based, and generative techniques with intelligent context management

Key Features

Intent Recognition Pipeline

Multi-Layer Cascading Architecture: Three-layer confidence-driven pipeline achieving 97.5% accuracy
- Layer 1 - Algorithmic (72% of queries, 0.5ms): TF-IDF indexing, Levenshtein/Jaccard similarity, contextual boost engine
- Layer 2 - Semantic (18% of queries, 13.7ms): Sentence-BERT embeddings with persistent caching
- Layer 3 - LLM Fallback (10% of queries): Ollama local/cloud models with conversation history
Contextual Boost Engine: Domain-specific heuristics for handling sarcasm, negative sentiment, multi-intent queries, and edge cases

Intelligent Response Generation

Selective RAG Implementation: Intent-driven context injection reducing LLM token usage by ~53%
- Only relevant business data sent based on classified intent
- Dynamic prompt construction with conversation history awareness
Hybrid Response Strategy: Template-based responses for deterministic queries, LLM generation for complex interactions
- Configurable template system (response_templates.json) for instant replies

End-to-End Voice Interaction

ASR (Automatic Speech Recognition): OpenAI Whisper with audio preprocessing
- Noise reduction, normalization, silence trimming
TTS (Text-to-Speech): Coqui TTS neural vocoder (VITS architecture)
- Response sanitization for natural speech (currency formatting, symbol replacement)
Full conversational loop: Speech → Transcription → Intent Classification → RAG → Response Generation → Speech Synthesis

Architecture Highlights

Engineering Core

Reliability & Error Handling

Comprehensive retry logic with graceful fallbacks (2 retries for LLM failures)
Full error handling across all components with meaningful error messages
Automatic fallback to previous layer results on LLM failure

Observability & Monitoring

Detailed per-component statistical tracking (accuracy, latency, token usage, layer distribution)
Real-time statistics endpoint (/statistics) for monitoring
Structured logging with color-coded module identification

Performance Optimization

Persistent embedding cache with MD5 invalidation (instant cold-start)
TF-IDF inverted index for O(1) candidate selection
Early exit thresholds to prevent unnecessary computation

Code Quality

Type hints and dataclasses for type safety
Modular architecture with clear separation of concerns
Comprehensive test suite: 76 unit tests + integration tests

Domain-Agnostic Design

Configuration-Driven Adaptation

Intent patterns in intent_patterns.json (290 patterns across 6 intents)
Linguistic resources in linguistic_resources.json (synonyms, critical keywords, filler words)
Business information in res_info.json (menu, hours, location, delivery info)
Response templates in response_templates.json (templated replies for common queries)
No core code changes required for domain adaptation

Highly Configurable

Switchable models: ASR (5 Whisper variants), TTS (Coqui models), LLM (Ollama models), Semantic (SBERT models)
Tunable thresholds: Algorithmic (0.65), Semantic (0.5), Minimum confidence (0.5)
Modular toggles: Enable/disable any layer independently for testing and optimization
Configurable boost engine rules via boostEngine.py

Quick Start

Get the system running in 3 steps:

1. Install Dependencies

# Install system dependencies
sudo apt install software-properties-common
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install python3.11 python3.11-venv ffmpeg espeak-ng

# Clone and setup
git clone https://github.com/atpritam/Voice-Assistant.git
cd Voice-Assistant
python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

2. LLM Backend

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

Option A: Local Ollama LLM (Recommended - Free & Private)

# Pull model and start service
ollama pull llama3.2:3b-instruct-q4_K_M
ollama serve &

Option B: Cloud Ollama Models (Faster setup, requires Ollama Signin)

ollama pull gpt-oss:120b-cloud
ollama signin

Sign in to Ollama with the link provided in the terminal.

Edit app.py and set LLM_MODEL = "gpt-oss:120b-cloud" (cloud models are auto-detected by the -cloud suffix)

3. Run the Application

# Create .env file with generated SECRET_KEY
echo "SECRET_KEY=$(python3 -c 'import secrets; print(secrets.token_hex(32))')" > .env

# Start the application
python app.py

Access the web interface at http://localhost:5000

Windows Installation

The project runs on Windows, but some dependencies (e.g., audio processing) may require additional setup. For best performance, use WSL 2 (Windows Subsystem for Linux) to follow the Ubuntu instructions above. Alternatively, install natively:

Install Prerequisites:
- Install Python 3.11 from the official site (include PATH option).
- Install Git from git-scm.com.
- Install Chocolatey (package manager): Run PowerShell as Admin and execute Set-ExecutionPolicy Bypass -Scope Process -Force; [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1')).
- Install FFmpeg: choco install ffmpeg.
- Install Espeak-ng (for Coqui TTS): choco install espeak-ng.

Clone Repository:

git clone https://github.com/your-repo/Voice-Assistant.git
cd Voice-Assistant

Virtual Environment:

python -m venv .venv
.venv\Scripts\activate

Install Dependencies:

pip install -r requirements.txt
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install git+https://github.com/openai/whisper.git
pip install coqui-tts

Ollama Setup:
- Download and run OllamaSetup.exe from ollama.com/download/windows.
- Pull model: ollama pull llama3.2:3b-instruct-q4_K_M.
Run the System: Access at http://127.0.0.1:5000.

Notes: For full Linux compatibility, enable WSL 2 via wsl --install in PowerShell, install Ubuntu, and follow Linux steps.

Web Interface

The web interface provides a simple and intuitive way to interact with the voice assistant, showing real-time transcription and system responses.

System Architecture

Intent Recognition Pipeline

The system implements a cascading three-layer architecture where each layer applies progressively sophisticated techniques until confident intent recognition is achieved:

Algorithmic Layer: Fast pattern matching using NLP preprocessing, TF-IDF candidate ranking, synonym expansion, and fuzzy string similarity
Semantic Layer: Neural semantic matching using transformer-based sentence embeddings (SBERT) for distributional semantic similarity
LLM Layer: Fallback to large language models for ambiguous queries and natural response generation

Each layer can be independently enabled or disabled with configurable confidence thresholds.

User Voice Input
     |
[ASR - Whisper]
     |
Text Query
     |
+-------------------------+
| Intent Recognition      |
|                         |
| 1. Algorithmic (fast)   | <- 72% of queries
|    |- Pattern matching  |
|    |- Levenshtein/Jaccard
|    +- Boost Engine      |
|         |               |
| 2. Semantic (accurate)  | <- 18% of queries
|    +- Neural embeddings |
|         |               |
| 3. LLM (fallback)       | <- 10% of queries
|    +- Ollama/Cloud      |
+-------------------------+
     |
     v
[Selective RAG]
     |
     +-- Intent-Based Context Injection
     |   (menu_inquiry -> menu data only)
     |   (~53% token reduction)
     |
     v
[Response Strategy Decision]
     |
     +-- Template Match? -> Direct Response
     |   (greetings, hours, fixed facts)
     |
     +-- Complex Query? -> LLM Generation
         (with conversation history)
     |
     v
Response Generation
     |
[TTS - Coqui VITS]
     |
Voice Output

Full Pipeline vs. LLM-Only

Intent Recognition Performance Comparison (600 queries, test mode):

Configuration	Accuracy	Latency	Q/s	Token Usage Δ
Full Pipeline (Llama3.2 3B)	97.50%	31.4ms	31.8	baseline
Full Pipeline (GPT-OSS 120B)	97.50%	106.8ms	9.4	+53%
LLM-Only (Llama3.2 3B, local)	87.17%	278.9ms	3.6	+981%
LLM-Only (GPT-OSS 120B, cloud)	94.83%	1.03s	1.0	+1,524%

*GPT-OSS 120B is the latest Open Source GPT model released by OpenAI on August 5, 2025.

See testResults/comparativeTest/ for detailed comparative analysis.

System Requirements

Minimum:

CPU: Modern multi-core processor
RAM: 8GB
Storage: 8GB free space
Python: 3.11 (Required) - Coqui TTS limitation

Recommended:

CPU: 4+ cores
RAM: 16GB
GPU: NVIDIA GPU with 4GB+ VRAM (CUDA-compatible)
Storage: 10GB free space

Note: All components work on CPU-only systems. GPU acceleration is optional but significantly improves performance.

Configuration

Environment Configuration

Create a .env file in the project root:

# Required
SECRET_KEY=your_generated_secret_key_here

Generate a secure SECRET_KEY:

python3 -c 'import secrets; print(secrets.token_hex(32))'

Pipeline Configuration

Edit app.py to configure the intent recognition pipeline:

# Layer enabling
ENABLE_ALGORITHMIC = True
ENABLE_SEMANTIC = True
ENABLE_LLM = True

# Confidence thresholds
ALGORITHMIC_THRESHOLD = 0.65
SEMANTIC_THRESHOLD = 0.5

# Model selection
SEMANTIC_MODEL = "all-MiniLM-L6-v2" # Options: all-mpnet-base-v2
LLM_MODEL = "llama3.2:3b-instruct-q4_K_M" # Options: "gpt-oss:120b-cloud", "gemma3:4b-it-qat"

TTS Configuration

ENABLE_TTS = True
TTS_MODEL = "tts_models/en/ljspeech/vits"

ASR Configuration

ENABLE_ASR = True
ASR_MODEL = "tiny.en"  # Options: tiny.en, base.en, small.en, medium.en, large
ENABLE_AUDIO_PREPROCESSING = True

Intent Patterns

The system's intent classification is driven by the utils/intent_patterns.json file, which contains 290 training patterns across 6 intent categories:

{
  "intent_name": {
    "patterns": ["example phrase 1", "example phrase 2", ...],
    "similarity_threshold": 0.40,
    "default_response": "Response text"
  }
}

Pattern Distribution:

order: 44 patterns - Purchase and ordering intents
complaint: 50 patterns - Issues, refunds, and service problems
menu_inquiry: 55 patterns - Menu questions and product information
delivery: 54 patterns - Delivery status, tracking, and options
hours_location: 46 patterns - Business hours and location queries
general: 41 patterns - Greetings, thanks, and general conversation
unknown: 0 patterns - Fallback for unrecognized intents

Diversity score: 0.970/1.000

Each intent has:

patterns: Training examples for pattern matching and embedding generation
similarity_threshold: Minimum confidence score required to accept the intent for algorithmic/semantic layer classification
default_response: Templated response when no LLM is available

Check Pattern File and Test Dataset distribution and diversity score with: python test/data.py

Usage

Running the Application inside .venv

python app.py

Access the web interface at http://localhost:5000

Using Ollama (Local LLM)

Install Ollama from https://ollama.ai

curl -fsSL https://ollama.com/install.sh | sh

Pull the desired model:
```
ollama pull llama3.2:3b-instruct-q4_K_M
```

Ensure Ollama is running on http://localhost:11434

ollama serve &

# test local llm
ollama run llama3.2:3b-instruct-q4_K_M "Say hello in 5 words"

Running Tests

The project includes comprehensive test suites for the intent recognition pipeline. Tests are located in the test/ directory.

Basic Comprehensive Test

Run a full evaluation of the current pipeline configuration:

python -m test.runtest

Comparative Analysis

Compare multiple pipeline configurations side-by-side:

python -m test.runtest -c

Boost Engine Analysis

Evaluate the impact of the contextual boost engine:

python -m test.runtest -b

Confusion Matrix and Error Analysis

Generate confusion matrix with per-intent metrics:

python -m test.runtest -mx

Failure Analysis

Generate detailed logs and score breakdown of misclassifications:

python -m test.runtest -f

Custom Test Configurations

# Test without semantic layer
python -m test.runtest --no-semantic

# Test algorithmic layer only
python -m test.runtest --no-semantic --no-llm

# Comparative test without boost engine
python -m test.runtest -c --no-boost

# Test without edge cases (standard dataset only)
python -m test.runtest --no-edge

# Single query test
python -m test.runtest "where is my pizza?" --exp delivery

Unit Tests

Runs pytest-based checks for the algorithmic recognizer Similarity scoring and text processing. (76 tests)

python -m test.runtest -unit

Performance Benchmarks

All tests use semantic model all-mpnet-base-v2 and LLM model llama3.2:3b-instruct-q4_K_M.

Extended Test Dataset (600 queries - with edge cases)

Comprehensive testing with 600 queries including 125 edge cases:

Configuration	Accuracy	Avg Time	Queries/s
Full Pipeline	97.33%	31.4ms	31.8
Algorithmic + Semantic	91.67%	4.1ms	244.8
Algorithmic + LLM	96.17%	75.3ms	13.3
Semantic + LLM	93.00%	56.1ms	17.8
Algorithmic Only	85.00%	0.5ms	1824.2
Semantic Only	86.17%	13.7ms	72.8
LLM Only	87.17%	278.9ms	3.6

Note: Due to the inherent non-deterministic nature of large language models, full pipeline accuracy scores exhibit variance across repeated evaluations, ranging from 97.17% to 97.50% (±0.33 percentage points). This variability is attributable to the stochastic sampling mechanisms employed in the LLM inference layer and represents expected behavior in hybrid architectures incorporating generative components.

Layer Distribution (Full Pipeline)

Algorithmic layer: 72.7% of queries (436/600)
Semantic layer: 18.0% of queries (108/600)
LLM layer: 9.3% of queries (56/600)

Boost Engine Impact

Comparison with and without contextual boost rules on full pipeline (600 queries with edge cases):

Metric	Without Boost	With Boost	Improvement
Accuracy	94.00%	97.50%	+3.50%
Correct Predictions	564	585	+21
Query Time	41.0ms	30.8ms	25% faster
Algorithmic Usage	366	436	+70
Semantic Usage	164	108	-56
LLM Fallback	70	56	-14

Confusion Matrix Results (Full Pipeline - 600 queries)

Per-Intent Performance

Intent	Precision	Recall	F1-Score	Support
complaint	99.10%	95.65%	97.35%	115
delivery	95.83%	97.87%	96.84%	94
general	96.67%	98.31%	97.48%	59
hours_location	97.87%	97.87%	97.87%	94
menu_inquiry	98.40%	98.40%	98.40%	125
order	96.49%	97.35%	96.92%	113

See testResults/ directory for detailed analyses.

Test Dataset Quality

The benchmark results above are validated against dataset with:

600 total queries (475 normal + 125 edge cases)
6 intent categories: order (113), complaint (115), menu_inquiry (125), hours_location (94), delivery (94), general (59)
Diversity score: 0.968/1.000 - High lexical variety, not repetitive memorization
Edge cases: Multi-intent queries, sarcasm, typos, slang, very short queries, ambiguous phrasing, Noise/Formatting
Zero duplicates - Unbiased evaluation

Key Technical Features

Algorithmic Layer

Core Pattern Matching Algorithms:

Levenshtein Distance (50% weight)
- Purpose: Character-level string similarity
- Function: Measures minimum edit operations (insertions, deletions, substitutions)
Jaccard Similarity (50% weight)
- Purpose: Word-level set similarity
- Function: Compares word overlap between query and pattern using set intersection/union
- Sub-components:
  - Exact word overlap: 70% weight
  - Synonym-expanded overlap: 30% weight

Additional Features:

Faster Processing: Inverted index with TF-IDF weighting for efficient candidate selection and ranking
Synonym Expansion: Domain-specific synonym dictionaries for lexical variation handling
N-gram Matching: Multi-word phrase detection with contextual bonuses
Contextual Boost Engine:
- Negative sentiment keyword detection for complaint classification
- Keyword co-occurrence patterns and contextual heuristics
- Sarcasm detection for implicit complaint identification

Semantic Layer

Transformer Embeddings: Sentence-BERT models (MPNet, MiniLM) for semantic similarity
Embedding Caching: Persistent storage with MD5 invalidation for instant initialization and reduced cold-start latency
Batch Processing: Optimized pattern encoding
Top-K Averaging: Weighted average of top 3 pattern matches (50%, 30%, 20%) for robust similarity scoring

LLM Layer

Conversation Context: Multi-turn conversation history tracking for context-aware classification and response generation
Retry Logic: 2 retries with graceful fallback on transient failures
Support for both cloud and local Ollama models

Development

Extending to Other Domains

The system is designed for easy adaptation to new domains:

Update Business Information (utils/res_info.json)
Define Intent Patterns (utils/intent_patterns.json)
- Add intent categories with example phrases
- Set similarity thresholds per intent
- Provide default responses
Configure Linguistic Resources (utils/linguistic_resources.json)
- Domain-specific synonyms
- Critical keywords per intent
- Filler words to filter
Customize Response Templates (utils/response_templates.json)
- Template responses for common queries
- Exclusion/inclusion phrase rules
- Response variations
Adjust Boost Rules (intentRecognizer/algorithmic/boostEngine.py)
- Domain-specific contextual heuristics
- Sentiment detection rules
- Multi-intent handling logic
Update LLM Prompts (intentRecognizer/llm/templates.py)
- Intent descriptions
- Business workflows
- RAG context mapping
Create Test Dataset (test/data.py)
- Representative queries for each intent
- Edge cases specific to domain

No core code changes required - all adaptation is configuration-driven.

License

Acknowledgments

OpenAI Whisper for speech recognition
Coqui TTS for speech synthesis
Sentence Transformers for semantic embeddings
Ollama for LLM inference

Contact

Author: Pritam Chakraborty
Supervisor: Dr. Eng. Jarosław Kozik
Institution: AGH University of Science and Technology, Krakow, Poland
Year: 2026

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
asrModule		asrModule
intentRecognizer		intentRecognizer
static		static
test		test
testResults		testResults
ttsModule		ttsModule
utils		utils
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

atpritam/Voice-Assistant

Folders and files

Latest commit

History

Repository files navigation

Voice Response System with Multi-Layer Intent Recognition Pipeline

Table of Contents

Key Features

Intent Recognition Pipeline

Intelligent Response Generation

End-to-End Voice Interaction

Architecture Highlights

Engineering Core

Domain-Agnostic Design

Quick Start

1. Install Dependencies

2. LLM Backend

3. Run the Application

Windows Installation

Web Interface

System Architecture

Intent Recognition Pipeline

Full Pipeline vs. LLM-Only

System Requirements

Configuration

Environment Configuration

Pipeline Configuration

TTS Configuration

ASR Configuration

Intent Patterns

Usage

Running the Application inside .venv

Using Ollama (Local LLM)

Running Tests

Basic Comprehensive Test

Comparative Analysis

Boost Engine Analysis

Confusion Matrix and Error Analysis

Failure Analysis

Custom Test Configurations

Unit Tests

Performance Benchmarks

Extended Test Dataset (600 queries - with edge cases)

Layer Distribution (Full Pipeline)

Boost Engine Impact

Confusion Matrix Results (Full Pipeline - 600 queries)

Per-Intent Performance

Test Dataset Quality

Key Technical Features

Algorithmic Layer

Semantic Layer

LLM Layer

Development

Extending to Other Domains

License

Acknowledgments

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages