Lightweight wake word detection that runs locally and is suitable for resource-constrained devices like the Raspberry Pi. It requires no model training to support custom wake words and can be fully configured by end users on their devices. The system is based on feature extraction combined with time-warping comparison against a user-defined reference set.
- Python 3.9 or later
- pip (Python package manager)
- Audio input device (e.g., microphone)
pip install local-wakesounddevice package depends on PortAudio which is not installed automatically on Linux. You can install it manually on Ubuntu with this:
sudo apt install libportaudio2git clone https://github.com/st-matskevich/local-wake.git
cd local-wake
pip install .The reference set is a collection of wake word recordings used as templates during detection. Usually, 3-4 samples are sufficient to achieve reliable detection performance.
lwake record ref/sample-1.wavref/sample-1.wav- Path for the recorded file
Optional arguments:
--duration(default: 3) - Duration in seconds--no-vad- Skip Voice Activity Detection silence trimming
Alternatively, you may use any recording tool of your choice. However, make sure that appropriate preprocessing is applied - specifically, silence must be trimmed from the recordings to achieve proper detection performance. A simple example of recording on Linux is:
arecord -d 3 -r 16000 -c 1 -f S16_LE output.wav
Notes:
- Current Voice Activity Detection may occasionally be too aggressive, therefore, verify your recordings to ensure the wake word is fully captured and not inadvertently trimmed.
To evaluate comparison and determine a suitable detection threshold:
lwake compare ref/sample-1.wav ref/sample-2.wavref/sample-1.wav- Path to the first file for comparisonref/sample-2.wav- Path to the second file for comparison
Optional arguments:
--method(default:embedding) - Feature extraction method:embeddingormfcc
Once reference set is ready and threshold value has been identified:
lwake listen reference/folder 0.1 reference/folder- Directory containing your reference wake word .wav files0.1- Detection threshold. Adjust this value based on your comparison tests to balance sensitivity and false positives
Optional arguments:
--method(default:embedding) - Feature extraction method:embeddingormfcc--buffer-size(default: 2.0) - Audio buffer size in seconds--slide-size(default: 0.25) - Step size in seconds for the sliding window--debug- Enable debug logs to observe real-time scores for incoming audio chunks
All logs are printed to stderr, while detection events are printed in JSON format to stdout:
{"timestamp": 1754947173771, "wakeword": "sample-01.wav", "distance": 0.00943875619501332}Notes:
- Buffer size should be similar to or slightly larger than your reference recording length
- Slide size can be set lower for better precision at the cost of higher CPU usage
You can also use local-wake as a Python library:
import lwake
# Record audio sample
lwake.record("sample.wav", duration=3, trim_silence=True)
# Compare two audio files
distance = lwake.compare("file1.wav", "file2.wav", method="embedding")
print(f"Distance: {distance}")
# Real-time detection. Callback blocks further listening until return.
# Callback also exposes underlying sounddevice stream if you need to read more audio
def handle_detection(detection, stream):
print(f"Detected '{detection['wakeword']}' at {detection['timestamp']}")
# audio, _ = stream.read(16000) # Read 1 second of audio
# soundfile.write("input.wav", audio, samplerate=16000) # Save recording
lwake.listen("reference/folder", threshold=0.1, method="embedding", callback=handle_detection)This repository includes several pre-recorded examples for experimenting with the project. You can find them in the examples directory. While each example provides a suggested detection threshold, this value may require adjustment based on differences in microphone quality and environment.
Existing solutions for wake word detection can generally be divided into two categories:
-
Classical deterministic, speaker-dependent approaches - Typically based on MFCC feature extraction combined with DTW, as used in projects such as Rhasspy Raven or Snips.
- Advantages: Support for user-defined wake words with minimal development effort.
- Limitations: Strongly speaker-dependent, requiring sample collection from all intended users. Highly sensitive to background noise.
-
Modern model-based, speaker-independent approaches - Use neural models to classify wake words directly, as in openWakeWord or Porcupine.
- Advantages: High precision across multiple speakers without additional sample collection.
- Limitations: Do not support arbitrary user-defined wake words. Adapting to product-specific wake words requires model retraining or fine-tuning, which, depending on the solution, can be complex and typically requires at least a basic understanding of machine learning concepts and dataset preparation.
Choosing either category imposes strict limitations: deterministic methods sacrifice robustness, while model-based methods sacrifice adaptability.
local-wake combines neural feature extraction with classical sequence matching to achieve flexible and robust wake word detection. It uses a pretrained Google's speech-embedding model (converted to ONNX format for efficient inference) to extract speech features, then applies Dynamic Time Warping to compare incoming audio against a user-defined reference set of wake word samples.
This approach merges the advantages of both categories described above: it supports user-defined wake words like traditional deterministic methods, while benefiting from the enhanced feature representations and noise robustness provided by neural models. The result is a system that delivers good precision and flexibility without requiring extensive model training or large datasets.
local-wake achieves 98.6% detection accuracy on clean same-speaker audio using the Qualcomm Keyword Speech Dataset.
For detailed evaluation results, see the benchmark documentation.
- Consider using a small model on top of feature extraction for comparison instead of DTW
- Consider using noise suppression for audio preprocessing
Distributed under the MIT License. See LICENSE for more information.
Want a new feature added? Found a bug? Go ahead and open a new issue or feel free to submit a pull request.